# Fork Long Training Runs with Neptune

<a target="_blank" href="https://colab.research.google.com/github/neptune-ai/scale-examples/blob/lb/forking-long-runs/how-to-guides/forking-long-runs/fork_long_training_runs.ipynb"> 
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/> 
</a>
<a target="_blank" href="https://github.com/neptune-ai/scale-examples/blob/lb/forking-long-runs/how-to-guides/debug-model-training-runs/debug_training_runs.ipynb">
  <img alt="Open in GitHub" src="https://img.shields.io/badge/Open_in_GitHub-blue?logo=github&labelColor=black">
</a>
<a target="_blank" href="https://docs-beta.neptune.ai/tutorials/">
  <img alt="View tutorial in docs" src="https://neptune.ai/wp-content/uploads/2024/01/docs-badge-2.svg">
</a>

## Introduction
Training large models can take weeks or months. During these long runs, you may need to:
- Resume from checkpoints after interruptions 
- Fork runs to try different hyperparameters
- Monitor total training progress and lineage

This tutorial shows you how to:
1. **Initialize Neptune** for long runs
2. Create a **forked training run** when needed
3. **Restart training** from checkpoints

Step through a pre-configured report:
<a target="_blank" href="https://scale.neptune.ai/leo/pytorch-tutorial/reports/9ec24024-08dd-45c4-8d82-51b916054fb6">
  <img alt="Explore in Neptune" src="https://neptune.ai/wp-content/uploads/2024/01/neptune-badge.svg">
</a>

_Note: This is a code recipe that you can adapt for your own model training needs._

## Before you start

  1. Create a Neptune Scale account. [Register &rarr;](https://neptune.ai/early-access)
  2. Create a Neptune project for tracking metadata. For instructions, see [Projects](https://docs-beta.neptune.ai/projects/) in the Neptune Scale docs.
  3. Install and configure Neptune Scale for logging metadata. For instructions, see [Get started](https://docs-beta.neptune.ai/setup) in the Neptune Scale docs.

### Set environment variables
Set your project name and API token as environment variables to log to your Neptune Scale project.

Uncomment the code block below and replace placeholder values with your own credentials:

In [None]:
# Set Neptune credentials as environment variables
# %env NEPTUNE_API_TOKEN = YOUR_API_TOKEN
# %env NEPTUNE_PROJECT = WORKSPACE_NAME/PROJECT_NAME

### Install dependencies and import libraries

In [None]:
# Install dependencies
! pip install -qU neptune_scale numpy

### Setup training simulation

This tutorial uses a simulated training scenario to demonstrate Neptune's forking capabilities. The simulation generates synthetic training metrics that mimic real model training, including:
- Gradually decreasing loss
- Increasing accuracy 
- Random noise and occasional spikes

In the next cell, we'll:
1. Import required libraries (numpy, json, etc.)
2. Create a `TrainingMetrics` class that simulates training progression
3. Define utility functions for saving model state (`save_checkpoint()` and `load_checkpoint()`)

_You can replace this simulation with your actual model training code._

In [None]:
import numpy as np
import json
from typing import Dict, Any
import math


class TrainingMetrics:
    def __init__(
        self,
        initial_loss: float = 1.0,
        initial_accuracy: float = 0.0,
        noise_scale: float = 0.1,
        loss_trend: float = -0.1,
        accuracy_trend: float = 0.1,
    ):
        self.previous_loss = initial_loss
        self.previous_accuracy = initial_accuracy
        self.noise_scale = noise_scale
        self.loss_trend = loss_trend
        self.accuracy_trend = accuracy_trend
        self.step_count = 1

        # Random convergence points
        self.target_loss = np.random.uniform(0.0, 1.0)
        self.target_accuracy = np.random.uniform(0.0, 1.0)

    def update_metrics(self, spikes=True) -> tuple[float, float]:
        """Update loss and accuracy using a random walk process with logarithmic trends"""
        self.step_count += 1

        # Base loss update with normal progression
        decay_factor = math.log(1 + abs(self.previous_loss - self.target_loss))
        loss_step = self.noise_scale * np.random.randn() + self.loss_trend * decay_factor
        loss_step *= 1 + 0.1 * math.log(self.step_count)

        # Check for spike/anomaly (0.01% chance) after step 10k
        if spikes and np.random.random() < 0.0001 and self.step_count > 10000:
            # Generate a sudden spike, independent of current loss
            spike_magnitude = np.random.uniform(1, 10)  # Random spike between 1x and 5x
            current_loss = spike_magnitude
        else:
            # Normal progression
            current_loss = max(0.0, self.previous_loss + loss_step)
            self.previous_loss = current_loss

        # self.previous_loss = current_loss
        current_accuracy = 1 - current_loss

        return current_loss, current_accuracy


def save_checkpoint(metrics: TrainingMetrics, epoch: int, loss: float, accuracy: float) -> None:
    """Save training state to disk"""
    checkpoint = {
        "epoch": epoch,
        "loss": loss,
        "accuracy": accuracy,
        "step_count": metrics.step_count,
        "noise_scale": metrics.noise_scale,
        "loss_trend": metrics.loss_trend,
        "accuracy_trend": metrics.accuracy_trend,
        "target_loss": metrics.target_loss,
        "target_accuracy": metrics.target_accuracy,
    }
    with open(f"checkpoint_epoch_{epoch}.json", "w") as f:
        json.dump(checkpoint, f)

    print(f"Checkpoint saved; epoch - {epoch}, step - {metrics.step_count}")


def load_checkpoint(epoch: int) -> Dict[str, Any]:
    """Load training state from disk"""
    with open(f"checkpoint_epoch_{epoch}.json", "r") as f:
        checkpoint = json.load(f)
    return checkpoint


# Training parameters
params = {
    "batch_size": 32,
    "epochs": 100,
    "noise_scale": np.random.uniform(0.0005, 0.002),
    "loss_trend": -np.random.uniform(0, 0.0002),
    "accuracy_trend": np.random.uniform(0, 0.0002),
}

# Initialize metrics
metrics = TrainingMetrics(
    initial_loss=np.random.uniform(3, 7),
    initial_accuracy=0,
    noise_scale=params["noise_scale"],
    loss_trend=params["loss_trend"],
    accuracy_trend=params["accuracy_trend"],
)

## Forking run steps

### Step 1: _Create initial base experiment_

Create a base experiment that represents a long training process. This could be your best performing model from hyperparameter optimization that you want to train further.

The training loop will:
1. Log training metrics (`loss` and `accuracy`) at each step
2. Save model state every 10 epochs

By saving checkpoints (or model state) periodically, we can restart training from any saved state if needed. Neptune's forking capability allows us to create new runs that inherit the training history, making it easy to visualize the complete training progression across multiple runs.

See our [Quickstart](https://docs.neptune.ai/quickstart) if you have not created a Neptune run before.

In [None]:
from neptune_scale import Run

run = Run(
    experiment_name="forking-long",  # Create a run that is the head of an experiment
)

run.log_configs(
    {
        "config/batch_size": params["batch_size"],
        "config/epochs": params["epochs"],
    }
)

run.add_tags(tags=["forking", "base run"])

print(f"See configuration parameters:\n{run.get_experiment_url() + '&detailsTab=charts'}")

step_counter = 0
# Training loop
for epoch in range(1, params["epochs"]):
    # Process batches
    for batch in range(0, 10000, params["batch_size"]):
        # Update metrics for this batch
        batch_loss, batch_accuracy = metrics.update_metrics()

        run.log_metrics(
            data={
                "metrics/train/loss": batch_loss,
                "metrics/train/accuracy": batch_accuracy,
                "epoch": epoch,
            },
            step=step_counter,
        )
        step_counter += 1

    if epoch % 10 == 0:
        save_checkpoint(metrics, epoch, loss=batch_loss, accuracy=batch_accuracy)

run.close()

### Step 2: Load model state and create forked run

As training instabilities start occuring after 50 epochs, we may decide to restart our training from one of our saved model states. We can then pick a model state to restart training from and tell Neptune to inherit our training history from the previous experiment. 

To create a forked run, we need to do the following;
1. Load our saved model state
2. Create a forked run using the Neptune `Run` object
3. Log metrics to the forked run

#### _Step 2.1: Load model state_

In [None]:
# Load checkpoint that you want to fork from
checkpoint = load_checkpoint(epoch=50)

# Initialize metrics with checkpoint values
metrics = TrainingMetrics(
    initial_loss=checkpoint["loss"],
    initial_accuracy=checkpoint["accuracy"],
    noise_scale=checkpoint["noise_scale"],
    loss_trend=checkpoint["loss_trend"],
    accuracy_trend=checkpoint["accuracy_trend"],
)

metrics.step_count = checkpoint["step_count"]

#### _Step 2.2: Create a forked run_

To create a forked run, you need:
- The run ID of the original run to fork from
  - Can be accessed via `run._run_id` property
  - Can be found in Neptune web app
  - Can be fetched programmatically using `neptune-fetcher` package
- The step number to fork from

Here's an example of how to create a forked run:
```python
forked_run = Run(
    ...
    fork_run_id="ID_OF_RUN_TO_FORK", # ID of the run to fork from
    fork_step=1234                   # Step at which to fork the run
)
```

See [fork an experiment](https://docs.neptune.ai/fork_experiment) for more details of forking.

In [None]:
# Create a forked run
print(f"Forking run with ID: {run._run_id}")

forked_run = Run(
    experiment_name="forking-long",  # Becomes new head of the experiment
    fork_run_id=run._run_id,
    fork_step=checkpoint["step_count"],
)

# Log the forked configuration
forked_run.log_configs(
    {
        "config/batch_size": params["batch_size"],
        "config/epochs": params["epochs"],
    }
)

forked_run.add_tags(tags=["fork"])

print(forked_run.get_run_url() + "&detailsTab=charts")

#### _Step 2.3: Log forked run metrics_

Log metrics as normal using the `log_metrics()` method. 

In [None]:
# Training loop
print(f"Forking from epoch - {checkpoint["epoch"]} and, step - {checkpoint["step_count"]}")

step_counter = checkpoint["step_count"]
for epoch in range(checkpoint["epoch"] + 1, params["epochs"]):
    for batch in range(0, 10000, params["batch_size"]):
        # Update metrics for this batch
        batch_loss, batch_accuracy = metrics.update_metrics(spikes=False)

        # Log batch metrics to Neptune
        forked_run.log_metrics(
            data={
                "metrics/train/loss": batch_loss,
                "metrics/train/accuracy": batch_accuracy,
                "epoch": epoch,
            },
            step=step_counter,
        )
        step_counter += 1

forked_run.close()

### Step 3: _Analyzing forked experiments_

When you fork a run in Neptune, the system automatically tracks key relationships in the `sys/forking` namespace:

| Property | Description | Example |
|----------|-------------|---------|
| `depth` | Number of forks from original run | 1 |
| `parent` | ID of immediate parent run | expansive-strategy-20250428124036156-fzp9y |
| `step` | Training step where fork occurred | 9704 |

#### Key Analysis Capabilities
1. **Compare Runs**: Use Neptune's comparison mode to analyze base and forked runs side-by-side
2. **Visualize Lineage**: Track relationships and evolution between forked runs
3. **Iterate & Improve**: Create additional forks to explore different approaches

For a practical example, check out this [training report](https://scale.neptune.ai/leo/pytorch-tutorial/reports/9ec24024-08dd-45c4-8d82-51b916054fb6) showcasing forking in action.

#### Next Steps
- **Experiment with Multiple Forks**: Create forks from different checkpoints to explore various training paths
- **Parameter Exploration**: Test different hyperparameters and architectures in separate forks
- **Comparative Analysis**: Leverage Neptune's UI and run tables for in-depth result comparison
- **Collaboration**: Document your findings in detailed reports to share insights with your team
- **Version Control**: Use forks as checkpoints to maintain a history of model improvements

See also: [SPECIFIC EXAMPLE USING PYTORCH]
