# Fork Training Runs with Neptune

<a target="_blank" href="https://colab.research.google.com/github/neptune-ai/scale-examples/blob/lb/forking-long-runs/how-to-guides/forking-long-runs/fork_long_training_runs.ipynb"> 
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/> 
</a>
<a target="_blank" href="https://github.com/neptune-ai/scale-examples/blob/lb/forking-long-runs/how-to-guides/debug-model-training-runs/debug_training_runs.ipynb">
  <img alt="Open in GitHub" src="https://img.shields.io/badge/Open_in_GitHub-blue?logo=github&labelColor=black">
</a>
<a target="_blank" href="https://docs-beta.neptune.ai/tutorials/">
  <img alt="View tutorial in docs" src="https://neptune.ai/wp-content/uploads/2024/01/docs-badge-2.svg">
</a>
<a target="_blank" href="https://scale.neptune.ai/leo/pytorch-tutorial/reports/9ec24024-08dd-45c4-8d82-51b916054fb6">
  <img alt="Explore in Neptune" src="https://neptune.ai/wp-content/uploads/2024/01/neptune-badge.svg">
</a>

## Introduction
Training large models can take weeks or months. During these long runs, you may need to:
- Resume from checkpoints after training instabilities
- Fork runs to try different hyperparameters
- Monitor total training progress and lineage

This tutorial shows you how to:
1. **Initialize Neptune** for long runs
2. Create a **forked training run** when needed
3. **Restart training** from checkpoints

_Note: This is a code recipe that you can adapt for your own model training needs._

## Before you start

  1. Create a Neptune Scale account. [Register &rarr;](https://neptune.ai/early-access)
  2. Create a Neptune project for tracking metadata. For instructions, see [Projects](https://docs-beta.neptune.ai/projects/) in the Neptune Scale docs.
  3. Install and configure Neptune Scale for logging metadata. For instructions, see [Get started](https://docs-beta.neptune.ai/setup) in the Neptune Scale docs.

### Set environment variables
Set your project name and API token as environment variables to log to your Neptune Scale project.

Uncomment the code block below and replace placeholder values with your own credentials:

In [None]:
# Set Neptune credentials as environment variables
# %env NEPTUNE_API_TOKEN = YOUR_API_TOKEN
# %env NEPTUNE_PROJECT = WORKSPACE_NAME/PROJECT_NAME

### Install dependencies and import libraries

In [None]:
# Install dependencies
! pip install -qU neptune-scale torch torchvision "numpy<2.0"

### Setup training code

This tutorial demonstrates Neptune's forking capabilities using a simple MNIST neural network. The code includes:
- A configurable multi-layer model (`SimpleModel`)
- Training loop with metrics
- Checkpoint management (saving and loading methods)

The next cell sets up the model architecture, checkpoint utilities, and training components.

_You can replace this simulation with your actual model training code._

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms


class SimpleModel(nn.Module):
    def __init__(
        self,
        input_size: int = 784,
        hidden_size: int = 128,
        output_size: int = 10,
        num_layers: int = 10,
    ):
        super().__init__()

        layers = [nn.Linear(input_size, hidden_size), nn.ReLU()]

        for _ in range(num_layers):
            layers.append(nn.Linear(hidden_size, hidden_size))
            layers.append(nn.ReLU())

        layers.append(nn.Linear(hidden_size, output_size))

        # Combine all layers into a sequential model
        self.model = nn.Sequential(*layers)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.model(x)


# Function to save the checkpoint
import os


def save_checkpoint(epoch, global_step, model, optimizer, run):
    # Function saves the checkpoint locally
    checkpoint_dir = "./checkpoints"

    # Create checkpoint directory if it doesn't exist
    os.makedirs(checkpoint_dir, exist_ok=True)
    checkpoint_path = os.path.join(checkpoint_dir, f"checkpoint_epoch_{epoch}.pth")
    torch.save(
        {
            "run_id": run._run_id,
            "epoch": epoch,
            "global_step": global_step,
            "model_state_dict": model.state_dict(),
            "optimizer_state_dict": optimizer.state_dict(),
        },
        checkpoint_path,
    )
    print(f"Checkpoint saved at epoch {epoch}")


# Function to load the checkpoint
def load_checkpoint(model, optimizer, checkpoint_path):
    checkpoint = torch.load(checkpoint_path)
    model.load_state_dict(checkpoint["model_state_dict"])
    optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
    epoch = checkpoint["epoch"]
    global_step = checkpoint["global_step"]
    print(
        f"Checkpoint created by experiment ID {checkpoint['run_id']}, loaded from step {global_step}, epoch {epoch}"
    )
    return checkpoint


# Define a generic training loop with Neptue logging
from neptune_scale import Run


def train(
    run: Run,
    model: nn.Module,
    params,
    train_loader,
    optimizer,
    epoch_start=0,
    step_start=0,
    forked=False,
):
    criterion = nn.CrossEntropyLoss()

    # Training loop
    step_counter = step_start
    for epoch in range(epoch_start, epoch_start + params["epochs"]):
        model.train()
        running_loss = 0.0
        for batch_idx, (data, target) in enumerate(train_loader):
            # Move data to device
            data, target = data.to(params["device"]), target.to(params["device"])
            data = data.view(data.size(0), -1)  # Flatten the images

            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()

            batch_loss = loss.item()

            running_loss += batch_loss

            run.log_metrics(
                data={
                    "metrics/train/loss": batch_loss,
                    "epoch": epoch,
                },
                step=step_counter,
            )
            step_counter += 1

        # Save checkpoint after each epoch
        if not forked:
            save_checkpoint(epoch, step_counter, model, optimizer, run)

    run.close()


# Training parameters
params = {
    "batch_size": 256,
    "epochs": 15,
    "lr": 0.001,
    "num_layers": 20,  # Configurable number of layers
    "seed": 42,
    "device": "cuda" if torch.cuda.is_available() else "cpu",
}

# Data transformations
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
train_dataset = datasets.MNIST("./data", train=True, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=params["batch_size"], shuffle=False)

# Make experiments reproducible
torch.manual_seed(params["seed"])

## Create forked runs

### Step 1: _Create initial base experiment_

Create a base experiment that represents a long training process. This could be your best performing model from hyperparameter optimization that you want to train further.

The training loop will:
1. Log training metrics (`loss`) at each step
2. Save model state (checkpoint) every epoch

By saving checkpoints (or model state) periodically, we can restart training from any saved state. Neptune's forking capability allows us to create new runs that inherit the training history, making it easy to visualize the complete training progression across multiple runs.

See our [Quickstart](https://docs.neptune.ai/quickstart) if you have not created a Neptune run before.

In [None]:
from neptune_scale import Run

run = Run(
    experiment_name="forking-example-initial",  # Create a run that is the head of an experiment
)

run.log_configs(
    {
        "config/batch_size": params["batch_size"],
        "config/epochs": params["epochs"],
        "config/lr": params["lr"],
        "config/num_layers": params["num_layers"],
        "config/seed": params["seed"],
    }
)

run.add_tags(tags=["forking"])

# Initialize model andoptimizer
model = SimpleModel(num_layers=params["num_layers"]).to(params["device"])
optimizer = optim.Adam(model.parameters(), lr=params["lr"])

# Execute training loop for initial experiment
train(run, model, params, train_loader, optimizer)

### Step 2: Load model state and create forked run

As training instabilities start occuring, we may decide to restart our training from one of our saved model states. We can then pick a model state to restart training from and tell Neptune to inherit our training history from the previous experiment. 

To create a forked run, we need to do the following;
1. Load our saved model state
2. Create a forked run using the Neptune `Run` object
3. Log metrics to the forked run

#### _Step 2.1: Load model state (checkpoint)_

In [None]:
# Load checkpoint that you want to fork from
checkpoint = load_checkpoint(model, optimizer, f"./checkpoints/checkpoint_epoch_{3}.pth")

params = {
    "batch_size": 512,
    "epochs": 10,
    "lr": 0.0001,
    "num_layers": 20,  # Configurable number of layers
    "seed": 42,
    "device": "cuda" if torch.cuda.is_available() else "cpu",
}

optimizer = optim.Adam(model.parameters(), lr=params["lr"])

train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=params["batch_size"], shuffle=True
)

#### _Step 2.2: Create a forked run_

To create a forked run, you need:
- The **run ID** of the original run to fork from cab be:
  - Accessed via `run._run_id` property (saved to the checkpoint file)
  - Found in Neptune web app
  - Fetched programmatically using `neptune-fetcher` package
- The **step** number to fork from

Here's an example of how to create a forked run:
```python
forked_run = Run(
    ...
    fork_run_id="ID_OF_RUN_TO_FORK", # ID of the run to fork from
    fork_step=1234                   # Step at which to fork the run
)
```

See [fork an experiment](https://docs.neptune.ai/fork_experiment) for more details of forking.

In [None]:
# Create a forked run
print(f"Forking run with ID: {checkpoint['run_id']}")

forked_run = Run(
    experiment_name="forking-single",  # Becomes new head of the experiment
    fork_run_id=checkpoint["run_id"],
    fork_step=checkpoint["global_step"],  # Fork from the step where the checkpoint was saved
)

# Log the forked configuration
forked_run.log_configs(
    {
        "config/batch_size": params["batch_size"],
        "config/epochs": params["epochs"],
        "config/lr": params["lr"],
        "config/num_layers": params["num_layers"],
        "config/seed": params["seed"],
    }
)

forked_run.add_tags(tags=["fork"])

#### _Step 2.3: Log forked run metrics_

Log metrics as normal using the `log_metrics()` method. 

In [None]:
# Training loop
print(f"Forking from epoch - {checkpoint['epoch']} and, step - {checkpoint['global_step']}")

train(
    forked_run,
    model,
    params,
    train_loader,
    optimizer,
    step_start=checkpoint["global_step"] + 1,
    epoch_start=checkpoint["epoch"],
    forked=True,
)

### Advanced: Launch multiple forks from a single parent

In this section, we demonstrate how multiple forks can be created from a single parent to test various parameters to analyse convergence. We can easily setup a parallel execution of experiments using the code we have setup previously. 

In [None]:
from concurrent.futures import ThreadPoolExecutor, as_completed
from neptune_scale import Run


def run_single_experiment(params, train_loader):
    """Function to run a single experiment"""

    # Create a fresh model instance for this experiment
    model = SimpleModel(num_layers=20).to(params["device"])
    optimizer = optim.Adam(model.parameters(), lr=params["lr"])

    # Load the checkpoint into this fresh model
    checkpoint_data = torch.load(f"./checkpoints/checkpoint_epoch_{3}.pth")
    model.load_state_dict(checkpoint_data["model_state_dict"])
    # optimizer.load_state_dict(checkpoint_data['optimizer_state_dict'])

    with Run(
        experiment_name=f"forking-parallel-lr={params['lr']}-bs={params['batch_size']}",
        fork_run_id=checkpoint_data["run_id"],
        fork_step=checkpoint_data["global_step"],
    ) as run:

        run.add_tags(["parallel", "forked"])
        run.log_configs(
            {
                "config/batch_size": params["batch_size"],
                "config/epochs": params["epochs"],
                "config/lr": params["lr"],
            }
        )

        print("Training forked run")
        # Your training code here
        train(
            run,
            model,
            params,
            train_loader,
            optimizer,
            step_start=checkpoint_data["global_step"] + 1,
            epoch_start=checkpoint_data["epoch"],
            forked=True,
        )

        return run.get_run_url()


# Launch multiple runs in parallel
trial_configs = [
    {"lr": 0.001, "batch_size": 256, "epochs": 10, "device": "cpu"},
    {"lr": 0.0001, "batch_size": 256, "epochs": 10, "device": "cpu"},
    {"lr": 0.00001, "batch_size": 256, "epochs": 10, "device": "cpu"},
]

with ThreadPoolExecutor(max_workers=3) as executor:
    # Submit all experiments
    futures = [
        executor.submit(run_single_experiment, config, train_loader) for config in trial_configs
    ]

    # Wait for all to complete
    for future in as_completed(futures):
        try:
            future.result()  # This will raise any exceptions
        except Exception as e:
            print(f"Experiment failed: {e}")

print("All experiments completed!")

### Step 3: _Analyzing forked experiments_

When you fork a run in Neptune, the system automatically tracks key relationships in the `sys/forking` namespace:

| Property | Description | Example |
|----------|-------------|---------|
| `depth` | Number of forks from original run | 1 |
| `parent` | ID of immediate parent run | expansive-strategy-20250428124036156-fzp9y |
| `step` | Training step where fork occurred | 9704 |

#### Key Analysis Capabilities
1. **Compare Runs**: Use Neptune's comparison mode to analyze base and forked runs side-by-side
2. **Visualize Lineage**: Track relationships and evolution between forked runs
3. **Iterate & Improve**: Create additional forks to explore different approaches

For a practical example, check out this [training report](https://scale.neptune.ai/leo/pytorch-tutorial/reports/9ec24024-08dd-45c4-8d82-51b916054fb6) showcasing forking in action.
