# Fork Long Training Runs with Neptune

<a target="_blank" href="https://colab.research.google.com/github/neptune-ai/scale-examples/blob/lb/forking-long-runs/how-to-guides/forking-long-runs/fork_long_training_runs.ipynb"> 
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/> 
</a>
<a target="_blank" href="https://github.com/neptune-ai/scale-examples/blob/lb/forking-long-runs/how-to-guides/debug-model-training-runs/debug_training_runs.ipynb">
  <img alt="Open in GitHub" src="https://img.shields.io/badge/Open_in_GitHub-blue?logo=github&labelColor=black">
</a>
<a target="_blank" href="https://docs-beta.neptune.ai/tutorials/">
  <img alt="View tutorial in docs" src="https://neptune.ai/wp-content/uploads/2024/01/docs-badge-2.svg">
</a>
<a target="_blank" href="https://scale.neptune.ai/leo/pytorch-tutorial/reports/9ec24024-08dd-45c4-8d82-51b916054fb6">
  <img alt="Explore in Neptune" src="https://neptune.ai/wp-content/uploads/2024/01/neptune-badge.svg">
</a>

## Introduction
Training large models can take weeks or months. During these long runs, you may need to:
- Resume from checkpoints after training instabilities
- Fork runs to try different hyperparameters
- Monitor total training progress and lineage

This tutorial shows you how to:
1. **Initialize Neptune** for long runs
2. Create a **forked training run** when needed
3. **Restart training** from checkpoints

_Note: This is a code recipe that you can adapt for your own model training needs._

## Before you start

  1. Create a Neptune Scale account. [Register &rarr;](https://neptune.ai/early-access)
  2. Create a Neptune project for tracking metadata. For instructions, see [Projects](https://docs-beta.neptune.ai/projects/) in the Neptune Scale docs.
  3. Install and configure Neptune Scale for logging metadata. For instructions, see [Get started](https://docs-beta.neptune.ai/setup) in the Neptune Scale docs.

### Set environment variables
Set your project name and API token as environment variables to log to your Neptune Scale project.

Uncomment the code block below and replace placeholder values with your own credentials:

In [None]:
# Set Neptune credentials as environment variables
# %env NEPTUNE_API_TOKEN = YOUR_API_TOKEN
# %env NEPTUNE_PROJECT = WORKSPACE_NAME/PROJECT_NAME

### Install dependencies and import libraries

In [None]:
# Install dependencies
! pip install -qU neptune_scale numpy

### Setup training code

This tutorial demonstrates Neptune's forking capabilities using a simple MNIST neural network. The code includes:
- A configurable multi-layer model (`SimpleModel`)
- Training loop with metrics
- Checkpoint management (saving and loading methods)

The next cell sets up the model architecture, checkpoint utilities, and training components.

_You can replace this simulation with your actual model training code._

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

class SimpleModel(nn.Module):
    def __init__(
        self,
        input_size: int = 784,
        hidden_size: int = 128,
        output_size: int = 10,
        num_layers: int = 10,
    ):
        super().__init__()

        layers = [nn.Linear(input_size, hidden_size), nn.ReLU()]

        for _ in range(num_layers):
            layers.append(nn.Linear(hidden_size, hidden_size))
            layers.append(nn.ReLU())

        layers.append(nn.Linear(hidden_size, output_size))

        # Combine all layers into a sequential model
        self.model = nn.Sequential(*layers)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.model(x)
    
# Function to save the checkpoint
import os
def save_checkpoint(epoch, global_step, model, optimizer, loss, run):
    # Function saves the checkpoint locally
    checkpoint_dir = './checkpoints'

    # Create checkpoint directory if it doesn't exist
    os.makedirs(checkpoint_dir, exist_ok=True)
    checkpoint_path = os.path.join(checkpoint_dir, f'checkpoint_epoch_{epoch}.pth')
    torch.save({
        'run_id': run._run_id,
        'epoch': epoch,
        'global_step': global_step,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss,
    }, checkpoint_path)
    print(f"Checkpoint saved at epoch {epoch}")

# Function to load the checkpoint
def load_checkpoint(model, optimizer, checkpoint_path):
    checkpoint = torch.load(checkpoint_path)
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    epoch = checkpoint['epoch']
    global_step = checkpoint['global_step']
    loss = checkpoint['loss']
    print(f"Checkpoint loaded from step {global_step}, epoch {epoch}, loss: {loss:.4f}")
    return checkpoint

# Training parameters
params = {
    "batch_size": 256,
    "epochs": 10,
    "lr": 0.001,
    "num_layers": 20,  # Configurable number of layers
}

# Data transformations
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])

# Load MNIST dataset
train_dataset = datasets.MNIST("./data", train=True, download=True, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=params["batch_size"], shuffle=True)

device = "cuda" if torch.cuda.is_available() else "cpu"
# Initialize model, loss function, and optimizer
model = SimpleModel(num_layers=params["num_layers"]).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=params["lr"])

In [7]:
# Define a generic training loop with Neptue logging
from neptune_scale import Run
def train(run: Run, model: nn.Module, params, train_loader, optimizer, epoch_start = 0, step_start = 0, forked = False):
    # Training loop
    step_counter = step_start
    for epoch in range(epoch_start, epoch_start + params["epochs"]):
        model.train()
        running_loss = 0.0
        for batch_idx, (data, target) in enumerate(train_loader):
            # Move data to device
            data, target = data.to(device), target.to(device)
            data = data.view(data.size(0), -1)  # Flatten the images

            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()

            batch_loss = loss.item()

            running_loss += batch_loss

            run.log_metrics(
                data={
                    "metrics/train/loss": batch_loss,
                    "epoch": epoch,
                },
                step=step_counter,
            )
            step_counter += 1

        # Print average loss for the epoch
        avg_loss = running_loss / len(train_loader)
        print(f"Epoch {epoch} average loss: {avg_loss:.4f}")

        # Save checkpoint after each epoch
        if not forked:
            save_checkpoint(epoch, step_counter, model, optimizer, avg_loss, run)

    run.close()

## Forking run steps

### Step 1: _Create initial base experiment_

Create a base experiment that represents a long training process. This could be your best performing model from hyperparameter optimization that you want to train further.

The training loop will:
1. Log training metrics (`loss` and `accuracy`) at each step
2. Save model state every epoch

By saving checkpoints (or model state) periodically, we can restart training from any saved state. Neptune's forking capability allows us to create new runs that inherit the training history, making it easy to visualize the complete training progression across multiple runs.

See our [Quickstart](https://docs.neptune.ai/quickstart) if you have not created a Neptune run before.

In [14]:
from neptune_scale import Run

run = Run(
    experiment_name="forking-long",  # Create a run that is the head of an experiment
)

run.log_configs(
    {
        "config/batch_size": params["batch_size"],
        "config/epochs": params["epochs"],
        "config/lr": params["lr"],
    }
)

run.add_tags(tags=["forking", "base_run"])


# Execute training loop for initial experiment
train(run, model, params, optimizer, train_loader)

Step 0 logged
Step 1 logged
Step 2 logged
Step 3 logged
Step 4 logged
Step 5 logged
Step 6 logged
Step 7 logged
Step 8 logged
Step 9 logged
Step 10 logged
Step 11 logged
Step 12 logged
Step 13 logged
Step 14 logged
Step 15 logged
Step 16 logged
Step 17 logged
Step 18 logged
Step 19 logged
Step 20 logged
Step 21 logged
Step 22 logged
Step 23 logged
Step 24 logged
Step 25 logged
Step 26 logged
Step 27 logged
Step 28 logged
Step 29 logged
Step 30 logged
Step 31 logged
Step 32 logged
Step 33 logged
Step 34 logged
Step 35 logged
Step 36 logged
Step 37 logged
Step 38 logged
Step 39 logged
Step 40 logged
Step 41 logged
Step 42 logged
Step 43 logged
Step 44 logged
Step 45 logged
Step 46 logged
Step 47 logged
Step 48 logged
Step 49 logged
Step 50 logged
Step 51 logged
Step 52 logged
Step 53 logged
Step 54 logged
Step 55 logged
Step 56 logged
Step 57 logged
Step 58 logged
Step 59 logged
Step 60 logged
Step 61 logged
Step 62 logged
Step 63 logged
Step 64 logged
Step 65 logged
Step 66 logged
Step 

2025-07-03 12:11:17,218 neptune:INFO: Waiting for all operations to be processed


Step 2348 logged
Step 2349 logged
Epoch 9 average loss: 0.5954
Checkpoint saved at epoch 9


2025-07-03 12:11:18,295 neptune:INFO: Waiting for remaining 42 operation(s) to be processed
2025-07-03 12:11:22,601 neptune:INFO: All operations were processed
2025-07-03 12:11:22,601 neptune:INFO: Waiting for all files to be uploaded
2025-07-03 12:11:22,611 neptune:INFO: All files were uploaded


### Step 2: Load model state and create forked run

As training instabilities start occuring after 50 epochs, we may decide to restart our training from one of our saved model states. We can then pick a model state to restart training from and tell Neptune to inherit our training history from the previous experiment. 

To create a forked run, we need to do the following;
1. Load our saved model state
2. Create a forked run using the Neptune `Run` object
3. Log metrics to the forked run

#### _Step 2.1: Load model state (checkpoint)_

In [33]:
# Load checkpoint that you want to fork from
checkpoint = load_checkpoint(model, optimizer, f"./checkpoints/checkpoint_epoch_{2}.pth")
print(checkpoint["run_id"])

params = {
    "batch_size": 256,
    "epochs": 8,
    "lr": 0.00001,
}

optimizer = optim.Adam(model.parameters(), lr=params["lr"])

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=params["batch_size"], shuffle=True)


Checkpoint loaded from step 705, epoch 2, loss: 0.5598
protected-accuracy-20250703100844084-lbs66


#### _Step 2.2: Create a forked run_

To create a forked run, you need:
- The **run ID** of the original run to fork from
  - Can be accessed via `run._run_id` property (saved to the check file)
  - Can be found in Neptune web app
  - Can be fetched programmatically using `neptune-fetcher` package
- The **step** number to fork from

Here's an example of how to create a forked run:
```python
forked_run = Run(
    ...
    fork_run_id="ID_OF_RUN_TO_FORK", # ID of the run to fork from
    fork_step=1234                   # Step at which to fork the run
)
```

See [fork an experiment](https://docs.neptune.ai/fork_experiment) for more details of forking.

In [31]:
# Create a forked run
print(f"Forking run with ID: {checkpoint['run_id']}")

forked_run = Run(
    experiment_name="forking-long-forked",  # Becomes new head of the experiment
    fork_run_id=checkpoint["run_id"],
    fork_step=checkpoint["global_step"], # Fork from the step where the checkpoint was saved
)

# Log the forked configuration
forked_run.log_configs(
    {
        "config/batch_size": params["batch_size"],
        "config/epochs": params["epochs"],
        "config/lr": params["lr"],
    }
)

forked_run.add_tags(tags=["fork"])

print(forked_run.get_run_url() + "&detailsTab=charts")

Forking run with ID: protected-accuracy-20250703100844084-lbs66
https://scale.neptune.ai/leo/pytorch-tutorial/-/run/?customId=complete-generator-20250703113739424-yscqs&detailsTab=charts


#### _Step 2.3: Log forked run metrics_

Log metrics as normal using the `log_metrics()` method. 

In [32]:
# Training loop
print(f"Forking from epoch - {checkpoint['epoch']} and, step - {checkpoint['global_step']}")

train(forked_run, model, params, optimizer, train_loader, step_start=checkpoint["global_step"]+1, epoch_start=checkpoint["epoch"], forked=True)

Forking from epoch - 2 and, step - 705
Step 706 logged
Step 707 logged
Step 708 logged
Step 709 logged
Step 710 logged
Step 711 logged
Step 712 logged
Step 713 logged
Step 714 logged
Step 715 logged
Step 716 logged
Step 717 logged
Step 718 logged
Step 719 logged
Step 720 logged
Step 721 logged
Step 722 logged
Step 723 logged
Step 724 logged
Step 725 logged
Step 726 logged
Step 727 logged
Step 728 logged
Step 729 logged
Step 730 logged
Step 731 logged
Step 732 logged
Step 733 logged
Step 734 logged
Step 735 logged
Step 736 logged
Step 737 logged
Step 738 logged
Step 739 logged
Step 740 logged
Step 741 logged
Step 742 logged
Step 743 logged
Step 744 logged
Step 745 logged
Step 746 logged
Step 747 logged
Step 748 logged
Step 749 logged
Step 750 logged
Step 751 logged
Step 752 logged
Step 753 logged
Step 754 logged
Step 755 logged
Step 756 logged
Step 757 logged
Step 758 logged
Step 759 logged
Step 760 logged
Step 761 logged
Step 762 logged
Step 763 logged
Step 764 logged
Step 765 logged
S

2025-07-03 13:40:05,629 neptune:INFO: Waiting for all operations to be processed
2025-07-03 13:40:08,273 neptune:INFO: All operations were processed
2025-07-03 13:40:08,273 neptune:INFO: Waiting for all files to be uploaded
2025-07-03 13:40:08,278 neptune:INFO: All files were uploaded


### Advanced: Launch multiple forks from a single parent

In this section, we demonstrate how multiple forks can be created from a single parent to test various parameters to analyse convergence. We can easily setup a parallel execution of experiment using the code we have setup previously. 

In [10]:
from concurrent.futures import ThreadPoolExecutor, as_completed
from neptune_scale import Run

def run_single_experiment(params, train_loader):
    """Function to run a single experiment"""

    # Create a fresh model instance for this experiment
    model = SimpleModel(num_layers=20).to(device)
    optimizer = optim.Adam(model.parameters(), lr=params['lr'])
    
    # Load the checkpoint into this fresh model
    checkpoint_data = torch.load(f"./checkpoints/checkpoint_epoch_{2}.pth")
    model.load_state_dict(checkpoint_data['model_state_dict'])
    # optimizer.load_state_dict(checkpoint_data['optimizer_state_dict'])

    with Run(
        experiment_name=f"forking-long-forked-lr={params['lr']}-bs={params['batch_size']}",
        fork_run_id=checkpoint_data["run_id"],
        fork_step=checkpoint_data["global_step"],
    ) as run:
        
        run.add_tags(["parallel", "forked"])
        run.log_configs(
            {
                "config/batch_size": params["batch_size"],
                "config/epochs": params["epochs"],
                "config/lr": params["lr"],
            }
        )
        
        print("Training forked run")
        # Your training code here
        train(run, model, params, train_loader, optimizer, step_start=checkpoint_data["global_step"]+1, epoch_start=checkpoint_data["epoch"], forked=True)
        
        return run.get_run_url()

# Launch multiple runs in parallel
trial_configs = [
    {"lr": 0.001, "batch_size": 256, "epochs": 5},
    {"lr": 0.0001, "batch_size": 256, "epochs": 5},
    {"lr": 0.00001, "batch_size": 256, "epochs": 5}
]

with ThreadPoolExecutor(max_workers=3) as executor:
    # Submit all experiments
    futures = [
        executor.submit(run_single_experiment, config, train_loader) 
        for config in trial_configs
    ]
    
    # Wait for all to complete
    for future in as_completed(futures):
        try:
            future.result()  # This will raise any exceptions
        except Exception as e:
            print(f"Experiment failed: {e}")

print("All experiments completed!")

Training forked run
Training forked run
Training forked run
Epoch 2 average loss: 0.5064
Epoch 2 average loss: 0.4153
Epoch 2 average loss: 0.3638
Epoch 3 average loss: 0.4216
Epoch 3 average loss: 0.3749
Epoch 3 average loss: 0.3352
Epoch 4 average loss: 0.5564
Epoch 4 average loss: 0.3593
Epoch 4 average loss: 0.3364
Epoch 5 average loss: 0.6132
Epoch 5 average loss: 0.3502
Epoch 5 average loss: 0.3055


2025-07-03 15:49:11,502 neptune:INFO: Waiting for all operations to be processed


Epoch 6 average loss: 0.4832


2025-07-03 15:49:13,503 neptune:INFO: Waiting for remaining 3 operation(s) to be processed
2025-07-03 15:49:17,749 neptune:INFO: All operations were processed
2025-07-03 15:49:17,752 neptune:INFO: Waiting for all files to be uploaded
2025-07-03 15:49:17,756 neptune:INFO: All files were uploaded
2025-07-03 15:50:06,914 neptune:INFO: Waiting for all operations to be processed
2025-07-03 15:50:06,943 neptune:INFO: Waiting for remaining 13 operation(s) to be processed


Epoch 6 average loss: 0.3422


2025-07-03 15:50:07,200 neptune:INFO: Waiting for all operations to be processed


Epoch 6 average loss: 0.2892


2025-07-03 15:50:10,165 neptune:INFO: All operations were processed
2025-07-03 15:50:10,166 neptune:INFO: Waiting for all files to be uploaded
2025-07-03 15:50:10,170 neptune:INFO: All files were uploaded
2025-07-03 15:50:10,449 neptune:INFO: All operations were processed
2025-07-03 15:50:10,449 neptune:INFO: Waiting for all files to be uploaded
2025-07-03 15:50:10,449 neptune:INFO: All files were uploaded


All experiments completed!


### Step 3: _Analyzing forked experiments_

When you fork a run in Neptune, the system automatically tracks key relationships in the `sys/forking` namespace:

| Property | Description | Example |
|----------|-------------|---------|
| `depth` | Number of forks from original run | 1 |
| `parent` | ID of immediate parent run | expansive-strategy-20250428124036156-fzp9y |
| `step` | Training step where fork occurred | 9704 |

#### Key Analysis Capabilities
1. **Compare Runs**: Use Neptune's comparison mode to analyze base and forked runs side-by-side
2. **Visualize Lineage**: Track relationships and evolution between forked runs
3. **Iterate & Improve**: Create additional forks to explore different approaches

For a practical example, check out this [training report](https://scale.neptune.ai/leo/pytorch-tutorial/reports/9ec24024-08dd-45c4-8d82-51b916054fb6) showcasing forking in action.

#### Next Steps
- **Experiment with Multiple Forks**: Create forks from different checkpoints to explore various training paths
- **Parameter Exploration**: Test different hyperparameters and architectures in separate forks
- **Comparative Analysis**: Leverage Neptune's UI and run tables for in-depth result comparison
- **Collaboration**: Document your findings in detailed reports to share insights with your team
- **Version Control**: Use forks as checkpoints to maintain a history of model improvements

See also: [SPECIFIC EXAMPLE USING PYTORCH]
