# Debug Model Training Runs with Neptune

<a target="_blank" href="https://colab.research.google.com/github/neptune-ai/scale-examples/blob/lb/debugging_model_training/how-to-guides/debug-model-training-runs/debug_training_runs.ipynb"> 
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/> 
</a>
<a target="_blank" href="https://github.com/neptune-ai/scale-examples/blob/lb/debugging_model_training/how-to-guides/debug-model-training-runs/debug_training_runs.ipynb">
  <img alt="Open in GitHub" src="https://img.shields.io/badge/Open_in_GitHub-blue?logo=github&labelColor=black">
</a>
<a target="_blank" href="https://docs-beta.neptune.ai/tutorials/">
  <img alt="View tutorial in docs" src="https://neptune.ai/wp-content/uploads/2024/01/docs-badge-2.svg">
</a>

## Introduction
Training large models requires careful monitoring of layer-wise metrics to catch issues early. 

Neptune makes it easy to track and visualize metrics like gradient norms across all layers of your model - helping you identify problems like vanishing/exploding gradients quickly.

In this tutorial, you'll learn how to:
1. **Initialize Neptune** and **log configuration parameters**
2. Track **layer-wise gradient norms** during training 
3. Analyze the metrics in Neptune's UI to **debug training issues**

Step through a pre-configured report:
<a target="_blank" href="https://scale.neptune.ai/leo/pytorch-tutorial/reports/9ed051fc-9b16-4c31-a2da-925230c9c360">
  <img alt="Explore in Neptune" src="https://neptune.ai/wp-content/uploads/2024/01/neptune-badge.svg">
</a>

_Note: This is a code recipe that you can adapt for your own model training needs._

## Before you start

  1. Create a Neptune Scale account. [Register &rarr;](https://neptune.ai/early-access)
  2. Create a Neptune project for tracking metadata. For instructions, see [Projects](https://docs-beta.neptune.ai/projects/) in the Neptune Scale docs.
  3. Install and configure Neptune Scale for logging metadata. For instructions, see [Get started](https://docs-beta.neptune.ai/setup) in the Neptune Scale docs.

### Set environment variables
Set your project name and API token as environment variables to log to your Neptune Scale project.

Uncomment the code block below and replace placeholder values with your own credentials:

In [None]:
# Set Neptune credentials as environment variables
# %env NEPTUNE_API_TOKEN = YOUR_API_TOKEN
# %env NEPTUNE_PROJECT = WORKSPACE_NAME/PROJECT_NAME

### Install dependencies and import libraries

This tutorial uses a datset from Hugging Face which is loaded using the `datasets` package from Hugging Face.

In [None]:
# Install dependencies
! pip install -qU neptune_scale numpy

### Setup training simulation

This tutorial uses a simple PyTorch model trained on the MNIST dataset to demonstrate Neptune's debugging capabilities. The training code tracks real training metrics including:
- Loss values from the CrossEntropyLoss function
- Gradient norms for each layer to monitor optimization

In the next cell, we'll:
1. Import PyTorch libraries and utilities
2. Create a SimpleModel class with 20 layers
3. Add gradient norm tracking to identify potential issues during training

The setup is not optimized for the best model, but for illustration of using Neptune for debugging the training runs that we create.

_You can replace this simulation with your actual model training code._

In [28]:
# Pure PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
import time

class SimpleModel(nn.Module):
    def __init__(self, input_size: int = 784, hidden_size: int = 128, output_size: int = 10):
        super().__init__()
        
        layers = []
        layers.append(nn.Linear(input_size, hidden_size))
        layers.append(nn.ReLU())
        
        for _ in range(18):
            layers.append(nn.Linear(hidden_size, hidden_size))
            layers.append(nn.ReLU())

        layers.append(nn.Linear(hidden_size, output_size))
        
        # Combine all layers into a sequential model
        self.model = nn.Sequential(*layers)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.model(x)
    
    def get_gradient_norms(self) -> dict:
        """
        Calculate the L2 norm of gradients for each layer.
        
        Returns:
            dict: Dictionary containing gradient norms for each layer
        """
        gradient_norms = {}
        
        # Iterate through all named parameters
        for name, param in self.named_parameters():
            if param.grad is not None:
                # Calculate L2 norm of gradients
                norm = param.grad.norm(2).item()
                # Store in dictionary with a descriptive key
                gradient_norms[f"debug/L2_grad_norm/{name}"] = norm
                
        return gradient_norms
    

# Training parameters
params = {
    "batch_size": 512,
    "epochs": 10,
    "lr": 0.001,
    "num_layers": 20,  # Configurable number of layers
}

# Data transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

# Load MNIST dataset
train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
# test_dataset = datasets.MNIST('./data', train=False, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=params["batch_size"], shuffle=True)
# test_loader = DataLoader(test_dataset, batch_size=batch_size)

device = "cuda" if torch.cuda.is_available() else "cpu"
# Initialize model, loss function, and optimizer
model = SimpleModel().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=params["lr"])

'\n# Create training step method\ndef train_step(model, device, data, target, optimizer):\n\n    data, target = data.to(device), target.to(device)\n    data = data.view(data.size(0), -1)  # Flatten the images\n    \n    # Forward pass\n    optimizer.zero_grad()\n    output = model(data)\n    loss = criterion(output, target)\n    \n    # Backward pass\n    loss.backward()\n    \n    # Get gradient norms for monitoring\n    gradient_norms = model.get_gradient_norms()\n    \n    # Update weights\n    optimizer.step()\n    \n    # Track statistics\n    batch_loss = loss.item()'

## Debug model training run with Neptune

In this section, we'll walk through 4 key steps to effectively debug your training runs:

1. **Initialize Neptune** - set up the Neptune environment to track your training metrics
2. **Log configuration parameters** - record your model's hyperparameters and setup
3. **Track layer-wise metrics** - monitor gradient norms across all model layers
4. **Analyze training behavior** - Use Neptune's visualization tools to identify and diagnose training issues

### Step 1: _Initialize Neptune Run object_

The `Run` object is used to log configuration parameters and metrics. 

In [29]:
from neptune_scale import Run

run = Run(
    experiment_name="debugging-gradient-norms",  # Create a run that is the head of an experiment.
)

print(run.get_experiment_url())

https://scale.neptune.ai/leo/pytorch-tutorial/runs/details?runIdentificationKey=debugging-gradient-norms&type=experiment


### Step 2: _Log configuration parameters_

In [30]:
run.log_configs(
    {
        "config/batch_size": params["batch_size"],
        "config/epochs": params["epochs"],
        "config/lr": params["lr"]
    }
)

run.add_tags(tags=["debug", "gradient-norm"])

print(f"See configuration parameters:\n{run.get_experiment_url() + '&detailsTab=metadata'}")

See configuration parameters:
https://scale.neptune.ai/leo/pytorch-tutorial/runs/details?runIdentificationKey=debugging-gradient-norms&type=experiment&detailsTab=metadata


### Step 3: _Track gradient norms during training_

In this training loop, we:
1. Calculate `loss` and ` L2 gradient norms` from the `named_parameters`.
2. Track gradient norms during training to identify potential issues like vanishing/exploding gradients in a dictionary called `gradient_norms`
3. Log the gradient norms to Neptune for visualization and analysis using the `log_metrics` method

This approach allows you to monitor the learning dynamics across your entire model architecture in near real-time.

_You can also use hooks to capture layer-wise training dynamics and log to Neptune_. 

In [32]:
step_counter = 0
for epoch in range(1, params["epochs"]):
    model.train()
    running_loss = 0.0
    
    for batch_idx, (data, target) in enumerate(train_loader):
        # Move data to device
        data, target = data.to(device), target.to(device)
        data = data.view(data.size(0), -1)  # Flatten the images
        
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        gradient_norms = model.get_gradient_norms()
        batch_loss = loss.item()
        
        run.log_metrics(
            data={
                "metrics/train/loss": batch_loss,
                "epoch": epoch,
                **gradient_norms,
            },
            step=step_counter,
        )
        step_counter += 1

run.close()

KeyboardInterrupt: 

### Step 4: _Analyze training behavior_
While your model trains, use Neptune's web interface to monitor and analyze metrics in near real-time:

**1. Real-time metric visualization**
- Navigate to the _Charts_ tab to view live training metrics
- Monitor multiple metrics simultaneously
- Track training progress in real-time

**2. Advanced metric filtering**
- Focus on specific metrics using [advanced regex search](https://docs.neptune.ai/charts#filtering-charts)
- Example: `norm & layer_\d` filter all gradient norms layers
- Perfect for isolating problematic layers or components

**3. Create custom dashboards**
- Save filtered metrics to a [custom dashboard](https://docs.neptune.ai/custom_dashboard) for continuous monitoring
- Automatically updates during training
- Share with team members
- Ideal for tracking known problematic layers

**4. Dynamic metric analysis**
- Group related metrics (e.g., all layer gradients)
- Create automatically updating charts by [selecting metrics dynamically](https://docs.neptune.ai/chart_widget/#dynamic-metric-selection)
- Quickly identify vanishing/exploding gradients
- Example query: `layer_\d`

_Example dashboard:_
<a target="_blank" href="https://scale.neptune.ai/leo/pytorch-tutorial/runs/details?viewId=standard-view&detailsTab=dashboard&dashboardId=9e9e6760-9e80-433a-867f-1a7d1ff40d30&runIdentificationKey=PYTOR-444&type=run">
  <img alt="Explore in Neptune" src="https://neptune.ai/wp-content/uploads/2024/01/neptune-badge.svg">
</a>

**5. Document training insights**
- Create [custom reports](https://docs.neptune.ai/reports) to track training progress
- Document anomalies and successful configurations
- Maintain training history
- Share insights with team members

_Example report:_
<a target="_blank" href="https://scale.neptune.ai/leo/pytorch-tutorial/reports/9ed051fc-9b16-4c31-a2da-925230c9c360">
  <img alt="Explore in Neptune" src="https://neptune.ai/wp-content/uploads/2024/01/neptune-badge.svg">
</a>

---

**Additional resources:**
- [PyTorch Layer-wise Tracking Package](TODO:Link to integration for tracking layer-wise metrics)
