# Log and visualize debugging metrics in Neptune

<a target="_blank" href="https://colab.research.google.com/github/neptune-ai/scale-examples/blob/master/how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb"> 
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/> 
</a>
<a target="_blank" href="https://github.com/neptune-ai/scale-examples/blob/main/how-to-guides/debug-model-training-runs/notebooks/debug_training_runs.ipynb">
  <img alt="Open in GitHub" src="https://img.shields.io/badge/Open_in_GitHub-blue?logo=github&labelColor=black">
</a>
<a target="_blank" href="https://docs-beta.neptune.ai/tutorials/">
  <img alt="View tutorial in docs" src="https://neptune.ai/wp-content/uploads/2024/01/docs-badge-2.svg">
</a>

## Introduction
Training large models requires careful monitoring of layer-wise metrics to catch issues early. 

Neptune makes it easy to track and visualize metrics like gradient norms across all layers of your model – and helps you quickly identify problems such as vanishing or exploding gradients.

In this tutorial, you'll learn how to:
1. **Initialize Neptune** and **log configuration parameters**.
2. Track **layer-wise gradient norms** during training.
3. Analyze the metrics in Neptune's UI to **debug training issues**.

See a pre-configured report in the Neptune app:
<a target="_blank" href="https://scale.neptune.ai/examples/debug-training-metrics/reports/Analyze-debugging-metrics-9f0f017e-c95a-4347-8bd4-2c50120f8315">
  <img alt="Explore in Neptune" src="https://neptune.ai/wp-content/uploads/2024/01/neptune-badge.svg">
</a>

_Note: This is a code recipe that you can adapt for your own model training needs._

## Before you start

  1. Create a Neptune Scale account. [Register &rarr;](https://neptune.ai/early-access)
  2. Create a Neptune project for tracking metadata. For instructions, see [Projects](https://docs-beta.neptune.ai/projects/) in the Neptune Scale docs.
  3. Install and configure Neptune Scale for logging metadata. For instructions, see [Get started](https://docs-beta.neptune.ai/setup) in the Neptune Scale docs.

### Set environment variables
Set your project name and API token as environment variables to log to your Neptune Scale project.

Uncomment the code block below and replace placeholder values with your own credentials:

In [None]:
# Set Neptune credentials as environment variables
# %env NEPTUNE_API_TOKEN = YOUR_API_TOKEN
# %env NEPTUNE_PROJECT = WORKSPACE_NAME/PROJECT_NAME

### Install dependencies and import libraries

This tutorial uses PyTorch and the MNIST dataset for model training. 

In [None]:
# Install dependencies
! pip install -qU neptune_scale torch torchvision "numpy<2"

### Set up training simulation

This tutorial uses a simple PyTorch model trained on the MNIST dataset to demonstrate Neptune's debugging capabilities. The training code tracks real training metrics including:
- Loss values from the `CrossEntropyLoss` function
- Gradient norms for each layer to monitor optimization

In the next cell, we will:
1. Import PyTorch libraries and utilities.
2. Create a `SimpleModel` class with 20 layers.
3. Add gradient norm tracking to identify potential issues during training.

Note that this setup is not optimized for the best model, but for illustrating how to use Neptune for debugging the training runs that we create.

_You can replace this simulation with your actual model training code._

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms


class SimpleModel(nn.Module):
    def __init__(
        self,
        input_size: int = 784,
        hidden_size: int = 128,
        output_size: int = 10,
        num_layers: int = 10,
    ):
        super().__init__()

        layers = [nn.Linear(input_size, hidden_size), nn.ReLU()]

        for _ in range(num_layers):
            layers.append(nn.Linear(hidden_size, hidden_size))
            layers.append(nn.ReLU())

        layers.append(nn.Linear(hidden_size, output_size))

        # Combine all layers into a sequential model
        self.model = nn.Sequential(*layers)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.model(x)

    def get_gradient_norms(self) -> dict:
        """
        Calculate the L2 norm of gradients for each layer.

        Returns:
            dict: Dictionary containing gradient norms for each layer
        """
        gradient_norms = {}

        # Iterate through all named parameters
        for name, param in self.named_parameters():
            if param.grad is not None:
                # Calculate L2 norm of gradients
                norm = param.grad.norm(2).item()
                # Store in dictionary with a descriptive key
                gradient_norms[f"debug/L2_grad_norm/{name}"] = norm

        return gradient_norms


# Training parameters
params = {
    "batch_size": 512,
    "epochs": 10,
    "lr": 0.001,
    "num_layers": 20,  # Configurable number of layers
}

# Data transformations
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])

# Load MNIST dataset
train_dataset = datasets.MNIST("./data", train=True, download=True, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=params["batch_size"], shuffle=True)

device = "cuda" if torch.cuda.is_available() else "cpu"
# Initialize model, loss function, and optimizer
model = SimpleModel(num_layers=params["num_layers"]).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=params["lr"])

## Debug model training runs with Neptune

In this section, we'll walk through four key steps to effectively debug your training runs:

1. **Initialize Neptune** – set up the Neptune environment to track your training metrics.
2. **Log configuration parameters** – record your model's hyperparameters and setup.
3. **Track layer-wise metrics** – monitor gradient norms across all model layers.
4. **Analyze training behavior** – Use Neptune's visualization tools to identify and diagnose training issues.

### Step 1: Initialize Neptune Run object

Use the `Run` object to log configuration parameters and metrics. 

In [None]:
from neptune_scale import Run

run = Run(
    experiment_name="debugging-gradient-norms",  # Create a run that is the head of an experiment
)

print(run.get_experiment_url())

### Step 2: Log configuration parameters

In [None]:
run.log_configs(
    {
        "config/batch_size": params["batch_size"],
        "config/epochs": params["epochs"],
        "config/lr": params["lr"],
    }
)

run.add_tags(tags=["debug", "gradient-norm"])

print(f"See configuration parameters:\n{run.get_experiment_url() + '&detailsTab=metadata'}")

### Step 3: Track gradient norms during training

In this training loop, we will:
1. Calculate `loss` and `L2 gradient norms` from the `named_parameters`.
2. Track gradient norms during training to identify potential issues like vanishing or exploding gradients in a dictionary called `gradient_norms`.
3. Use the `log_metrics` method to log the gradient norms to Neptune for visualization and analysis.

This approach allows you to monitor the learning dynamics across your entire model architecture in near real-time.

_You can also use hooks to capture layer-wise training dynamics and log to Neptune_. 

In [None]:
step_counter = 0
for epoch in range(params["epochs"]):
    model.train()

    for batch_idx, (data, target) in enumerate(train_loader):
        # Move data to device
        data, target = data.to(device), target.to(device)
        data = data.view(data.size(0), -1)  # Flatten the images

        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        gradient_norms = model.get_gradient_norms()
        batch_loss = loss.item()

        run.log_metrics(
            data={
                "metrics/train/loss": batch_loss,
                "epoch": epoch,
                **gradient_norms,
            },
            step=step_counter,
        )
        step_counter += 1

run.close()

### Step 4: Analyze training behavior
While your model trains, use Neptune's web interface to monitor and analyze metrics in near real-time:

**1. Real-time metric visualization**
- To view live training, navigate to the _Charts_ tab 
- Monitor multiple metrics simultaneously
- Track training progress in real-time

**2. Advanced metric filtering**
- Focus on specific metrics using [advanced regex search](https://docs.neptune.ai/charts#filtering-charts)
- Example: `norm | \d.weight` filters all gradient norms layers
- Perfect for isolating problematic layers or components

**3. Create custom dashboards**
- Save filtered metrics to a [custom dashboard](https://docs.neptune.ai/custom_dashboard) for continuous monitoring
- Automatically updates during training
- You can share dashboards with team members
- Ideal for tracking known problematic layers

**4. Dynamic metric analysis**
- Group related metric, for example all layer gradients
- Create automatically updating charts by [selecting metrics dynamically](https://docs.neptune.ai/chart_widget/#dynamic-metric-selection)
- Quickly identify vanishing or exploding gradients
- Example query: `\d.weight$`

_Example dashboard:_
<a target="_blank" href="https://scale.neptune.ai/o/examples/org/debug-training-metrics/runs/compare?viewId=standard-view&dash=dashboard&dashboardId=9f0f002f-1852-448b-813d-b230de12b0b5&compare=uXbDuUl1H5I-cMeJj_n1iXy_roGF0tTL3IHVit0zU8Gs">
  <img alt="Explore in Neptune" src="https://neptune.ai/wp-content/uploads/2024/01/neptune-badge.svg">
</a>

**5. Document training insights**
- Create [custom reports](https://docs.neptune.ai/reports) to track training progress
- Document anomalies and successful configurations
- Maintain training history
- Share insights with team members

_Example report:_
<a target="_blank" href="https://scale.neptune.ai/examples/debug-training-metrics/reports/Analyze-debugging-metrics-9f0f017e-c95a-4347-8bd4-2c50120f8315">
  <img alt="Explore in Neptune" src="https://neptune.ai/wp-content/uploads/2024/01/neptune-badge.svg">
</a>
