# Debug Model Training Runs with Neptune

<a target="_blank" href="https://colab.research.google.com/github/neptune-ai/scale-examples/blob/lb/debugging_model_training/how-to-guides/debug-model-training-runs/debug_training_runs.ipynb"> 
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/> 
</a>
<a target="_blank" href="https://github.com/neptune-ai/scale-examples/blob/lb/debugging_model_training/how-to-guides/debug-model-training-runs/debug_training_runs.ipynb">
  <img alt="Open in GitHub" src="https://img.shields.io/badge/Open_in_GitHub-blue?logo=github&labelColor=black">
</a>
<a target="_blank" href="https://docs-beta.neptune.ai/tutorials/">
  <img alt="View tutorial in docs" src="https://neptune.ai/wp-content/uploads/2024/01/docs-badge-2.svg">
</a>

## Introduction
Training large models requires careful monitoring of layer-wise metrics to catch issues early. 

Neptune makes it easy to track and visualize metrics like gradient norms across all layers of your model - helping you identify problems like vanishing/exploding gradients quickly.

In this tutorial, you'll learn how to:
1. **Initialize Neptune** and **log configuration parameters**
2. Track **layer-wise gradient norms** during training 
3. Analyze the metrics in Neptune's UI to **debug training issues**

Step through a pre-configured report:
<a target="_blank" href="https://scale.neptune.ai/leo/pytorch-tutorial/reports/9ed051fc-9b16-4c31-a2da-925230c9c360">
  <img alt="Explore in Neptune" src="https://neptune.ai/wp-content/uploads/2024/01/neptune-badge.svg">
</a>

_Note: This is a code recipe that you can adapt for your own model training needs._

## Before you start

  1. Create a Neptune Scale account. [Register &rarr;](https://neptune.ai/early-access)
  2. Create a Neptune project for tracking metadata. For instructions, see [Projects](https://docs-beta.neptune.ai/projects/) in the Neptune Scale docs.
  3. Install and configure Neptune Scale for logging metadata. For instructions, see [Get started](https://docs-beta.neptune.ai/setup) in the Neptune Scale docs.

### Set environment variables
Set your project name and API token as environment variables to log to your Neptune Scale project.

Uncomment the code block below and replace placeholder values with your own credentials:

In [None]:
# Set Neptune credentials as environment variables
# %env NEPTUNE_API_TOKEN = YOUR_API_TOKEN
# %env NEPTUNE_PROJECT = WORKSPACE_NAME/PROJECT_NAME

### Install dependencies and import libraries

This tutorial uses a datset from Hugging Face which is loaded using the `datasets` package from Hugging Face.

In [None]:
# Install dependencies
! pip install -qU neptune_scale

### Setup training simulation

This tutorial uses a simulated training scenario to demonstrate Neptune's debugging capabilities. The simulation generates synthetic training metrics that mimic real model training, including:
- Gradually decreasing loss
- Increasing accuracy
- Random noise and occasional spikes to mimic gradient norm explosions
- Per-layer gradient norms that fluctuate naturally and spike during loss anomalies

In the next cell, we'll:
1. Import required libraries (numpy, json, etc.) 
2. Create a `TrainingMetrics` class that simulates training progression
3. Track gradient norms for 20 layers to help identify optimization issues

_You can replace this simulation with your actual model training code._

In [None]:
import numpy as np
import json
from typing import Dict, Any
import math


class TrainingMetrics:
    def __init__(
        self,
        initial_loss: float = 1.0,
        initial_accuracy: float = 0.0,
        noise_scale: float = 0.1,
        loss_trend: float = -0.1,
        accuracy_trend: float = 0.1,
    ):
        self.previous_loss = initial_loss
        self.previous_accuracy = initial_accuracy
        self.noise_scale = noise_scale
        self.loss_trend = loss_trend
        self.accuracy_trend = accuracy_trend
        self.step_count = 1

        # Random convergence points
        self.target_loss = np.random.uniform(0.0, 1.0)
        self.target_accuracy = np.random.uniform(0.0, 1.0)

    def update_metrics(self, spikes=True) -> tuple[float, float, dict]:
        """Update loss and accuracy using a random walk process with logarithmic trends"""
        self.step_count += 1

        # Base loss update with normal progression
        decay_factor = math.log(1 + abs(self.previous_loss - self.target_loss))
        loss_step = self.noise_scale * np.random.randn() + self.loss_trend * decay_factor
        loss_step *= 1 + 0.1 * math.log(self.step_count)

        # Initialize gradient norms for 20 layers with small random values
        gradient_norms = {
            f"debug/grad_norm/layer_{i}": np.random.uniform(0.01, 0.1) for i in range(20)
        }

        # Check for spike/anomaly (0.01% chance) after step 10k
        if spikes and np.random.random() < 0.0001:  # and self.step_count > 10000:
            # Generate a sudden spike, independent of current loss
            spike_magnitude = np.random.uniform(1, 10)  # Random spike between 1x and 5x
            current_loss = spike_magnitude

            # When there's a loss spike, create a corresponding spike in a random layer
            spike_layer = np.random.randint(0, 20)
            gradient_norms[f"debug/grad_norm/layer_{spike_layer}"] = (
                spike_magnitude * np.random.uniform(0.8, 1.2)
            )
        else:
            # Normal progression
            current_loss = max(0.0, self.previous_loss + loss_step)
            self.previous_loss = current_loss

        current_accuracy = 1 - current_loss

        return current_loss, current_accuracy, gradient_norms


# Training parameters
params = {
    "batch_size": 32,
    "epochs": 100,
    "noise_scale": np.random.uniform(0.0005, 0.002),
    "loss_trend": -np.random.uniform(0, 0.0002),
    "accuracy_trend": np.random.uniform(0, 0.0002),
}

# Initialize metrics
metrics = TrainingMetrics(
    initial_loss=np.random.uniform(3, 7),
    initial_accuracy=0,
    noise_scale=params["noise_scale"],
    loss_trend=params["loss_trend"],
    accuracy_trend=params["accuracy_trend"],
)

## Debug model training run with Neptune

In this section, we'll walk through 4 key steps to effectively debug your training runs:

1. **Initialize Neptune** - set up the Neptune environment to track your training metrics
2. **Log configuration parameters** - record your model's hyperparameters and setup
3. **Track layer-wise metrics** - monitor gradient norms across all model layers
4. **Analyze training behavior** - Use Neptune's visualization tools to identify and diagnose training issues

### Step 1: _Initialize Neptune Run object_

The `Run` object is used to log configuration parameters and metrics. 

In [None]:
from neptune_scale import Run

run = Run(
    experiment_name="debugging-gradient-norms",  # Create a run that is the head of an experiment.
)

print(run.get_experiment_url())

### Step 2: _Log configuration parameters_

In [None]:
run.log_configs(
    {
        "config/batch_size": params["batch_size"],
        "config/epochs": params["epochs"],
    }
)

run.add_tags(tags=["debug", "gradient-norm"])

print(f"See configuration parameters:\n{run.get_experiment_url() + '&detailsTab=metadata'}")

### Step 3: _Track gradient norms during training_

In this training loop, we:
1. Simulate the calculation of `loss`, `accuracy` and `gradient norms`
2. Track gradient norms during training to identify potential issues like vanishing/exploding gradients in a dictionary called `gradient_norms`
3. Log the gradient norms to Neptune for visualization and analysis using the `log_metrics` method

This approach allows you to monitor the learning dynamics across your entire model architecture in near real-time.

In [None]:
step_counter = 0
# Training loop
for epoch in range(1, params["epochs"]):
    # Process batches
    for batch in range(0, 10000, params["batch_size"]):
        # Update metrics for this batch
        batch_loss, batch_accuracy, gradient_norms = metrics.update_metrics()

        run.log_metrics(
            data={
                "metrics/train/loss": batch_loss,
                "metrics/train/accuracy": batch_accuracy,
                "epoch": epoch,
                **gradient_norms,
            },
            step=step_counter,
        )
        step_counter += 1

run.close()

### Step 4: _Analyze training behavior_
While your model trains, use Neptune's web interface to monitor and analyze metrics in near real-time:

**1. Real-time metric visualization**
- Navigate to the _Charts_ tab to view live training metrics
- Monitor multiple metrics simultaneously
- Track training progress in real-time

**2. Advanced metric filtering**
- Focus on specific metrics using [advanced regex search](https://docs.neptune.ai/charts#filtering-charts)
- Example: `norm & layer_\d` filter all gradient norms layers
- Perfect for isolating problematic layers or components

**3. Create custom dashboards**
- Save filtered metrics to a [custom dashboard](https://docs.neptune.ai/custom_dashboard) for continuous monitoring
- Automatically updates during training
- Share with team members
- Ideal for tracking known problematic layers

**4. Dynamic metric analysis**
- Group related metrics (e.g., all layer gradients)
- Create automatically updating charts by [selecting metrics dynamically](https://docs.neptune.ai/chart_widget/#dynamic-metric-selection)
- Quickly identify vanishing/exploding gradients
- Example query: `layer_\d`

_Example dashboard:_
<a target="_blank" href="https://scale.neptune.ai/leo/pytorch-tutorial/runs/compare?viewId=standard-view&dash=dashboard&dashboardId=9e9e6760-9e80-433a-867f-1a7d1ff40d30&compare=u4ir2KsDA_hFibTp9gV7tFm8XD-q2pARCUx-tpoKiKFc&compareChartsFilter-compound=uYOMsIHr6vYTYKmC9EoPmJB2Ry6L9tcXGcV5uA99vgRg">
  <img alt="Explore in Neptune" src="https://neptune.ai/wp-content/uploads/2024/01/neptune-badge.svg">
</a>

**5. Document training insights**
- Create [custom reports](https://docs.neptune.ai/reports) to track training progress
- Document anomalies and successful configurations
- Maintain training history
- Share insights with team members

_Example report:_
<a target="_blank" href="https://scale.neptune.ai/leo/pytorch-tutorial/reports/9e79d952-272a-4a38-83e5-27df4dd225ec">
  <img alt="Explore in Neptune" src="https://neptune.ai/wp-content/uploads/2024/01/neptune-badge.svg">
</a>

---

**Additional resources:**
- [PyTorch Layer-wise Tracking Package](TODO:Link to integration for tracking layer-wise metrics)
