# Debug Model Training with Neptune

<a target="_blank" href="https://colab.research.google.com/github/neptune-ai/scale-examples/blob/lb%2Fdebugging_model_training/how-to-guides/debug-model-training-runs/debug_trainng_runs.ipynb"> 
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/> 
</a>

## Introduction
Training large models requires careful monitoring of layer-wise metrics to catch issues early. 

Neptune makes it easy to track and visualize metrics like gradient norms across all layers of your model - helping you identify problems like vanishing/exploding gradients quickly.

In this tutorial, you'll learn how to:
1. **Initialize Neptune** and **log configuration parameters**
2. Track **layer-wise gradient norms** during training 
3. Analyze the metrics in Neptune's UI to **debug training issues**

Step through a pre-configured report [here](https://scale.neptune.ai/leo/pytorch-tutorial/reports/9e79d952-272a-4a38-83e5-27df4dd225ec) to see a finalized version.

_Note: This is a code recipe that you can adapt for your own model training needs._

 ![Layer-wise gradient norms visualization in Neptune](tutorial-images/debugging_report.png)

## Before you start

  1. Create a Neptune Scale account. [Register &rarr;](https://neptune.ai/early-access)
  2. Create a Neptune project for tracking metadata. For instructions, see [Projects](https://docs-beta.neptune.ai/projects/) in the Neptune Scale docs.
  3. Install and configure Neptune Scale for logging metadata. For instructions, see [Get started](https://docs-beta.neptune.ai/setup) in the Neptune Scale docs.

### Set environment variables
Set your project name and API token as environment variables to log to your Neptune Scale project.

Uncomment the code block below and replace placeholder values with your own credentials:

In [20]:
# Set Neptune credentials as environment variables
# %env NEPTUNE_API_TOKEN = YOUR_API_TOKEN
# %env NEPTUNE_PROJECT = WORKSPACE_NAME/PROJECT_NAME

### Install dependencies and import libraries

In [19]:
# Install dependencies
! pip install -qU neptune_scale torch datasets

import torch

### Initialize parameters

In [21]:
params = {
    "optimizer": "Adam",
    "batch_size": 8,
    "learning_rate": 0.01,
    "epochs": 5,
    "device": torch.device("cuda" if torch.cuda.is_available() else "cpu"),
    "input_features": 256,
    "embed_size": 1000,
    "hidden_size": 256,  # hidden size for the LSTM
    "dropout_prob": 0.3,
    "num_lstm_layers": 3,
}

### Setup data, model and other PyTorch required functions

The `setup.py` script wraps the data and model creation for use in this tutorial. You can use your own data and model setup if required. 

In [22]:
from setup import setup_training
   
# Setup complete training environment
model, optimizer, criterion, train_dataloader, val_dataloader, vocab_size = setup_training(params, use_multilayer=True)

Training samples: 81926
Validation samples: 935
Vocabulary size: 128257
Model created: MultilayerModel
Optimizer: Adam
Criterion: CrossEntropyLoss


## Debug model training run with Neptune

### Step 1: _Initialize Neptune Run object_

The `Run` object is used to log configuration parameters and metrics. 

In [27]:
from neptune_scale import Run

run = Run(
    experiment_name="pytorch-text", # Create a run that is the head of an experiment. This is also used for forking.
)

print(run.get_experiment_url())

https://scale.neptune.ai/leo/pytorch-tutorial/runs/details?runIdentificationKey=pytorch-text&type=experiment


### Step 2: _Log configuration parameters and tags_

In [31]:
run.log_configs(
    {
        "config/learning_rate": params["learning_rate"],
        "config/optimizer": params["optimizer"],
        "config/batch_size": params["batch_size"],
        "config/epochs": params["epochs"],
        "config/num_lstm_layers": params["num_lstm_layers"],
        "data/embed_size": params["embed_size"],
    }
)

run.add_tags(tags=["text", "LLM", "Simple", params["optimizer"]])

print(f"See configuration parameters:\n{run.get_experiment_url() + '&detailsTab=metadata'}")

See configuration parameters:
https://scale.neptune.ai/leo/pytorch-tutorial/runs/details?runIdentificationKey=pytorch-text&type=experiment&detailsTab=metadata


### Step 3: _Execute model training loop_

In this training loop, we:
1. Register backward hooks to capture gradient norms from all model layers with `register_full_backward_hook()`
2. Track these norms during training to identify potential issues like vanishing/exploding gradients in a dictionary called `debugging_gradient_norms`
3. Log the gradient norms to Neptune for visualization and analysis using the `log_metrics` method

This approach allows you to monitor the learning dynamics across your entire model architecture in near real-time.

In [None]:
# Register hooks to track gradients for each layer
def hook_fn(module, grad_input, grad_output):
    layer_name = next(name for name, mod in model.named_modules() if mod is module)
    if grad_input[0] is not None:  # Check if gradients exist
        grad_norm = grad_input[0].norm().item()
        debugging_gradient_norms[f"debug/gradient/{layer_name}/norm"] = grad_norm

# Define dictionary of metrics to log to Neptune
debugging_gradient_norms = {}
# Register hooks once before training
for name, module in model.named_modules():
    module.register_full_backward_hook(hook_fn)

# Create custom Neptune URLS for tutorial steps
print(f"View charts in near real-time:\n{run.get_experiment_url() + '&detailsTab=charts'}")

step_counter = 0
# Training loop
for epoch in range(params["epochs"]):
    total_loss = 0
    for batch in train_dataloader:
        model.train()
        step_counter += 1
        input_ids = batch["input_ids"].to(params["device"])
        labels = batch["labels"].to(params["device"])
        optimizer.zero_grad()
        logits = model(input_ids)
        loss = criterion(logits.view(-1, vocab_size), labels.view(-1))
        loss.backward()
        optimizer.step()

        # Log global training loss and layer-wise gradient norms
        run.log_metrics(
            data={
                  "metrics/train/loss": loss.item(), 
                  **debugging_gradient_norms
                  },
            step=step_counter,
        )

# Close run to ensure all operations are processed
run.close()

View charts in near real-time:
https://scale.neptune.ai/leo/pytorch-tutorial/runs/details?runIdentificationKey=pytorch-text&type=experiment&detailsTab=charts


2025-04-08 09:18:43,794 neptune:ERROR: 

NeptuneSeriesStepNonIncreasing: Subsequent steps of a series must be increasing.

This can be caused by either:
- The step of a series value is smaller than the most recently logged step for this series
- the step is exactly the same but the value is different

For help, see https://docs-beta.neptune.ai/log_metrics



KeyboardInterrupt: 

2025-04-08 09:18:49,998 neptune:ERROR: 

NeptuneSeriesStepNonIncreasing: Subsequent steps of a series must be increasing.

This can be caused by either:
- The step of a series value is smaller than the most recently logged step for this series
- the step is exactly the same but the value is different

For help, see https://docs-beta.neptune.ai/log_metrics

2025-04-08 09:18:50,001 neptune:ERROR: 

NeptuneSeriesStepNonIncreasing: Subsequent steps of a series must be increasing.

This can be caused by either:
- The step of a series value is smaller than the most recently logged step for this series
- the step is exactly the same but the value is different

For help, see https://docs-beta.neptune.ai/log_metrics



### Step 4: _Analyze and debug training_
While the model is training, you can start using the Neptune web app to browse your metrics and create custom analyses and visualizations:
1. To visualize the large number of metrics being logged in near real time, navigate to the **Charts** tab of the active run (_or select link above_).
2. Filter the metrics using the [advanced regex searching capabilities](https://docs-beta.neptune.ai/charts#filtering-charts). For example, enter `gradient & fc & layers.[0-5] & norm` in the search bar. This query filters the metrics for the first 6 layers of the gradients norms of the fully connected layers. You can specify down to exactly the metrics name you want.

![Alt text](tutorial-images/debugging_regex_search.png)

3. Export the filter to a [dashboard](https://docs-beta.neptune.ai/custom_dashboard). The saved dashboard will now only display these metrics during training. This is useful if you know that a certain set of layers can be troublesome during training. 
4. Alternatively, use the [dynamic metric selection](https://docs-beta.neptune.ai/chart_widget#dynamic-metric-selection) and create a new chart widget to display all LSTM layers gradient norms in one chart. Again, use the `(.*gradient)(.*lstm)(.*norm)` query. This makes it easy to have an automatically updating chart that allows you to view all layers on a single chart for rapid debugging in case vanishing or exploding gradients appear. 

![Alt text](tutorial-images/debugging_dashboard.png)

5. To document this behavior, create a [custom report](https://docs-beta.neptune.ai/reports) to outline the model training, global metrics, debugging metrics for the model you're training. This allows you to keep track of any anomalies but also to see what worked or did not work during training.

![Alt text](tutorial-images/debugging_report.png)

See the pre-configured [example of the training report](https://scale.neptune.ai/leo/pytorch-tutorial/reports/9e79d952-272a-4a38-83e5-27df4dd225ec).

See also: PyTorch layer-wise tracking package [here](TODO:Link to integration for tracking layer-wise metrics)
