# Neptune + PyTorch

<a target="_blank" href="https://colab.research.google.com/github/neptune-ai/scale-examples/blob/lr%2Fpytorch_example/integrations-and-supported-tools/pytorch/notebooks/pytorch_text_model_debugging.ipynb"> 
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/> 
</a>

## Introduction
Global metrics such as loss or accuracy provide a high-level performance snapshot and ensure training is on course.

However, for large or foundation models, monitoring layer-wise metrics—such as gradients and activations—delivers critical insights into how each layer learns. This level of detail helps identify issues and fine-tune individual layers for better overall performance.
The main challenge is the volume of data generated by layer-wise logging.

Fortunately, Neptune is built for hyperscale tracking. It enables you to capture, organize, and analyze metrics from every layer without disrupting the training process. No matter how large is your model.

This guide will show you how to:
- Initialize the **Neptune Run** object and log configuration parameters
- Create a **reusable class** to hook layer-wise metrics (`HookManager`)
- Log **aggregated metrics** such as loss and accuracy
- Log **layer-wise metrics** to debug model training such as:

| **Metric**                        | **Demonstrated** | **What it shows**                                                                                             | **How to capture**                                             |
|-----------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------|
| **Activations**                   | Yes                                  | Dead or exploding activations can indicate issues with training stability. | `HookManager`            |
| **Gradients**                     | Yes                                  | Essential for diagnosing vanishing or exploding gradients. Small gradients may indicate vanishing gradients, while large ones can signal instability. | `HookManager`        |
| **Parameters**            | Yes                                  | Tracks how the model’s parameters evolve during training. Large or small weights may indicate the need for better regularization or adjustments in learning rate. | Extract directly from the model’s parameters.                  |
| **Loss**               | No                                   | Identifies which parts of the network contribute more to the overall loss, aiding debugging and optimization. | Monitor outputs from each layer and compare with the target.   |
| **Learning rate**       | No                                   | Helpful if using techniques like Layer-wise Learning Rate Decay (L2LRD). Tracking this can provide insight into the layer-specific learning rate. | Manually track based on optimizer settings.                    |
| **Output norms**            | No                                   | The L2-norm of layer outputs can highlight issues like gradient explosion or vanishing gradients. | Compute the L2-norm for each layer’s output.                   |

Use this notebook as a code recipe. Add your own code and adapt the sections to your own model training needs.

## Before you start

  1. Create a Neptune Scale account. [Register &rarr;](https://neptune.ai/early-access)
  2. Create a Neptune project for tracking metadata. For instructions, see [Projects](https://docs-beta.neptune.ai/projects/) in the Neptune Scale docs.
  3. Install and configure Neptune Scale for logging metadata. For instructions, see [Get started](https://docs-beta.neptune.ai/setup) in the Neptune Scale docs.

### Set environment variables
By setting your project name and API token as environment variables, you can use them throughout this notebook.

Uncomment the code block below and replace placeholder values with your own credentials:

In [None]:
# Set Neptune credentials as environment variables
# %env NEPTUNE_API_TOKEN = "your_api_token"
# %env NEPTUNE_PROJECT = "your_workspace_name_here/your_project_name"

### Install dependencies and import libraries

In [None]:
# Install dependencies
! pip install -qU neptune_scale torch datasets

In [1]:
# Import libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
from collections import Counter
from datasets import load_dataset

from neptune_scale import Run
import os

from typing import Literal

  from .autonotebook import tqdm as notebook_tqdm


### Initialize parameters

In [2]:
# Initialize model, loss function, and optimizer

params = {
    "optimizer": "Adam",
    "batch_size": 8,
    "learning_rate": 0.01,
    "epochs": 5,
    "device": torch.device("cuda" if torch.cuda.is_available() else "cpu"),
    "input_features": 256,
    "embed_size": 1000,
    "hidden_size": 256,  # hidden size for the LSTM
    "dropout_prob": 0.3,
    "num_lstm_layers": 3,
}

## Download or use next token prediction dataset
This example uses the dataset from [HuggingFace](https://huggingface.co/datasets/Na0s/Next_Token_Prediction_dataset) (HF).

You can increase the size of the dataset to test the logging capabilities of Neptune. Note that increasing the size will increase the time needed for the dataset to download. The current setup only downloads the first parquet file from the Hugging Face public dataset.

The validation dataset is also reduced to decrease the training loop execution time. To increase the validation size, change the `test_size` key-value pair in the `train_test_split()` method from HuggingFace.

In [3]:
# For the example, download a random subset of 10% of the original dataset
base_url = "https://huggingface.co/datasets/Na0s/Next_Token_Prediction_dataset/resolve/main/data/"
data_files = {
    "train": base_url
    + "train-00001-of-00067.parquet",  # download only the first 10 files from the HF dataset
    "validation": base_url + "validation-00000-of-00001.parquet",
}  # download the complete validation dataset

data_subset = load_dataset("parquet", data_files=data_files, num_proc=4)
# validation_subset = load_dataset("parquet", data_files = {"validation": base_url + "validation-00000-of-00001.parquet"}, num_proc=4, split=["validation[:5%]"])
validation_subset = data_subset.get("validation").train_test_split(test_size=0.1)
print(
    f"Training samples: {data_subset['train'].num_rows} \nValidation samples: {validation_subset['test'].num_rows}"
)

Training samples: 81926 
Validation samples: 935


## Create `DataLoader` objects
To execute the models with PyTorch, convert the training and validation datasets to tensors. Then, set up `DataLoader` for easier batching in the training loop.

The model architecture requires the vocabulary size as an input and this is why we calculate the max token from the dataset.

In [4]:
train_subset = data_subset["train"].with_format(
    type="torch", columns=["text", "input_ids", "labels"]
)  # HF provides methods to convert data types to tensors
validation_subset = validation_subset["test"].with_format(
    type="torch", columns=["text", "input_ids", "labels"]
)  # HF provides methods to convert data types to tensors

train_dataloader = DataLoader(train_subset, batch_size=params["batch_size"], shuffle=True)
val_dataloader = DataLoader(validation_subset, batch_size=params["batch_size"], shuffle=True)

# Determine the vocab size of the dataset
# Flatten the list of tokenized sentences into one long list of token IDs
vocab_size = (
    max([token for sentence in data_subset["train"]["input_ids"] for token in sentence]) + 1
)
params["vocab_size"] = vocab_size
print(f"Vocabulary size: {vocab_size}")

Vocabulary size: 128257


### Define PyTorch model architecture and helpers
Define a simple LLM model architecture using PyTorch. Since this is a text-based example, we use an embedding layer, a LSTM layer, and a fully connected layer.

You can adjust this architecture to your needs and increase its size when testing the workflow:
- To increase the size of the LSTM layers, change the `num_layers` parameter in the parameters dictionary.
- To increase the number of fully connected layers, update the mode architecture itself.

In [5]:
# Define the simple LLM model with LSTM
class SimpleLLM(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        super(SimpleLLM, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers=num_layers, batch_first=True)
        self.fc1 = nn.Linear(hidden_size, vocab_size)

    def forward(self, x):
        x = self.embedding(x)
        lstm_out, _ = self.lstm(x)  # LSTM returns output and hidden/cell state tuple
        out = self.fc1(lstm_out)  # Use the last output from the LSTM
        return out


# Function to evaluate the model after each epoch/step
def evaluate(model, val_dataloader, criterion, device, vocab_size):
    model.eval()  # Set the model to evaluation mode
    total_loss = 0
    with torch.no_grad():  # Disable gradient calculation for validation
        for batch in val_dataloader:
            input_ids = batch["input_ids"].to(device)
            labels = batch["labels"].to(device)

            # Forward pass for validation
            logits = model(input_ids)  # Shape: (batch_size, seq_len, vocab_size)

            # Calculate the loss
            loss = criterion(logits.view(-1, vocab_size), labels.view(-1))
            total_loss += loss.item()

    avg_val_loss = total_loss / len(val_dataloader)
    return avg_val_loss

### Create tracking class

This section creates a `HookManager` class:
- It allows to capture the **activations** and **gradients** from each layer.
- It accepts a PyTorch model object as an input.
- It automatically captures the gradients and activations in each layer of the model.

See a pseudo implementation:

```python
# Initialize model
model = your_ModelClass()
# Register hooks
hm = HookManager(model)
hm.register_hooks()

# Training loop
for epoch in range(3):
    
    # Forward pass, e.g. model.train()
    # Backward pass, e.g. loss.backward()
    
    activations = hm.get_activations()
    gradients = hm.get_gradients()

    # Log values (mean, std, etc.) to Neptune
```

_Important_: You can use the `HookManager` class in your own training script as it only accepts a model object as input.

In [None]:
# A class to manage hooks for activations and gradients
class HookManager:
    def __init__(self, model):
        if not isinstance(model, nn.Module):
            raise TypeError("The model must be a PyTorch model")

        self.model = model
        self.hooks = []
        self.activations = {}
        self.gradients = {}

    # Function to save activations
    def save_activation(self, name):
        def hook(module, input, output):
            self.activations[name] = output

        return hook

    # Function to save gradients (registering hooks for the model parameters)
    def save_gradient(self, name):
        def hook(module, grad_input, grad_output):
            self.gradients[name] = grad_output[0]

        return hook

    # Function to register hooks for activations and gradients
    def register_hooks(self):
        # Register forward hooks for activations
        for name, module in self.model.named_modules():
            self.hooks.append(module.register_forward_hook(self.save_activation(name)))

        # Register backward hooks for gradients
        for name, module in self.model.named_modules():
            if isinstance(module, (nn.LSTM, nn.Linear)):  # You can add more layer types here
                self.hooks.append(module.register_full_backward_hook(self.save_gradient(name)))

    # Function to clear activations and gradients after use
    def clear(self):
        self.activations = {}
        self.gradients = {}

    # Function to get activations
    def get_activations(self):
        return self.activations

    # Function to get gradients
    def get_gradients(self):
        return self.gradients


from typing import Literal, List, Optional  
class TorchWatcher:
    def __init__(self, 
                 model: nn.Module, 
                 run: Run, 
                 ) -> None:
        
        self.model = model
        self.run = run
        self.hm = HookManager(model)
        self.hm.register_hooks()
        self.debug_metrics = {}

    def track_activations(self): 
        # Track activations
        activations = self.hm.get_activations()
        for layer, activation in activations.items():
            if layer is not None:
                self.debug_metrics[f"debug/activation/{layer}_mean"] = activation[0].mean().item()
                self.debug_metrics[f"debug/activation/{layer}_std"] = activation[0].std().item()

    def track_gradients(self):
        # Track gradients with hooks
        gradients = self.hm.get_gradients()
        for layer, gradient in gradients.items():
            self.debug_metrics[f"debug/gradient/{layer}_mean"] = gradient.mean().item()

    def track_parameters(self):
        # Track gradients per layer at each epoch
        for layer, param in self.model.named_parameters():
            if param is not None:
                self.debug_metrics[f"debug/parameters/{layer}_std"] = param.grad.std().item()
                self.debug_metrics[f"debug/parameters/{layer}_mean"] = param.grad.mean().item()
                self.debug_metrics[f"debug/parameters/{layer}_norm"] = (
                    param.grad.norm().item()
                )  # L2 norm (Euclidean norm) of the gradients

    def watch(self, step: int|float, log: Optional[List[Literal["gradients", "parameters", "activations"]]] | None = "all"):
        match log:
            case "gradients":
                self.track_gradients()
            case "parameters":
                self.track_parameters()
            case "all":
                self.track_gradients()
                self.track_parameters()

        self.run.log_metrics(
            data=self.debug_metrics,
            step=step
        )

        self.hm.clear()

## Set up model training

### Initialize Neptune run object and log hyperparameters

In [17]:
from neptune_scale import Run
from uuid import uuid4

custom_run_id = f"pytorch-text-{uuid4()}"  # Create your own custom run_id
experiment_name = "pytorch-text"  # Create a run that is the head of an experiment. This will also be used for forking.

run = Run(
    run_id=custom_run_id,
    experiment_name=experiment_name,
)

run.log_configs(
    {
        "config/learning_rate": params["learning_rate"],
        "config/optimizer": params["optimizer"],
        "config/batch_size": params["batch_size"],
        "config/epochs": params["epochs"],
        "config/num_lstm_layers": params["num_lstm_layers"],
        "data/vocab_size": params["vocab_size"],
        "data/embed_size": params["embed_size"],
    }
)

run.add_tags(tags=[params["optimizer"]], group_tags=True)
run.add_tags(tags=["text", "LLM", "Simple"])

print(run.get_experiment_url())

https://scale.neptune.ai/leo/pytorch-tutorial/runs/details?runIdentificationKey=pytorch-text&type=experiment


### Execute model training loop
In this loop, we configure the `HookManager` and register the hooks.

In your training loop, use the `get_` methods to retrieve the stored values for the activations and gradients after the forward and backward passes are complete.

In [None]:
# Training setup
debug_metrics = {}

# Initialize model and optimizer
model = SimpleLLM(
    params["vocab_size"], params["embed_size"], params["hidden_size"], params["num_lstm_layers"]
)
optimizer = optim.Adam(model.parameters(), lr=params["learning_rate"])
criterion = nn.CrossEntropyLoss(
    ignore_index=-100
)  # Ignore the buffering index of -100 in the dataset

# Define watcher class
watcher = TorchWatcher(model=model, run=run)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
step_counter = 0

# Training loop
for epoch in range(params["epochs"]):
    total_loss = 0
    for batch in train_dataloader:
        model.train()
        step_counter += 1

        input_ids = batch["input_ids"].to(device)
        labels = batch["labels"].to(device)

        optimizer.zero_grad()

        # Forward pass
        logits = model(input_ids)

        # Compute the loss (ignore padding tokens by masking labels)
        loss = criterion(logits.view(-1, vocab_size), labels.view(-1))

        # Backward pass and optimization
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        print(f"Step {step_counter} / {len(train_dataloader)}, Loss: {loss.item()}")

        # Call watch() method in loop determing which layer-wise metrics to watch
        watcher.watch(step=step_counter, log="gradients")

        if step_counter % 50 == 0:  # Log validation loss at every 50 steps
            val_loss = evaluate(model, val_dataloader, criterion, device, vocab_size)

            run.log_metrics(
                data={
                    "metrics/train/loss": loss.item(),
                    "metrics/validation/loss": val_loss,
                    "epoch/value": epoch,
                    **debug_metrics,
                },
                step=step_counter,
            )
        else:  # Log training loss and debugging metrics for each step
            run.log_metrics(
                data={"metrics/train/loss": loss.item(), "epoch/value": epoch, **debug_metrics},
                step=step_counter,
            )

    print(f"Epoch {epoch + 1}, Loss: {total_loss / len(train_dataloader)}")

# Close run to ensure all operations are processed
run.close()

Step 1 / 10241, Loss: 11.762494087219238
1
gradients
Step 2 / 10241, Loss: 11.673644065856934
2
gradients


2025-03-27 18:14:46,982 neptune:ERROR: 

NeptuneSeriesStepNonIncreasing: Subsequent steps of a series must be increasing.

This can be caused by either:
- The step of a series value is smaller than the most recently logged step for this series
- the step is exactly the same but the value is different

For help, see https://docs-beta.neptune.ai/log_metrics

2025-03-27 18:14:46,987 neptune:ERROR: 

NeptuneSeriesStepNonIncreasing: Subsequent steps of a series must be increasing.

This can be caused by either:
- The step of a series value is smaller than the most recently logged step for this series
- the step is exactly the same but the value is different

For help, see https://docs-beta.neptune.ai/log_metrics



Step 3 / 10241, Loss: 10.27919864654541
3
gradients
Step 4 / 10241, Loss: 9.545785903930664
4
gradients


2025-03-27 18:14:52,233 neptune:ERROR: 

NeptuneSeriesStepNonIncreasing: Subsequent steps of a series must be increasing.

This can be caused by either:
- The step of a series value is smaller than the most recently logged step for this series
- the step is exactly the same but the value is different

For help, see https://docs-beta.neptune.ai/log_metrics

2025-03-27 18:14:52,234 neptune:ERROR: 

NeptuneSeriesStepNonIncreasing: Subsequent steps of a series must be increasing.

This can be caused by either:
- The step of a series value is smaller than the most recently logged step for this series
- the step is exactly the same but the value is different

For help, see https://docs-beta.neptune.ai/log_metrics



Step 5 / 10241, Loss: 9.699625968933105
5
gradients
Step 6 / 10241, Loss: 9.690692901611328
6
gradients
Step 7 / 10241, Loss: 10.060093879699707
7
gradients


KeyboardInterrupt: 

## What's next?
While the model is training, you can start using the Neptune web app to browse your metrics and create custom analyses and visualizations:
1. To visualize the large number of metrics being logged in near real time, navigate to the **Charts** tab of the active run.
2. Filter the metrics using the [advanced regex searching capabilities](https://docs-beta.neptune.ai/charts#filtering-charts). For example, enter `.*gradient+.*fc\d` in the search bar. This query filters all metrics for the gradients of the fully connected layers. The more FC layers, the more charts will appear.
3. Export the filter to a [dashboard](https://docs-beta.neptune.ai/custom_dashboard). The saved dashboard will now only display these metrics during training.
4. Use the [dynamic metric selection](https://docs-beta.neptune.ai/chart_widget#dynamic-metric-selection) and update the chart widget to display all fully connected layers gradients in one chart. Again, use the `.*gradient+.*fc\d` query.
5. Create a [custom report](https://docs-beta.neptune.ai/reports) to outline the model training, global metrics, debugging metrics, and more.

See also [a generic example of the training result](https://scale.neptune.ai/o/examples/org/LLM-Pretraining/reports/9e6a2cad-77e7-42df-9d64-28f07d37e908).
