# Chapter 7. Debugging PyTorch Models

We’ve created a lot of models so far in this book, but in this chapter, we have a brief look at interpreting them and working out what’s going on underneath the covers. 

We take a look at using class activation mapping with PyTorch hooks to determine the focus of a model’s decision about how to connect PyTorch to Google’s TensorBoard for debugging purposes. 

I show how to use flame graphs to identify the bottlenecks in transforms and training pipelines, as well as provide a worked example of speeding up a slow transformation. Finally, we look at how to trade compute for memory when working with larger models using checkpointing.

## TensorBoard

TensorBoard is a web application designed for visualizing various aspects of neural networks. It allows for easy, real-time viewing of statistics such as accuracy, losses activation values, and really anything you want to send across the wire. 

Although it was written with TensorFlow in mind, it has such an agnostic and fairly straightforward API that working with it in PyTorch is not that different from how you’d use it in TensorFlow.

TensorBoard can then be started on the command line:

tensorboard --logdir=runs

You can then go to http://[your-machine]:6006, where you’ll see the welcome screen shown in Figure 7-1. We can now send data to the application.”

Excerpt From: Ian Pointer. “Programming PyTorch for Deep Learning”. Apple Books. 

## Sending Data to TensorBoard

The module for using TensorBoard with PyTorch is located in torch.utils.tensorboard:

- from torch.utils.tensorboard import SummaryWriter
- writer = SummaryWriter()
- writer.add_scalar('example', 3)

We use the SummaryWriter class to talk to TensorBoard using the standard location for logging output, ./runs, and we can send a scalar by using add_scalar with a tag.

In [1]:
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()
writer.add_scalar('example', 3)

In [2]:
import random
value = 10
writer.add_scalar('test_loop', value, 0)
for i in range(1,10000):
  value += random.random() - 0.5
  writer.add_scalar('test_loop', value, i)

We can use this to replace our print statements in the training loop. We can also send the model itself to get a representation in TensorBoard!

In [3]:
import torch
import torchvision
from torch.utils.tensorboard import SummaryWriter
from torchvision import datasets, transforms,models

writer = SummaryWriter()
model = models.resnet18(False)
writer.add_graph(model,torch.rand([1,3,224,224]))

def train(model, optimizer, loss_fn, train_data_loader, test_data_loader, epochs=20):
    model = model.train()
    iteration = 0

    for epoch in range(epochs):
        model.train()
        for batch in train_loader:
            optimizer.zero_grad()
            input, target = batch
            output = model(input)
            loss = loss_fn(output, target)
            writer.add_scalar('loss', loss, epoch)
            loss.backward()
            optimizer.step()

        model.eval()
        num_correct = 0
        num_examples = 0
        for batch in val_loader:
            input, target = batch
            output = model(input)
            correct = torch.eq(torch.max(F.softmax(output), dim=1)[1], target).view(-1)
            num_correct += torch.sum(correct).item()
            num_examples += correct.shape[0]
            print("Epoch {}, accuracy = {:.2f}".format(epoch,
                   num_correct / num_examples))
            writer.add_scalar('accuracy', num_correct / num_examples, epoch)
        iterations += 1

We now have the ability to send accuracy and loss information as well as model structure to TensorBoard. By aggregating multiple runs of accuracy and loss information, we can see whether anything is different in a particular run compared to others, which is a useful clue when trying to work out why a training run produced poor results. We return to TensorBoard shortly, but first let’s look at other features that PyTorch makes available for debugging.

## PyTorch Hooks

PyTorch has hooks, which are functions that can be attached to either a tensor or a module on the forward or backward pass. 

When PyTorch encounters a module with a hook during a pass, it will call the registered hooks. A hook registered on a tensor will be called when its gradient is being calculated.

Hooks are potentially powerful ways of manipulating modules and tensors because you can completely replace the output of what comes into the hook if you so desire. You could change the gradient, mask off activations, replace all the biases in the module, and so on. In this chapter, though, we’re just going to use them as a way of obtaining information about the network as data flows through.

Given a ResNet-18 model, we can attach a forward hook on a particular part of the model by using register_forward_hook:

In [4]:
def print_hook(self, module, input):
  print(f"Shape of input is {input.shape}")

model = models.resnet18()
hook_ref  = model.fc.register_forward_hook(print_hook)
model(torch.rand([1,3,224,224]))
hook_ref.remove()
model(torch.rand([1,3,224,224]))

Shape of input is torch.Size([1, 1000])


tensor([[ 1.4940e-01,  2.5545e-01, -1.3384e-01, -6.8795e-01, -3.1285e-01,
         -2.4005e-01, -7.4302e-02,  4.4725e-01, -5.0306e-01,  7.5905e-01,
         -2.1913e-01, -9.9590e-01, -6.6987e-01, -7.4829e-01,  1.7848e-01,
          4.7451e-01,  8.6537e-01,  6.0188e-01,  2.5773e-01,  5.6054e-02,
          4.7870e-01,  1.8389e-01, -1.1972e-01, -6.7993e-01,  2.5985e-01,
          3.2601e-01, -3.3697e-01, -1.5563e-03,  2.2842e-01,  4.9729e-02,
          4.8531e-01,  2.3919e-01,  2.1326e-02, -6.2627e-01, -3.1924e-01,
          2.2056e-01,  2.4207e-01,  2.3682e-02, -2.6045e-01,  4.1044e-01,
         -4.3596e-01, -8.5637e-03,  3.1259e-03,  3.7924e-01, -1.1541e-01,
         -3.7762e-01,  6.0185e-01,  5.2089e-02,  3.0902e-01, -3.4481e-01,
          3.6467e-01,  2.7574e-01, -7.7089e-01, -6.8905e-02,  2.4259e-01,
         -6.1318e-01, -1.2381e-01,  7.0634e-01,  4.9156e-01, -1.0248e+00,
          1.0890e+00,  8.6883e-01,  6.0189e-01,  5.8971e-01, -1.5924e-01,
          5.1269e-01, -2.9370e-01,  2.

If you run this code you should see text printed out showing the shape of the input to the linear classifier layer of the model. Note that the second time you pass a random tensor through the model, you shouldn’t see the print statement. 

When we add a hook to a module or tensor, PyTorch returns a reference to that hook. We should always save that reference (here we do it in hook_ref) and then call remove() when we’re finished. If you don’t store the reference, then it will just hang out and take up valuable memory (and potentially waste compute resources during a pass). Backward hooks work in the same way, except you call register_backward_hook() instead.

Of course, if we can print() something, we can certainly send it to TensorBoard! Let’s see how to use both hooks and TensorBoard to get important stats on our layers during training.

In [5]:
def send_stats(i, module, input, output):
    writer.add_scalar(f"layer {i}-mean", output.data.mean())
    writer.add_scalar(f"layer {i}-stddev", output.data.std())

We can’t use this by itself to set up a forward hook, but using the Python function partial(), we can create a series of forward hooks that will attach themselves to a layer with a set i value that will make sure that the correct values are routed to the right graphs in TensorBoard:

In [6]:
from functools import partial

model.conv1 = torch.nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False)

for i, m in enumerate(model.children()):
    m.register_forward_hook(partial(send_stats, i))

Note that we’re using model.children(), which will attach only to each top-level block of the model, so if we have an nn.Sequential() layer (which we will have in a ResNet-based model), we’ll attach a hook to only that block and not one for each individual module within the nn.Sequential list.

If we train our model with our usual training function, we should see the activations start streaming into TensorBoard. 
You’ll have to switch to wall-clock time within the UI as we’re no longer sending step information back to TensorBoard with the hook (as we’re getting the module information only when the PyTorch hook is called).

In [7]:
import torch
import torch.nn as nn
import torch.utils.data
import torchvision
from functools import partial
from torch import optim
from torch.utils.tensorboard import SummaryWriter
from torchvision import datasets, transforms

transform = transforms.Compose(
    [transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))]
)
trainset = datasets.MNIST("mnist_train", train=True, download=True, transform=transform)
train_data_loader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

images, labels = next(iter(train_data_loader))

grid = torchvision.utils.make_grid(images)

writer.add_image("images", grid, 0)
writer.add_graph(model, images)

optimizer = optim.Adam(model.parameters(), lr=2e-2)
criterion = nn.CrossEntropyLoss()


def train(
    model, optimizer, loss_fn, train_loader, val_loader, epochs=20, device="cuda:0"
):
    model.to(device)
    for epoch in range(epochs):
        print(f"epoch {epoch+1}")
        model.train()
        for batch in train_loader:
            optimizer.zero_grad()
            ww, target = batch
            ww = ww.to(device)
            target = target.to(device)
            output = model(ww)
            loss = loss_fn(output, target)
            loss.backward()
            optimizer.step()

        model.eval()
        num_correct = 0
        num_examples = 0
        for batch in val_loader:
            ww, target = batch
            ww = ww.to(device)
            target = target.to(device)
            output = model(ww)
            correct = torch.eq(torch.max(output, dim=1)[1], target).view(-1)
            num_correct += torch.sum(correct).item()
            num_examples += correct.shape[0]
        print("Epoch {}, accuracy = {:.2f}".format(epoch+1, num_correct / num_examples))


train(model, optimizer, criterion, train_data_loader, train_data_loader, epochs=5)

epoch 1
Epoch 1, accuracy = 0.96
epoch 2
Epoch 2, accuracy = 0.98
epoch 3
Epoch 3, accuracy = 0.98
epoch 4
Epoch 4, accuracy = 0.98
epoch 5
Epoch 5, accuracy = 0.99


Our mean is close to zero, but our standard deviation is also pretty close to zero as well. 
If this is happening in many layers of your network, it may be a sign that your activation functions (e.g., ReLU) are not quite suited to your problem domain. It might be worth experimenting with other functions to see if they improve the model’s performance; PyTorch’s LeakyReLU is a good alternative offering similar activations to the standard ReLU but lets more information through, which might help in training.