# Going deeper with our models

Now that we had a good overview on the different steps on how to build our model, we will see some **tips** and **tools** when training our models.

## Epochs and loss

We have cover the differents lost functions and talk about the general idea: we need to minimize the loss to have better results.
At the end of the `Quickstart.ipynb` we introduced the concept of Epochs. We defined it as the number of pass we will have throught the entire training dataset.
Howerver, we just gave a fixed number of epochs without specifying why we choose this one and if it is really the best one.

Finding the number of epochs is experimental, as long as the validation loss improve, we can go through new epochs.
If the validation loss starts increasing while the training loss keeps decreasing, we have to stop training our model, it is **overfitting**.

## Tools for visualization

To see in real time the validation and training loss, the **TensorBoard** tool is often used.

In [1]:
# pip install tensorboard

In [6]:
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor
from torch.utils.tensorboard import SummaryWriter
from datetime import datetime

torch.set_num_threads(16)

# -----------------------------
# Define model
# -----------------------------
class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits


# -----------------------------
# Define pipeline
# -----------------------------
class ModelPipeline:
    def __init__(self):
        # Data
        self.training_data = datasets.FashionMNIST(
            root="data", train=True, download=True, transform=ToTensor()
        )
        self.test_data = datasets.FashionMNIST(
            root="data", train=False, download=True, transform=ToTensor()
        )
        self.batch_size = 64

        self.train_dataloader = DataLoader(self.training_data, batch_size=self.batch_size, shuffle=True)
        self.test_dataloader = DataLoader(self.test_data, batch_size=self.batch_size)

        # Device
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        print(f"Using {self.device} device")

        # Model, loss, optimizer
        self.model = NeuralNetwork().to(self.device)
        self.loss_fn = nn.CrossEntropyLoss()
        self.optimizer = torch.optim.SGD(self.model.parameters(), lr=1e-3)

    def train(self):
        self.model.train()
        total_loss, correct = 0, 0

        for X, y in self.train_dataloader:
            X, y = X.to(self.device), y.to(self.device)

            pred = self.model(X)
            loss = self.loss_fn(pred, y)

            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()

            total_loss += loss.item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()

        avg_loss = total_loss / len(self.train_dataloader)
        accuracy = correct / len(self.training_data)
        return avg_loss, accuracy

    def test(self):
        self.model.eval()
        total_loss, correct = 0, 0

        with torch.no_grad():
            for X, y in self.test_dataloader:
                X, y = X.to(self.device), y.to(self.device)
                pred = self.model(X)
                total_loss += self.loss_fn(pred, y).item()
                correct += (pred.argmax(1) == y).type(torch.float).sum().item()

        avg_loss = total_loss / len(self.test_dataloader)
        accuracy = correct / len(self.test_data)
        return avg_loss, accuracy

    def run(self, epochs=20):
        run_name = f"fashionMNIST_SGD_lr1e-3_{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}")
        writer = SummaryWriter(log_dir=f"runs/{run_name}")
        for epoch in range(epochs):
            print(f"Epoch {epoch+1}/{epochs}")
            train_loss, train_acc = self.train()
            val_loss, val_acc = self.test()

            writer.add_scalar("Loss/train", train_loss, epoch)
            writer.add_scalar("Loss/validation", val_loss, epoch)
            writer.add_scalar("Accuracy/train", train_acc, epoch)
            writer.add_scalar("Accuracy/validation", val_acc, epoch)

            print(f"  Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.3f}")
            print(f"  Val   Loss: {val_loss:.4f} | Val   Acc: {val_acc:.3f}\n")

        writer.close()
        print("Training complete!")


In [7]:
%load_ext tensorboard
%tensorboard --logdir runs


The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6006 (pid 243876), started 0:07:27 ago. (Use '!kill 243876' to kill it.)

In [8]:
test = ModelPipeline()
test.run(epochs=80)


Using cpu device
Epoch 1/80
  Train Loss: 2.2365 | Train Acc: 0.372
  Val   Loss: 2.1636 | Val   Acc: 0.519

Epoch 2/80
  Train Loss: 2.0485 | Train Acc: 0.590
  Val   Loss: 1.9094 | Val   Acc: 0.618

Epoch 3/80
  Train Loss: 1.7246 | Train Acc: 0.633
  Val   Loss: 1.5477 | Val   Acc: 0.616

Epoch 4/80
  Train Loss: 1.3941 | Train Acc: 0.634
  Val   Loss: 1.2722 | Val   Acc: 0.638

Epoch 5/80
  Train Loss: 1.1715 | Train Acc: 0.650
  Val   Loss: 1.0991 | Val   Acc: 0.647

Epoch 6/80
  Train Loss: 1.0307 | Train Acc: 0.664
  Val   Loss: 0.9885 | Val   Acc: 0.661

Epoch 7/80
  Train Loss: 0.9385 | Train Acc: 0.677
  Val   Loss: 0.9137 | Val   Acc: 0.671

Epoch 8/80
  Train Loss: 0.8738 | Train Acc: 0.690
  Val   Loss: 0.8604 | Val   Acc: 0.684

Epoch 9/80
  Train Loss: 0.8263 | Train Acc: 0.701
  Val   Loss: 0.8203 | Val   Acc: 0.702

Epoch 10/80
  Train Loss: 0.7897 | Train Acc: 0.715
  Val   Loss: 0.7885 | Val   Acc: 0.705

Epoch 11/80
  Train Loss: 0.7598 | Train Acc: 0.725
  Val   Lo

### Explainations about the code

As we can see, this class `ModelPipeline`reuse the code from the Quickstart but with more modularity and by integrating  TensorBoard.

Here are the steps to use TensorBoard:

1. Install it with pip: `pip install tensorboard`
2. Import the `SummaryWriter`class : `from torch.utils.tensorboard import SummaryWriter`
3. Create the writer object. You can specify the folder in which the run logs will be save `writer = SummaryWriter(log_dir=f"runs/{run_name}")`
4. Add the scalar value of the wanted metrics with th `add_scalar("name_of_the_metrics", x_metric,y_metric)`
5. Close the writer at the end of the training.

In jupyter-notebook, before executing our model pipeline, we must run `%load_ext tensorboard`to load the tensorboard and `%tensorboard --logdir runs`to indicate in which folder are the metrics we want to display.

## Compare different loss functions and optimizer

As we specified with the parameter `--logdir` from tensorboard, we can load a folder with multiple the performances of different models configurations. For instance, the loss function or the optimizer can vary. 

To do so, we need to make our pipeline class more modular.

In [15]:
from torch import optim
from torchvision import transforms
import inspect
from datetime import datetime

class NewModelPipeline:
    def __init__(
        self,
        model: nn.Module,
        dataset_fn,
        optimizer: torch.optim.Optimizer,
        loss_fn: nn.Module,
        batch_size: int = 64,
        transform=None,
        log_root: str = "runs",
    ):
        # Dataset configuration
        transform = transform or transforms.ToTensor()
        self.train_data = dataset_fn(root="data", train=True, download=True, transform=transform)
        self.test_data = dataset_fn(root="data", train=False, download=True, transform=transform)
        self.train_loader = DataLoader(self.train_data, batch_size=batch_size, shuffle=True)
        self.test_loader = DataLoader(self.test_data, batch_size=batch_size)

        # Device setup
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        print(f"Using {self.device} device")

        # Model, optimizer, and loss
        self.model = model.to(self.device)
        self.optimizer = optimizer
        self.loss_fn = loss_fn

        # Auto naming for TensorBoard log
        dataset_name = dataset_fn.__name__
        optimizer_name = self.optimizer.__class__.__name__
        loss_name = self.loss_fn.__class__.__name__

        timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
        log_name = f"{dataset_name}_{optimizer_name}_{loss_name}_{timestamp}"
        self.writer = SummaryWriter(log_dir=f"{log_root}/{log_name}")

    # Training step
    def train_one_epoch(self):
        self.model.train()
        total_loss, correct = 0, 0
        for X, y in self.train_loader:
            X, y = X.to(self.device), y.to(self.device)

            pred = self.model(X)
            loss = self.loss_fn(pred, y)

            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()

            total_loss += loss.item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()

        avg_loss = total_loss / len(self.train_loader)
        accuracy = correct / len(self.train_loader.dataset)
        return avg_loss, accuracy

    # Evaluation step
    def evaluate(self):
        self.model.eval()
        total_loss, correct = 0, 0
        with torch.no_grad():
            for X, y in self.test_loader:
                X, y = X.to(self.device), y.to(self.device)
                pred = self.model(X)
                total_loss += self.loss_fn(pred, y).item()
                correct += (pred.argmax(1) == y).type(torch.float).sum().item()

        avg_loss = total_loss / len(self.test_loader)
        accuracy = correct / len(self.test_loader.dataset)
        return avg_loss, accuracy

    # Full training loop
    def fit(self, epochs=10, interactive=False):
        for epoch in range(epochs):
            print(f"\nEpoch {epoch + 1}/{epochs}")
            train_loss, train_acc = self.train_one_epoch()
            val_loss, val_acc = self.evaluate()

            print(f"Train | loss: {train_loss:.4f}, acc: {train_acc:.3f}")
            print(f"Val   | loss: {val_loss:.4f}, acc: {val_acc:.3f}")

            self.writer.add_scalar("Loss/train", train_loss, epoch)
            self.writer.add_scalar("Loss/validation", val_loss, epoch)
            self.writer.add_scalar("Accuracy/train", train_acc, epoch)
            self.writer.add_scalar("Accuracy/validation", val_acc, epoch)

            if interactive:
                cont = input("Continue to next epoch? (y/n): ")
                if cont.lower() != "y":
                    print("Training stopped by user.")
                    break

        self.writer.close()


This class now allows us to choose which **dataset**, **optimizer** and **loss function** to use and the name of the logs will be defined accordingly.
We can also choose the **model** class to use and the **batch size**.

I also added the possibility to use an **interactive mode** and some parameters have default values.
Below is an instance of our pipeline with the **Adam** optimizer.

In [17]:
# Instantiate components outside (for full flexibility)
model = NeuralNetwork()
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3) 
# Create pipeline
pipeline = NewModelPipeline(
    model=model,
    dataset_fn=datasets.FashionMNIST,
    optimizer=optimizer,
    loss_fn=loss_fn,
    batch_size=64,
)

# Initialize the tensorboard 

%load_ext tensorboard
%tensorboard --logdir runs

# Run training (interactive=False for normal training)
pipeline.fit(epochs=50)

Using cpu device
The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6006 (pid 243876), started 16:26:24 ago. (Use '!kill 243876' to kill it.)


Epoch 1/50
Train | loss: 0.4852, acc: 0.824
Val   | loss: 0.4082, acc: 0.851

Epoch 2/50
Train | loss: 0.3568, acc: 0.869
Val   | loss: 0.3702, acc: 0.862

Epoch 3/50
Train | loss: 0.3202, acc: 0.881
Val   | loss: 0.3515, acc: 0.869

Epoch 4/50
Train | loss: 0.2988, acc: 0.888
Val   | loss: 0.3748, acc: 0.866

Epoch 5/50
Train | loss: 0.2769, acc: 0.896
Val   | loss: 0.3488, acc: 0.876

Epoch 6/50
Train | loss: 0.2643, acc: 0.901
Val   | loss: 0.3342, acc: 0.884

Epoch 7/50
Train | loss: 0.2488, acc: 0.906
Val   | loss: 0.3271, acc: 0.878

Epoch 8/50
Train | loss: 0.2372, acc: 0.910
Val   | loss: 0.3347, acc: 0.881

Epoch 9/50
Train | loss: 0.2264, acc: 0.913
Val   | loss: 0.3536, acc: 0.882

Epoch 10/50
Train | loss: 0.2153, acc: 0.918
Val   | loss: 0.3160, acc: 0.890

Epoch 11/50
Train | loss: 0.2045, acc: 0.922
Val   | loss: 0.3320, acc: 0.891

Epoch 12/50
Train | loss: 0.1968, acc: 0.924
Val   | loss: 0.3531, acc: 0.882

Epoch 13/50
Train | loss: 0.1884, acc: 0.928
Val   | loss: 0

We can now easly compare the two optimizers with *TensorBoard* now!

While it seems that for the **SGD** with a learning rate of 0.001 performanes cap around **84%** accuracy and there is no overfitting even after 80 epochs, using **Adam** with also a learning rate of 0.001 make our model reach more than **90%** accuracy but overfit after only 10 epochs. 

