# Metrics

> PyTorch, out of the box, __does not provide any metrics for us to use__

Statement above means that:
- We have to create our own metrics from scratch
- We can use third party module to do that for us

First option is __error prone, time consuming__ and requires high attention to detail (as the metrics could be used in various settings).

Let's see how one could design metric-like API:

In [None]:
import abc
import torch

class Metric(torch.nn.Module):
    def __init__(self):
        self.cache = 0
        self.i = 0

    @abc.abstractmethod
    def forward(self, *args, **kwargs):
        pass

    def __call__(self, logits, labels):
        self.i += 1
        self.cache += self.forward(logits.detach(), labels)

    def evaluate(self):
        result = self.cache / self.i
        self.cache = 0
        self.i = 0
        return result


class CrossEntropyLoss(Metric):
    def forward(self, logits, labels):
        return torch.nn.functional.cross_entropy(logits, labels, reduction="mean")

class Accuracy(Metric):
    def forward(self, logits, labels):
        return torch.mean((torch.argmax(logits, dim=-1) == labels).float())

As one can see there are a few quirks, namely:
- __accumulating values__ - each neural network outputs have to be gathered
- __creating generic interface__ - so users can easily extend it with their own metrics

Fortunately, tested, improved and maintained implementation are provided in [__`torchmetrics`__](https://torchmetrics.readthedocs.io/en/latest/) package:

In [None]:
!pip install torchmetrics

Usage consists of the following steps:
1. Obtain outputs from neural network
2. Obtain targets (classification, regression or any other task)
3. Pass both through metric of choice
4. __Repeat above steps for the whole dataset__
5. __Obtain accumulated results__

Let's see how one could do that via `torchmetrics`:

In [None]:
import torch
import torchmetrics

# initialize metric
metric = torchmetrics.Accuracy()

n_batches = 10
for i in range(n_batches):
    # simulate a classification problem
    preds = torch.randn(10, 5).softmax(dim=-1)
    target = torch.randint(5, (10,))
    # metric on current batch
    acc = metric(preds, target)
    print(f"Accuracy on batch {i}: {acc}")

# metric on all batches using custom accumulation
acc = metric.compute()
print(f"Accuracy on all data: {acc}")

Things to note:
- __All `torchmetrics` are instances of `torch.nn.Module`__, they have to be assigned is used within another `torch.nn.Module`
- `__call__` is used to calculate __per-batch metric__
- `compute` is used to calculate __gathered metrics__ across multiple batches

# Tensorboard

> __Tensorboard is a standalone GUI tool which allows us to visualize metrics (and other data)__

Originally created for Tensorflow, after that widespread adoption and has an intergration for PyTorch ([`torch.utils.tensorboard`](https://pytorch.org/docs/stable/tensorboard.html) module).

First, one has to install `tensorboard` separately in order to use this module:

In [None]:
!pip install tensorboard

After that there are a few main steps:
1. __Create `torch.utils.tensorboard.SummaryWriter` instance__ 
2. __Use it's specific methods to write data__

Most of the methods have the following signature:

```python
write_{what}("name", data, step)
```

where
- `name` - Under which label should the data be logged to. __Nested labels allowed__, for example:
    - `loss/training` - loss but for training phase
    - `loss/validation` - loss but for validation phase
- `data` - usually `torch.tensor` instances (`np.array`s allowed also)
- `step` - __global step under which this data will be saved__. __Should be incremented on a per-batch basis (or per-epoch, depending what you want to do)__.

Let's see an example:

In [None]:
from torch.utils.tensorboard import SummaryWriter
import numpy as np

# By default data will be saved in "./runs" folder
writer = SummaryWriter()

for n_iter in range(100):
    writer.add_scalar('Loss/train', np.random.random(), n_iter)
    writer.add_scalar('Loss/test', np.random.random(), n_iter)
    writer.add_scalar('Accuracy/train', np.random.random(), n_iter)
    writer.add_scalar('Accuracy/test', np.random.random(), n_iter)

After our data has been gathered one can visualize the results.

First, within Google Colab one can do:

In [None]:
# Google colab
# Load the TensorBoard notebook extension
%load_ext tensorboard

In [None]:
%tensorboard --logdir runs

From `localhost` via `cmdline`:

In [None]:
!tensorboard --logdir runs

# Saving data

> __PyTorch allows us to easily save data via `pickle` based interface__

Usually one would like to save:
- `torch.nn.Module` instances after training (for later re-use)
- `torch.optim.Optimizer` instances in order to restart the training

> __`.pt` is the preffered extension to save your data via PyTorch__

## torch.save & torch.load

> __Easiest (yet not the best) method to save our data__

Simple `torch.save(data, path)` can be used to save:
- `torch.Tensor` instances (our metrics)

Let's see how one could do that:

In [1]:
import torch

data = torch.randn(20)
torch.save(data, "tensor.pt")

loaded = torch.load("tensor.pt")

data == loaded

tensor([True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True])

## state_dict

> __In PyTorch, `state_dict` should be used to save models, optimizers and other "more stateful" objects (usually those inheriting from `torch.nn.Module`)__

Why?
- PyTorch loads data __but__ definitions of objects are loaded as code (e.g. you need access to source code of your model before loading it)
- If we simply use `torch.save(model, "model.pt")` __our parameters saved will be bound to specific code version__
- If we were to change the model architecture (e.g. to refactor the code) __`torch.load` would crash!__

> __`state_dict` is Python dictionary containing ONLY PARAMETERS & BUFFERS (e.g. weights) which one can load to models WHICH HAVE THE SAME LAYER LAYOUT__

Let's see an example of how one could do that (including restoration):

In [None]:
class ExampleModel(torch.nn.Module):
    def __init__(self):
        self.model = torch.nn.Sequential(
            torch.nn.Linear(20, 20), torch.nn.ReLU(), torch.nn.Linear(20, 1)
        )


torch.save(
    ExampleModel().state_dict(), "model.pt"
)

model = ExampleModel()
model.load_state_dict(torch.load("model.pt"))

## Saving multiple objects

> __`torch.save` can be used to save multiple data points creating a generic CHECKPOINT__

In order to create a checkpoint we simply __save dictionary containing our objects__.

Let's see how this looks in practice:

In [None]:
torch.save(
    {
        "epoch": epoch,
        "model_state_dict": model.state_dict(),
        "optimizer_state_dict": optimizer.state_dict(),
        "loss": loss,
    },
    PATH,
)

In [None]:
model = TheModelClass(*args, **kwargs)
optimizer = TheOptimizerClass(*args, **kwargs)

checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']

# Training and Evaluation

Now we almost have all of the necessary pieces to work with neural networks:
- __Basics of PyTorch__
- __Basic neural network for tabular data__ - `torch.nn.Module`, `torch.nn.Linear` and `torch.nn.Sequential`
- __Datasets & DataLoaders__ - how to create batches of examples for our neural network
- __Optimization procedure__ - how to optimize our neural network
- __Measuring it's performance__ - using `torchmetrics`
- __Saving metric values__ - using `tensorboard`
- __Saving and loading checkpoints__ - using `torch.save` and `torch.load`

Last thing we're missing is how to actually `train` on our dataset and `evaluate` on validation and/or test data.

## model.train() and model.eval()

> `model.train()` changes mode of the model to training

What does that mean? 
- Some layer behave differently based on whether we train them or use them for evaluation (we will see `Dropout` and `BatchNorm` in the next chapter)

Analogously `model.eval()` turns on evaluation mode.

> __Remember to always change model's mode before specific phase!__

In [3]:
import torch

data = torch.randn(3, 5)
model = torch.nn.Sequential(torch.nn.Linear(5, 20), torch.nn.Dropout(p=0.5))

model.training() # Default mode
training = model(data)
model.eval()
evaluation = model(data)

print(training)
evaluation

tensor([[-2.4591, -0.0000,  2.3225,  1.5816,  0.0000, -2.1619, -0.8403,  0.0000,
          3.1002, -0.2020, -0.0000, -0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
         -0.0000,  1.6939, -0.3198,  2.4890],
        [-0.0000,  0.0000,  0.0000, -0.0000, -0.9355, -0.0000,  0.0000,  2.0543,
          0.0000, -1.2924, -1.5273,  1.3543,  0.0000, -0.4998,  2.8455, -0.5934,
          0.0000, -0.0000, -0.0000, -0.0000],
        [-0.5197, -0.8565, -0.0144,  0.0000, -0.0000, -0.0000,  0.0000,  0.0000,
          0.0000,  0.1857, -0.5098,  0.0000, -0.0000,  1.6102,  0.0000, -0.0000,
         -0.0000, -0.9521,  0.1253, -0.4957]], grad_fn=<MulBackward0>)


tensor([[-1.2295, -0.3679,  1.1612,  0.7908,  0.0706, -1.0809, -0.4202,  1.1120,
          1.5501, -0.1010, -1.0078, -0.0088,  0.5752,  0.5799,  1.0626,  0.9444,
         -0.2436,  0.8470, -0.1599,  1.2445],
        [-1.0470,  0.0079,  0.5864, -0.2979, -0.4678, -0.2323,  1.0144,  1.0272,
          0.6973, -0.6462, -0.7636,  0.6771,  0.0994, -0.2499,  1.4227, -0.2967,
          0.1265, -0.1090, -0.8705, -0.0624],
        [-0.2599, -0.4282, -0.0072,  0.3843, -0.0788, -0.5478,  0.6889,  0.5713,
          0.6715,  0.0929, -0.2549,  0.3580, -0.7167,  0.8051,  0.1825, -0.8064,
         -0.5278, -0.4760,  0.0626, -0.2479]], grad_fn=<AddmmBackward>)

## torch.no_grad

> `torch.no_grad()` is a context manager (and decorator) __used to turn off gradient tape recording__

While this operation does not influence the results it has additional properties:
- Performance improvement as the operations are not recorded __as we won't backpropagate through it__
- Allows us not to traverse through graph second time (which raises an error)

> __Due to above reasons always use it when evaluating model performance!__

In [5]:
x = torch.tensor([1.], requires_grad=True)

with torch.no_grad():
    y = x * 2
    
y.requires_grad

False

And as a decorator:

In [6]:
@torch.no_grad()
def doubler(x):
    return x * 2

z = doubler(x)
z.requires_grad

False

## Basic training loop

Let's assume we have our data already in place (we will use `torch.random.randn` as input, __usually we would use `torch.utils.data.DataLoader` for that!__).

Basic idea is as follows:
- setup necessary objects (metrics, summary writer, criterions (loss functions), model, optimizer)
- For training:
    - Turn on `model.train()` before you start training (default mode, but good practice to be explicit about it)
    - Loop over `torch.utils.data.DataLoader` getting samples and targets
    - Pass them through neural network
    - Calculate `loss` based on `criterion`
    - `loss.backward()` to obtain gradient of loss w.r.t. model's parameters (weights)
    - `optimizer.step()` to apply gradient modifying weights based on optimizer's formula
    - __`optimizer.zero_grad()`__ - zeroes out gradient contained in parameters (otherwise it would be accumulated during next pass and would become too large and destroy network's parameters during update)
- For evaluation (__validation & test work the same but on different datasets!__) everything works the same __except__:
    - We turn on `model.eval()` at the beggining of evaluation
    - We use `torch.no_grad()` context manager __on whole `DataLoader` pass__

Below is a "standard" (skeletonized) training loop for regression tasks.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = model.to(device)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Dummy data
# Usually we iterate over
X, y = torch.randn(64, 15), torch.randn(64)
X, y = X.to(device), y.to(device)

# Iterate for 20 epochs over WHOLE DATASET
for epoch in range(20):
    outputs = model(X)
    loss = criterion(outputs, y)
    
    # Perform backpropagation
    loss.backward()
    
    # Perform optimization step & zero-out gradient
    optimizer.step()
    optimizer.zero_grad()
    
    print(f"EPOCH: {epoch} | LOSS: {loss.detach()}")

> __Please notice DATA has to be explicitly casted to the device of choice!__

> __Please notice MODEL has to be explicitly casted to the device of choice!__

> __`torch.utils.data.DataLoader` DOES NOT automatically cast tensors to device!__

# Exercise

> __Create full training and evaluation system for PyTorch!__

## Data

- Use [`sklearn.datasets.load_digits`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits) and wrap it with appropriate `torch.utils.data.Dataset`
- Split dataset using `torch.utils.data.random_split`
- After that create three separate `torch.utils.data.DataLoader`s instances

## Model

Create a neural network containing a few layer with appropriate shapes for this classification task.

Add the following methods:
- `forward` - returns logits
- `predict_proba` - returns probabilities
- `predict` - returns predicted class per-example

## Setup

Create necessary variables, namely:
- `torch.device` (gpu if available)
- `criterion` - appropriate loss function for our task
- `optimizer`- method to update our neural network

## Training & evaluation

### Abstract Base Class

Create an abstract base class called `System` (inherit from `abc.ABC` and mark methods to implement for users via `abc.abstractmethod` decorator) which:
- Defines `__init__` method which takes:
    - `model`
    - `optimizer`
    - `device`
    - `criterion`
    - `writer` - `tensorboard` SummaryWriter
    - `metrics` - this one is a dictionary with three keys: `[train, validation, test]`) and each having a list of metrics which we will use for separate pipeline steps
- Defines `train` method which gets a single argument (`dataloader`) and:
    - sets up `model.train()`
    - iterates over `dataloader`
    - passes `batch` from dataloader to `train_step` function
    - gets output from `train_step` which is our `loss` value
    - performs `backward()` on loss
    - uses `optimizer` to perform updates
- Defines `validate` method which gets a single argument (`dataloader`) and:
    - sets up `model.eval()` and is wrapped with `torch.no_grad`
    - iterates over `dataloader`
    - passes `batch` from dataloader to `validate_step` function
- Defines `test` method which is the same as above but uses `test_step` (how to remove unnecessary code duplication?)
- Specifies `train_step`, `validation_step` and `test_step` as abstract methods (inheriting class needs to overwrite them)

### Concrete implementation

Inherit from `System` (name this class `ClassificationSystem`) and implement `train_step`, `validation_step` and `test_step`:
- Each of them has to use appropriate `self.metrics` key and calculate their metrics
- Log metrics to summary writer based on it's key (e.g. `train`) and name of the metric (e.g. `accuracy`). This would become `train/accuracy`. __Tip:__ you can get name of the class (metric) by issuing `metric.__class__.__name__`


## Running the whole system

Instantiate `ClassificationSystem` with necessary arguments and run `train`, `validate` and `test` methods.

Verify your scores using tensorboard's GUI.
 
    
> __Additional:__ How to add `scheduler`s to our implementation?

In [None]:
# Your code here, have fun :)

# Challenges

## Assessment

- What are the other options to turn off gradient computations? See [PyTorch documentation](https://pytorch.org/docs/stable/notes/autograd.html#locally-disable-grad-doc). Why would you use one over the other?
- What is gradient accumulation and how could one program it using PyTorch?

## Non-assessment

- Check [PyTorch Ignite](https://pytorch.org/ignite/index.html) a third-party  framework used to remove some boilerplate code from your training loop