# Introduction to PyTorch
In this tutorial we will take a first look at the PyTorch library as well as some extensions. The official documents are much more detailed, so it is always a good idea to go through them:
* [Official tutorials](https://pytorch.org/tutorials/)
* [PyTorch Documentation](https://pytorch.org/docs/stable/index.html)
* [Examples](https://github.com/pytorch/examples)
* [PyTorch Lightning Documentation](https://pytorch-lightning.readthedocs.io/en/latest/)

Make sure you have a recent version of PyTorch installed. Installation instructions can be found [here](https://pytorch.org/get-started/locally/). Additionally, you need the following packages:
* `torchvision`
* `tqdm`
* `scikit-learn`
* `pytorch-lightning`
* `tensorboard`

If you use [Google Colab](https://colab.research.google.com/) (recommended), you should already have all packages except `pytorch-lightning`. You can install missing packages like this:

In [None]:
!pip install pytorch-lightning

If you use Jupyter Lab, make sure to activate extensions as described [here](https://github.com/tqdm/tqdm/issues/394#issuecomment-384743637), otherwise you won't be able to see progress bars.

This tutorial is made with PyTorch version 1.4.0. Below you can check the versions of your installed packages:

In [None]:
import torch
import torchvision
import tqdm
import sklearn
import pytorch_lightning
import tensorboard

for m in [torch, torchvision, tqdm, sklearn, pytorch_lightning, tensorboard]:
    print('{}: v{}'.format(m.__name__, m.__version__))

## PyTorch Basics
The first part of the tutoral covers the basics of PyTorch without the Lightning extension.

### Example: Digit Recognition
We will start by creating a feed-forward network for hand-written digit recognition using the [MNIST dataset](https://en.wikipedia.org/wiki/MNIST_database). It contains images, where each image shows one hand-written digit. The task is to classify each image as one of ten classes, one for each digit. This dataset is included in `torchvision`. In order to use this dataset with PyTorch, the images have to be converted to Tensors, which we can achieve easily by using `torchvision.transforms.ToTensor`. Note that in many cases it is also useful to normalize the data (from $[0, 255]$ to $[0, 1]$). We skip this step for simplicity here.

In [None]:
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor

mnist_train = MNIST(root='data', download=True, train=True, transform=ToTensor())
mnist_test = MNIST(root='data', download=True, train=False, transform=ToTensor())

The data is already split in training and test sets. We can inspect the lengths of the datasets to find out the number of images. In this case we have $60000$ training examples and $10000$ testing examples. The images are $28 \times 28$ pixels, the labels are integers in $[0, 9]$.

In [None]:
print('{} training images'.format(len(mnist_train)))
print('{} test images'.format(len(mnist_test)))

image, label = mnist_train[0]
print('image shape: {}'.format(image.shape))
print('label: {}'.format(label))

#### Model Definition
There are multiple ways of defining models in PyTorch. For simple problems, it is possible to define a model as a list of sequential layers. We start with a simple feed-forward network for this task. Using `torch.nn.Sequential`, it is as simple as combining a list of layers:

In [None]:
import torch.nn as nn

model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(in_features=784, out_features=512),
    nn.ReLU(),
    nn.Dropout(p=0.2),
    nn.Linear(in_features=512, out_features=10))

Alternatively, it is possible to define more complex models using a functional API. Each custom module should subclass `torch.nn.Module` and implement the `forward` method. The following model definition is equivalent to the first one:

In [None]:
class MNISTModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_1 = nn.Linear(in_features=784, out_features=512)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(p=0.2)
        self.linear_2 = nn.Linear(in_features=512, out_features=10)

    def forward(self, inputs):
        x = self.flatten(inputs)
        x = self.linear_1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.linear_2(x)
        return x

model = MNISTModel()

In detail, our new model has the following layers:
1. The first layer flattens the input. Right now our images have a shape of $(28, 28)$. Since our model consists of regular fully-connected layers, we need to convert them to vectors. This can be done using a `torch.nn.Flatten` layer, which is our model's input layer.
2. The second layer is a dense (fully connected) hidden layer with $512$ neurons. The number of input features is the number of pixels per image. We use ReLU activation.
3. The third layer adds some dropout regularization.
4. The fourth layer is another dense layer with $10$ neurons, where each neuron corresponds to one class.

Now that our model is complete, we need a loss function and an optimizer to train it. We choose cross entropy loss. Note that `torch.nn.CrossEntropyLoss` does more than just computing the loss:
1. It converts the output of our model to a probability distribution using the _LogSoftmax_. Since we have $10$ classes, we need $10$ output neurons, where each neuron outputs the probability that the input image belongs to its class. Recall the definition of the Softmax function:
\begin{equation}
	softmax(y_i) = \frac{e^{y_i}}{\sum_{j=1}^{m} e^{y_j}}
\end{equation}
The Softmax function produces a probability distribution in the output layer, where we can then select the class with the highest probability.
2. It handles our labels correctly, even though they are simply class indices ($0$ to $9$) and not probability distributions.

As our optimizer we choose `torch.optim.Adam`. It is possible to optimize all model parameters or just a subset of them.

In [None]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())

#### Training the Model
We are now ready to train our model. PyTorch offers utility classes that can (and should) be used for training. The class `torch.utils.data.DataLoader` provides useful features such as shuffling, batching, parallel processing and more. It expects a `torch.utils.data.Dataset`, which `torchvision.datasets.MNIST` is a subclass of, hence we can directly use it:

In [None]:
from torch.utils.data import DataLoader

train_dl = DataLoader(mnist_train, batch_size=32, shuffle=True)
test_dl = DataLoader(mnist_test, batch_size=32)

To train a single epoch, we iterate over the DataLoader. For each input batch, we first apply the model (forward pass), then calculate the loss and call the optimizer (backward pass).

In [None]:
from tqdm.notebook import tqdm

def train_epoch(model, train_dl, loss_fn, optimizer):
    epoch_loss = 0
    for inputs, labels in tqdm(train_dl):
        outputs = model(inputs)
        loss = loss_fn(outputs, labels)
        epoch_loss += loss.item()
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    return epoch_loss / len(train_dl)

We can now train our model for a couple of epochs. Calling the `train` function ensures that the model is in training mode (more on that later).

In [None]:
model.train()
for epoch in range(5):
    loss = train_epoch(model, train_dl, loss_fn, optimizer)
    print('epoch {} -- training loss: {}'.format(epoch + 1, loss))

#### Testing the Model
Now that our model is trained, we can use it to make some predictions. Let's take the first item in the test set and predict its class. Note that our model expects batches as inputs, i.e. the shape should be $(b, 28, 28)$, where $b$ is the batch size (this can be variable). Even if we only want to input a single example, the shape must be $(1, 28, 28)$.

In [None]:
inputs, label = mnist_test[0]
print('input shape: {}'.format(inputs.shape))
print('real label: {}'.format(label))

Whenever we intend to use the model without training it, we do not need the gradients to be computed during the forward pass. To prevent this, we can use `torch.no_grad`. Before we use the model for predictions, we switch to evaluation mode by calling the `eval` function to disable dropout.

In [None]:
model.eval()
with torch.no_grad():
    predictions = model(inputs)

print('predictions: {}'.format(predictions))
print('class with highest score: {}'.format(torch.argmax(predictions)))

Our model assigned the image to the correct class.

Using the same technique, we can now evaluate our model by computing some metrics on the test set. We convert the labels and outputs to numpy arrays using `torch.Tensor.numpy` and use Scikit-learn to compute the standard classification metrics.

In [None]:
from sklearn.metrics import classification_report

def evaluate(model, test_dl):
    true_labels = []
    predicted_labels = []
    for inputs, labels in tqdm(test_dl):
        outputs = model(inputs)
        predictions = torch.argmax(outputs, -1)

        true_labels.extend(labels.numpy())
        predicted_labels.extend(predictions.numpy())
    return true_labels, predicted_labels

with torch.no_grad():
    true_labels, predicted_labels = evaluate(model, test_dl)
print(classification_report(true_labels, predicted_labels))

Congratulations, your model has an accuracy of $98\%$! You can now save it using `torch.save`:

In [None]:
torch.save(model.state_dict(), 'mnist_model.pt')

Loading it back is just as easy:

In [None]:
state_dict = torch.load('mnist_model.pt')
model.load_state_dict(state_dict)

Note that this only saves the model weights, but not the structure. You can also save the whole model instead of just the state dict.

## PyTorch Lightning
[PyTorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning) is a PyTorch wrapper that implements many useful functions and saves us from writing a lot of code. We will introduce it (somewhat briefly) in this section.

### Lightning Modules
In order to use Lightning to train your model, it has to be defined as a subclass of `pytorch_lightning.LightningModule`. This class extends the standard `torch.nn.Module`, adding methods for datasets, optimizers, loss functions and more. Let's redefine our MNIST model using Lightning:

In [None]:
from pytorch_lightning import LightningModule
from sklearn.metrics import accuracy_score

class MNISTLightningModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_1 = nn.Linear(in_features=784, out_features=512)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(p=0.2)
        self.linear_2 = nn.Linear(in_features=512, out_features=10)
        
        self.loss_fn = nn.CrossEntropyLoss()

    def forward(self, inputs):
        x = self.flatten(inputs)
        x = self.linear_1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.linear_2(x)
        return x

    def prepare_data(self):
        self.mnist_train = MNIST(root='data', download=True, train=True, transform=ToTensor())
        self.mnist_test = MNIST(root='data', download=True, train=False, transform=ToTensor())

    def training_step(self, batch, batch_idx):
        inputs, labels = batch
        outputs = self(inputs)
        return {'loss': self.loss_fn(outputs, labels)}

    def training_epoch_end(self, results):
        avg_loss = torch.stack([step['loss'] for step in results]).mean()
        return {'log': {'train_loss': avg_loss}}

    def test_step(self, batch, batch_idx):
        inputs, labels = batch
        outputs = self(inputs)
        predictions = torch.argmax(outputs, -1)
        return {'test_labels': labels, 'test_predictions': predictions}

    def test_epoch_end(self, results):
        true_labels = []
        predicted_labels = []
        for step in results:
            true_labels.extend(step['test_labels'].numpy())
            predicted_labels.extend(step['test_predictions'].numpy())

        # all logs need to be tensors, not numbers
        acc = torch.as_tensor(accuracy_score(true_labels, predicted_labels))
        return {'log': {'test_accuracy': acc}}

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters())

    def train_dataloader(self):
        return DataLoader(self.mnist_train, batch_size=32, shuffle=True)

    def test_dataloader(self):
        return DataLoader(self.mnist_test, batch_size=32)

The Lightning model above contains everything our previous model contained and is defined by a single well-organized class.  The `prepare_data` method gets called once before the training starts and takes care of dataset related tasks like downloads. The training is handled by the `training_step` method, which returns the loss for a given batch. `training_epoch_end` is called after one epoch and averages the training loss. Similarly, `test_step` applies the model to a batch of testing data and returns the labels and predictions, while `test_epoch_end` finally aggregates the results and returns the accuracy score for the whole test set.

Training and testing the model now becomes trivial:

In [None]:
from pytorch_lightning import Trainer

model = MNISTLightningModel()
trainer = Trainer(max_epochs=5)
trainer.fit(model)
trainer.test()

### TensorBoard
You might have noticed that our Lightning model did not output any numbers like training loss. We did, however, return logging dictionaries in our `training_epoch_end` and `test_epoch_end` methods that contain the values we would like to monitor during and after training. The logs are saved in a new folder (`lightning_logs` by default). We can use _TensorBoard_ to inspect and visualize these logs. TensorBoard a very useful tool used to monitor the training process and inspect the model. It allows you to plot losses, metrics and other parameters like the learning rate in real time during the training and has many other cool features. You can launch TensorBoard either via the terminal or directly in the notebook:

In [None]:
%load_ext tensorboard
%tensorboard --logdir lightning_logs

Above you should be able to see your training loss now. If you ran the training multiple times, you can enable and disable each run individually on the left. You can customize the values to be logged here to your liking by returning more values in the logging dictionaries within your Lightning model. For example, it is possible to log the training loss after each batch in `training_step` and omit the `training_epoch_end` function altogether. This can be useful if your dataset is very large and does not require even one full epoch of training.

Note that, if you do not see the test results in TensorBoard, it might be related to [this bug](https://github.com/PyTorchLightning/pytorch-lightning/issues/1447), which should be fixed in the next version of Lightning.

### Validation
So far we have only trained our model for a fixed number of epochs on our training data. However, this approach is very prone to overfitting. It is usually best to keep a smaller number of examples as a validation set (or dev set) and calculate the loss on this set after every $n$ epochs (or batches). As soon as we see the validation loss increase, we should stop the training to avoid overfitting.

In order to use early stopping with our dataset, we need to create a validation set first by taking a small number of instances from the training set. For simplicity, we create the split in the `prepare_data` method here; however, in practice it might be better to have a fixed split that never changes. Additionally, we implement the methods `validation_step`, which computes the loss on a batch of validation data, and `validation_epoch_end`, which averages the loss over the whole validation set.

In [None]:
from torch.utils.data import random_split

class MNISTLightningModelWithVal(MNISTLightningModel):
    def prepare_data(self):
        mnist_train = MNIST(root='data', download=True, train=True, transform=ToTensor())
        self.mnist_train, self.mnist_val = random_split(mnist_train, [55000, 5000])
        self.mnist_test = MNIST(root='data', download=True, train=False, transform=ToTensor())

    def validation_step(self, batch, batch_idx):
        inputs, labels = batch
        outputs = self(inputs)
        return {'loss': self.loss_fn(outputs, labels)}

    def validation_epoch_end(self, results):
        avg_loss = torch.stack([step['loss'] for step in results]).mean()
        return {'log': {'val_loss': avg_loss}}

    def train_dataloader(self):
        return DataLoader(self.mnist_train, batch_size=32, shuffle=True)

    def val_dataloader(self):
        return DataLoader(self.mnist_val, batch_size=32)

    def test_dataloader(self):
        return DataLoader(self.mnist_test, batch_size=32)

After implementing the validation methods, you can use parameters such as `check_val_every_n_epoch` or `val_check_interval` to control the frequency of validation during training. It is also possible to perform validation more than once per epoch by setting `val_check_interval` to a value between $0$ and $1$.

In [None]:
model = MNISTLightningModelWithVal()
trainer = Trainer(max_epochs=5, check_val_every_n_epoch=1)
trainer.fit(model)

### Callbacks
Callback functions can be used to execute any arbitrary code during training or testing of your Lightning model. A callback class implements methods like
* `on_batch_end`
* `on_batch_start`
* `on_epoch_end`
* `on_epoch_start`

and so on. The complete list can be found [here](https://pytorch-lightning.readthedocs.io/en/latest/callbacks.html). These methods are then called at the appropriate times during the training/testing process. You can create your own callback classes by subclassing `pytorch_lightning.callbacks.base.Callback`. In this section we will look at some existing callbacks.

#### Early Stopping
In the previous section we implemented validation. Since the validation loss tells us when the model starts overfitting, it makes sense to stop the training process after reaching this point to avoid unnecessary computations. This technique is called _early stopping_. Lightning provides a pre-implemented callback `pytorch_lightning.callbacks.EarlyStopping` that monitors the validation loss and stops the training once it starts increasing:

In [None]:
from pytorch_lightning.callbacks import EarlyStopping

cb_early_stop = EarlyStopping(monitor='val_loss', patience=3)

Note that name of value we monitor (`val_loss`) needs to be identical to the name we assigned in our logs earlier. We can also set the `patience` parameter as the maximum number of iterations where an increase is tolerated before stopping. Let's train the same model again with early stopping. During the training, you can scroll up and monitor the logs in real time using TensorBoard.

In [None]:
model = MNISTLightningModelWithVal()
trainer = Trainer(check_val_every_n_epoch=1, early_stop_callback=cb_early_stop)
trainer.fit(model)

#### Checkpointing
Model checkpointing is an important technique when you train models over a longer period of time. You want to save your trained weights after every couple of epochs (or batches) for two reasons:
1. If the process is killed unexpectedly, you don't lose all of your progress.
2. If your model starts to overfit, you can use the last 'good' checkpoint.

Checkpointing is implemented as a callback function, `pytorch_lightning.callbacks.ModelCheckpoint`.

In [None]:
import os
from pytorch_lightning.callbacks import ModelCheckpoint

os.makedirs('ckpt', exist_ok=True)
ckpt_path = os.path.join('ckpt', '{epoch}-{val_loss:.2f}')
cb_checkpoint = ModelCheckpoint(filepath=ckpt_path, period=1, save_top_k=3, monitor='val_loss')

We configured the checkpointing callback above to save the checkpoints in a directory called `ckpt`. A checkpoint will be saved after every epoch, as controlled by `period`. Only the best $3$ checkpoints will be kept, based on the validation loss. The file name is important here, as it determines whether or not checkpoints will be overwritten. Placeholders in the file name will be replaced by the corresponding values.

In [None]:
model = MNISTLightningModelWithVal()
trainer = Trainer(check_val_every_n_epoch=1, early_stop_callback=cb_early_stop, checkpoint_callback=cb_checkpoint)
trainer.fit(model)

## Further Reading
In this section we describe some more advanced topics which are not fully covered in this tutorial but well worth checking out.

### Training on GPUs
So far we have only used a small dataset which can be trained on CPUs in reasonable time. However, once you start working with large models or data, this becomes unfeasible. GPUs (and TPUs) are optimized for matrix (or tensor) multiplications and are much better suited for training deep models. You have multiple options to use GPUs:

* If you have your own NVIDIA GPU, install [CUDA](https://developer.nvidia.com/cuda-downloads) and [cuDNN](https://developer.nvidia.com/cudnn) to use it (note that Deep Learning libraries tend to require specific versions of these).
* If you are using Google Colab, you can activate GPU (or TPU) acceleration for free in the menu under `Runtime`, selecting `Change runtime type`. Restart the runtime afterwards.

Start by checking your CUDA installation:

In [None]:
print(torch.cuda.is_available())

#### PyTorch
In order to train on a GPU, you have to transfer both your parameters (the model) and your data to the GPU before (or during) training. Note that this is a small model, so the speed-up is not very big.

In [None]:
model = MNISTModel()
model.to('cuda:0')
optimizer = torch.optim.Adam(model.parameters())

def train_epoch_cuda(model, train_dl, loss_fn, optimizer):
    epoch_loss = 0
    for inputs, labels in tqdm(train_dl):
        outputs = model(inputs.to('cuda:0'))
        loss = loss_fn(outputs, labels.to('cuda:0'))
        epoch_loss += loss.item()
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    return epoch_loss / len(train_dl)

model.train()
for epoch in range(5):
    loss = train_epoch_cuda(model, train_dl, loss_fn, optimizer)
    print('epoch {} -- training loss: {}'.format(epoch + 1, loss))

#### Lightning
Lightning makes training on a GPU very easy. All we have to do is set the `gpus` parameter:

In [None]:
model = MNISTLightningModel()
trainer = Trainer(gpus=1, max_epochs=5)
trainer.fit(model)

Lightning also supports training on TPUs (available through Google Colab) out of the box. Check out the official documentation [here](https://pytorch-lightning.readthedocs.io/en/latest/tpu.html).

If you have multiple GPUs, you can use PyTorch's `DataParallel` or `DistributedDataParallel` mode. Check out the official guides for [PyTorch](https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html) and [Lightning](https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html).

### Custom Datasets
Once you start using your own data for training it is very likely you have to create custom dataset classes. Refer to the [relevant PyTorch documentation](https://pytorch.org/docs/stable/data.html) on how to do this. In essence, you have to extend `torch.utils.data.Dataset` and implement the `__getitem__` and `__len__` methods. Depending on your use case, it might be required to implement a _collate_ function, which forms batches from your data instances.

For very large datasets that do not fit in the main memory, it might be worth checking out `h5py` ([HDF5 for Python](http://docs.h5py.org/en/stable/)). It is a data type that allows non-sequential access directly from the disk.