# Creating tensor and manipulating them

## Creating tensors and basic properties

**Create Tensors**
* `torch.tensor()`: create random tensor with given structure and numbers
* `torch.rand()`: create random tensor with given dimensions
* `torch.zeros()`: create a tensor filled with zeros
* `torch.ones()`: create a tensor filled with ones
* `torch.arange()`: create a range (similar to function `range` but output is a tensor)

**Attributes**
* `.dtype`: data type
* `.type()`: assign a new type
* `.shape`: shape of the tensor
* `.device`: on which device the tensor lives

**Misc**
* `torch.manual_seed()`: to reset the seed


In [None]:
import tensor

print(torch.tensor[7, 7])
print(torch.tensor[1, 2], [3, 4]])
print(torch.tensor[[[1, 2], [3, 4]], [[5, 6], [7, 8]]])


## Manipulating tensors

* `.reshape()`: change the shape of the tensor
* `.view()`: change the view: creates a new view, the 2 tensors data are the same, changing one tensor changes the other as well but the shape of the view will be different
* `.stack()`: stack tensors of compatible dimensions
* `.permute()`: change order of dimensions, *useful to move colour channel first to last and viceversa*
* `.squeeze()` and `.unsqueeze()`: remove or add dimensions to a tensor


# PyTorch `nn.Module()`

* `nn.Module()`: class to define models: when subclassing they need a self and a forward method inside
* `nn.Parameters()`: to manually define parameters
* loss functions:
  * `nn.L1Loss`: MSE loss for fitting linear models
  * `nn.CrossEntropyLoss`: cross entropy for multi-class classification
  * `nn.BCEWithLogitsLoss()`: for binary classification, includes sigmoid activation function. Outputs logits, use `torch.sigmoid()` to transform into predictions/probabilities
* `torch.optim.SGD()`: Stochastic Gradient Descent algorithm

**`nn.Module` possible transformations**
* `nn.Sequential()`: to put together several transformations
* `nn.Flatten()`: transform a multi-dimensional tensor in a vector
* `nn.Linear()`: for linear transformation (e.g. simple linear regression)
* `nn.Conv2d()`: convolution step
* `nn.MaxPool2d()`: take maximum over a square of pixels and reduce dimensions
* `nn.ReLU()`: rectified linear activation function $max(0,x)$
* `nn.GELU()`: Gaussian Error Linear Units function
* `nn.MultiheadAttention()`: multihead self attention block
* `nn.Parameters()`: to create ad-hoc parameters

**Methods and Attributes for a model:**
* `a_model.state_dict()`: to get dictionary of parameters
* `a_model.eval()`, `a_model.train()`: eval and train status
* `with torch.inference()`: to turn off gradients, necessary when forecasting or calculating test performance


## Fitting a model

1. Set up number of epochs (iteration)
1. Set up epochs loop
1. Set up loop through batches in a DataLoader
1. `a_model.train()`: Get model in train mode
1. `a_model(X_data)`: Do a forward pass
1. `loss_fn(y_pred, y_test)`: Calculate the train loss
  * Maybe necessary to transform the output: e.g. from logit to probability
1. `optimizer.zero_grad()`: Reset the optimizer
1. `loss_fn.backward()`: Perform loss propagation backward
1. `optimizer.setp()`: Perform optimizer step

```python
# set the timer
torch.manual_seed(42)
train_time_start_on_cpu = timer()

# set number of epochs
epochs = 3

# create training and test loop
for epoch in tqdm(range(epochs)):
  print(f"Epoch: {epoch}\n---------")
  ### Training
  train_loss = 0 # cumulates loss per batch
  # Loop through batches
  for batch, (X, y) in enumerate(train_dataloader):
    model_0.train()
    # forward pass
    y_pred = model_0(X)
    # loss
    loss =loss_fn(y_pred, y)
    train_loss += loss # accumulates the train loss
    # optimizer reset
    optimizer.zero_grad()
    # loss backward
    loss.backward()
    # optimizer step: updating model parameters once per BATCH
    optimizer.step()
    if batch % 400 == 0:
      print(f"Looked at {batch * len(X)}/{len(train_dataloader.dataset)} samples.")

  # back to epoch loop
  # divide loss by length dataloader
  train_loss /= len(train_dataloader)

  # testing loop
  model_0.eval()
  test_loss, test_acc = 0, 0
  with torch.inference_mode():
    for X_test, y_test in test_dataloader:
      # forward pass
      test_pred = model_0(X_test)
      # loss
      test_loss += loss_fn(test_pred, y_test)
      # accuracy
      test_acc += accuracy_fn(y_true=y_test, y_pred=test_pred.argmax(dim=1))
    # calculate the test loss average per batch
    test_loss /= len(test_dataloader)
    # accuracy average
    test_acc /= len(test_dataloader)

  print(f"\nTrain loss: {train_loss:.4f} | Train acc: {test_acc:.2f}%\nTest loss: {test_loss:.4f} | Test acc: {test_acc:.2f}%")
```

# Model Architecture

## TinyVGG architecture model

```python
class TinyVGGArchitecture(nn.Module):
  """
  Model architecture replicating TinyVGG
  from CNN explainer website.
  """
  def __init__(self,
               input_shape: int,
               hidden_units: int, # number of hidden units, it's not the size of each concoluted picture
               output_shape: int):
    super().__init__()
    # architecure: multiple blocks
    # convolutional blocks: multiple layers
    self.conv_block_1 = nn.Sequential(
        nn.Conv2d(in_channels=input_shape, # convolutional 2 dimensional
                  out_channels=hidden_units,
                  kernel_size=3,
                  stride=1,
                  padding=1), # we set these values in NN
        nn.ReLU(),
        nn.Conv2d(in_channels=hidden_units, # convolutional 2 dimensional
                  out_channels=hidden_units,
                  kernel_size=3,
                  stride=1,
                  padding=1),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2,
                     stride=2) # by default same as kernel size
    )
    self.conv_block_2 = nn.Sequential(
        nn.Conv2d(in_channels=hidden_units,
                  out_channels=hidden_units,
                  kernel_size=3,
                  stride=1,
                  padding=1),
        nn.ReLU(),
        nn.Conv2d(in_channels=hidden_units,
                  out_channels=hidden_units,
                  kernel_size=3,
                  stride=1,
                  padding=1),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2)
    )
    # last block needs to output a classifier
    self.classifier = nn.Sequential(
        nn.Flatten(),
        nn.Linear(in_features=hidden_units*7*7,
                  out_features=output_shape)
    )
  def forward(self, x):
    x = self.conv_block_1(x)
    # print(x.shape) # to help get the right size in the Linear layer
    x = self.conv_block_2(x)
    # print(x.shape)
    x = self.classifier(x)
    # print(x.shape)
    return x
```

## VGG models' family

VGG family is a series of models with 11 to 19 layers alternating:
* Convolutional layers
* Max pool layers
* End with a dense layer, with 3 linear layers.

Usually between linear layers and after convolutional layers, there is an activation function (ReLU).

An example is the VGG-11 model, the smallest of the family:

**Conv layer -> Max pool -> Conv layer -> Max pool -> 2 Conv layers -> Max pool -> 2 Conv layers -> Max pool -> 2 Conv layers -> Max pool -> 3 Linear layers -> Soft-max**

In ***PYTorch*** some examples:
* `torch.VGG11_Weights` and `torch.vgg11`
* `torch.VGG13_Weights` and `torch.vgg13`
* `torch.VGG16_Weights` and `torch.vgg16`

https://pytorch.org/vision/stable/models/vgg.html

## Vision Transformer (ViT) architecture

References:
* https://arxiv.org/abs/1706.03762: equations below are taken from this article
* https://arxiv.org/abs/2010.11929

Let's start with an image in $3$ colour channels size $224\times224$ ($H\times W$).

### Embedding Patches step

The image is split into non-overlapping patches of size $P$ where the image size must be divisible by $P$ (e.g. $224/16=14$). The split creates $H\times W/P^2$ patches: if $H=W=224$ and $P=16$ we have $224\times224/16^2 = 14^2 = 196$ patches.

A linear layer is applied to the patches in order to obtain a vector of size $3\times P^2 = 3 \times 6^2 = 768$ corresponding to each image patch.

***With code*** this can be achieved with a combination of one convolutional layer (size and step equal $P$) and a linear layer with hidden units or output features equal $3\times P^2$.

Each patch is: $X^i_pE$ where $E$ is a matrix of learnable parameters, $i=1,\ldots,14^2$, and $p$ refers to the size of each patch.

Dimensions:
$$[B, 3, 224, 224] -> [B, 196, 768]$$
where $B$ is the batch size.

A learnable vector ($x_{class}$) size $[1, 768]$ for the class is stacked on top of the output obtaining a matrix $[197, 768]$ and then a learnable matrix $E_{pos}$ of the same size is added to it tracking the position of each patch within the image.

***Final output***: $[B, 197, 768]$

***Summary Equation:***

$$
\mathbf{z}_0 = \left[x_{class};x^1_pE;\cdots;x^N_pE\right] + E_{pos}
$$

***Calculating the number of parameters***
* For the convolutional layer we have: $16$ filters of size $16^2$, hence $16^3$
* For the linear layer we have: $768\times 768 + 768$
* For the class head: $768$
* For the position head: $197*768$

### Transformer Encoder: MSA + MLP

The transformer encoder has an MSA layer (Multihead Self Attention) and an MLP (Multilayer Perceptron). Before each layer a Layernorm transformation is applied to the data.

***LayerNorm***: This is a normalisation process (from PyTorch documentation of `torch.nn.LayerNorm` function):
$$
y_{i,j,:} = \frac{x_{i,j,:}-E(x_{i,j,:})}{\sqrt{Var(x_{i,j,:})+\varepsilon}}*\gamma_{:, :, k}+\beta_{:, :, k}
$$
where $\gamma$ and $\beta$ are learnable parameters, and $x=X[i, j, :]$ is a vector of size $768$ from the input $[B, 197, 768]$.

Layernorm has no impact on the size of the data. Output is still $[B, 197, 768]$.

***MSA (Multihead Self Attention)***: The output of previous step is split into $H$ heads, for simplicity let's assume $12$. The patch dimension $768$ must be divisible by the number of heads $12$. Each head is size $[B, 197, 64]$ where $64=768/12$. Also the number of heads does not influence the number of parameters for this step.

Each head $\mathbf{z}_{h,l}$ goes through the following:
$$
f\left((\mathbf{z}_{h,l}\cdot W_{h,q}) (\mathbf{z}_{h,l}\cdot W_{h,k})^T\right) (\mathbf{z}_{h,l}\cdot W_{h,v}) + \mathbf{z}_{h,l}
$$
where $h=1,\ldots,H$, $l=1,\ldots,L$ with L the number of transformer encoder layers in the model, $W_{h,q}$, $W_{h,k}$, and $W_{h,v}$ are all learnable matrices of the same size. The function $f(\cdot)$ applies 3 transformations to the product: scale, mask (only in some models) and softmax.

MSA can be coded with ad-hoc function `torch.nn.MultiheadAttention()`.

The output is concatenated back to size $[B, 197, 768]$ ready to input into the next step.

***MLP (Multilayer Perceptron)***: this layer consists of 2 linear layers with an activation function in between (e.g. GELU): in the first layer the hidden units are quadrupled from $768$ to $3072$ and in the second layer they are projected back to $768$.

In code this is translated as a sequence: Linear -> GELU -> (Dropout ->) Linear.

*Transformer Encoder* has its own functions in PyTorch `torch.nn.TransformerEncoderLayer` for 1 layer and `torch.nn.TransformerEncoder` to create a sequence of transformer encoder layers.

https://pytorch.org/docs/stable/nn.html#transformer-layers

***Summary Equation***

$$
\mathbf{z}_l'=MSA(LN(\mathbf{z}_{l-1}))+\mathbf{z}_{l-1}
$$
$$
\mathbf{z}_l=MSA(LN(z_l'))+\mathbf{z}_l'
$$


***Calculating the number of parameters***
* For the layernorm we have $768*2$ (this happens twice before MSA and MLP)
* For the MSA: $12$ heads, $3$ matrices of size $768*768/12$ plut bias $3*768/12$ and a matrix $768*768$ for a fully connected layer plut the fully connected bias $768$: $3*768*768+3*768+768*768+768$
* For the MLP: $768*3072+3072$ for the first layer and $3072*768+768$ for the second linear layer

### MLP Head

The last layer contains the classifier, projecting the outout of the last transformer encoder layer into the number of classes. It takes as input the first row of elements of $z_L^0$ and it applies a layernorm and a linear layer:
$$
\mathbf{y}=LN\left(\mathbf{z}^0_L\right)
$$

***Calculating number of parameters***

* For the layernorm it's $768*2$ parameters
* For the linear layer the parameters are $768*C+C$, where $C$ is the number of classes


## Transfer Learning

`PyTorch` offers a wide range of model architectures with pre-trained weights. It is possible to use these models, loading pre-trained weights and adapt to the problem we are trying to solve.

### Load a model

There are 3 important steps in preparing an instance of a pre-trained model:
1. Load the weights
2. Extract the appropriate transformation: it is important that our data is transformed in the same way as the images used to train the model
3. Create an instance of the model and load the pre-trained weights

```python
# 1. weights
weights = torchvision.models.EfficientNet_B0_Weights.DEFAULT # .DEFAULT = best available weights from pretraining on ImageNet
# 2. transform
model_transforms = weights.transforms()
# 3. model
model = torchvision.models.efficientnet_b0(weights=weights).to(device)
```

### Fitting a pre-trained model

The following steps:
1. Create DataLoaders with the appropriate transform
2. Replace the classifier to adapt to the right number of classes (basically change the very last linear layer)
3. Freeze the gradient of all parameters in the feature extraction layers of the model so they do not get updated when fitting the model

```python
# 1. dataloaders
train_data = datasets.ImageFolder(root=train_dir, # target folder of images
                                  transform=model_transforms) # from the pre-trained weights
# 2. freeze parameters
for param in model.features.parameters():
    param.requires_grad = False
# 3. replace classifier
model.classifier = torch.nn.Sequential(
    torch.nn.Dropout(p=0.2, inplace=True), 
    torch.nn.Linear(in_features=1280, 
                    out_features=output_shape, # same number of output units as our number of classes
                    bias=True)).to(device)
```

# Loading and creating datasets

## DataLoader

`from torch.utils.data import DataLoader` to create batches of data as it's computationally impossible to use all images at the same time. Good batch size are powers of 2, like 32 or 64.
* use `next(iter(aDataLoader))` to access one batch of data/images

# Experiment tracking

# Evaluating models

## Torchvision

* `import datasets`: contains datasets
* `import transform`: contains transformation to adapt images to correct format/size or to augment data

## Torchinfo 

* `from torchinfo import summary`: nice summary of model

```python
summary(model=a_model,
        input_size=(32, 3, 224, 224), # (batch_size, color_channels, height, width)
        col_names=["input_size", "output_size", "num_params", "trainable"],
        col_width=20,
        row_settings=["var_names"])
```
```
========================================================================================================================
Layer (type (var_name))                  Input Shape          Output Shape         Param #              Trainable
========================================================================================================================
PatchEmbedding (PatchEmbedding)          [32, 3, 224, 224]    [32, 196, 768]       --                   True
├─Conv2d (patcher)                       [32, 3, 224, 224]    [32, 768, 14, 14]    590,592              True
├─Flatten (flatten)                      [32, 768, 14, 14]    [32, 768, 196]       --                   --
========================================================================================================================
Total params: 590,592
Trainable params: 590,592
Non-trainable params: 0
Total mult-adds (G): 3.70
========================================================================================================================
Input size (MB): 19.27
Forward/backward pass size (MB): 38.54
Params size (MB): 2.36
Estimated Total Size (MB): 60.17
========================================================================================================================
```

## Torchmetrics

```python
try:
  import torchmetrics
except:
  !pip install -q torchmetrics
  import torchmetrics
```

Contains functions to help evaluate models
* `Accuracy()`

### Confusion matrix

```python
from torchmetrics import ConfusionMatrix
from mlxtend.plotting import plot_confusion_matrix
```




# `sklearn` useful functions

* `from sklearn.datasets import make_circles, moons, make_blobs`: to create artifical datasets
* `from sklearn.model_selection import train_test_split`: to split dataset into train and test datasets

# Misc

* `from tqdm,auto import tqdm`: to have a progress bar when running a loop
* `from timeit import default_timer as timer`: to get system time