In [1]:
# Pytorch Quickstart tutorial

In [6]:
# Import torch modules
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

# I - Load a Dataset

In [13]:
# Download training data from open datasets.
training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor(),
)


# Download test data from open datasets.
test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor(),
)





 ## FashionMNIST seems to be a dataset about fashion
 
 Link for more infos:[here](https://docs.pytorch.org/vision/stable/generated/torchvision.datasets.FashionMNIST.html#torchvision.datasets.FashionMNIST)

 
 It is a **dataset of Zalendo's article images**. 
 
 A training set of 60k examples and a test set of 10k examples.
 
 Here is its github: [link](https://github.com/zalandoresearch/fashion-mnist) 

 We can conclude that the "train" parameter means that the pytorch dataset **already have separeted training and testing dataset** and we can choose which one to use.


 download= locally I think
 
 transform=ToTensor(), kinda logical: we convert them in torch tensor to process them

In [14]:
batch_size = 64

# Create data loaders.
train_dataloader = DataLoader(training_data, batch_size=batch_size)
test_dataloader = DataLoader(test_data, batch_size=batch_size)

for X, y in test_dataloader:
    print(f"Shape of X [N, C, H, W]: {X.shape}")
    print(f"Shape of y: {y.shape} {y.dtype}")
    break

Shape of X [N, C, H, W]: torch.Size([64, 1, 28, 28])
Shape of y: torch.Size([64]) torch.int64


## Dataloader

This class is used to load a Pyttorch dataset
Only two parameters:
- a dataset
- its batch size = the number of training example we will give in one forward/backward pass.

Increasing the batch size increase the RAM memory needed.

Shape NCHW:
- N = **number of batchs/ data sample**
- C = **number of channels**: for an image, RBG = 3 channels
- H = **height**
- W = **width**
Thoses are the axes in  tensor containing image data sample

More infos about Dataloader on the [documentation](https://docs.pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)


# II - Create a Model

In [15]:
device = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"


print(f"Using {device} device")

# Define model
class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork().to(device)
print(model)

Using cpu device
NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)


To create a neural network in pytorch, we HAVE TO create a class that inherits from **`nn.Module`**.
[Here](https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html) is the doc for `nn.Module`.

We have the choice between using an accelerator such as CUDA or stay on using CPU.

[source](https://docs.pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html)

In our class constructor, we first start with `nn.Flatten()`.

This class is a layer that convert each 2D 28x28 image into a contiguous array og 784 pixel values.

A contiguous array is an array stored in a **unbroken block of memory**, [link](https://stackoverflow.com/questions/26998223/what-is-the-difference-between-contiguous-and-non-contiguous-arrays) for illustrated explanations.

Then we define a **`nn.Sequential`** attribute called `self.linear_relu_stack`.
A Sequential Layer is a container that make our data pass sequentially through multiple layers, here some linear and ReLu layers.
[Here](https://docs.pytorch.org/docs/stable/generated/torch.nn.Sequential.html#torch.nn.Sequential) for more info about sequential module.


Inside is **`nn.Linear`**. The linear layer is a module that applies a linear transformation (y = ax+b) on the input using its stored weights and biases.

Between linear layers, we also use **`nn.ReLu`**. It is a non-linear activation layer detailed [here](https://docs.pytorch.org/docs/stable/generated/torch.nn.ReLU.html) that just does (y=0 if y<=0 else y=x).
 
The constructor `__init__`is called once when the neural network is created. At that time, all layers (the flatten and the sequential) are just defined. They will be called in the **`forward(self,x)`** method.

This method is called every time a data is sent through the network.
The data `x`is first flattened then passed through the sequential layer (with Relu and Linear layers which mean laers with trainable parameters).

The result is returned by this method, it is called logits, an unnormalised output of the model.

We ofter normalized them with a softmax function.

With those we will be able to define the notion of loss and backpropagation to train our parameters.


# III - Train a model

In [20]:
loss_fn = nn.CrossEntropyLoss()

To train a parameters, we need to define a **loss function** and an **optimizer**.

We choose the **cross entropy** loss function between ne numerous differents [loss function available](https://docs.pytorch.org/docs/stable/nn.html#loss-functions).

## A quick explaination about the most popular loss functions

For each mathematical formula, $y_i$ is the target value, $\hat{y}_i$ is the predicted value and $N$ is the number of samples.

### 1 - Mean Squared Error (MSELoss)

This function is used for regression task, when predicting continuous values (temperatures, house prices,...)

$$
\mathrm{MSE}(y,\hat{y}) = \frac{1}{N} \sum_{i=1}^{N} \bigl(y_i - \hat{y}_i\bigr)^2
$$


It penalized large error more strongly because we are taking the square of the difference between the target and predicted value.

![MSE](https://miro.medium.com/v2/resize:fit:640/format:webp/1*WfVDoLsarrM5HpO9sh_ZQQ.png)

### 2 - Mean Absolute Error (L1Loss)

It is used in regression tasks where you want robustness to outliers ( valeurs abérantes).

$$
\mathrm{MAE}(y,\hat{y}) = \frac{1}{N} \sum_{i=1}^{N} \bigl|y_i - \hat{y}_i \bigr|
$$

It has only a linear penalty and not a square one like in the MSE which make it less sensitive to outliers but it has a less smooth optimization which can lead to sparses gradients.

![L1_loss](https://miro.medium.com/v2/resize:fit:640/format:webp/1*0hbNOtpfr6aoR_Bmty-JkA.jpeg)

### 3 - Cross Entropy Loss

The cross entroy loss is uesd for multi-class classification.

We define, for each output (result from each data given to our model), a set of scores for all possibles classes. 

Then we convert these scores into probabilities that sum to 1 via softmax ( ex : 0,3=30% chance to be a dog, 0,5=50% to be a cat,...)

The cross-entropy loss compares those predicted probabilities with the rue class label which is the "perfect probability distribution: 1 for the correct class and ° on other ( ex:  0% dog, 100% cat)

$$
\mathrm{CrossEntropy}(y,\hat{y}) = -\frac{1}{N} \sum_{i=1}^{N} \log\left( \frac{e^{\hat{y}_{i, y_i}}}{\sum_{j} e^{\hat{y}_{i,j}}} \right)
$$

#### Explaination:

We first apply the **Softmax** function to convert logits into probabilities:
$$
p_{i,j} = \frac{e^{\hat{y}_{i, y_i}}}{\sum_{j} e^{\hat{y}_{i,j}}}
$$

Then the cross-entropy will take the logarithm from each of these probabilities

$$
\mathrm{CrossEntropy}(y,\hat{y}) = -\frac{1}{N} \sum_{i=1}^{N} \log\left( p_{i,j} \right)
$$


![cross_entropy](https://ml-cheatsheet.readthedocs.io/en/latest/_images/cross_entropy.png)

### 4 - Binary cross-entropy (BCE)

It is used for binary classification ( ex: spam vs not spam)

$$
\mathrm{BCE}(y,\hat{y}) = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1 - y_i)\log(1 - \hat{y}_i) \right]
$$

The target $y_i$ will always be 0 or 1 but the prediction $\hat{y}_i \in [0,1]$ is the probability to be in the class 1 ( ex : class 0 = not spam and class 1 = spam) 

To obtain this probability we must pass the result of the `nn.Linear` which is a real number (positive or negative) into sigmoid function).

Here is the formula of the sigmoid: $\sigma(z) = \frac{1}{1+e^{-z}}$

If we want to proess logits directly we should use `nn.BCEWithLogitsLoss`instead of `nn.BCELoss` which require to apply `torch.sigmoid` on the logits beforehand.

![binary_Xentropy](https://www.pinecone.io/_next/image/?url=https%3A%2F%2Fcdn.sanity.io%2Fimages%2Fvr8gru94%2Fproduction%2F54c97fda8af4dccc23d58bd14cd95802df6f1e49-393x272.png&w=640&q=75)

### 5 - Negative Log Likelihood Loss (NLLLoss)

It is used for multi-class classification when your model already outputs log_probabilities.

$$
\mathrm{NLL}(y,\hat{y}) = -\frac{1}{N} \sum_{i=1}^{N} \log p_{i, y_i}
$$

That's exactly the cross-entropy formula when applied the **LogSoftmax** function (not just the Softmax).

In LogSoftmax, instead of computing softmax and then log (which is unstable numerically), PyTorch computes both in one go:
$$
\log p_{i,j} = \hat{y}_{i,j} - \log  \left( \sum_{k=1}^{C} e^{\hat{y}_i,k} \right)
$$

So the full formula is:

$$
\mathrm{NLL}(y,\hat{y}) = -\frac{1}{N} \sum_{i=1}^{N} \hat{y}_{i,y_i} - \log  \left( \sum_{k=1}^{C} e^{\hat{y}_i,k} \right)
$$

Anyway it is rarely used, **we prefere to use CrossEntropyLoss** directly.

### 6 - Huber Loss (`nn.SmoothL1Loss`)

Used for regression with both small and lrge error, it is a blend between MSE and MAE.

It is less sensitive to outliers than MSE but smoother than MAE.

Good general-purpos loss for regression tasks.

Because you have to find the $\delta$ experimentally, you can finetuned to have the best loss possible by finding the best $\delta$.

$$
L_{\delta}(y, \hat{y}) =
\begin{cases}
\frac{1}{2} (y - \hat{y})^2, & \text{if } |y - \hat{y}| \le \delta, \\
\delta \cdot \bigl(|y - \hat{y}| - \tfrac{1}{2}\delta \bigr), & \text{otherwise.}
\end{cases}
$$

### 7 - KL Divergence Loss (nn.KLDivLoss)

Used for comparing probability distribution (ex: in a variational Autoencoder, knowledge distillation)
It measures how one probability distribution diverges from another.
$$
D_{KL}(P||Q) = \sum_{i} P(i) \log \frac{P(i)}{Q(i)}
$$



In [21]:
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

## A quick explaination about the optimizer

An optimizer is the lgorithm tha update my model's parameters (weights and biases) based on the computed gradients during training.

In a pass we have:
- the forward method that generate the logits
- the loss that compute the loss
- we do a backward pass with the loss (`loss.backward`)
- we update the weights with the optimizer (`optimizer.step()` )

### 1 - Stochastic Gradient Descent (SGD)

This is the **simpliest** and most classic optimizer.

**Stochastic** means it use a random subset of data (a batch) instead of the full dataset for each update.
If we want to use all data each steps, it is a Batch Gradient Descent, not a SGD.
Using one example at a time is called Pure SGD. In practice we use small batch (like 64 samples) per step, it's called Mini-batch SGD.

$$
\theta_{t+1} = \theta_t - \eta \, \nabla_\theta L(\theta_t)
$$
Where $\theta_t$ are the parameters (weight) at step t, $\eta$ is the learning rate and $\nabla_\theta L(\theta_t)$ is the gradient of the loss with respect to the parameters. Here is its formula:

$$
\nabla_\theta L(\theta_t) = 
\begin{bmatrix}
\dfrac{\partial L}{\partial \theta_1} \\
\dfrac{\partial L}{\partial \theta_2} \\
\vdots \\
\dfrac{\partial L}{\partial \theta_n}
\end{bmatrix}
$$


When using `torch.optim.SGD()`the first parameter given are the model parameters (`model.parameters()`) and we also define the learning rate `lr`).

The learning rate controls how big each update step is. We have to find the best learning rate experimentally.

For instance, `lr = 1e-3` make some small, gentle updates whereas `lr= 1e-1` make large agressive updates.

The limitations of this forumla is:
- it has the same learning rate $\eta$ for all parameters
- it is sensitive to scale of gradients
- it can oscillate and converge slowly, especially in deep networks.

-> this optimizer is good for small or simple models

### 2 - SGD with momentum

We can enhance the SGD rule by adding a "momentum" to make training faster and smoother.

$$
v_t = \beta v_{t-1} + (1 - \beta) \, C
$$
$$
\theta_{t+1} = \theta_t - \eta v_t
$$

This velocity term `v_t`accumulate previous gradients, it helps to reduce oscillation by accelerating in consistent gradient direction.

We usually define the momentum $\beta = 0.9$ 

-> this optimizer is commonly used in CNNs

### 3 - Adagrad

$$
G_t = G_{t-1} + (\nabla_\theta L(\theta_t))^2
$$

$$
\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \varepsilon}} \, \nabla_\theta L(\theta_t)
$$


-> this optimizer is used in sparse data or in Natural Language Processing (NPL)


### 4 - RMSProp

$$
E[g^2]_t = \alpha E[g^2]_{t-1} + (1 - \alpha) \, (\nabla_\theta L(\theta_t))^2 \\
\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \varepsilon}} \, \nabla_\theta L(\theta_t)
$$


### 5 - Adaptive Moment Estimation (Adam)
$$
m_t = \beta_1 m_{t-1} + (1 - \beta_1) \, \nabla_\theta L(\theta_t) \\
v_t = \beta_2 v_{t-1} + (1 - \beta_2) \, (\nabla_\theta L(\theta_t))^2 \\
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \\
\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon}
$$

-> this optimizer is default for most models


### 6 - AdamW (Adam with Decoupled Weight Decay)

$$
\theta_{t+1} = \theta_t - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon} + \lambda \theta_t \right)
$$

-> we prefer to use this model from Transformers, LLMs

