# Pytorch Quickstart tutorial

[Here](https://docs.pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html) is the source quickstart pytorch tutorial.

In [1]:
# Import torch modules
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

# I - Load a Dataset

In [2]:
# Download training data from open datasets.
training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor(),
)


# Download test data from open datasets.
test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor(),
)





100.0%
100.0%
100.0%
100.0%


## FashionMNIST Dataset

The **FashionMNIST** dataset is a modern replacement for the classic MNIST digits dataset. 
but instead of handwritten numbers, it contains **fashion article images** from *Zalando* (clothing items, shoes, bags, etc.).

📚 More info: [PyTorch docs](https://pytorch.org/vision/stable/generated/torchvision.datasets.FashionMNIST.html)  
👕 Official GitHub: [zalandoresearch/fashion-mnist](https://github.com/zalandoresearch/fashion-mnist)

The dataset contains:
- **60,000 training images**
- **10,000 test images**

Each image is a **grayscale 28×28 pixel** image belonging to one of **10 classes** (T-shirt, Trouser, Pullover, etc.).

We can notice that the `train` parameter in `torchvision.datasets.FashionMNIST` simply tells PyTorch which split to load:
- `train=True` → training set  
- `train=False` → test set  

This means that **the dataset already comes pre-split**: we just choose which part to use.

Common arguments:
- `download=True` → downloads the dataset locally if it’s not already present.  
- `transform=ToTensor()` → converts the images (PIL format) into PyTorch tensors so they can be processed by the model.


In [3]:
batch_size = 64

# Create data loaders.
train_dataloader = DataLoader(training_data, batch_size=batch_size)
test_dataloader = DataLoader(test_data, batch_size=batch_size)

for X, y in test_dataloader:
    print(f"Shape of X [N, C, H, W]: {X.shape}")
    print(f"Shape of y: {y.shape} {y.dtype}")
    break

Shape of X [N, C, H, W]: torch.Size([64, 1, 28, 28])
Shape of y: torch.Size([64]) torch.int64


## DataLoader

The **DataLoader** class is used to load and iterate through a PyTorch dataset efficiently.

It mainly takes two parameters:
- a **dataset** (e.g. `train_dataset`)  
- a **batch size** = number of samples to feed into the model per forward/backward pass.

A larger batch size uses **more memory (RAM/VRAM)** but can improve training stability.

The data is usually represented in the **NCHW format** (the shape of an image tensor):

- **N** → number of samples in the batch  
- **C** → number of channels (for an RGB image, C = 3; for grayscale, C = 1)  
- **H** → image height  
- **W** → image width  

Those four axes define the dimensions of the image tensors handled by PyTorch.

📖 More info on the [DataLoader documentation](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)


# II - Create a Model

In [4]:
device = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"


print(f"Using {device} device")

# Define model
class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork().to(device)
print(model)

Using cpu device
NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)


To create a neural network in PyTorch, we **have to** create a class that inherits from **`nn.Module`**.
[Here](https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html) is the official documentation for `nn.Module`.

We have the choice of using an accelerator such as **CUDA**, or sticking with the **CPU**.

---

### Defining the Neural Network

In our class constructor, we first define `nn.Flatten()`.

This layer converts each 2D 28×28 image into a contiguous array of 784 pixel values.
A **contiguous array** is an array stored in an **unbroken block of memory**. [Here](https://stackoverflow.com/questions/26998223/what-is-the-difference-between-contiguous-and-non-contiguous-arrays) is an illustrated explanation.

Next, we define a **`nn.Sequential`** attribute called `self.linear_relu_stack`.

A **Sequential** layer is a container that allows data to pass sequentially through multiple layers, in our case a mix of linear and ReLU layers.
[Here](https://docs.pytorch.org/docs/stable/generated/torch.nn.Sequential.html#torch.nn.Sequential) is more information about the Sequential module.

---

### Layers Inside `nn.Sequential`

* **`nn.Linear`**: A linear layer applies a linear transformation (y = Ax + b) on the input using its stored weights and biases.
* **`nn.ReLU`**: A non-linear activation layer that applies (y = 0) if (x \le 0), else (y = x). [More details](https://docs.pytorch.org/docs/stable/generated/torch.nn.ReLU.html)

---

### Constructor and Forward Method

The constructor `__init__` is called **once** when the neural network is created.
At this point, all layers (flatten and sequential) are just defined, they are not yet applied to any data.

The **`forward(self, x)`** method is called **every time** data is passed through the network.

* The input `x` is first **flattened**.
* Then it passes through `self.linear_relu_stack` (the linear and ReLU layers).

The output of this method is called **logits**, which are the **unnormalized outputs** of the model.
We often normalize them using a **softmax** function.

Logits are the prediction of our model, it need to be compared to the target value to improve our model.

With logits, we can define a **loss function** and perform **backpropagation** to train the network's parameters.

# III - Train a model

In [5]:
loss_fn = nn.CrossEntropyLoss()

To train model parameters, we need to define both a **loss function** and an **optimizer**.

In our case, we’ll use the **Cross Entropy** loss, one of the most common choices among the many different [loss functions available in PyTorch](https://pytorch.org/docs/stable/nn.html#loss-functions).

---

## A quick explanation about the most popular loss functions

For all the following formulas:
- $y_i$ = target value  
- $\hat{y}_i$ = predicted value  
- $N$ = number of samples

---

### 1 - Mean Squared Error (MSELoss)

Used for **regression tasks**, when predicting continuous values (e.g. temperature, house prices, etc.).

$$
\mathrm{MSE}(y, \hat{y}) = \frac{1}{N} \sum_{i=1}^{N} \bigl(y_i - \hat{y}_i\bigr)^2
$$

MSE penalizes large errors more strongly because we take into consideration the **square value** of the difference between prediction and target.

![MSE](https://miro.medium.com/v2/resize:fit:640/format:webp/1*WfVDoLsarrM5HpO9sh_ZQQ.png)

---

### 2 - Mean Absolute Error (L1Loss)

Used for **regression** when you want **robustness to outliers** (*valeurs aberrantes*).

$$
\mathrm{MAE}(y, \hat{y}) = \frac{1}{N} \sum_{i=1}^{N} \bigl|y_i - \hat{y}_i\bigr|
$$

Unlike MSE, it applies a **linear penalty** instead of a squared one.  
That makes it less sensitive to outliers, but the optimization landscape is less smooth which can lead to sparse gradients, making it harder to optimize.

![L1_loss](https://miro.medium.com/v2/resize:fit:640/format:webp/1*0hbNOtpfr6aoR_Bmty-JkA.jpeg)

---

### 3 - Cross Entropy Loss

The **Cross Entropy Loss** is used for **multi-class classification** problems.

For each input, the model outputs a set of **scores (logits)**, one per class.  
We then convert those scores into probabilities that sum to 1 using the **Softmax** function (e.g. 0.3 → 30% dog, 0.5 → 50% cat, etc.).

The loss compares the predicted probabilities to the true class label, which is represented as a **“perfect” distribution** (1 for the correct class, 0 for the others).

$$
\mathrm{CrossEntropy}(y, \hat{y}) = -\frac{1}{N} \sum_{i=1}^{N} \log\left( \frac{e^{\hat{y}_{i, y_i}}}{\sum_{j} e^{\hat{y}_{i,j}}} \right)
$$

#### Explanation:

1. Apply **Softmax** to convert logits into probabilities:

$$
p_{i,j} = \frac{e^{\hat{y}_{i,j}}}{\sum_{k} e^{\hat{y}_{i,k}}}
$$

2. Then compute the average **negative log-likelihood** of the correct class:

$$
\mathrm{CrossEntropy}(y, \hat{y}) = -\frac{1}{N} \sum_{i=1}^{N} \log\left(p_{i, y_i}\right)
$$

This encourages the model to assign higher probabilities to the correct class.

![cross_entropy](https://ml-cheatsheet.readthedocs.io/en/latest/_images/cross_entropy.png)

---

### 4 - Binary Cross-Entropy (BCE)

Used for **binary classification** (e.g. spam vs. not spam).

$$
\mathrm{BCE}(y, \hat{y}) = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1 - y_i)\log(1 - \hat{y}_i) \right]
$$

Here:
- The target $y_i$ ∈ {0, 1}  
- The prediction $\hat{y}_i \in [0, 1]$ represents the **probability** of belonging to class 1 (e.g. class spam = 1, not spam = 0).

To get $\hat{y}_i$, we apply a **sigmoid** to the raw model output (logit):

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

If you want to work directly with logits (without applying sigmoid yourself), use `nn.BCEWithLogitsLoss`, which combines both operations safely.

![binary_Xentropy](https://www.pinecone.io/_next/image/?url=https%3A%2F%2Fcdn.sanity.io%2Fimages%2Fvr8gru94%2Fproduction%2F54c97fda8af4dccc23d58bd14cd95802df6f1e49-393x272.png&w=640&q=75)

---

### 5 - Negative Log Likelihood Loss (NLLLoss)

Used for **multi-class classification** when your model already outputs **log-probabilities** instead of raw logits.

$$
\mathrm{NLL}(y, \hat{y}) = -\frac{1}{N} \sum_{i=1}^{N} \log p_{i, y_i}
$$

This is mathematically equivalent to the **Cross Entropy Loss**, but expects the input to already be log-softmaxed.

In PyTorch, that’s done using `nn.LogSoftmax`:

$$
\log p_{i,j} = \hat{y}_{i,j} - \log \left( \sum_{k=1}^{C} e^{\hat{y}_{i,k}} \right)
$$

Substituting it gives:

$$
\mathrm{NLL}(y, \hat{y}) = -\frac{1}{N} \sum_{i=1}^{N} \left( \hat{y}_{i, y_i} - \log \sum_{k=1}^{C} e^{\hat{y}_{i,k}} \right)
$$

In practice, we rarely use NLLLoss directly because **`CrossEntropyLoss` already combines `LogSoftmax` and `NLLLoss`** for better numerical stability.

---

### 6 - Huber Loss (`nn.SmoothL1Loss`)

Used for **regression** tasks with both small and large errors.  
It’s a compromise between **MSE** (sensitive to outliers) and **MAE** (robust but non-smooth).

$$
L_{\delta}(y, \hat{y}) =
\begin{cases}
\frac{1}{2} (y - \hat{y})^2, & \text{if } |y - \hat{y}| \le \delta, \\
\delta \cdot \bigl(|y - \hat{y}| - \tfrac{1}{2}\delta \bigr), & \text{otherwise.}
\end{cases}
$$

The parameter $\delta$ defines the transition point between quadratic and linear loss.  
You can tune it experimentally to get the best fit for your dataset.

**Huber Loss** is a good general-purpose choice when you want stability and robustness in regression.

---

### 7 - KL Divergence Loss (`nn.KLDivLoss`)

Used for comparing **probability distributions** — for example, in **Variational Autoencoders** or **Knowledge Distillation**.

It measures how one probability distribution $P$ diverges from another $Q$:

$$
D_{KL}(P || Q) = \sum_{i} P(i) \log \frac{P(i)}{Q(i)}
$$

In simple terms, it tells us how much information is lost when we use $Q$ to approximate $P$.


In [6]:
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

## A quick explanation about the optimizer

An **optimizer** is the algorithm that updates my model's parameters (weights and biases) based on the computed gradients during training.

During one training pass we have:
- the **forward** method that generates the logits (the model’s predictions),
- the **loss** function that computes the error,
- a **backward pass** (`loss.backward()`) that calculates the gradients,
- and finally the **optimizer step** (`optimizer.step()`) that updates the weights.

The loss will be designated by the variable $L$. Computing the partial derivative of our loss function with respect of the parameter $\theta_i$ is written $\dfrac{\partial L}{\partial \theta_i}$ and tell how much and in which direction will the loss change when changing the parameter $\theta_i$. This mathematical notion is crucial for what is following.

---

### 1 - Stochastic Gradient Descent (SGD)

This is the **simplest** and most classic optimizer.

**Stochastic** means it uses a random subset of data (a *batch*) instead of the full dataset for each update.  
If we use the entire dataset at once, that’s **Batch Gradient Descent**.  
If we use only one example at a time, that’s **Pure SGD**.  
In practice, we use small batches (like 64 samples), so this is called **Mini-batch SGD**.

$$
\theta_{t+1} = \theta_t - \eta \, \nabla_\theta L(\theta_t)
$$

Where:
- $\theta_t$ are the model parameters (weights) at step *t*,
- $\eta$ is the **learning rate**,  
- and $\nabla_\theta L(\theta_t)$ is the gradient of the loss with respect to the parameters.

$$
\nabla_\theta L(\theta_t) = 
\begin{bmatrix}
\dfrac{\partial L}{\partial \theta_1} \\
\dfrac{\partial L}{\partial \theta_2} \\
\vdots \\
\dfrac{\partial L}{\partial \theta_n}
\end{bmatrix}
$$

In PyTorch, using `torch.optim.SGD()` requires:
- the model parameters (`model.parameters()`)
- and the learning rate `lr`.

The learning rate controls how big each update step is.  
We usually find the best `lr` experimentally:
- `lr = 1e-3` → small, gentle updates  
- `lr = 1e-1` → large, aggressive updates

**Limitations:**
- Same learning rate $\eta$ for all parameters  
- Sensitive to gradient scale  
- Can oscillate and converge slowly, especially in deep networks  

Because of those limitations, plain SGD is mostly used for **small or simple models**.

---

### 2 - SGD with Momentum

We can improve SGD by adding **momentum** to make training faster and smoother.

$$
v_t = \beta v_{t-1} + (1 - \beta) \, \nabla_\theta L(\theta_t)
$$

$$
\theta_{t+1} = \theta_t - \eta v_t
$$

Here, $v_t$ is a **velocity** term that accumulates previous gradients.  
It helps reduce oscillations and accelerates learning in consistent directions.

We usually set the momentum coefficient $\beta = 0.9$.

**Limitations:**
- Still uses a global learning rate for all parameters (no per-parameter adaptation)  
- Can be sensitive to the choice of learning rate and momentum  

**When to use it:**
- Common for **CNNs**  
  - Stable and smooth updates on dense loss surfaces  
  - Often generalizes better than adaptive methods  
  - Memory efficient (stores only one extra value per parameter)  
  - Works well with learning-rate schedules (step decay, cosine annealing)

---

### 3 - Adagrad

$$
G_t = G_{t-1} + (\nabla_\theta L(\theta_t))^2
$$

$$
\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \varepsilon}} \, \nabla_\theta L(\theta_t)
$$

Here, $G_t$ is a **running sum of squared gradients**, and $\varepsilon$ is a small constant to prevent division by zero.

Each parameter keeps its own $G_t$, which means **each parameter has its own learning rate**.  
As $G_t$ accumulates over time, the denominator grows and the effective learning rate decreases.

This makes Adagrad great when gradients are **sparse**, such as in **NLP embeddings** or **recommendation systems**.

**Limitation:**  
Because $G_t$ keeps increasing, the learning rate can shrink too much, stopping training prematurely.

---

### 4 - RMSProp

RMSProp fixes Adagrad’s limitation by keeping an **exponentially decaying average** of past squared gradients instead of a cumulative sum.

$$
v_t = \alpha v_{t-1} + (1 - \alpha) \, (\nabla_\theta L(\theta_t))^2
$$

$$
\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t + \varepsilon}} \, \nabla_\theta L(\theta_t)
$$

Here:
- $v_t$ is the exponentially weighted moving average of squared gradients,  
- $\alpha$ (typically 0.9) controls the decay rate.

We keep about 90% of the old value and add 10% of the new squared gradient.  
This keeps $v_t$ stable and prevents learning rates from vanishing, unlike Adagrad.

RMSProp is often used in **RNNs**, where gradient magnitudes vary a lot.

---

### 5 - Adaptive Moment Estimation (Adam)

Adam combines the ideas of **Momentum** and **RMSProp**:
- it keeps a running average of past gradients $m_t$ (the **first moment**),
- and a running average of squared gradients $v_t$ (the **second moment**).

$$
m_t = \beta_1 m_{t-1} + (1 - \beta_1) \, \nabla_\theta L(\theta_t)
$$
$$
v_t = \beta_2 v_{t-1} + (1 - \beta_2) \, (\nabla_\theta L(\theta_t))^2
$$

Both $m_t$ and $v_t$ start at zero, so early in training they are biased toward 0.  
We correct this with **bias-corrected estimates**:

$$
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad 
\hat{v}_t = \frac{v_t}{1 - \beta_2^t}
$$

Then the parameters are updated as:

$$
\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon}
$$

Here:
- The numerator $\hat{m}_t$ gives the update **direction** (the mean of past gradients),
- The denominator $\sqrt{\hat{v}_t}$ scales the step size according to gradient **variance**,
- $\varepsilon$ (usually $10^{-8}$) prevents division by zero.

Because $\beta_1$ and $\beta_2$ are close to 1 (usually 0.9 and 0.999),  
the bias correction is strong at the beginning and fades over time ($\beta^{t+1} < \beta^t$), letting $m_t$ and $v_t$ settle to their true averages.

Adam adapts the learning rate per parameter and smooths updates, making it robust to noisy gradients and suitable for most architectures.

This optimizer is the **default choice** for most deep learning models (CNNs, RNNs, Transformers) because it converges fast and needs little tuning.

---

### 6 - AdamW (Adam with Decoupled Weight Decay)

In the original Adam, **weight decay** was coupled with the gradient update, which caused issues for large models.  
**AdamW** decouples weight decay from the adaptive update:

$$
m_t = \beta_1 m_{t-1} + (1 - \beta_1) \, \nabla_\theta L(\theta_t)
$$

$$
v_t = \beta_2 v_{t-1} + (1 - \beta_2) \, (\nabla_\theta L(\theta_t))^2
$$

$$
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad
\hat{v}_t = \frac{v_t}{1 - \beta_2^t}
$$

$$
\theta_{t+1} = \theta_t - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon} + \lambda \theta_t \right)
$$

Only the final update changes — the term $\eta \lambda \theta_t$ directly penalizes large weights.  
The weight-decay coefficient $\lambda$ is typically set around $10^{-2}$.

AdamW leads to **better generalization** for large models such as **Transformers**, **LLMs**, and **large-scale vision models**.


In [7]:
def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)

        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        if batch % 100 == 0:
            loss, current = loss.item(), (batch + 1) * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

In the dataloader, we defined the batch size to 64 — so we’re using **Mini-batch SGD**, as seen previously.

In the `train(dataloader, model, loss_fn, optimizer)` function, here’s what happens step by step:

* **`size = len(dataloader.dataset)`**
  Retrieves the total number of samples in the dataset.

* **`model.train()`**
  Sets the model in **training mode**.
  Some layers like `Dropout` or `BatchNorm` behave differently during training and evaluation, so this line is essential before starting a training pass.

* **`for batch, (X, y) in enumerate(dataloader):`**
  Iterates over each batch of data.
  Since we’re using `enumerate`, `batch` is just the batch index.
  `X` contains the input tensors (shape $[N, C, H, W]$ for images), and `y` contains the corresponding labels (shape $[N]$ for classification tasks).

* **`X, y = X.to(device), y.to(device)`**
  Moves the inputs and labels to the same device as the model (CPU or GPU).
  This step is mandatory if the model is on GPU, otherwise you’ll get a device mismatch error.

* **`pred = model(X)`**
  Performs a **forward pass** through the model.
  This line implicitly calls the model’s `forward(self, X)` method and returns the raw outputs (called **logits** in classification tasks).

* **`loss = loss_fn(pred, y)`**
  Computes the **loss value**, i.e. how far the predictions are from the true labels.
  For example, if we use cross-entropy, this compares the predicted probabilities to the true class labels.
  The result is a single scalar value.

* **`loss.backward()`**
  This triggers **automatic differentiation** using backpropagation.
  PyTorch computes the gradient of the loss with respect to each parameter in the model:

  $$
  \frac{\partial L}{\partial \theta_i}
  $$

  These gradients are then stored in the `.grad` attribute of each parameter that has `requires_grad=True`.

* **`optimizer.step()`**
  The optimizer uses those gradients to **update the model’s parameters** according to the optimization rule (for example, SGD or Adam).
  This is where learning actually happens.

* **`optimizer.zero_grad()`**
  Clears the previously stored gradients in `.grad`.
  This avoids **gradient accumulation** from multiple backward passes, which would otherwise cause incorrect updates.

Finally, the rest of the code simply **prints the loss every 100 batches** to monitor training progress.

In [8]:
def test(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

The `test(dataloader, model, loss_fn)` function is an evaluation loop. It check how well the model performs on unseen data i.e. the test dataset.

There is some parts of this function that need explaination:

- **`model.eval()`**: It switch the model to evaluation mode, same as `model.train()`does it for the training mode. It is crucial for some layers like `Dropout` or `BatchNorm `

- **`with torch.no_grad()`**: it tells Pytorch **not to track gradients** or store intermediate values for backpropagation. Since we are only testing and not training, we do not need gradient. Doing so reduce memory usage and speeds up computation.

- **`test_loss += loss_fn(pred, y).item()`**: it computes the loss function like in the train function. `item()`extract the scalar value from the tensor.

- **`correct += (pred.argmax(1) == y).type(torch.float).sum().item()`**:

    -  `pred.argmax(1)` takes the index of the highest logit along dimension 1 i.e. the predicted class, the one with the best score.
    -  `== y` compare the prediction with the target value, return a tensor of booleans.
    -  `.type(torch.float)` converts booleans to floats
    -  `.sum().item()` sums them to get the number of correct preditions in this batch
    -  `correct +=` accumulate the number of correct predictions over alll batches.


In [9]:
epochs = 20
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train(train_dataloader, model, loss_fn, optimizer)
    test(test_dataloader, model, loss_fn)
print("Done!")

Epoch 1
-------------------------------
loss: 2.296837  [   64/60000]
loss: 2.292622  [ 6464/60000]
loss: 2.270679  [12864/60000]
loss: 2.273979  [19264/60000]
loss: 2.249237  [25664/60000]
loss: 2.230193  [32064/60000]
loss: 2.225190  [38464/60000]
loss: 2.196285  [44864/60000]
loss: 2.198832  [51264/60000]
loss: 2.181857  [57664/60000]
Test Error: 
 Accuracy: 43.4%, Avg loss: 2.165192 

Epoch 2
-------------------------------
loss: 2.169605  [   64/60000]
loss: 2.165305  [ 6464/60000]
loss: 2.109302  [12864/60000]
loss: 2.130728  [19264/60000]
loss: 2.080288  [25664/60000]
loss: 2.032074  [32064/60000]
loss: 2.043264  [38464/60000]
loss: 1.978059  [44864/60000]
loss: 1.987979  [51264/60000]
loss: 1.921017  [57664/60000]
Test Error: 
 Accuracy: 61.5%, Avg loss: 1.912943 

Epoch 3
-------------------------------
loss: 1.941181  [   64/60000]
loss: 1.917256  [ 6464/60000]
loss: 1.802822  [12864/60000]
loss: 1.838444  [19264/60000]
loss: 1.735825  [25664/60000]
loss: 1.692222  [32064/600

If needed, we can save the model with this command:

In [10]:
torch.save(model.state_dict(), "model.pth")