![](images/pytorch-logo-dark.png)

# PyTorch 101: Building a Model Step-by-Step

## Introduction

**PyTorch** is the **fastest growing** Deep Learning framework and it is also used by **Fast.ai** in its MOOC, [Deep Learning for Coders](https://course.fast.ai/) and its [library](https://docs.fast.ai/).

PyTorch is also very *pythonic*, meaning, it feels more natural to use it if you already are a Python developer.

Besides, using PyTorch may even improve your health, according to [Andrej Karpathy](https://twitter.com/karpathy/status/868178954032513024) :-)

<p align="center">
<img src="images/tweet_karpathy.png">
</p>

## Motivation

There are *many many* PyTorch tutorials around and its documentation is quite complete and extensive. So, **why** should you keep reading this step-by-step tutorial?

Well, even though one can find information on pretty much anything PyTorch can do, I missed having a **structured, incremental and from first principles** approach to it.

In this tutorial, I will guide you through the *main reasons* why PyTorch makes it much **easier** and more **intuitive** to build a Deep Learning model in Python — **autograd, dynamic computation graph, model classes** and more.

## Agenda

<h3>
<ul>
    <li>A Simple Problem - Linear Regression</li>
</ul>
<ul>
    <li>PyTorch: tensors, tensors, tensors</li>
</ul>
<ul>
    <li>Gradient Descent in 5 easy steps!</li>
</ul>
<ul>
    <li>Autograd, your companion for all your gradient needs!</li>
</ul>
<ul>
    <li>Dynamic Computation Graph: what is that?</li>
</ul>
<ul>
    <li>Optimizer: learning the parameters step-by-step</li>
</ul>
<ul>
    <li>Loss: aggregating erros into a single value</li>
</ul>
<ul>
    <li>Model: making predictions</li>
</ul>
<ul>
    <li>Dataset</li>
</ul>
<ul>
    <li>DataLoader, splitting your data into mini-batches</li>
</ul>
<ul>
    <li>Evaluation: does it generalize?</li>
</ul>
<ul>
    <li>Saving (and loading) models: taking a break</li>
</ul>
</h3>

## A Simple Problem - Linear Regression

Most tutorials start with some nice and pretty *image classification problem* to illustrate how to use PyTorch. It may seem cool, but I believe it **distracts** you from the **main goal: how PyTorch works**?

For this reason, in this tutorial, I will stick with a **simple** and **familiar** problem: a **linear regression with a single feature x**! It doesn’t get much simpler than that…

$$
\large y = a + b x + \epsilon
$$

We can also think of it as the **simplest neural network**: one node, one input, one output, linear activation function.

<p align="center">
<img src="images/NNs_bias_2.png">
</p>

<p align="center">
Adapted from <a href="http://jalammar.github.io/visual-interactive-guide-basics-neural-networks/">Source</a>
</p>

### Data Generation

Let’s start **generating** some synthetic data: we start with a vector of 100 points for our **feature x** and create our **labels** using **a = 1, b = 2** and some Gaussian noise.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')

In [None]:
true_a = 1
true_b = 2
N = 100

# Data Generation
np.random.seed(42)
x = np.random.rand(N, 1)
y = true_a + true_b * x + .1 * np.random.randn(N, 1)

### Train / Validation Split

Next, let’s **split** our synthetic data into **train** and **validation** sets, shuffling the array of indices and using the first 80 shuffled points for training.

In [None]:
# Shuffles the indices
idx = np.arange(N)
np.random.shuffle(idx)

# Uses first 80 random indices for train
train_idx = idx[:int(N*.8)]
# Uses the remaining indices for validation
val_idx = idx[int(N*.8):]

# Generates train and validation sets
x_train, y_train = x[train_idx], y[train_idx]
x_val, y_val = x[val_idx], y[val_idx]

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
ax[0].scatter(x_train, y_train)
ax[0].set_xlabel('x')
ax[0].set_ylabel('y')
ax[0].set_ylim([1, 3])
ax[0].set_title('Generated Data - Train')
ax[1].scatter(x_val, y_val, c='r')
ax[1].set_xlabel('x')
ax[1].set_ylabel('y')
ax[1].set_ylim([1, 3])
ax[1].set_title('Generated Data - Validation')

## PyTorch: tensors, tensors, tensors

In [None]:
!pip install --quiet torchviz
import torch
import torch.optim as optim
import torch.nn as nn
from torchviz import make_dot

First, we need to cover a **few basic concepts** that may throw you off-balance if you don’t grasp them well enough before going full-force on modeling.

In Deep Learning, we see **tensors** everywhere. Well, Google’s framework is called *TensorFlow* for a reason! *What is a tensor, anyway*?

### Tensors

In *Numpy*, you may have an **array** that has **three dimensions**, right? That is, technically speaking, a **tensor**.

A **scalar** (a single number) has **zero** dimensions, a **vector has one** dimension, a **matrix has two** dimensions and a **tensor has three or more dimensions**. That’s it!

But, to keep things simple, it is commonplace to call vectors and matrices tensors as well — so, from now on, **everything is either a scalar or a tensor**.

![alt text](images/linear_dogs.jpg)
Tensors are just higher-dimensional matrices :-) [Source](http://karlstratos.com)

You can create **tensors** in PyTorch pretty much the same way you create **arrays** in Numpy. Using [**tensor()**](https://pytorch.org/docs/stable/torch.html#torch.tensor) you can create either a scalar or a tensor.

PyTorch's tensors have equivalent functions as its Numpy counterparts, like: [**ones()**](https://pytorch.org/docs/stable/torch.html#torch.ones), [**zeros()**](https://pytorch.org/docs/stable/torch.html#torch.zeros), [**rand()**](https://pytorch.org/docs/stable/torch.html#torch.rand), [**randn()**](https://pytorch.org/docs/stable/torch.html#torch.randn) and many more.

In [None]:
scalar = torch.tensor(3.14159)
vector = torch.tensor([1, 2, 3])
matrix = torch.ones((2, 3), dtype=torch.float)
tensor = torch.randn((2, 3, 4), dtype=torch.float)

print(scalar)
print(vector)
print(matrix)
print(tensor)

You can get the shape of a tensor using its [**size()**](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.size) method or its **shape** attribute.

In [None]:
print(tensor.size(), tensor.shape)

You can also reshape a tensor using its [**reshape()**](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.reshape) or [**view()**](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.view) methods.

Beware: these methods create a new tensor with the desired shape that **shares the underlying data** with the original tensor!

In [None]:
new_tensor1 = tensor.reshape(2, -1)
new_tensor2 = tensor.view(2, -1)
print(new_tensor1.shape, new_tensor2.shape)

If you want to copy all data for real, that is, duplicate it in memory, you should use either its [**copy_()**](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.copy_) or [**clone()**](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.clone) methods.

### Loading Data, Devices and CUDA

”*How do we go from Numpy’s arrays to PyTorch’s tensors*”, you ask? 

That’s what [**from_numpy()**](https://pytorch.org/docs/stable/torch.html#torch.from_numpy) is good for. It returns a **CPU tensor**, though.

You can also easily **cast** it to a lower precision (32-bit float) using [**float()**](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.float).

In [None]:
# Our data was in Numpy arrays, but we need to transform them into PyTorch's Tensors
x_train_tensor = torch.from_numpy(x_train).float()
y_train_tensor = torch.from_numpy(y_train).float()

print(type(x_train), type(x_train_tensor))

“*But I want to use my fancy GPU…*”, you say.

No worries, that’s what [**to()**](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.to) is good for. It sends your tensor to whatever **device** you specify, including your **GPU** (referred to as `cuda` or `cuda:0`).

“*What if I want my code to fallback to CPU if no GPU is available?*”, you may be wondering… 

PyTorch got your back once more — you can use [**cuda.is_available()**](https://pytorch.org/docs/stable/cuda.html?highlight=is_available#torch.cuda.is_available) to find out if you have a GPU at your disposal and set your device accordingly.

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Our data was in Numpy arrays, but we need to transform them into PyTorch's Tensors
x_train_tensor = torch.from_numpy(x_train).float().to(device)
y_train_tensor = torch.from_numpy(y_train).float().to(device)

print(type(x_train), type(x_train_tensor))

If you compare the **types** of both variables, you’ll get what you’d expect: `numpy.ndarray` for the first one and `torch.Tensor` for the second one.

But where does your nice tensor “live”? In your CPU or your GPU? You can’t say… but if you use PyTorch’s **type()**, it will reveal its **location** — `torch.cuda.FloatTensor` — a GPU tensor in this case.

In [None]:
print(x_train_tensor.type())

We can also go the other way around, turning tensors back into Numpy arrays, using [**numpy()**](https://pytorch.org/docs/stable/tensors.html?highlight=numpy#torch.Tensor.numpy). It should be easy as `x_train_tensor.numpy()` but…

In [None]:
x_train_tensor.numpy()

Unfortunately, Numpy **cannot** handle GPU tensors… you need to make them CPU tensors first using [**cpu()**](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.cpu).

In [None]:
x_train_tensor.cpu().numpy()

### Creating Tensor for Parameters

What distinguishes a *tensor* used for *data* — like the ones we’ve just created — from a **tensor** used as a (*trainable*) **parameter/weight**?

The latter tensors require the **computation of its gradients**, so we can **update** their values (the parameters’ values, that is). That’s what the **`requires_grad=True`** argument is good for. It tells PyTorch we want it to compute gradients for us.

---

<h2><b><i>A tensor for a learnable parameter requires gradient!</i></b></h2>

---

You may be tempted to create a simple tensor for a parameter and, later on, send it to your chosen device, as we did with our data, right?

Actually, you should **assign** tensors to a **device** at the moment of their **creation** to avoid unexpected behaviors...

In [None]:
# We can specify the device at the moment of creation - RECOMMENDED!
a = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
print(a, b)

Now that we know how to create tensors that require gradients, let’s see how PyTorch handles them — that’s the role of the…

## Gradient Descent in 5 easy steps!

Gradient descent is the most common **optimization algorithm** in Machine Learning and Deep Learning.

The purpose of using gradient descent is **to minimize the loss**, that is, **minimize the errors between predictions and actual values** (and sometimes some other term as well).

It goes beyond the scope of this tutorial to fully explain how gradient descent works, but I'll cover the **five basic steps** you'd need to go through to compute it, namely:

- Step 0: Random initialize parameters / weights
- Step 1: Compute model's predictions - forward pass
- Step 2: Compute loss
- Step 3: Compute the gradients
- Step 4: Update the parameters
- Step 5: Rinse and repeat!

---

If you want to learn more about gradient descent, check the following resources:
- [**Linear Regression Simulator**](https://www.mladdict.com/linear-regression-simulator), which goes through the very same steps listed here
- [**A Visual and Interactive Guide to the Basics of Neuran Networks**](http://jalammar.github.io/visual-interactive-guide-basics-neural-networks/)
- [**Gradient Descent Algorithms and Its Variants**](https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3)

---

### Step 0: Initialization

Technically, this step is not part of gradient descent, but it is an important step nonetheless.

For training a model, you need to **randomly initialize the parameters/weights** (we have only two, **a** and **b**).

Make sure to *always initialize your random seed* to ensure **reproducibility** of your results. As usual, the random seed is [42](https://en.wikipedia.org/wiki/Phrases_from_The_Hitchhiker%27s_Guide_to_the_Galaxy#Answer_to_the_Ultimate_Question_of_Life,_the_Universe,_and_Everything_(42)), the *least random* of all random seeds one could possibly choose :-)

**BTW: we are back to Numpy for a little while!**

In [None]:
np.random.seed(42)
a = np.random.randn(1)
b = np.random.randn(1)

print(a, b)

### Step 1: Compute Model's Predictions

This is the **forward pass** - it simply *computes the model's predictions using the current values of the parameters/weights*. At the very beginning, we will be producing really bad predictions, as we started with random values from Step 0.

In [None]:
# Computes our model's predicted output
yhat = a + b * x_train

### Step 2: Compute Loss

There is a subtle but fundamental difference between **error** and **loss**. 

The **error** is the difference between **actual** and **predicted** computed for a single data point.

$$
\Large error_i = y_i - \hat{y_i}
$$

The **loss**, on the other hand, is some sort of **aggregation of errors for a set of data points**.

For a regression problem, the **loss** is given by the **Mean Square Error (MSE)**, that is, the average of all squared differences between **actual values** (y) and **predictions** (a + bx).

$$
\large MSE = \frac{1}{N} \sum_{i=1}^N{error_i}^2
$$

$$
\large MSE = \frac{1}{N} \sum_{i=1}^N{(y_i - \hat{y_i})}^2
$$

$$
\large MSE = \frac{1}{N} \sum_{i=1}^N{(y_i - a - b x_i)}^2
$$

---

It is worth mentioning that, if we **compute the loss** using:
- **all points** in the training set (N), we are performing a **batch** gradient descent
- a **single point** at each time, it would be a **stochastic** gradient descent
- anything else (n) **in-between 1 and N** characterizes a **mini-batch** gradient descent

---

<p align="center">
<img src="images/batch_vs_stochastic.png">
</p>
<p align="center">
<a href="https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3">Source</a>
</p>


In [None]:
# How wrong is our model? That's the error! 
error = (y_train - yhat)

# It is a regression, so it computes mean squared error (MSE)
loss = (error ** 2).mean()

print(loss)

### Step 3: Compute the Gradients

A **gradient** is a **partial derivative** — *why partial*? Because one computes it with respect to (w.r.t.) a **single parameter**. We have two parameters, **a** and **b**, so we must compute two partial derivatives.

A **derivative** tells you *how much* **a given quantity changes** when you *slightly* vary some **other quantity**. In our case, how much does our **MSE** **loss** change when we vary **each one of our two parameters**?

The *right-most* part of the equations below is what you usually see in implementations of gradient descent for a simple linear regression. In the **intermediate step**, I show you **all elements** that pop-up from the application of the [chain rule](https://en.wikipedia.org/wiki/Chain_rule), so you know how the final expression came to be.

---

<h3><i><b>Gradient = how much the LOSS changes if ONE parameter changes a little bit!</b></i></h3>

---

**Gradients**:

$$
\large \frac{\partial{MSE}}{\partial{a}} = \frac{\partial{MSE}}{\partial{\hat{y_i}}} \cdot \frac{\partial{\hat{y_i}}}{\partial{a}} = \frac{1}{N} \sum_{i=1}^N{2(y_i - a - b x_i) \cdot (-1)} = -2 \frac{1}{N} \sum_{i=1}^N{(y_i - \hat{y_i})}
$$ 

$$
\large \frac{\partial{MSE}}{\partial{b}} = \frac{\partial{MSE}}{\partial{\hat{y_i}}} \cdot \frac{\partial{\hat{y_i}}}{\partial{b}} = \frac{1}{N} \sum_{i=1}^N{2(y_i - a - b x_i) \cdot (-x_i)} = -2 \frac{1}{N} \sum_{i=1}^N{x_i (y_i - \hat{y_i})}
$$


In [None]:
# Computes gradients for both "a" and "b" parameters
a_grad = -2 * error.mean()
b_grad = -2 * (x_train * error).mean()
print(a_grad, b_grad)

### Step 4: Update the Parameters

In the final step, we **use the gradients to update** the parameters. Since we are trying to **minimize** our **losses**, we **reverse the sign** of the gradient for the update.

There is still another parameter to consider: the **learning rate**, denoted by the *Greek letter* **eta** (that looks like the letter **n**), which is the **multiplicative factor** that we need to apply to the gradient for the parameter update.

**Parameters**:

$$
\large a = a - \eta \frac{\partial{MSE}}{\partial{a}}
$$

$$
\large b = b - \eta \frac{\partial{MSE}}{\partial{b}}
$$

Let's start with a value of **0.1** (which is a relatively *big value*, as far as learning rates are concerned!).

In [None]:
# Sets learning rate
lr = 1e-1
print(a, b)

# Updates parameters using gradients and the learning rate
a = a - lr * a_grad
b = b - lr * b_grad

print(a, b)

---

<h2><b><i>"Choose your learning rate wisely..."</b></i></h2>

<h3><i><b>The learning rate is the single most important hyper-parameter to tune when you are using Deep Learning models!</b></i></h3>

What happens if I choose the learning rate **poorly**? Your model may **take too long to train** or **get stuck with a high loss** or, even worse, **diverge into an exploding loss**!

<p align="center">
<img src="images/learningrates.jpeg">
</p>
<p align="center">
<a href="http://cs231n.github.io/neural-networks-3/">Source</a>
</p>

---

### Playing with Learning Rates

Let's work through **an interactive example**!

We start at a (not so) **random initial value** of our **feature**, say, -1.5. It has a corresponding **loss** of 2.25.

You can choose between **two functions**:
- **convex**, meaning, its **loss is well-behaved** and **gradient descent is guaranteed to converge**
- **non-convex**, meaning, **all bets are off**!

Every time you **take a step**, the plot gets updated:

- The **red vector** is our update to the **weight**, that is, **learning rate times gradient**.

- The **gray vecto**r shows **how much the cost changes** given our update.

- If you divide their lengths, **gray over red**, it will give you the **approximate gradient**.

In [None]:
# Uncomment and run once if you're in Google Colab
#!curl https://raw.githubusercontent.com/dvgodoy/PyTorch101_ODSC_London2019/master/gradient_descent.py --output gradient_descent.py

In [None]:
from plotly.offline import iplot, init_notebook_mode
from ipywidgets import VBox, IntSlider, FloatSlider, Dropdown
from gradient_descent import *

init_notebook_mode(connected=False)

w0 = FloatSlider(description='Start', value=-1.5, min=-2, max=2, step=.05)
functype = Dropdown(description='Function', options=['Convex', 'Non-Convex'], value='Convex')
lrate = FloatSlider(description='Learning Rate', value=.05, min=.05, max=1.1, step=.05)
n_steps = IntSlider(description='# updates', value=10, min=10, max=20, step=1)

In [None]:
# Uncomment if you're in Google Colab
#configure_plotly_browser_state()
VBox((w0, functype, lrate, n_steps))

In [None]:
# Uncomment if you're in Google Colab
#configure_plotly_browser_state()
fig = build_fig(functype.value, lrate.value, w0.value, n_steps.value)
iplot(fig)

### Step 5: Rinse and Repeat!

Now we use the **updated parameters** to go back to **Step 1** and restart the process.

Repeating this process over and over, for **many epochs**, is, in a nutshell, **training** a model.

---

An **epoch** is complete whenever **every point has been already used once for computing the loss**: 
- **batch** gradient descent: this is trivial, as it uses all points for computing the loss — **one epoch** is the same as **one update**
- **stochastic** gradient descent: **one epoch** means **N updates**
- **mini-batch** (of size n): **one epoch** has **N/n updates**

---


Let's put the previous pieces of code together and loop over many epochs:

In [None]:
# Defines number of epochs
n_epochs = 1000

# Step 0
np.random.seed(42)
a = np.random.randn(1)
b = np.random.randn(1)

for epoch in range(n_epochs):
    # Step 1
    # Computes our model's predicted output
    yhat = a + b * x_train
    
    # Step 2
    # How wrong is our model? That's the error! 
    error = (y_train - yhat)
    # It is a regression, so it computes mean squared error (MSE)
    loss = (error ** 2).mean()

    # Step 3    
    # Computes gradients for both "a" and "b" parameters
    a_grad = -2 * error.mean()
    b_grad = -2 * (x_train * error).mean()
    
    # Step 4
    # Updates parameters using gradients and the learning rate
    a -= lr * a_grad
    b -= lr * b_grad

In [None]:
print(a, b)

Just keep in mind that, if you **don’t** use batch gradient descent (our example does),you’ll have to write an **inner loop** to perform the **five training steps** for either each **individual point** (**stochastic**) or **n points** (**mini-batch**). We’ll see a mini-batch example later down the line.

### Sanity Check

Just to make sure we haven’t done any mistakes in our code, we can use *Scikit-Learn’s Linear Regression* to fit the model and compare the coefficients.

In [None]:
# Sanity Check: do we get the same results as our gradient descent?
from sklearn.linear_model import LinearRegression
linr = LinearRegression()
linr.fit(x_train, y_train)
print(linr.intercept_, linr.coef_[0])

They **match** up to 6 decimal places — we have a *fully working implementation of linear regression* using Numpy.

**Numpy?! Wait a minute… I thought this tutorial was about PyTorch!**

Yes, it is, but this served **two purposes**: *first*, to introduce the **structure** of our task, which will remain largely the same and, *second*, to show you the main **pain points** so you can fully appreciate how much PyTorch makes your life easier :-)

<h2><b><i>Numpy?! TORCH IT!</b></i></h2>


## Autograd, your companion for all your gradient needs!

Autograd is PyTorch’s *automatic differentiation package*. Thanks to it, we **don’t need to worry** about partial derivatives, chain rule or anything like it.

<h2><b><i>Computing gradients manually?! No way! Backward!</b></i></h2>


### backward

So, how do we tell PyTorch to do its thing and **compute all gradients**? That’s what [**backward()**](https://pytorch.org/docs/stable/autograd.html#torch.autograd.backward) is good for.

Do you remember the **starting point** for **computing the gradients**? It was the **loss**, as we computed its partial derivatives w.r.t. our parameters. Hence, we need to invoke the `backward()` method from the corresponding Python variable, like, `loss.backward()`.

In [None]:
# Step 0
torch.manual_seed(42)

a = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)

In [None]:
# Step 1
# Computes our model's predicted output
yhat = a + b * x_train_tensor

# Step 2    
# How wrong is our model? That's the error! 
error = (y_train_tensor - yhat)
# It is a regression, so it computes mean squared error (MSE)
loss = (error ** 2).mean()

# Step 3    
# No more manual computation of gradients! 
loss.backward()

# Computes gradients for both "a" and "b" parameters
# a_grad = -2 * error.mean()
# b_grad = -2 * (x_train_tensor * error).mean()

### grad / zero_


What about the **actual values** of the **gradients**? We can inspect them by looking at the [**grad**](https://pytorch.org/docs/stable/autograd.html#torch.Tensor.grad) **attribute** of a tensor.

In [None]:
print(a.grad, b.grad)

If you check the method’s documentation, it clearly states that **gradients are accumulated**. 

You can check this out by running the two code cells above again.

So, every time we use the **gradients** to **update** the parameters, we need to **zero the gradients afterwards**. And that’s what [**zero_()**](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.zero_) is good for.

---

*In PyTorch, every method that **ends** with an **underscore (_)** makes changes **in-place**, meaning, they will **modify** the underlying variable.*

---

In [None]:
a.grad.zero_(), b.grad.zero_()

So, let’s **ditch** the **manual computation of gradients** and use both `backward()` and `zero_()` methods instead.

And, we are still missing **Step 4**, that is, **updating the parameters**. Let's include it as well...

In [None]:
# Step 0
torch.manual_seed(42)
a = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)

# Step 1
# Computes our model's predicted output
yhat = a + b * x_train_tensor

# Step 2    
# How wrong is our model? That's the error! 
error = (y_train_tensor - yhat)
# It is a regression, so it computes mean squared error (MSE)
loss = (error ** 2).mean()

# Step 3    
# No more manual computation of gradients! 
loss.backward()
# Computes gradients for both "a" and "b" parameters
# a_grad = -2 * error.mean()
# b_grad = -2 * (x_train_tensor * error).mean()
print(a.grad, b.grad)

# Step 4
# Updates parameters using gradients and the learning rate
with torch.no_grad(): # what is that?!
    a -= lr * a.grad
    b -= lr * b.grad

# PyTorch is "clingy" to its computed gradients, we need to tell it to let it go...
a.grad.zero_()
b.grad.zero_()

print(a.grad, b.grad)

### no_grad

<h2><b><i>"One does not simply update parameters without no_grad"</b></i></h2>

Why do we need to use [**no_grad()**](https://pytorch.org/docs/stable/autograd.html#torch.autograd.no_grad) to **update the parameters**?

The culprit is PyTorch’s ability to build a **dynamic computation graph** from every **Python operation** that involves any **gradient-computing tensor** or its **dependencies**.

---

**What is a dynamic computation graph?**

Don't worry, we’ll go deeper into the inner workings of the dynamic computation graph in the next section.

---

So, how do we tell PyTorch to “**back off**” and let us **update our parameters** without messing up with its **fancy dynamic computation graph**? 


That is the purpose of **no_grad()**: it allows us to **perform regular Python operations on tensors, independent of PyTorch’s computation graph**.

In [None]:
lr = 1e-1
n_epochs = 1000

# Step 0
torch.manual_seed(42)
a = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)

for epoch in range(n_epochs):
    # Step 1
    # Computes our model's predicted output
    yhat = a + b * x_train_tensor

    # Step 2    
    # How wrong is our model? That's the error! 
    error = (y_train_tensor - yhat)
    # It is a regression, so it computes mean squared error (MSE)
    loss = (error ** 2).mean()

    # Step 3    
    # No more manual computation of gradients! 
    loss.backward()

    # Step 4
    # Updates parameters using gradients and the learning rate
    with torch.no_grad():
        a -= lr * a.grad
        b -= lr * b.grad

    # PyTorch is "clingy" to its computed gradients, we need to tell it to let it go...
    a.grad.zero_()
    b.grad.zero_()

print(a, b)

Finally, we managed to successfully run our model and get the **resulting parameters**. Surely enough, they **match** the ones we got in our *Numpy*-only implementation.

Let's take a look at the **loss** at the end of the training...

In [None]:
loss

What if we wanted to have it as a *Numpy* array? I guess we could just use **numpy()** again, right? (and **cpu()** as well, since our *loss* is in the `cuda` device...

In [None]:
loss.cpu().numpy()

What happened here? Unlike our *data tensors*, the **loss tensor** is actually computing gradients - and in order to use **numpy**, we need to [**detach()**](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.detach_) that tensor from the computation graph first:

In [None]:
loss.detach().cpu().numpy()

This seems like **a lot of work**, there must be an easier way! And there is one indeed: we can use [**item()**](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.item), for **tensors with a single element** or [**tolist()**](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.tolist) otherwise.

In [None]:
print(loss.item(), loss.tolist())

## Dynamic Computation Graph: what is that?

<h2><b><i>"No one can be told what the dynamic computation graph is - you have to see it for yourself"</b></i></h2>

Jokes aside, I want **you** to **see the graph for yourself** too!

The [PyTorchViz](https://github.com/szagoruyko/pytorchviz) package and its `make_dot(variable)` method allows us to easily visualize a graph associated with a given Python variable.

So, let’s stick with the **bare minimum**: two (gradient computing) **tensors** for our parameters, predictions, errors and loss.

In [None]:
torch.manual_seed(42)
a = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)

yhat = a + b * x_train_tensor
error = y_train_tensor - yhat
loss = (error ** 2).mean()

Now let's plot the **computation graph** for the **yhat** variable.

In [None]:
make_dot(yhat)

Let’s take a closer look at its components:

* **blue boxes**: these correspond to the **tensors** we use as **parameters**, the ones we’re asking PyTorch to **compute gradients** for;

* **gray box**: a **Python operation** that involves a **gradient-computing tensor or its dependencies**;

* **green box**: the same as the gray box, except it is the **starting point for the computation** of gradients (assuming the `**backward()**` method is called from the **variable used to visualize** the graph)— they are computed from the **bottom-up** in a graph.

Now, take a closer look at the **green box**: there are **two arrows** pointing to it, since it is **adding up two variables**, `a` and `b*x`. Seems obvious, right?

Then, look at the **gray box** of the same graph: it is performing a **multiplication**, namely, `b*x`. But there is only **one arrow** pointing to it! The arrow comes from the **blue box** that corresponds to our parameter `b`.

Why don’t we have a box for our **data x**? The answer is: **we do not compute gradients for it**! So, even though there are *more* tensors involved in the operations performed by the computation graph, it **only** shows **gradient-computing tensors and its dependencies**.

Try using the `make_dot` method to plot the **computation graph** of other variables, like `error` or `loss`.

The **only difference** between them and the first one is the number of **intermediate steps (gray boxes)**.



In [None]:
make_dot(loss)

What would happen to the computation graph if we set **`requires_grad`** to **`False`** for our parameter **`a`**?

In [None]:
a_nograd = torch.randn(1, requires_grad=False, dtype=torch.float, device=device)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)

yhat = a_nograd + b * x_train_tensor

In [None]:
make_dot(yhat)

Unsurprisingly, the **blue box** corresponding to the **parameter a** is no more! 

Simple enough: **no gradients, no graph**.

The **best thing** about the *dynamic computing graph* is the fact that you can make it **as complex as you want it**. You can even use *control flow statements* (e.g., if statements) to **control the flow of the gradients** (obviously!) :-)

Let's build a nonsensical, yet complex, computation graph just to make a point!

In [None]:
yhat = a + b * x_train_tensor
error = y_train_tensor - yhat

loss = (error ** 2).mean()

if loss > 0:
    yhat2 = b * x_train_tensor
    error2 = y_train_tensor - yhat2

loss += error2.mean()

In [None]:
make_dot(loss)

## Optimizer:  learning the parameters step-by-step

So far, we’ve been **manually** updating the parameters using the computed gradients. That’s probably fine for *two parameters*… but what if we had a **whole lot of them**?! We use one of PyTorch’s **optimizers**, like [SGD](https://pytorch.org/docs/stable/optim.html#torch.optim.SGD) or [Adam](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam).

---

There are **many** optimizers, **SGD** is the most basic of them and **Adam** is one of the most popular. They achieve the same goal through, literally, **different paths**.

<p align="center">
<img src="images/opt2.gif">
</p>

<p align="center">
<a href="http://cs231n.github.io/neural-networks-3/">Source</a>
</p>

---

In the code below, we create a *Stochastic Gradient Descent* (SGD) optimizer to update our parameters **a** and **b**.

---

Don’t be fooled by the **optimizer’s** name: if we use **all training data** at once for the update — as we are actually doing in the code — the optimizer is performing a **batch** gradient descent, despite of its name.

---

In [None]:
# Our parameters
torch.manual_seed(42)
a = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)

# Learning rate
lr = 1e-1

# Defines a SGD optimizer to update the parameters
optimizer = optim.SGD([a, b], lr=lr)

### step / zero_grad

An optimizer takes the **parameters** we want to update, the **learning rate** we want to use (and possibly many other hyper-parameters as well!) and **performs the updates** through its [**`step()`**](https://pytorch.org/docs/stable/optim.html#torch.optim.Optimizer.step) method.

Besides, we also don’t need to zero the gradients one by one anymore. We just invoke the optimizer’s [**`zero_grad()`**](https://pytorch.org/docs/stable/optim.html#torch.optim.Optimizer.zero_grad) method and that’s it!

In [None]:
n_epochs = 1000

for epoch in range(n_epochs):
    # Step 1
    yhat = a + b * x_train_tensor

    # Step 2
    error = y_train_tensor - yhat
    loss = (error ** 2).mean()

    # Step 3
    loss.backward()    
    
    # Step 4
    # No more manual update!
    # with torch.no_grad():
    #     a -= lr * a.grad
    #     b -= lr * b.grad
    optimizer.step()
    
    # No more telling PyTorch to let gradients go!
    # a.grad.zero_()
    # b.grad.zero_()
    optimizer.zero_grad()

print(a, b)

Cool! We’ve *optimized* the **optimization** process :-) What’s left?

## Loss: aggregating erros into a single value

We now tackle the **loss computation**. As expected, PyTorch got us covered once again. There are many [loss functions](https://pytorch.org/docs/stable/nn.html#loss-functions) to choose from, depending on the task at hand. Since ours is a regression, we are using the [Mean Square Error (MSE)](https://pytorch.org/docs/stable/nn.html#torch.nn.MSELoss) loss.

---

Notice that `nn.MSELoss` actually **creates a loss function** for us — **it is NOT the loss function itself**. Moreover, you can specify a **reduction method** to be applied, that is, **how do you want to aggregate the results for individual points** — you can average them (reduction=’mean’) or simply sum them up (reduction=’sum’).

---

In [None]:
# Defines a MSE loss function
loss_fn = nn.MSELoss(reduction='mean')

loss_fn

In [None]:
fake_labels = torch.tensor([1., 2., 3.])
fake_preds = torch.tensor([1., 3., 5.])

loss_fn(fake_labels, fake_preds)

We then **use** the created loss function to compute the loss given our **predictions** and our **labels**.

In [None]:
torch.manual_seed(42)
a = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)

lr = 1e-1
n_epochs = 1000

# Defines a MSE loss function
loss_fn = nn.MSELoss(reduction='mean')

optimizer = optim.SGD([a, b], lr=lr)

for epoch in range(n_epochs):
    # Step 1
    yhat = a + b * x_train_tensor
    
    # Step 2
    # No more manual loss!
    # error = y_tensor - yhat
    # loss = (error ** 2).mean()
    loss = loss_fn(y_train_tensor, yhat)

    # Step 3
    loss.backward() 

    # Step 4
    optimizer.step()
    optimizer.zero_grad()
    
print(a, b)

At this point, there’s only one piece of code left to change: the **predictions**. It is then time to introduce PyTorch’s way of implementing a…

## Model: making predictions

In PyTorch, a **model** is represented by a regular **Python class** that inherits from the [**Module**](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) class.

The most fundamental methods it needs to implement are:

* **`__init__(self)`**: **it defines the parts that make up the model** —in our case, two parameters, **a** and **b**.

* **`forward(self, x)`**: it performs the **actual computation**, that is, it **outputs a prediction**, given the input **x**.

Let’s build a proper (yet simple) model for our regression task. It should look like this:

In [None]:
class ManualLinearRegression(nn.Module):
    def __init__(self):
        super().__init__()
        a = torch.randn(1, requires_grad=True, dtype=torch.float)
        b = torch.randn(1, requires_grad=True, dtype=torch.float)

        # To make "a" and "b" real parameters of the model, we need to wrap them with nn.Parameter
        self.a = nn.Parameter(a)
        self.b = nn.Parameter(b)
        
    def forward(self, x):
        # Computes the outputs / predictions
        return self.a + self.b * x

### Parameter


In the **\__init__** method, we define our **two parameters**, **a** and **b**, using the [**Parameter()**](https://pytorch.org/docs/stable/nn.html#torch.nn.Parameter) class, to tell PyTorch these **tensors should be considered parameters of the model they are an attribute of**.

Why should we care about that? By doing so, we can use our model’s [**parameters()**](https://pytorch.org/docs/stable/nn.html#torch.nn.Module.parameters) method to retrieve **an iterator over all model’s parameters**, even those parameters of **nested models**, that we can use to feed our optimizer (instead of building a list of parameters ourselves!).

In [None]:
dummy = ManualLinearRegression()

list(dummy.parameters())

Moreover, we can get the **current values for all parameters** using our model’s [**state_dict()**](https://pytorch.org/docs/stable/nn.html#torch.nn.Module.state_dict) method.

In [None]:
dummy.state_dict()

### state_dict

The **state_dict()** of a given model is simply a Python dictionary that **maps each layer / parameter to its corresponding tensor**. But only **learnable** parameters are included, as its purpose is to keep track of parameters that are going to be updated by the **optimizer**.

The **optimizer** itself also has a **state_dict()**, which contains its internal state, as well as the hyperparameters used.

---

It turns out **state_dicts** can also be used for **checkpointing** a model, as we will see later down the line.

---

In [None]:
optimizer.state_dict()

### Device

**IMPORTANT**: we need to **send our model to the same device where the data is**. If our data is made of GPU tensors, our model must “live” inside the GPU as well.

In [None]:
torch.manual_seed(42)

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Now we can create a model and send it at once to the device
model = ManualLinearRegression().to(device)

# We can also inspect its parameters using its state_dict
print(model.state_dict())

### Forward Pass

The **forward pass** is the moment when the model **makes predictions**.

---

You should **NOT call the `forward(x)`** method, though. You should **call the whole model itself**, as in **`model(x)`** to perform a forward pass and output predictions.

---


In [None]:
yhat = model(x_train_tensor)

### train

<h2><b><i>"What does train() do? It only sets the mode!"</b></i></h2>

In PyTorch, models have a [**train()**](https://pytorch.org/docs/stable/nn.html#torch.nn.Module.train) method which, somewhat disappointingly, **does NOT perform a training step**. Its only purpose is to **set the model to training mode**. 

Why is this important? Some models may use mechanisms like [**Dropout**](https://pytorch.org/docs/stable/nn.html#torch.nn.Dropout), for instance, which have **distinct behaviors in training and evaluation phases**.

In [None]:
lr = 1e-1
n_epochs = 1000

loss_fn = nn.MSELoss(reduction='mean')
# Now the optimizers uses the parameters from the model
optimizer = optim.SGD(model.parameters(), lr=lr)

for epoch in range(n_epochs):
    # Sets model to training mode
    model.train()

    # Step 1
    # No more manual prediction!
    # yhat = a + b * x_tensor
    yhat = model(x_train_tensor)
    
    # Step 2
    loss = loss_fn(yhat, y_train_tensor)
    # Step 3
    loss.backward()
    # Step 4
    optimizer.step()
    optimizer.zero_grad()
    
print(model.state_dict())

Now, the printed statements will look like this — final values for parameters **a** and **b** are still the same, so everything is ok :-)

### Nested Models

In our model, we manually created two parameters to perform a linear regression. 

---

You are **not** limited to defining parameters, though… **models can contain other models as its attributes** as well, so you can easily nest them. We’ll see an example of this shortly as well.

---

Let’s use PyTorch’s [**Linear**](https://pytorch.org/docs/stable/nn.html#torch.nn.Linear) model as an attribute of our own, thus creating a nested model.

Even though this clearly is a contrived example, as we are pretty much wrapping the underlying model without adding anything useful (or, at all!) to it, it illustrates well the concept.

In the **`__init__`** method, we created an attribute that contains our **nested `Linear` model**.

In the **`forward()`** method, we **call the nested model itself** to perform the forward pass (notice, we are **not** calling `self.linear.forward(x)`!).

In [None]:
class LayerLinearRegression(nn.Module):
    def __init__(self):
        super().__init__()
        # Instead of our custom parameters, we use a Linear layer with single input and single output
        self.linear = nn.Linear(1, 1)
                
    def forward(self, x):
        # Now it only takes a call to the layer to make predictions
        return self.linear(x)

Now, if we call the **parameters()** method of this model, **PyTorch will figure the parameters of its attributes in a recursive way**.

You can also add new `Linear` attributes and, even if you don’t use them at all in the forward pass, they will **still** be listed under `parameters()`.

In [None]:
dummy = LayerLinearRegression()

list(dummy.parameters())

In [None]:
dummy.state_dict()

### Layers

A **Linear** model can be seen as a **layer** in a neural network.

<p align="center">
<img src="images/layer.png">
</p>
<p align="center">
<a href="https://www.kdnuggets.com/2017/09/neural-network-foundations-explained-activation-function.html">Source</a>
</p>

In the example above, the **hidden layer** would be `nn.Linear(3, 4)` and the **output layer** would be `nn.Linear(4, 1)`.


There are **MANY** different layers that can be uses in PyTorch:
- [Convolution Layers](https://pytorch.org/docs/stable/nn.html#convolution-layers)
- [Pooling Layers](https://pytorch.org/docs/stable/nn.html#pooling-layers)
- [Padding Layers](https://pytorch.org/docs/stable/nn.html#padding-layers)
- [Non-linear Activations](https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity)
- [Normalization Layers](https://pytorch.org/docs/stable/nn.html#normalization-layers)
- [Recurrent Layers](https://pytorch.org/docs/stable/nn.html#recurrent-layers)
- [Transformer Layers](https://pytorch.org/docs/stable/nn.html#transformer-layers)
- [Linear Layers](https://pytorch.org/docs/stable/nn.html#linear-layers)
- [Dropout Layers](https://pytorch.org/docs/stable/nn.html#dropout-layers)
- [Sparse Layers (embbedings)](https://pytorch.org/docs/stable/nn.html#sparse-layers)
- [Vision Layers](https://pytorch.org/docs/stable/nn.html#vision-layers)
- [DataParallel Layers (multi-GPU)](https://pytorch.org/docs/stable/nn.html#dataparallel-layers-multi-gpu-distributed)
- [Flatten Layer](https://pytorch.org/docs/stable/nn.html#flatten)

We have just used a **Linear** layer.

### Sequential Models

<h2><b><i>Run-of-the-mill layers? Sequential model!</b></i></h2>

Our model was simple enough… You may be thinking: “*why even bother to build a class for it?!*” Well, you have a point…

For **straightforward models**, that use **run-of-the-mill layers**, where the output of a layer is sequentially fed as an input to the next, we can use a, er… [**Sequential**](https://pytorch.org/docs/stable/nn.html#torch.nn.Sequential) model :-)

In our case, we would build a Sequential model with a single argument, that is, the Linear layer we used to train our linear regression. The model would look like this:

In [None]:
model = nn.Sequential(nn.Linear(1, 1)).to(device)

Simple enough, right?

### Training Step

So far, we’ve defined:
* an **optimizer**

* a **loss function**

* a **model**

Scroll up a bit and take a quick look at the code inside the loop. Would it **change** if we were using a **different optimizer**, or **loss**, or even **model**? If not, how can we make it more generic?

Well, I guess we could say all these lines of code **perform a training step**, given those **three elements** (optimizer, loss and model),the **features** and the **labels**.

So, how about **writing a function that takes those three elements** and **returns another function that performs a training step**, taking a set of features and labels as arguments and returning the corresponding loss?

In [None]:
def make_train_step(model, loss_fn, optimizer):
    # Builds function that performs a step in the train loop
    def train_step(x, y):
        # Sets model to TRAIN mode
        model.train()
        # Step 1: Makes predictions
        yhat = model(x)
        # Step 2: Computes loss
        loss = loss_fn(yhat, y)
        # Step 3: Computes gradients
        loss.backward()
        # Step 4: Updates parameters and zeroes gradients
        optimizer.step()
        optimizer.zero_grad()
        # Returns the loss
        return loss.item()
    
    # Returns the function that will be called inside the train loop
    return train_step

Then we can use this general-purpose function to build a **train_step()** function to be called inside our training loop.

In [None]:
lr = 1e-1

# Create a MODEL, a LOSS FUNCTION and an OPTIMIZER
model = nn.Sequential(nn.Linear(1, 1)).to(device)
loss_fn = nn.MSELoss(reduction='mean')
optimizer = optim.SGD(model.parameters(), lr=lr)

# Creates the train_step function for our model, loss function and optimizer
train_step = make_train_step(model, loss_fn, optimizer)

train_step

Now our code should look like this… see how **tiny** the training loop is now?

In [None]:
n_epochs = 1000

losses = []
# For each epoch...
for epoch in range(n_epochs):
    # Performs one train step and returns the corresponding loss
    loss = train_step(x_train_tensor, y_train_tensor)
    losses.append(loss)
    
# Checks model's parameters
print(model.state_dict())

In [None]:
plt.plot(losses[:200])
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.yscale('log')

Let’s give our training loop a rest and focus on our **data** for a while… so far, we’ve simply used our *Numpy arrays* turned **PyTorch tensors**. But we can do better, we can build a…

## Dataset

In PyTorch, a **dataset** is represented by a regular **Python class** that inherits from the [**Dataset**](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) class. You can think of it as a kind of a Python **list of tuples**, each tuple corresponding to **one point (features, label)**.

The most fundamental methods it needs to implement are:

* **`__init__(self)`**: it takes **whatever arguments** needed to build a **list of tuples** — it may be the name of a CSV file that will be loaded and processed; it may be two tensors, one for features, another one for labels; or anything else, depending on the task at hand.

* **`__get_item__(self, index)`**: it allows the dataset to be **indexed**, so it can work like a list (`dataset[i]`) — it must **return a tuple (features, label)** corresponding to the requested data point. We can either return the **corresponding slices** of our **pre-loaded** dataset or tensors or, as mentioned above, **load them on demand** (like in this [example](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class)).

* **`__len__(self)`**: it should simply return the **size** of the whole dataset so, whenever it is sampled, its indexing is limited to the actual size.

---

There is **no need to load the whole dataset in the constructor method** (`__init__`). If your **dataset is big** (tens of thousands of image files, for instance), loading it at once would not be memory efficient. It is recommended to **load them on demand** (whenever `__get_item__` is called).

---

Let’s build a simple custom dataset that takes two tensors as arguments: one for the features, one for the labels. For any given index, our dataset class will return the corresponding slice of each of those tensors. It should look like this:

In [None]:
from torch.utils.data import Dataset

class CustomDataset(Dataset):
    def __init__(self, x_tensor, y_tensor):
        self.x = x_tensor
        self.y = y_tensor
        
    def __getitem__(self, index):
        return (self.x[index], self.y[index])

    def __len__(self):
        return len(self.x)

In [None]:
# Wait, is this a CPU tensor now? Why? Where is .to(device)?
x_train_tensor = torch.from_numpy(x_train).float()
y_train_tensor = torch.from_numpy(y_train).float()

train_data = CustomDataset(x_train_tensor, y_train_tensor)
print(train_data[0])

---

Did you notice we built our **training tensors** out of Numpy arrays but we **did not send them to a device**? So, they are **CPU** tensors now! **Why**?

We **don’t want our whole training data to be loaded into GPU tensors**, as we have been doing in our example so far, because **it takes up space** in our precious **graphics card’s RAM**.

---

### TensorDataset

Besides, you may be thinking “*why go through all this trouble to wrap a couple of tensors in a class?*”. And, once again, you do have a point… if a dataset is nothing else but a **couple of tensors**, we can use PyTorch’s [**TensorDataset**](https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset) class, which will do pretty much what we did in our custom dataset above.

In [None]:
from torch.utils.data import TensorDataset
train_data = TensorDataset(x_train_tensor, y_train_tensor)
print(train_data[0])

OK, fine, but then again, **why** are we building a dataset anyway? We’re doing it because we want to use a…

## DataLoader, splitting your data into mini-batches

<h2><b><i>- Let's split data into mini-batches<br>- Use DataLoaders!</b></i></h2>

Until now, we have used the **whole training data** at every training step. It has been **batch gradient descent** all along. This is fine for our *ridiculously small dataset*, sure, but if we want to go serious about all this, we **must use mini-batch** gradient descent. Thus, we need mini-batches. Thus, we need to **slice** our dataset accordingly. 

Do you want to do it *manually*?! Me neither!

So we use PyTorch’s [**DataLoader**](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) class for this job. We tell it which **dataset** to use (the one we just built in the previous section), the desired **mini-batch size** and if we’d like to **shuffle** it or not. That’s it!

Our **loader** will behave like an **iterator**, so we can **loop over it** and **fetch a different mini-batch** every time.

In [None]:
from torch.utils.data import DataLoader

train_loader = DataLoader(dataset=train_data, batch_size=16, shuffle=True)

To retrieve a sample mini-batch, one can simply run the command below — it will return a list containing two tensors, one for the features, another one for the labels.

In [None]:
next(iter(train_loader))

How does this change our training loop? Let’s check it out!

In [None]:
lr = 1e-1

# Create a MODEL, a LOSS FUNCTION and an OPTIMIZER
model = nn.Sequential(nn.Linear(1, 1)).to(device)
loss_fn = nn.MSELoss(reduction='mean')
optimizer = optim.SGD(model.parameters(), lr=lr)

# Creates the train_step function for our model, loss function and optimizer
train_step = make_train_step(model, loss_fn, optimizer)

n_epochs = 1000

losses = []

for epoch in range(n_epochs):
    # inner loop
    for x_batch, y_batch in train_loader:
        # the dataset "lives" in the CPU, so do our mini-batches
        # therefore, we need to send those mini-batches to the
        # device where the model "lives"
        x_batch = x_batch.to(device)
        y_batch = y_batch.to(device)
        
        loss = train_step(x_batch, y_batch)
        losses.append(loss)
        
print(model.state_dict())

In [None]:
plt.plot(losses)
plt.xlabel('Epochs (?)')
plt.ylabel('Loss')
plt.yscale('log')

Did you notice it is taking **longer** to train now? Can you guess **why**?

Two things are different now: not only we have an **inner loop** to load each and every **mini-batch** from our **DataLoader** but, more importantly, we are now **sending only one mini-batch to the device**.

---

For bigger datasets, **loading data sample by sample** (into a **CPU** tensor) using **Dataset’s \__get_item__** and then **sending all samples** that belong to the **same mini-batch at once to your GPU** (device) is the way to go in order to make the **best use of your graphics card’s RAM**.

Moreover, if you have **many GPUs** to train your model on, it is best to keep your dataset “agnostic” and assign the batches to different GPUs during training.

---

So far, we’ve focused on the **training data** only. We built a *dataset* and a *data loader* for it. We could do the same for the **validation** data, using the **split** we performed at the beginning of this post… or we could use **random_split** instead.

### random_split

<h2><b><i>- How did you split your data?<br>- I didn't...<br>- WHAT?</b></i></h2>

PyTorch’s [**random_split()**](https://pytorch.org/docs/stable/data.html#torch.utils.data.random_split) method is an easy and familiar way of performing a **training-validation split**. Just keep in mind that, in our example, we need to apply it to the **whole dataset** (not the *training* dataset we built in couple of sections ago).

Then, for each subset of data, we build a corresponding DataLoader, so our code looks like this:

In [None]:
from torch.utils.data.dataset import random_split

# builds tensors from numpy arrays BEFORE split
x_tensor = torch.from_numpy(x).float()
y_tensor = torch.from_numpy(y).float()

# builds dataset containing ALL data points
dataset = TensorDataset(x_tensor, y_tensor)

# performs the split
train_dataset, val_dataset = random_split(dataset, [80, 20])

# builds a loader of each set
train_loader = DataLoader(dataset=train_dataset, batch_size=16)
val_loader = DataLoader(dataset=val_dataset, batch_size=20)

Now we have a **data loader** for our **validation** set, so, it makes sense to use it for the…

## Evaluation: does it generalize?

Now, we need to change the training loop to include the **evaluation of our model**, that is, computing the **validation loss**. The first step is to include another inner loop to handle the *mini-batches* that come from the *validation loader* , sending them to the same *device* as our model. Next, we make **predictions** using our model and compute the corresponding **loss**.

That’s pretty much it, but there are **two small, yet important**, things to consider:

* [**torch.no_grad()**](https://pytorch.org/docs/stable/autograd.html#torch.autograd.no_grad): even though it won’t make a difference in our simple model, it is a **good practice to wrap the validation inner loop with this context manager to disable any gradient calculation** that you may inadvertently trigger — **gradients belong in training**, not in validation steps;
    
* [**eval()**](https://pytorch.org/docs/stable/nn.html#torch.nn.Module.eval): the only thing it does is **setting the model to evaluation mode** (just like its `train()` counterpart did), so the model can adjust its behavior regarding some operations, like [**Dropout**](https://pytorch.org/docs/stable/nn.html#torch.nn.Dropout).

Now, our training loop should look like this:

In [None]:
torch.manual_seed(42)

# builds tensors from numpy arrays BEFORE split
x_tensor = torch.from_numpy(x).float()
y_tensor = torch.from_numpy(y).float()

# builds dataset containing ALL data points
dataset = TensorDataset(x_tensor, y_tensor)

# performs the split
train_dataset, val_dataset = random_split(dataset, [80, 20])

# builds a loader of each set
train_loader = DataLoader(dataset=train_dataset, batch_size=16)
val_loader = DataLoader(dataset=val_dataset, batch_size=20)

# defines learning rate
lr = 1e-1

# Create a MODEL, a LOSS FUNCTION and an OPTIMIZER
model = nn.Sequential(nn.Linear(1, 1)).to(device)
loss_fn = nn.MSELoss(reduction='mean')
optimizer = optim.SGD(model.parameters(), lr=lr)

# Creates the train_step function for our model, loss function and optimizer
train_step = make_train_step(model, loss_fn, optimizer)

n_epochs = 1000

losses = []
val_losses = []

# Looping through epochs...
for epoch in range(n_epochs):
    # TRAINING
    batch_losses = []
    for x_batch, y_batch in train_loader:
        x_batch = x_batch.to(device)
        y_batch = y_batch.to(device)

        loss = train_step(x_batch, y_batch)
        batch_losses.append(loss)

    losses.append(np.mean(batch_losses))
        
    # VALIDATION
    # no gradients in validation!
    with torch.no_grad():
        val_batch_losses = []
        for x_val, y_val in val_loader:
            x_val = x_val.to(device)
            y_val = y_val.to(device)
            
            # sets model to EVAL mode
            model.eval()

            # make predictions
            yhat = model(x_val)
            val_loss = loss_fn(yhat, y_val)
            val_batch_losses.append(val_loss.item())

        val_losses.append(np.mean(val_batch_losses))

print(model.state_dict())

In [None]:
plt.plot(losses, label='Training Loss')
plt.plot(val_losses, label='Validation Loss')
plt.yscale('log')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

"*Wait, there is something weird with this plot...*", you say. You're right, the **validation loss** is **smaller** than the **training loss**. Shouldn't it be the other way around?! Well, generally speaking, *YES*, it should... but you can learn more about situations where this *swap* happens at this great [post](pyimg.co/kku35).

## Training Loop

The training loop should be a stable structure, so we can organize it into functions as well...
Let's build a function for **validation** and another one for the **training loop** itself, training step and all!

In [None]:
def make_train_step(model, loss_fn, optimizer):
    # Builds function that performs a step in the train loop
    def train_step(x, y):
        # Sets model to TRAIN mode
        model.train()
        # Step 1: Makes predictions
        yhat = model(x)
        # Step 2: Computes loss
        loss = loss_fn(yhat, y)
        # Step 3: Computes gradients
        loss.backward()
        # Step 4: Updates parameters and zeroes gradients
        optimizer.step()
        optimizer.zero_grad()
        # Returns the loss
        return loss.item()
    
    # Returns the function that will be called inside the train loop
    return train_step

def validation(model, loss_fn, val_loader):
    # Figures device from where the model parameters (hence, the model) are
    device = next(model.parameters()).device.type

    # no gradients in validation!
    with torch.no_grad():
        val_batch_losses = []
        for x_val, y_val in val_loader:
            x_val = x_val.to(device)
            y_val = y_val.to(device)
            
            # sets model to EVAL mode
            model.eval()

            # make predictions
            yhat = model(x_val)
            val_loss = loss_fn(yhat, y_val)
            val_batch_losses.append(val_loss.item())

        val_losses = np.mean(val_batch_losses)

    return val_losses


def train_loop(model, loss_fn, optimizer, n_epochs, train_loader, val_loader=None):
    # Figures device from where the model parameters (hence, the model) are
    device = next(model.parameters()).device.type
    # Creates the train_step function for our model, loss function and optimizer
    train_step = make_train_step(model, loss_fn, optimizer)

    losses = []
    val_losses = []

    for epoch in range(n_epochs):
        # TRAINING
        batch_losses = []
        for x_batch, y_batch in train_loader:
            x_batch = x_batch.to(device)
            y_batch = y_batch.to(device)

            loss = train_step(x_batch, y_batch)
            batch_losses.append(loss)

        losses.append(np.mean(batch_losses))

        # VALIDATION
        if val_loader is not None:
            val_loss = validation(model, loss_fn, val_loader)
            val_losses.append(val_loss)

        print("Epoch {} complete...".format(epoch))

    return losses, val_losses

## Final Code

We finally have an organized version of our code, consisting of the following steps:
- building a **Dataset**
- performing a **random split** into **train** and **validation** datasets
- building **DataLoaders**
- building a **model**
- defining a **loss function**
- specifying a **learning rate**
- defining an **optimizer**
- specifying the **number of epochs**

All nitty-gritty details of performing the actual training is encapsulated inside the **`train_loop`** function.

In [None]:
torch.manual_seed(42)

# builds tensors from numpy arrays BEFORE split
x_tensor = torch.from_numpy(x).float()
y_tensor = torch.from_numpy(y).float()

# builds dataset containing ALL data points
dataset = TensorDataset(x_tensor, y_tensor)

# performs the split
train_dataset, val_dataset = random_split(dataset, [80, 20])

# builds a loader of each set
train_loader = DataLoader(dataset=train_dataset, batch_size=16)
val_loader = DataLoader(dataset=val_dataset, batch_size=20)

# defines learning rate
lr = 1e-1

# Create a MODEL, a LOSS FUNCTION and an OPTIMIZER
model = nn.Sequential(nn.Linear(1, 1)).to(device)
loss_fn = nn.MSELoss(reduction='mean')
optimizer = optim.SGD(model.parameters(), lr=lr)

n_epochs = 1000

losses, val_losses = train_loop(model, loss_fn, optimizer, n_epochs, train_loader, val_loader)

print(model.state_dict())

In [None]:
plt.plot(losses, label='Training Loss')
plt.plot(val_losses, label='Validation Loss')
plt.yscale('log')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

## Saving (and Loading) Models: taking a break

<h2><b><i>"That would be great, to restart training later"</b></i></h2>

So, it is important to be able to **checkpoint** our model, in case we'd like to **restart training later**.

To checkpoint a model, we basically have to **save its state** into a file, to **load** it back later - nothing special, actually.

What defines the **state of a model**?
- **model.state_dict()**: kinda obvious, right?
- **optimizer.state_dict()**: remember optimizers had the `state_dict` as well?
- **loss**: after all, you should keep track of its evolution
- **epoch**: it is just a number, so why not? :-)
- **anything else you'd like to have restored**

Then, **wrap everything into a Python dictionary** and use [**torch.save()**](https://pytorch.org/docs/stable/torch.html?highlight=save#torch.save) to dump it all into a file! Easy peasy!

In [None]:
checkpoint = {'epoch': n_epochs,
              'model_state_dict': model.state_dict(),
              'optimizer_state_dict': optimizer.state_dict(),
              'loss': losses,
              'val_loss': val_losses}

torch.save(checkpoint, 'model_checkpoint.pth')

How would you **load** it back? Easy as well:
- load the dictionary back using [**torch.load()**](https://pytorch.org/docs/stable/torch.html?highlight=torch%20load#torch.load)
- load **model** and **optimizer** state dictionaries back using its methods [**load_state_dict()**](https://pytorch.org/docs/stable/nn.html?highlight=load_state_dict#torch.nn.Module.load_state_dict)
- load everything else into their corresponding variables

In [None]:
checkpoint = torch.load('model_checkpoint.pth')

model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])

epoch = checkpoint['epoch']
losses = checkpoint['loss']
val_losses = checkpoint['val_loss']

You may save a model for **checkpointing**, like we have just done, or for **making predictions**, assuming training is finished.

After loading the model, **DO NOT FORGET**:

---

**SET THE MODE** (not the mood!):
- **checkpointing: model.train()**
- **predicting: model.eval()**

---

## BONUS: Further Improvements


Is there **anything else** we can improve or change? Sure, there is **always something else** to add to your model — using a [**learning rate scheduler**](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate), for instance.

### Learning Rate Scheduler

In the "Playing with the Learning Rate" section, we observed how different **learning rates** may be more useful at different moments of the optimization process.

PyTorch offers a long list of **learning rate schedulers** for all your learning rate needs:
- [**StepLR**](https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.StepLR)
- [**MultiStepLR**](https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.MultiStepLR)
- [**ReduceLROnPlateau**](https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.ReduceLROnPlateau)
- [**LambdaLR**](https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.LambdaLR)
- [**ExponentialLR**](https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.ExponentialLR)
- [**CosineAnnealingLR**](https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.CosineAnnealingLR)
- [**CyclicLR**](https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.CyclicLR)
- [**OneCycleLR**](https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.OneCycleLR)
- [**CosineAnnealingWarmRestarts**](https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.CosineAnnealingWarmRestarts)

To include a scheduler into our workflow, we need to take two steps:
- create a **scheduler** and pass our **optimizer as argument**
- use our scheduler's **step()** method
    - **after the validation**, that is, **last thing before finishing an epoch**, for the first 6 schedulers on the list
    - **after every batch update** for the last 3 schedulers on the list
    
We also need to **pass an argument** to **step()** if we're using **ReduceLROnPlateau**: the **validation loss**, which is the quantity we're using to **control the effectiveness of the current learning rate**.

In [None]:
from torch.optim.lr_scheduler import StepLR, ReduceLROnPlateau, MultiStepLR

optimizer = optim.SGD(model.parameters(), lr=lr)
scheduler = ReduceLROnPlateau(optimizer, 'min')

#scheduler = StepLR(optimizer, step_size=30, gamma=0.5)
#scheduler = MultiStepLR(optimizer, milestones=[30,80], gamma=0.1)

We are focusing only on **ReduceLROnPlateau**, **StepLR** and **MultiStepLR** on this tutorial, so we'll change our training loop accordingly: adding the **scheduler's step()** as **last thing before finishing an epoch**.

In [None]:
def train_loop_with_scheduler(model, loss_fn, optimizer, scheduler, n_epochs, train_loader, val_loader=None):
    # Figures device from where the model parameters (hence, the model) are
    device = next(model.parameters()).device.type
    # Creates the train_step function for our model, loss function and optimizer
    train_step = make_train_step(model, loss_fn, optimizer)

    losses = []
    val_losses = []
    learning_rates = []

    for epoch in range(n_epochs):        
        # TRAINING
        batch_losses = []
        for x_batch, y_batch in train_loader:
            x_batch = x_batch.to(device)
            y_batch = y_batch.to(device)

            loss = train_step(x_batch, y_batch)
            batch_losses.append(loss)

        losses.append(np.mean(batch_losses))

        # VALIDATION
        if val_loader is not None:
            val_loss = validation(model, loss_fn, val_loader)
            val_losses.append(val_loss)

        print("Epoch {} complete...".format(epoch))

        # SCHEDULER
        if isinstance(scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau):
            scheduler.step(val_loss)
        else:
            scheduler.step()
            
        learning_rates.append(optimizer.state_dict()['param_groups'][0]['lr'])
        
    return losses, val_losses, learning_rates

Let's run the whole thing once again!

In [None]:
torch.manual_seed(42)

# builds tensors from numpy arrays BEFORE split
x_tensor = torch.from_numpy(x).float()
y_tensor = torch.from_numpy(y).float()

# builds dataset containing ALL data points
dataset = TensorDataset(x_tensor, y_tensor)

# performs the split
train_dataset, val_dataset = random_split(dataset, [80, 20])

# builds a loader of each set
train_loader = DataLoader(dataset=train_dataset, batch_size=16)
val_loader = DataLoader(dataset=val_dataset, batch_size=20)

# defines learning rate
lr = 1e-1

# Create a MODEL, a LOSS FUNCTION and an OPTIMIZER (and SCHEDULER)
model = nn.Sequential(nn.Linear(1, 1)).to(device)
loss_fn = nn.MSELoss(reduction='mean')
optimizer = optim.SGD(model.parameters(), lr=lr)

scheduler = ReduceLROnPlateau(optimizer, 'min')
#scheduler = StepLR(optimizer, step_size=30, gamma=0.5)
#scheduler = MultiStepLR(optimizer, milestones=[30,80], gamma=0.1)

n_epochs = 1000

losses, val_losses, l_rates = train_loop_with_scheduler(model, loss_fn, optimizer, scheduler, n_epochs, train_loader, val_loader)

In [None]:
print(model.state_dict())

In [None]:
plt.plot(l_rates)
plt.yscale('log')
plt.xlabel('Epochs')
plt.ylabel('Learning Rate')

As expected, the learning rate is progressively reduced.

In [None]:
plt.plot(losses, label='Training Loss')
plt.plot(val_losses, label='Validation Loss')
plt.yscale('log')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

## Final Thoughts

I believe this tutorial has **most of the necessary steps** one needs go to trough in order to **learn**, in a **structured** and **incremental** way, how to **develop Deep Learning models using PyTorch**.

Hopefully, after finishing working through all code in this post, you’ll be able to better appreciate and more easily work your way through PyTorch’s official [tutorials](https://pytorch.org/tutorials/).

If you have any thoughts, comments or questions, please leave a comment below or contact me on [LinkedIn](https://br.linkedin.com/in/dvgodoy) or [Twitter](https://twitter.com/dvgodoy).