<a href="https://colab.research.google.com/github/probabll/ntmi-tutorials/blob/main/PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Guide

Check the guide carefully before starting.

## ILOs

After completing this lab you should be able to 

* use PyTorch to implement differentiable computation graphs (neural networks)
* use automatic differentiation to obtain partial derivatives (gradients) with respect to trainable parameters
* use these general purpose functions to parameterise a binary classifier

## General notes

* In this notebook you are expected to use $\LaTeX$. 
* Use python3.
* Use Torch
* To have GPU support run this notebook on Google Colab (you will find more instructions later).

We will use a set of standard libraries that are often used in machine learning projects. If you are running this notebook on Google Colab, all libraries should be pre-installed. If you are running this notebook locally you will need to install some additional packages, ask your TA for help if you have problems setting up.


This notebook gives you a short introduction to PyTorch, a software package that allows for easy design of differentiable computation graphs, the key object needed to implement and deploy models powered by neural networks.

## Credits

The first part of this tutorial (including the XOR example) is based on the [tutorial used for the MSc AI course on Deep Learning](https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial2/Introduction_to_PyTorch.html).



## Table of contents

* Neural Networks
* The Basics of PyTorch
    * Tensors
    * Dynamic Computation Graph and Backpropagation    
    * Learning by example: Continuous XOR

## How to use this notebook

Check the entire notebook before you get started, this gives you an idea of what lies ahead.

We advise you work through the entire section `The Basics of PyTorch` before the live session. If you are new to PyTorch, that section on its own is a 2-4 hours investment. Because of that the other classes in week 4 will demand less prep-work.

## Setting up

Here we set up the packages that you will need to install for this tutorial.

In [None]:
!pip install tqdm
!pip install seaborn
!pip install torch
!pip install sklearn

In [None]:
## Standard libraries
import os
import math
import numpy as np 
import time

## Imports for plotting
import matplotlib.pyplot as plt
%matplotlib inline 
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('svg', 'pdf') # For export
from matplotlib.colors import to_rgba
import seaborn as sns
sns.set()

## Progress bar
from tqdm.auto import tqdm

# Neural Networks


A neural network is a very flexible real-valued function, it maps some input to some output by means of a composition of differentiable parametric transformations. 

Being parametric means that these transformations are specified by a set of real-valued parameters, whose values we can adjust/optimise towards a certain goal (e.g., maximum likelihood given a statistical model and a dataset of observations). Being differentiable means that we can use gradient-based search for parameter estimation.

Remember the GLM for text analysis? Given a document $x \in \mathcal X$ and a feature function $\mathbf h: \mathcal X \to \mathbb R^D$, the GLM uses linear models and nonlinear activations functions to parameterise a conditional distribution over the possible values of a response random variable $Y$ taking on values in $\mathcal Y$. Consider, for example, a GLM for a binary response variable:

\begin{align}
Y|X=x &\sim \mathrm{Bernoulli}(g(x; \theta)) \\
s &= \mathbf w^\top \mathbf h(x) + b \\
g(x; \mathbf w, b) &= \mathrm{sigmoid}(s)\\
\theta &= \{\mathbf w, b\}\\
&\quad \mathbf w \in \mathbb R^D, b \in \mathbb R
\end{align}

The output $s$ of the linear transformation is called a *linear predictor* (it maps the feature vector $\mathbf h(x)$ to the dimensionality of the Bernoulli parameter), the $\mathrm{sigmoid}$ function after that is called an *activation function* (it maps the linear predictor to the correct parameter space for the Bernoulli distribution). 

As it turns out the GLM is a very shallow neural network (NN)! It is made of a composition of two functions (the linear transformation and the activation), which are differentiable with respect to the trainable parameters. In a GLM, the data point, represented by its feature vector $\mathbf h(x)$, and the parameters interact linearly. In a neural network more generally, we would allow that interaction to be non-linear. 

We had mentioned that one of the limitations of GLMs is the need for a pre-specified feature functon. Now, with NNs, we are going to *parameterise* the feature function as well!

Before we can do this, we need to introduce you to a new software package for deep learning: **pytorch**. The running example will be a probabilistic binary classifier, but we will not use text as input for now.

<details>
<summary>Why another package when we already know some JAX?</summary>

 JAX is a good didactic tool to give you an understanding of the role of automatic differentiation and to introduce you to gradient-based optimisation, but, in the long run, we need a software package that offers more ready-to-go code, so that you can count on certain important functionalities, pytorch is one of the best options out there, it's highly regarded amonsgt academics and in the industry, it is also the choice in the UvA's MSc AI (in case you decide to join that programme later on).

---

</details>


# The Basics of PyTorch

We will start with reviewing the very basic concepts of PyTorch. As a prerequisite, we recommend to be familiar with the `numpy` package as most machine learning frameworks are based on very similar concepts. If you are not familiar with numpy yet, don't worry: here is a [tutorial](https://numpy.org/devdocs/user/quickstart.html) to go through. 

So, let's start with importing PyTorch. The package is called `torch`, based on its original framework [Torch](http://torch.ch/). As a first step, we can check its version:

In [None]:
import torch
print("Using torch", torch.__version__)

As in every machine learning framework, PyTorch provides functions that are stochastic like generating random numbers. However, a very good practice is to setup your code to be reproducible with the exact same random numbers. This is why we set a seed below. 

In [None]:
torch.manual_seed(42) # Setting the seed

### Tensors

Tensors are the PyTorch equivalent to Numpy arrays, with the addition to also have support for GPU acceleration (more on that later).
The name "tensor" is a generalization of concepts you already know. For instance, a vector is a 1-D tensor, and a matrix a 2-D tensor. When working with neural networks, we will use tensors of various shapes and number of dimensions.

Most common functions you know from numpy can be used on tensors as well. Actually, since numpy arrays are so similar to tensors, we can convert most tensors to numpy arrays (and back) but we don't need it too often.

#### Initialization

Let's first start by looking at different ways of creating a tensor. There are many possible options, the most simple one is to call `torch.Tensor` passing the desired shape as input argument:

In [None]:
x = torch.Tensor(2, 3, 4)
print(x)

The function `torch.Tensor` allocates memory for the desired tensor, but reuses any values that have already been in the memory. To directly assign values to the tensor during initialization, there are many alternatives including:

* `torch.zeros`: Creates a tensor filled with zeros
* `torch.ones`: Creates a tensor filled with ones
* `torch.rand`: Creates a tensor with random values uniformly sampled between 0 and 1
* `torch.randn`: Creates a tensor with random values sampled from a normal distribution with mean 0 and variance 1
* `torch.arange`: Creates a tensor containing the values $N,N+1,N+2,...,M$
* `torch.Tensor` (input list): Creates a tensor from the list elements you provide

In [None]:
# Create a tensor from a (nested) list
x = torch.Tensor([[1, 2], [3, 4]])
print(x)

In [None]:
# Create a tensor with random values between 0 and 1 with the shape [2, 3, 4]
x = torch.rand(2, 3, 4)
print(x)

You can obtain the shape of a tensor in the same way as in numpy (`x.shape`), or using the `.size` method:

In [None]:
shape = x.shape
print("Shape:", x.shape)

size = x.size()
print("Size:", size)

dim1, dim2, dim3 = x.size()
print("Size:", dim1, dim2, dim3)

#### Tensor to Numpy, and Numpy to Tensor

Tensors can be converted to numpy arrays, and numpy arrays back to tensors. To transform a numpy array into a tensor, we can use the function `torch.from_numpy`:

In [None]:
np_arr = np.array([[1, 2], [3, 4]])
tensor = torch.from_numpy(np_arr)

print("Numpy array:", np_arr)
print("PyTorch tensor:", tensor)

To transform a PyTorch tensor back to a numpy array, we can use the function `.numpy()` on tensors:

In [None]:
tensor = torch.arange(4)
np_arr = tensor.numpy()

print("PyTorch tensor:", tensor)
print("Numpy array:", np_arr)

The conversion of tensors to numpy require the tensor to be on the CPU, and not the GPU (more on GPU support in a later section). In case you have a tensor on GPU, you need to call `.cpu()` on the tensor beforehand. Hence, you get a line like `np_arr = tensor.cpu().numpy()`.

#### Operations

Most operations that exist in numpy, also exist in PyTorch. A full list of operations can be found in the [PyTorch documentation](https://pytorch.org/docs/stable/tensors.html#), but we will review the most important ones here.

The simplest operation is to add two tensors:

In [None]:
x1 = torch.rand(2, 3)
x2 = torch.rand(2, 3)
y = x1 + x2

print("X1", x1)
print("X2", x2)
print("Y", y)

Calling `x1 + x2` creates a new tensor containing the sum of the two inputs. However, we can also use in-place operations that are applied directly on the memory of a tensor. We therefore change the values of `x2` without the chance to re-accessing the values of `x2` before the operation. An example is shown below:

In [None]:
x1 = torch.rand(2, 3)
x2 = torch.rand(2, 3)
print("X1 (before)", x1)
print("X2 (before)", x2)

x2.add_(x1)
print("X1 (after)", x1)
print("X2 (after)", x2)

In-place operations are usually marked with a underscore postfix (e.g. "add_" instead of "add").

Another common operation aims at changing the shape of a tensor. A tensor of size (2,3) can be re-organized to any other shape with the same number of elements (e.g. a tensor of size (6), or (3,2), ...). In PyTorch, this operation is called `view`:

In [None]:
x = torch.arange(6)
print("X", x)

In [None]:
x = x.view(2, 3)
print("X", x)

In [None]:
x = x.permute(1, 0) # Swapping dimension 0 and 1
print("X", x)

Other commonly used operations include matrix multiplications, which are essential for neural networks. Quite often, we have an input vector $\mathbf{x}$, which is transformed using a learned weight matrix $\mathbf{W}$. There are multiple ways and functions to perform matrix multiplication, some of which we list below:

* `torch.matmul`: Performs the matrix product over two tensors, where the specific behavior depends on the dimensions. If both inputs are matrices (2-dimensional tensors), it performs the standard matrix product. For higher dimensional inputs, the function supports broadcasting (for details see the [documentation](https://pytorch.org/docs/stable/generated/torch.matmul.html?highlight=matmul#torch.matmul)). Can also be written as `a @ b`, similar to numpy. 
* `torch.mm`: Performs the matrix product over two matrices, but doesn't support broadcasting (see [documentation](https://pytorch.org/docs/stable/generated/torch.mm.html?highlight=torch%20mm#torch.mm))
* `torch.bmm`: Performs the matrix product with a support batch dimension. If the first tensor $T$ is of shape ($b\times n\times m$), and the second tensor $R$ ($b\times m\times p$), the output $O$ is of shape ($b\times n\times p$), and has been calculated by performing $b$ matrix multiplications of the submatrices of $T$ and $R$: $O_i = T_i @ R_i$
* `torch.einsum`: Performs matrix multiplications and more (i.e. sums of products) using the Einstein summation convention. Explanation of the Einstein sum can be found in assignment 1.

Usually, we use `torch.matmul` or `torch.bmm`. We can try a matrix multiplication with `torch.matmul` below.

In [None]:
x = torch.arange(6)
x = x.view(2, 3)
print("X", x)

In [None]:
W = torch.arange(9).view(3, 3) # We can also stack multiple operations in a single line
print("W", W)

In [None]:
h = torch.matmul(x, W) # Verify the result by calculating it by hand too!
print("h", h)

#### Indexing

We often have the situation where we need to select a part of a tensor. Indexing works just like in numpy, so let's try it:

In [None]:
x = torch.arange(12).view(3, 4)
print("X", x)

In [None]:
print(x[:, 1])   # Second column

In [None]:
print(x[0])      # First row

In [None]:
print(x[:2, -1]) # First two rows, last column

In [None]:
print(x[1:3, :]) # Middle two rows

## Dynamic Computation Graph and Backpropagation

One of the main reasons for using PyTorch in Deep Learning projects is that we can automatically get **gradients/derivatives** of functions that we define. We will mainly use PyTorch for implementing neural networks, and they are just fancy functions. If we use weight matrices in our function that we want to learn, then those are called the **parameters** or simply the **weights**.

If our neural network would output a single scalar value, we would talk about taking the **derivative**, but you will see that quite often we will have **multiple** output variables ("values"); in that case we talk about **gradients**. It's a more general term.

Given an input $\mathbf{x}$, we define our function by **manipulating** that input, usually by matrix-multiplications with weight matrices and additions with so-called bias vectors. As we manipulate our input, we are automatically creating a **computational graph**. This graph shows how to arrive at our output from our input. 
PyTorch is a **define-by-run** framework; this means that we can just do our manipulations, and PyTorch will keep track of that graph for us. Thus, we create a dynamic computation graph along the way.

So, to recap: the only thing we have to do is to describe how to compute the **output**, and then we can ask PyTorch to automatically get the **gradients**. 

> **Note:  Why do we want gradients?** Consider that we have defined a function, a neural net, that is supposed to compute a certain output $y$ for an input vector $\mathbf{x}$. We then define an **error measure** that tells us how wrong our network is; how bad it is in predicting output $y$ from input $\mathbf{x}$. Based on this error measure, we can use the gradients to **update** the weights $\mathbf{W}$ that were responsible for the output, so that the next time we present input $\mathbf{x}$ to our network, the output will be closer to what we want.

The first thing we have to do is to specify which tensors require gradients. By default, when we create a tensor, it does not require gradients.

In [None]:
x = torch.ones((3,))
print(x.requires_grad)

We can change this for an existing tensor using the function `requires_grad_()` (underscore indicating that this is a in-place operation). Alternatively, when creating a tensor, you can pass the argument `requires_grad=True` to most initializers we have seen above.

In [None]:
x.requires_grad_(True)
print(x.requires_grad)

In order to get familiar with the concept of a computation graph, we will create one for the following function from $\mathbb R^2$ to $\mathbb R$: 

$$f(x_1, x_2; w_1, w_2) = \sqrt{(x_1 - w_1)^2 + (x_2 - w_2)^2}$$

where we have two real valued inputs $(x_1, x_2)$ and two real-valued parameters $(w_1, w_2)$.


You could imagine that we are given the inputs $(x_1=1.5, x_2=2.5)$ and intend to optimise our choice of parameters as to either maximize or minimize the output $f(x_1, x_2; w_1, w_2)$. For this, we want to obtain the vector of partial derivatives of $f$ evaluated at the given input with respect to each one of the parameters: 

\begin{bmatrix}
\frac{\partial}{\partial w_1}f(x_1, x_2; w_1, w_2)\\ 
\frac{\partial}{\partial w_2}f(x_1, x_2; w_1, w_2)
\end{bmatrix}

Because writing this can get cumbersome rather quickly, we introduce vector notation. That is we can define the $2$-dimensional input vector $\mathbf x = (x_1, x_2)^\top$ and the two-dimensional parameter vector $\mathbf w = (w_1, w_2)^\top$, and denote our function $f$ evaluated at a given $\mathbf x$ for a fixed choice of $\mathbf w$ by, for example, $z = f(\mathbf x; \mathbf w)$. The gradient vector can then be expressed as 

\begin{align}
\nabla_{\mathbf w} z = ( \partial z / \partial w_1, \partial z / \partial w_2 )^\top
\end{align}


**Notation guideline** Recall that in mathematics, vectors are generally *column vectors*, that is when we say $\mathbf x = (x_1, x_2)^\top$ is a vector we mean that the left-hand size is a *column* vector, which we obtain by transposing the *row* vector $(x_1, x_2)$ in the right-hand side of the expression. Unfortunately, much of machine learning, deep learning, and NLP literature can be very inconsistent about this. We try to give you a cleaner view of the picture, but you need to practice some robustness to these inconsistencies, as they will come up when you read papers and even textbooks. 


Also note that when we write vectorised code, different libraries implement different conventions, so the best strategy is to be alert and check carefully what each software package does.

**Inputs versus parameters** From the point of view of a neural network toolkit, the difference between what we call inputs and what we call parameters is not very obvious. Without inputs and parameters, the neural network cannot compute an output, so in a sense both inputs and parameters are necessary inputs. For us, designers, the difference is clear: the parameters are fixed, together with the operations we have, they specify the actual function; the inputs may vary, each time we use the function we are interested in its output value for a certain $\mathbf x$. When we instantiate inputs and parameters in an NN toolkit, we need to let the NN know that we intend to treat parameters as trainable quantities, and we do that by telling the toolkit that a certain quantityt *requires gradients*.  

If we confuse you when we say that *parameters are fixed* just think about it from the point of view of PyTorch. We may "change" parameters through the course of a training algorithm, but each time we evaluate the function, the parameters are known and fixed.

**Quiz** Intantiate the input vector $\mathbf x = (1.5, 2.5)^\top$ in a torch tensor.

<details>
    <summary><b>SOLUTION</b></summary>
    
```python
x = torch.tensor([1.5, 2.5], dtype=torch.float32)
x
```
    
    
</details>

---

**Quiz** Intantiate the parameter vector $\mathbf w = (-1, 1)^\top$ in a torch tensor. The choice $(-1, 1)$ is not informed by anything, we just pick it as an example. 

<details>
    <summary><b>SOLUTION</b></summary>
    
```python
w = torch.tensor([-1, 1], dtype=torch.float32, requires_grad=True)
w
```
    
    
</details>

---

**Quiz** Compute $z$, the output of $f$ evaluated at $\mathbf x$ with a given parameter vector $\mathbf w$. Make sure to write vectorised code (i.e, numpy-style code).

<details>
    <summary><b>SOLUTION</b></summary>

    
```python
z = torch.sqrt(((x - w)**2).sum())
```     

Or something more step-by-step such as
    
```python    
a = x - w
b = a ** 2
c = b.sum()
z = torch.sqrt(c)
z
```

   
    
    
</details>

---

Here we will build the computation graph step by step. You can combine multiple operations in a single line, but we will separate them here to get a better understanding of how each operation is added to the computation graph.

In [None]:
# inputs
x = torch.tensor([1.5, 2.5], dtype=torch.float32)
# parameters
w = torch.tensor([-1, 1], dtype=torch.float32, requires_grad=True)
# computation
a = x - w
b = a ** 2
c = b.sum()
z = torch.sqrt(c)
z

Using the statements above, we have created a computation graph that looks similar to the figure below:

<center style="width: 100%"><img src="https://raw.githubusercontent.com/probabll/ntmi-tutorials/main/img/pytorch_computation_graph.svg" width="200px"></center>

We calculate $\mathbf a$ based on the inputs $\mathbf x$ and the parameter vector $\mathbf w$, $\mathbf b$ is $\mathbf a$ squared elementwise, and $c$ is the sum of the coordinates of $\mathbf b$, and $z$ is the squared-root of $c$. The visualization is an abstraction of the dependencies between inputs and outputs of the operations we have applied.
Each node of the computation graph has automatically defined a function for calculating the gradients with respect to its inputs, `grad_fn`. You can see this when we printed the output tensor $z$. This is why the computation graph is usually visualized in the reverse direction (arrows point from the result to the inputs). 

<details>
<summary> <b> Warning </b> </summary> 
Do not confuse a computation graph with a directed graphical model, both are built upon directed (often acyclic) graphs, but they are used for different things: a computation graph is a way to organise function compositions, a directed graphical model is a way to state conditional independence assumptions.

---

</details>

We can perform backpropagation on the computation graph by calling the function `backward()` on the last output, which effectively calculates the gradients for each tensor that has the property `requires_grad=True`:

**Quiz**  Evaluate the gradient of $z$ with respect to the given parameters $\mathbf w$, by calling `.backward()` on the output node. Note that PyTorch will store the value of the gradient directly on the parameter vector itself, in the attribute `.grad`.

<details>
    <summary><b>SOLUTION</b></summary>
    
```python
z.backward()
w.grad
```
    
    
</details>

---

Each coordinate of `w.grad` is a coordinate of the gradient vector.

We can also verify these gradients by hand. We will calculate the gradients using the chain rule, in the same way as PyTorch did it:

$$\frac{\partial z}{\partial w_i} = \frac{\partial z}{\partial c}\frac{\partial c}{\partial b_i}\frac{\partial b_i}{\partial a_i}\frac{\partial a_i}{\partial w_i}$$


**Quiz**  State the partial derivatives:

\begin{align}
\frac{\partial a_i}{\partial w_i} &=  \\
\frac{\partial b_i}{\partial a_i} &= \\
\frac{\partial c}{\partial b_i} &= \\
\frac{\partial z}{\partial c} &= 
\end{align}


<details>
    <summary><b>SOLUTION</b></summary>
    
\begin{align}
\frac{\partial a_i}{\partial w_i} &= -1 \\
\frac{\partial b_i}{\partial a_i} &= 2 a_i\\
\frac{\partial c}{\partial b_i} &= 1 \\
\frac{\partial z}{\partial c} &= \frac{1}{2}c^{-1/2}
\end{align}
    
</details>

---

Isn't it handy that PyTorch can do all of that for us? Not only it's less tedious that way, PyTorch is debugged, tested, and knows the derivatives of *many* elementary functions. It also knows tricks to compute them efficiently and in a numerically stable way.

### GPU support

A crucial feature of PyTorch is the support of GPUs, short for Graphics Processing Unit. A GPU can perform many thousands of small operations in parallel, making it very well suitable for performing large matrix operations in neural networks. When comparing GPUs to CPUs, we can list the following main differences (credit: [Kevin Krewell, 2009](https://blogs.nvidia.com/blog/2009/12/16/whats-the-difference-between-a-cpu-and-a-gpu/)) 

<center style="width: 100%"><img src="https://raw.githubusercontent.com/probabll/ntmi-tutorials/main/img/comparison_CPU_GPU.png" width="700px"></center>

CPUs and GPUs have both different advantages and disadvantages, which is why many computers contain both components and use them for different tasks. In case you are not familiar with GPUs, you can read up more details in this [NVIDIA blog post](https://blogs.nvidia.com/blog/2009/12/16/whats-the-difference-between-a-cpu-and-a-gpu/) or [here](https://www.intel.com/content/www/us/en/products/docs/processors/what-is-a-gpu.html). 

GPUs can accelerate the training of your network up to a factor of $100$ which is essential for large neural networks. PyTorch implements a lot of functionality for supporting GPUs (mostly those of NVIDIA due to the libraries [CUDA](https://developer.nvidia.com/cuda-zone) and [cuDNN](https://developer.nvidia.com/cudnn)). First, let's check whether you have a GPU available:

In [None]:
gpu_avail = torch.cuda.is_available()
print("Is the GPU available? %s" % str(gpu_avail))

If you have a GPU on your computer but the command above returns False, make sure you have the correct CUDA-version installed. On Google Colab (*recommended*), make sure that you have selected a GPU in your runtime setup (in the menu, check under `Runtime -> Change runtime type`). 

By default, all tensors you create are stored on the CPU. We can push a tensor to the GPU by using the function `.to(...)`, or `.cuda()`. However, it is often a good practice to define a `device` object in your code which points to the GPU if you have one, and otherwise to the CPU. Then, you can write your code with respect to this device object, and it allows you to run the same code on both a CPU-only system, and one with a GPU. Let's try it below. We can specify the device as follows: 

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device

On Colab you can have access to a GPU, you need to go to menu `Runtime` and `Change runtime type to GPU`, but note that there's a limit of how much time you can stay connected to a GPU. For most of this tutorial you will in fact not need a GPU. You can use it if you like though.

The most important aspect of pytorch we will be using for now is its ability to compute derivatives automatically for us (not so much its ability to accelarete code using GPUs).

Now let's create a tensor and push it to the device:

In [None]:
x = torch.zeros(2, 3)
x = x.to(device)
x

In case you have a GPU, you should now see the attribute `device='cuda:0'` being printed next to your tensor. The zero next to cuda indicates that this is the zero-th GPU device on your computer. PyTorch also supports multi-GPU systems, but this you will only need once you have very big networks to train (if interested, see the [PyTorch documentation](https://pytorch.org/docs/stable/distributed.html#distributed-basics)). We can also compare the runtime of a large matrix multiplication on the CPU with a operation on the GPU:

In [None]:
x = torch.randn(5000, 5000)

## CPU version
start_time = time.time()
_ = torch.matmul(x, x)
end_time = time.time()
print("CPU time: %6.5fs" % (end_time - start_time))

## GPU version
x = x.to(device)
# The first operation on a CUDA device can be slow as it has to establish a CPU-GPU communication first. 
# Hence, we run an arbitrary command first without timing it for a fair comparison.
if torch.cuda.is_available():
    _ = torch.matmul(x*0.0, x)
start_time = time.time()
_ = torch.matmul(x, x)
end_time = time.time()
print("GPU time: %6.5fs" % (end_time - start_time))

Depending on the size of the operation and the CPU/GPU in your system, the speedup of this operation can be >500x. As `matmul` operations are very common in neural networks, we can already see the great benefit of training a NN on a GPU. The time estimate can be relatively noisy here because we haven't run it for multiple times. Feel free to extend this, but it also takes longer to run.

When generating random numbers, the seed between CPU and GPU is not synchronized. Hence, we need to set the seed on the GPU separately to ensure a reproducible code. Note that due to different GPU architectures, running the same code on different GPUs does not guarantee the same random numbers. Still, we don't want that our code gives us a different output every time we run it on the exact same hardware. Hence, we also set the seed on the GPU:

In [None]:
# GPU operations have a separate seed we also want to set
if torch.cuda.is_available(): 
    torch.cuda.manual_seed(42)
    torch.cuda.manual_seed_all(42)
    
# Additionally, some operations on a GPU are implemented stochastically for efficiency
# We want to ensure that all operations are deterministic on GPU (if used) for reproducibility
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

## Learning by example: Continuous XOR

If we want to build a neural network in PyTorch, we could specify all our parameters (weight matrices, bias vectors) using `Tensors` (with `requires_grad=True`), ask PyTorch to calculate the gradients and then adjust the parameters. But things can quickly get cumbersome if we have a lot of parameters. In PyTorch, there is a package called `torch.nn` that makes building neural networks more convenient. 

We will introduce the libraries and all additional parts you might need to train a neural network in PyTorch, using a simple example classifier on a simple yet well known example: XOR. Given two binary inputs $x_1$ and $x_2$, the label to predict is $y=1$ if either $x_1$ or $x_2$ is $1$ while the other is $0$, or the label is $y=0$ in all other cases. The example became famous by the fact that a single neuron, i.e. a linear classifier, cannot learn this simple function.
Hence, we will learn how to build a small neural network that can learn this function. 
To make it a little bit more interesting, we move the XOR into continuous space and introduce some gaussian noise on the binary inputs. Our desired separation of an XOR dataset could look as follows:

<center style="width: 100%"><img src="https://raw.githubusercontent.com/probabll/ntmi-tutorials/main/img/continuous_xor.svg" width="350px"></center>


Note that, unlike most real problems, the continuous XOR has a simple, albeit non-linear, *deterministic* solution. Yet, we are going to pretend that is not the case just to see whether our statistical model can learn to behave as if XOR does.

### The model


\begin{align}
Y|X_1=x_1, X_2 = x_2 \sim \mathrm{Bernoulli}(g(x_1, x_2; \theta))
\end{align}

where we map from inputs $(x_1, x_2)$ to a probability value $\phi=g(x_1, x_2; \theta)$ via:

\begin{align}
\mathbf x &= (x_1, x_2)^\top & \\
\mathbf h &= \tanh(\mathbf W^{(1)} \mathbf x + \mathbf b^{(1)}) \\
s &= \mathbf W^{(2)} \mathbf h + \mathbf b^{(2)} \\
g(x_1, x_2; \theta) &= \mathrm{sigmoid}(s)\\
\end{align}

* $\mathbf W^{(1)} \in \mathbb R^{H\times 2},  \mathbf b^{(1)} \in \mathbb R^H$
* $\mathbf W^{(2)} \in \mathbb R^{1\times H},  \mathbf b^{(2)} \in \mathbb R$
* $\tanh$ applies elementwise
* trainable parameters $\theta = \{\mathbf W^{(1)},  \mathbf b^{(1)}, W^{(2)},  \mathbf b^{(2)}\}$

Remember the probability mass function (pmf) of the $\mathrm{Bernoulli}(\phi)$ distribution: it assigns probability mass $f(y|\phi)=\phi^y(1-\phi)^{1-y}$ to an outcome $y \in \{0, 1\}$. For us, the parameter will be predicted for each input via $\phi = g(x_1, x_2; \theta)$.

The design above is what we call an *architecture*. We pick an architecture based on intuition, convenience, and research practice. Architectures do vary in expressiveness (that is, the class of functions they encompass). This one is more expressive than a simple linear function due to the presence of what we call a non-linear hidden layer. It turns out the presence of this layer is the key to solving the XOR problem. It essentially learns a **feature representation** for us.

**Describing architectures** We can describe an architecture as shown above, carefully specifying the layers and the dimensionality of their parameter, or we can describe the architecture by a diagram using some building blocks. The former is the most complete way, but it's very verbose, sometimes the latter is sufficient. For example, you could use something like this

\begin{align}
\mathbf x &= (x_1, x_2)^\top & \\
\mathbf h &= \tanh(\mathrm{affine}_H(\mathbf x; \theta_{\text{hid}})) \\
s &= \mathrm{affine}_1(\mathbf h; \theta_{\text{out}})\\
g(x_1, x_2; \theta) &= \mathrm{sigmoid}(s)\\
\end{align}

where an 'affine' layer is a linear transformation from an input dimensionality to an output dimensionality, the subscript indicates the dimensionality of the output of the layer, the parameters of the layer are indicated after `;`, and we can infer its shapes by knowing the shapes of the input, the dimensionality of the output and the type of layer we have. 

For example, an $\mathrm{affine}_H(\mathbf x; \theta_{\text{hid}}))$ layer is just a linear transformation of its inputs ($\mathbf x$ in this case) to an $H$-dimensional output, the transformation uses its own parameters ($\theta_{\text{hid}}$ in this case); similarly $\mathrm{affine}_1(\mathbf h; \theta_{\text{out}})$ transforms its input ($\mathbf h$) to a single scalar and uses parameters $\theta_{\text{out}}$ for that. By inspection we can deduce the parameters of the layers: 

* a linear transformation from $2$ dimensions (because $\mathbf x \in \mathbb R^2$) to $H$ dimensions takes a weight matrix of size $H \times 2$ and a bias vector of size $H$; these are the parameters in $\theta_{\text{hid}}$;
* a linear transformation from $H$ dimensions (because $\mathbf h \in \mathbb R^H$) to a single dimension takes a weight vector of size $H$ and a single bias); these are the parameters in $\theta_{\text{out}}$;
* in total, we have to estimate the union of all parameter sets: $\theta = \theta_{\text{hid}} \cup \theta_{\text{out}}$.

This looks a lot like the implementation in torch, as we shall see.

The package `torch.nn` defines a series of useful classes like linear networks layers, activation functions, loss functions etc. A full list can be found [here](https://pytorch.org/docs/stable/nn.html). In case you need a certain network layer, check the documentation of the package first before writing the layer yourself as the package likely contains the code for it already. We import it below:

In [None]:
import torch.nn as nn

Additionally to `torch.nn`, there is also `torch.nn.functional`. It contains functions that are used in network layers. This is in contrast to `torch.nn` which defines them as `nn.Modules` (more on it below), and `torch.nn` actually uses a lot of functionalities from `torch.nn.functional`. Hence, the functional package is useful in many situations, and so we import it as well here.

In [None]:
import torch.nn.functional as F

#### nn.Module

In PyTorch, a neural network is build up out of modules. Modules can contain other modules, and a neural network is considered to be a module itself as well. The basic template of a module is as follows:

In [None]:
class MyModule(nn.Module):
    
    def __init__(self):
        super().__init__() # You **always** need to start a Module with a call to super().__init__()
        # Here you can do some init for my module

        
    def forward(self, x):
        # Function for performing the calculation of the module.
        pass

The forward function is where the computation of the module is taken place, and is executed when you call the module (`nn = MyModule(); nn(x)`). In the init function, we usually create the parameters of the module, using `nn.Parameter`, or defining other modules that are used in the forward function. The backward calculation is done automatically, but could be overwritten as well if wanted.

#### Simple classifier
We can now make use of the pre-defined modules in the `torch.nn` package, and define our own small neural network. We will use a minimal network with a input layer, one hidden layer with tanh as activation function, and a output layer. In other words, our networks should look something like this:

<center width="100%"><img src="https://raw.githubusercontent.com/probabll/ntmi-tutorials/main/img/small_neural_network.svg" width="300px"></center>

The input neurons are shown in blue, which represent the coordinates $x_1$ and $x_2$ of a data point. The hidden neurons including a tanh activation are shown in white, and the output neuron in red. Note that we are choosing to output $s$ (the logit) rather than the Bernoulli parameter ($\mathrm{sigmoid}(s)$), that is, we are stopping right before the sigmoid. 

We don't have to do it this way, but it will turn out useful as its more numerically stable to manipulate logits than probabilities. We will still use the *sigmoid* function as it is necessary for the Bernoulli pmf, but we will use it in a later moment when computing the log-likelihood for the optimisation loss.

So here's our model, as we will implement it

\begin{align}
Y|X_1=x_1, X_2=x_2 &\sim \mathrm{Bernoulli}(g(x_1, x_2; \theta)) \\
\mathbf x &= (x_1, x_2)^\top & \\
\mathbf h &= \tanh(\mathrm{affine}_H(\mathbf x; \theta_{\text{hid}})) \\
s &= \mathrm{affine}_1(\mathbf h; \theta_{\text{out}})  \\
g(x_1, x_2;\theta) &= \mathrm{sigmoid}(s)
\end{align}




**Quiz** What is the log probability this model assigns to $(X_1=x_1, X_2=x_2, Y=1)$? What is the log probability it assigns to $(X_1=x_1, X_2=x_2, Y=0)$? What is the log probability this model assigns to $(X_1=x_1, X_2=x_2, Y=y)$. Express your answers as a function of $g(x_1, x_2;\theta)$. Then express them as a function of $s$.

<details>
    <summary><b>SOLUTION</b></summary>

\begin{align}
    \log P_{Y|X_1X_2}(1|x_1, x_2) &= \log g(x_1, x_2;\theta) \\
    &= \log \mathrm{sigmoid}(s) \\
    \log P_{Y|X_1X_2}(0|x_1, x_2) &= \log (1-g(x_1, x_2;\theta))\\
    &= \log (1 - \mathrm{sigmoid}(s)) \\
    &= \log \mathrm{sigmoid}(-s) \\
    \log P_{Y|X_1X_2}(y|x_1, x_2) &= y\log \mathrm{sigmoid}(s) + (1-y) \log \mathrm{sigmoid}(-s) \\
\end{align}
</details>

---


In PyTorch, we can define the NN in the model above as follows:

In [None]:
class LogitPredictor(nn.Module):
    
    def __init__(self, num_inputs, num_hidden, num_outputs):
        super().__init__()
        # Initialize the modules we need to build the network
        # this predicts our h
        self.linear_h = nn.Linear(num_inputs, num_hidden)
        self.act_fn = nn.Tanh()
        # this predicts our s
        self.linear_s = nn.Linear(num_hidden, num_outputs)        
        
    def forward(self, x):
        """
        Map a batch of inputs (x) to a batch of logits (s).
         
        We predict the logit (rather than probability=sigmoid(logit) for numerical stability).
        To convert from logit to 
            * Pr(Y=1|X=x) you can use torch.sigmoid(logit)
            * Pr(Y=0|X=x) you can use torch.sigmoid(-logit)
            * log Pr(Y=1|X=x) you can use F.logsigmoid(logit)
            * log Pr(Y=0|X=x) you can use F.logsigmoid(-logit)
        These operations are more stable than manipulating probability values directly.
        """
        # It's useful to document the expected shape of your tensors
        
        # [batch_size, hidden size]
        h = self.act_fn(self.linear_h(x))
        # [batch size, 1]
        s = self.linear_s(h)
        return s

For the examples in this notebook, we will use a tiny neural network with two input neurons and four hidden neurons. As we perform binary classification, we will use a single output neuron. Note that we do not apply a sigmoid on the output yet. This is because other functions, especially the loss, are more efficient and precise to calculate on the original outputs instead of the sigmoid output. We will discuss the detailed reason later.

In [None]:
model = LogitPredictor(num_inputs=2, num_hidden=4, num_outputs=1)
# Printing a module shows all its submodules
print(model)

Printing the model lists all submodules it contains. The parameters of a module can be obtained by using its `parameters()` functions, or `named_parameters()` to get a name to each parameter object. For our small neural network, we have the following parameters:

In [None]:
for name, param in model.named_parameters():
    print("Parameter %s, shape %s" % (name, str(param.shape)))

Each linear layer has a weight matrix of the shape `[output, input]`, and a bias of the shape `[output]`. The tanh activation function does not have any parameters. Note that parameters are only registered for `nn.Module` objects that are direct object attributes, i.e. `self.a = ...`. If you define a list of modules, the parameters of those are not registered for the outer module and can cause some issues when you try to optimize your module. There are alternatives, like `nn.ModuleList`, `nn.ModuleDict` and `nn.Sequential`, that allow you to have different data structures of modules. We will use them in a few later tutorials and explain them there. 

### The data

PyTorch also provides a few functionalities to load the training and test data efficiently, summarized in the package `torch.utils.data`.

In [None]:
import torch.utils.data as data

The data package defines two classes which are the standard interface for handling data in PyTorch: `data.Dataset`, and `data.DataLoader`. The dataset class provides an uniform interface to access the training/test data, while the data loader makes sure to efficiently load and stack the data points from the dataset into batches during training.

#### The dataset class

The dataset class summarizes the basic functionality of a dataset in a natural way. To define a dataset in PyTorch, we simply specify two functions: `__getitem__`, and `__len__`. The get-item function has to return the $i$-th data point in the dataset, while the len function returns the size of the dataset. For the XOR dataset, we can define the dataset class as follows:

In [None]:
def generate_continuous_xor(size, std):
    # Each data point in the XOR dataset has two variables, x and y, that can be either 0 or 1
    # The label is their XOR combination, i.e. 1 if only x or only y is 1 while the other is 0.
    # If x=y, the label is 0.
    x = torch.randint(low=0, high=2, size=(size, 2), dtype=torch.float32)
    y = (x.sum(dim=1) == 1).to(torch.long)
    # To make it slightly more challenging, we add a bit of gaussian noise to the data points.
    x += std * torch.randn(x.shape)

    return x, y

class XORDataset(data.Dataset):
    
    def __init__(self, size, std=0.1):
        """
        Inputs:
            size - Number of data points we want to generate
            std - Standard deviation of the noise (see generate_continuous_xor function)
        """
        super().__init__()
        self.size = size
        self.std = std
        self.x, self.y = generate_continuous_xor(size, std)
        
    def __len__(self):
        # Number of data point we have. Alternatively self.data.shape[0], or self.label.shape[0]
        return self.size
    
    def __getitem__(self, idx):
        # Return the idx-th data point of the dataset
        # If we have multiple things to return (data point and label), we can return them as tuple
        x = self.x[idx]
        y = self.y[idx]
        return x, y

Let's try to create such a dataset and inspect it:

In [None]:
dataset = XORDataset(size=200)
print("Size of dataset:", len(dataset))
print("Data point 0: x={} y={}".format(dataset[0][0], dataset[0][1]))
print("Data point 1: x={} y={}".format(dataset[1][0], dataset[1][1]))

To better relate to the dataset, we visualize the samples below. 

In [None]:
def visualize_samples(data, label):
    if isinstance(data, torch.Tensor):
        data = data.cpu().numpy()
    if isinstance(label, torch.Tensor):
        label = label.cpu().numpy()
    data_0 = data[label == 0]
    data_1 = data[label == 1]
    
    plt.figure(figsize=(4,4))
    plt.scatter(data_0[:,0], data_0[:,1], edgecolor="#333", label="Class 0")
    plt.scatter(data_1[:,0], data_1[:,1], edgecolor="#333", label="Class 1")
    plt.title("Dataset samples")
    plt.ylabel(r"$x_2$")
    plt.xlabel(r"$x_1$")
    plt.legend()

In [None]:
visualize_samples(dataset.x, dataset.y)
plt.show()

#### The data loader class

The class `torch.utils.data.DataLoader` represents a Python iterable over a dataset with support for automatic batching, multi-process data loading and many more features. The data loader communicates with the dataset using the function `__getitem__`, and stacks its outputs as tensors over the first dimension to form a batch.
In contrast to the dataset class, we usually don't have to define our own data loader class, but can just create an object of it with the dataset as input. Additionally, we can configure our data loader with the following input arguments (only a selection, see full list [here](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)):

* `batch_size`: Number of samples to stack per batch
* `shuffle`: If True, the data is returned in a random order. This is important during training for introducing stochasticity. 
* `num_workers`: Number of subprocesses to use for data loading. The default, 0, means that the data will be loaded in the main process which can slow down training for datasets where loading a data point takes a considerable amount of time (e.g. large images). More workers are recommended for those, but can cause issues on Windows computers. For tiny datasets as ours, 0 workers are usually faster.
* `pin_memory`: If True, the data loader will copy Tensors into CUDA pinned memory before returning them. This can save some time for large data points on GPUs. Usually a good practice to use for a training set, but not necessarily for validation and test to save memory on the GPU.
* `drop_last`: If True, the last batch is dropped in case it is smaller than the specified batch size. This occurs when the dataset size is not a multiple of the batch size. Only potentially helpful during training to keep a consistent batch size.

Let's create a simple data loader below:

In [None]:
data_loader = data.DataLoader(dataset, batch_size=8, shuffle=True)

In [None]:
# next(iter(...)) catches the first batch of the data loader
# If shuffle is True, this will return a different batch every time we run this cell
# For iterating over the whole dataset, we can simple use "for batch in data_loader: ..."
data_inputs, data_labels = next(iter(data_loader))

# The shape of the outputs are [batch_size, d_1,...,d_N] where d_1,...,d_N are the 
# dimensions of the data point returned from the dataset class
print("Data inputs", data_inputs.shape, "\n", data_inputs)
print("Data labels", data_labels.shape, "\n", data_labels)

### Optimisation

After defining the model and the dataset, it is time to prepare the optimization of the model. During training, we will perform the following steps:

1. Get a batch from the data loader
2. Use the NN to predict logits for the instances in the batch
3. Calculate the loss based on the choice of statistical model and observations (target labels)
4. Backpropagation: calculate the gradients for every parameter with respect to the loss
5. Update the parameters of the model in the direction of the gradients

We have seen how we can do step 1, 2 and 4 in PyTorch. Now, we will look at step 3 and 5.

#### Loss

We are going to be performing maximum likelihood estimation of the parameters of the model, thus our *loss* (negative of utility) is the negative of the log-likelihood of the model given a dataset of observations, i.e.,  $- \mathcal L_{\mathcal D}(\theta)$.

The optimisation problem is then 

\begin{align}
\theta^{\text{MLE}} &= \arg\min_\theta ~ -\mathcal L_{\mathcal D}(\theta) \\
&= \arg\min_\theta ~ - \sum_{(x_1, x_2),y \sim \mathcal D} \log f(y|g(x_1, x_2; \theta))
\end{align}
where $f(y|\phi)$ is the probability mass  (or probability density for continuous outcomes) of the response $y$ under our choice of statistical model, with the pmf/pdf parameter $\phi = g(x_1, x_2;\theta)$ predicted by the NN architecture. 

for which a local optimum can be obtained via SGD

\begin{align}
\theta^{(t)} = \theta^{(t-1)} - \eta_t \nabla_{\theta^{(t-1)}} \mathcal L_{\mathcal D}(\theta^{(t-1)})
\end{align}

with the exact gradient $\nabla_{\theta^{(t-1)}} \mathcal L_{\mathcal D}(\theta^{(t-1)})$ replaced by an unbiased gradient estimate (using random mini batches).

#### Gradient estimate

As we have a Bernoulli distribution, with a single observation $(x_1, x_2, y)$, the loss is therefore

\begin{align}
   - \log f(y|g(x_1, x_2;\theta)) &= - y\log  g(x_1, x_2;\theta) - (1-y)\log (1-g(x_1, x_2;\theta))\\
   &=- y\log \mathrm{sigmoid}(s) - (1-y)\log \mathrm{sigmoid}(-s) \\
\end{align}

For a batch of $N$ observations $\{(\mathbf x_n, y_n)\}_{n=1}^N$, the loss is thus the sample mean 
\begin{align}
   - \frac{1}{N} \sum_{n=1}^N y_n\log \mathrm{sigmoid}(s_n) + (1-y_n)\log \mathrm{sigmoid}(-s_n) \\
\end{align}
where $s_n$ is the logit predicted for the $n$th data point. 

Remember that the gradient of the loss assessed at a stochastic batch is an unbiased estimate of the gradient given an entire dataset.

#### Implementation

We can calculate the loss for a batch by simply performing a few tensor operations as those are automatically added to the computation graph. 


**Quiz** Implement a computation graph for the loss function assessed given a batch of observed labels. 

Tips: you might have already heard of some cross entropy loss module before, if you did, ignore it for now, it will turn out a rather didactic exercise. You may need to look into `torch.where` to implement thes loss.



In [None]:
def loss_fn(logits, observations):
    """
    :param logits: tensor of dtype float with shape [batch size]
        sigmoid(logits[n]) specifies the probability that the i-th observation should be labelled Y=1
    :param observations: tensor of dtype long with shape [batch_size]     
    :return: the negative of the log-likelihood of the model given observations
        -1/N \sum_{n=1}^{N}  log P(Y=y[n]|X=x_n)
        = -1/N \sum_{n=1}^{N}  log Bernoulli(Y=y[n]|sigmoid(logits[n]))
        = -1/N \sum_{n=1}^{N}  [y[n]=1] log sigmoid(logits[n]) + [y[n]=0] log sigmoid(-logits[n])
        
        where N is batch_size.
    """
    pass

In [None]:
# TODO: DELETE THIS
def loss_fn(logits, observations):
    # [batch_size]
    log_probs = torch.distributions.Bernoulli(logits=logits).log_prob(observations.float())
    # []
    return - log_probs.mean()

<details>
    <summary><b>SOLUTION</b></summary>


There are at least 3 ways to approach this, and we will discuss all three here.

**Manual implementation**

Here is how we implement the solution to the previous quiz in torch, the key is the use of `torch.where` (which implements "select one-thing if condition else another-thing".
    
```python
def loss_fn(logits, observations):
    # [batch_size] 
    # If observation is 1 return log sigmoid(logit), else return log (1 - simoid(logit))
    negative_log_prob = -torch.where(observations == 1, F.logsigmoid(logits),  F.logsigmoid(-logits))
    # []
    return negative_log_prob.mean()    
```    
    
It's always good to reuse stable and reliable code, so next we discuss two other solutions that will use more of what's already implemented in torch.
    

**Statistical style**

PyTorch has all elementary distributions in it, check `torch.distributions`, so we can use our predicted logit to parameterise a distribution of the appropriate family (Bernoulli in this case) and use the resulting distribution to assess the probability of the observed label. 

```python
def loss_fn(logits, observations):
    # [batch_size]
    log_probs = torch.distributions.Bernoulli(logits=logits).log_prob(observations.float())
    # []
    return - log_probs.mean()
```       

If you look carefully, this looks just like the theory: $\frac{-1}{N}\sum_{n=1}^N \log \mathrm{Bernoulli}(y_n|\mathrm{sigmoid}(\mathrm{NN}(\mathbf x_n; \theta)))$, the only difference is that we do not apply the sigmoid ourselves (as we used the argument `logits` of `torch.distributions.Bernoulli`, that class will make sure to apply the sigmoid when needed). 


**Cross-entropy style**

Classic DL literature proposed NNs as function approximators. Thus we would optimise $\theta$ so that $f(x; \theta)$ would approximate a real world response $y$.  In this view, NNs minimise a notion of error (or loss): $\ell(f(x; \theta), y)$. In a regression task, this could be $||y - f(x;\theta)||_2$.

Where the response variable is discrete this view leads to difficulties, as we cannot easily define a differentiable loss function in terms of a norm, for example. For binary and categorical data, traditional DL literature interprets the observation $y$ as a *discrete distribution* (rather than a *discrete outcome* sampled from a distribution). For example, a ternary outcome $3$ would be interpreted as $\mathrm{Categorical}(0, 0, 1)$. Under such a view, we can take the *cross entropy* of $y$ relative to $z=f(x; \theta)$, i.e., $\mathbb H(y, z) = - \sum_{k=1}^3 y_k \log z_k$ as a measure of misfit or loss.

Conceptually, this view requires mapping data from their natural form (categories) to some distribution (which can be non-trivial depending on the data type). It also requires memorising specific versions of *cross entropies* for different types of data (e.g., binary, categorical, ordinal).

For completeness, we discuss it here. The DL loss that corresponds to MLE for a Bernoulli model is the *Binary cross entropy* (BCE) loss. 

PyTorch provides a list of predefined loss functions which we can use (see [here](https://pytorch.org/docs/stable/nn.html#loss-functions) for a full list). For instance, for BCE, PyTorch has two modules: `nn.BCELoss()`, `nn.BCEWithLogitsLoss()`. While `nn.BCELoss` expects its inputs to be in the range $[0,1]$, i.e. the output of a sigmoid, `nn.BCEWithLogitsLoss` combines a sigmoid layer and the BCE loss in a single class. This version is numerically more stable than using a plain Sigmoid followed by a BCE loss because of the logarithms applied in the loss function. Hence, it is adviced to use loss functions applied on "logits" where possible (remember to not apply a sigmoid on the output of the model in this case!). For our model defined above, we would use the module `nn.BCEWithLogitsLoss`. 

```python
loss_fn = nn.BCEWithLogitsLoss()
```

We advise against becoming depending on cross entropy losses without understanding what they are, what they imply, and how they come about. It's okay to use the existing implementation, as its stable, as long as you understand why you are doing it. But if you are going to rely on existing code anyway (which is a good thing to do), note that the *Statistical style* above is conceptually simpler than the cross entropy style, it corresponds precisely to the statistical objective being optimised (log likelihood of model given observations) and it only requires knowing what distribution we are modelling with (something you would know for sure). 
    
</details>

---    

### Stochastic Gradient Descent

For updating the parameters, PyTorch provides the package `torch.optim` that has most popular optimizers implemented. We will discuss some specific optimizers and their differences later in the course, but will for now use the simplest of them: `torch.optim.SGD`. Stochastic Gradient Descent updates parameters by multiplying the gradients with a small constant, called learning rate, and subtracting those from the parameters (hence minimizing the loss). Therefore, we slowly move towards the direction of minimizing the loss. A good default value of the learning rate for a small network as ours is 0.1. 

In [None]:
# Input to the optimizer are the parameters of the model: model.parameters()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

The optimizer provides two useful functions: `optimizer.step()`, and `optimizer.zero_grad()`. The step function updates the parameters based on the gradients as explained above. The function `optimizer.zero_grad()` sets the gradients of all parameters to zero. While this function seems less relevant at first, it is a crucial pre-step before performing backpropagation. If we would call the `backward` function on the loss while the parameter gradients are non-zero from the previous batch, the new gradients would actually be added to the previous ones instead of overwriting them. This is done because a parameter might occur multiple times in a computation graph, and we need to sum the gradients in this case instead of replacing them. Hence, remember to call `optimizer.zero_grad()` before calculating the gradients of a batch.

### Training

Finally, we are ready to train our model. As a first step, we create a slightly larger dataset and specify a data loader with a larger batch size. 

In [None]:
train_dataset = XORDataset(size=1000)
train_data_loader = data.DataLoader(train_dataset, batch_size=128, shuffle=True)

Now, we can write a small training function. Remember our five steps: load a batch, obtain the predictions, calculate the loss, backpropagate, and update. Additionally, we have to push all data and model parameters to the device of our choice (GPU if available). For the tiny neural network we have, communicating the data to the GPU actually takes much more time than we could save from running the operation on GPU. For large networks, the communication time is significantly smaller than the actual runtime making a GPU crucial in these cases. Still, to practice, we will push the data to GPU here. 

In [None]:
# Push model to device. Has to be only done once
model.to(device)

In addition, we set our model to training mode. This is done by calling `model.train()`. There exist certain modules that need to perform a different forward step during training than during testing (e.g. BatchNorm and Dropout), and we can switch between them using `model.train()` and `model.eval()`.

In [None]:
def train_model(model, optimizer, data_loader, loss_fn, num_epochs=100):
    # Set model to training mode
    model.train() 
    
    # Training loop
    with tqdm(range(num_epochs)) as progressbar:
        for epoch in progressbar:
            for data_inputs, data_labels in data_loader:
                
                ## Step 1: Move input data to device (only strictly necessary if we use GPU)
                data_inputs, data_labels = data_inputs.to(device), data_labels.to(device)
                
                ## Step 2: Run the model on the input data
                logits = model(data_inputs)
                logits = logits.squeeze(dim=1) # Output is [Batch size, 1], but we want [Batch size]
                
                ## Step 3: Calculate the loss using predicted logits and available observations
                loss = loss_fn(logits, data_labels.float())            
                
                # Nice way of visualising the loss value during training
                progressbar.set_postfix({'loss': loss.item()})
                
                ## Step 4: Perform backpropagation
                # Before calculating the gradients, we need to ensure that they are all zero. 
                # The gradients would not be overwritten, but actually added to the existing ones.
                optimizer.zero_grad() 
                # Perform backpropagation
                loss.backward()
                
                ## Step 5: Update the parameters
                optimizer.step()

In [None]:
train_model(model, optimizer, train_data_loader, loss_fn)

#### Saving a model

After finish training a model, we save the model to disk so that we can load the same weights at a later time. For this, we extract the so-called `state_dict` from the model which contains all learnable parameters. For our simple model, the state dict contains the following entries:

In [None]:
state_dict = model.state_dict()
print(state_dict)

To save the state dictionary, we can use `torch.save`:

In [None]:
# torch.save(object, filename). For the filename, any extension can be used
torch.save(state_dict, "our_model.tar")

To load a model from a state dict, we use the function `torch.load` to load the state dict from the disk, and the module function `load_state_dict` to overwrite our parameters with the new values:

In [None]:
# Load state dict from the disk (make sure it is the same name as above)
state_dict = torch.load("our_model.tar")

# Create a new model and load the state
new_model = LogitPredictor(num_inputs=2, num_hidden=4, num_outputs=1)
new_model.load_state_dict(state_dict)

# Verify that the parameters are the same
print("Original model\n", model.state_dict())
print("\nLoaded model\n", new_model.state_dict())

A detailed tutorial on saving and loading models in PyTorch can be found [here](https://pytorch.org/tutorials/beginner/saving_loading_models.html).

### Evaluation

Once we have trained a model, it is time to evaluate it on a held-out test set. As our dataset consist of randomly generated data points, we need to first create a test set with a corresponding data loader.

In [None]:
test_dataset = XORDataset(size=500)
# drop_last -> Don't drop the last batch although it is smaller than 128
test_data_loader = data.DataLoader(test_dataset, batch_size=128, shuffle=False, drop_last=False) 

As metric, we will use accuracy which is calculated as follows:

$$acc = \frac{\#\text{correct predictions}}{\#\text{all predictions}} = \frac{TP+TN}{TP+TN+FP+FN}$$

where TP are the true positives, TN true negatives, FP false positives, and FN the fale negatives. 

When evaluating the model, we don't need to keep track of the computation graph as we don't intend to calculate the gradients. This reduces the required memory and speed up the model. In PyTorch, we can deactivate the computation graph using `with torch.no_grad(): ...`. Remember to additionally set the model to eval mode.

**Quiz** Given an input $(x_1, x_2)$ and a trained model, how can we predict a label?

<details>
    <summary><b>SOLUTION</b></summary>

For example, by predicting the most probable class:
    
\begin{align}
    y^\star &= \arg\max_{y \in \{0, 1\}} ~ P_{Y|X_1X_2}(y|x_1, x_2, \theta) \\
    &=\begin{cases}
       1 & \text{if } \sigma(s) \ge 0.5\\
       0 & \text{otherwise}
    \end{cases}
\end{align}
where $s=\mathrm{NN}(x_1, x_2; \theta)$.
    
</details>

---    

In [None]:
def eval_model(model, data_loader):
    model.eval() # Set model to eval mode
    true_preds, num_preds = 0., 0.
    
    with torch.no_grad(): # Deactivate gradients for the following code
        for data_inputs, data_labels in data_loader:
            
            # Determine prediction of model on dev set
            data_inputs, data_labels = data_inputs.to(device), data_labels.to(device)
            # [batch_size, 1]
            logits = model(data_inputs)
            # [batch_size]
            logits = logits.squeeze(dim=1)
            # Now it's okay to use sigmoid 
            # as we won't be computing a loss for parameter estimation this will be stable enough
            # [batch_size]
            probs = torch.sigmoid(logits) 
            # We want ot solve the decision problem 
            #  argmax_{y in {0,1}} P(Y=y|X=x)
            # or, in other words, predict 1 if the probability is at least 0.5, otherwise predict 0
            pred_labels = (probs >= 0.5).long() 
            
            # Keep records of predictions for the accuracy metric (true_preds=TP+TN, num_preds=TP+TN+FP+FN)
            true_preds += (pred_labels == data_labels).sum()
            num_preds += data_labels.shape[0]
            
    acc = true_preds / num_preds
    print("Accuracy of the model: %4.2f%%" % (100.0*acc))

In [None]:
eval_model(model, test_data_loader)

If we trained our model correctly, we should see a score close to 100% accuracy. However, this is only possible because of our simple task, and unfortunately, we usually don't get such high scores on test sets of more complex tasks.

#### Visualizing classification boundaries

To visualize what our model has learned, we can perform a prediction for every data point in a range of $[-0.5, 1.5]$, and visualize the predicted class as in the sample figure at the beginning of this section. This shows where the model has created decision boundaries, and which points would be classified as $0$, and which as $1$. We therefore get a background image out of blue (class 0) and orange (class 1). The spots where the model is uncertain we will see a blurry overlap. The specific code is less relevant compared to the output figure which should hopefully show us a clear separation of classes:

In [None]:
@torch.no_grad() # Decorator, same effect as "with torch.no_grad(): ..." over the whole function.
def visualize_classification(model, data, label):
    if isinstance(data, torch.Tensor):
        data = data.cpu().numpy()
    if isinstance(label, torch.Tensor):
        label = label.cpu().numpy()
    data_0 = data[label == 0]
    data_1 = data[label == 1]
    
    plt.figure(figsize=(4,4))
    plt.scatter(data_0[:,0], data_0[:,1], edgecolor="#333", label="Class 0")
    plt.scatter(data_1[:,0], data_1[:,1], edgecolor="#333", label="Class 1")
    plt.title("Dataset samples")
    plt.ylabel(r"$x_2$")
    plt.xlabel(r"$x_1$")
    plt.legend()
    
    # Let's make use of a lot of operations we have learned above
    model.to(device)
    c0 = torch.Tensor(to_rgba("C0")).to(device)
    c1 = torch.Tensor(to_rgba("C1")).to(device)
    x1 = torch.arange(-0.5, 1.5, step=0.01, device=device)
    x2 = torch.arange(-0.5, 1.5, step=0.01, device=device)
    xx1, xx2 = torch.meshgrid(x1, x2) # Meshgrid function as in numpy
    model_inputs = torch.stack([xx1, xx2], dim=-1)
    preds = model(model_inputs)
    preds = torch.sigmoid(preds)
    output_image = preds * c0[None,None] + (1 - preds) * c1[None,None] # Specifying "None" in a dimension creates a new one
    output_image = output_image.cpu().numpy() # Convert to numpy array. This only works for tensors on CPU, hence first push to CPU
    plt.imshow(output_image, origin='upper', extent=(-0.5, 1.5, -0.5, 1.5))
    plt.grid(False)

visualize_classification(model, dataset.x, dataset.y)
plt.show()

The decision boundaries might not look exactly as in the figure in the preamble of this section which can be caused by running it on CPU or a different GPU architecture. Nevertheless, the result on the accuracy metric should be the approximately the same. 