# PyTorch tutorial

In this tutorial I will try to guide you through how to use PyTorch - the library for deep learning that we will use!

[PyTorch cheatsheet](https://pytorch.org/tutorials/beginner/ptcheat.html)

<img src='https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Ftse1.mm.bing.net%2Fth%3Fid%3DOIP.oiQt_8Md2ucueqYIJ30b9gHaHa%26pid%3DApi&f=1' width=200>

## Scalars, vectors, matrices and tensors

### Scalars

$1.54$ - this is a __scalar__ (a singular number). If it has decimal points, it's called a _float_, if it doesn't - an _integer_.

### Vectors

$[1.54, -1.38, 0.12, 5.56]$ - this is a __vector__ (a row of numbers). Each entry in a vector is a scalar. Also can be called an _array_ or a _list_. Vectors are always one-dimensional.

### Matrix 

$[[1.54, -1.38, 0.12, 5.56],$

$[2.71, -2.23, 0.15, 4.56],$

$[1.55, -2.87, 0.18, 3.56]]$

This is a __matrix__ - 2-dimensional storage. Each row (and each column) in a matrix is a vector and each entry is a scalar.

### Tensor

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSrj-eaRaN-2Qxsl-LSPzj7wgzvr4635kaR4XNnvXpvaekj06bkHDpfW5wl4fpJZ5fjMIA&usqp=CAU" width=300>

This is a 3d tensor. It is a collection of 2d matrices. Tensors can be of any number of dimensions.


In [None]:
import torch

In [None]:
t1 = torch.rand(2,3,5) # uniform distribution from 0 to 1
print(t1)
print(t1.size())
t1 = t1.unsqueeze(-1)
print(t1.size())
t1 = t1.squeeze(-1)
print(t1.size())
t1 = t1.transpose(0,2)
print(t1.size())
print(t1)

tensor([[[0.8740, 0.9985, 0.4113, 0.9648, 0.4899],
         [0.5985, 0.2744, 0.1495, 0.1254, 0.8483],
         [0.1827, 0.3457, 0.3793, 0.5476, 0.2959]],

        [[0.4161, 0.6293, 0.5717, 0.7178, 0.4259],
         [0.2044, 0.3851, 0.2665, 0.5535, 0.2831],
         [0.4750, 0.7619, 0.5655, 0.5537, 0.6160]]])
torch.Size([2, 3, 5])
torch.Size([2, 3, 5, 1])
torch.Size([2, 3, 5])
torch.Size([5, 3, 2])
tensor([[[0.8740, 0.4161],
         [0.5985, 0.2044],
         [0.1827, 0.4750]],

        [[0.9985, 0.6293],
         [0.2744, 0.3851],
         [0.3457, 0.7619]],

        [[0.4113, 0.5717],
         [0.1495, 0.2665],
         [0.3793, 0.5655]],

        [[0.9648, 0.7178],
         [0.1254, 0.5535],
         [0.5476, 0.5537]],

        [[0.4899, 0.4259],
         [0.8483, 0.2831],
         [0.2959, 0.6160]]])


## Reshaping tensors

Sometimes we need to change the dimensionality of a tensor, in order to perform a certain operation. This is called a _view_.

In [None]:
t1 = torch.randint(0, 10, (3,2,4)) # (low, high, tuple with tensor sizes)
print(t1)
print(t1.view(6,-1))
print(t1.view(-1)) # calculate last dim automatically

tensor([[[4, 4, 7, 7],
         [6, 0, 0, 6]],

        [[8, 6, 3, 3],
         [6, 5, 0, 7]],

        [[2, 2, 6, 9],
         [8, 4, 8, 5]]])
tensor([[4, 4, 7, 7],
        [6, 0, 0, 6],
        [8, 6, 3, 3],
        [6, 5, 0, 7],
        [2, 2, 6, 9],
        [8, 4, 8, 5]])
tensor([4, 4, 7, 7, 6, 0, 0, 6, 8, 6, 3, 3, 6, 5, 0, 7, 2, 2, 6, 9, 8, 4, 8, 5])


## Matrix multiplication

In linear algebra, there is an operation called _matrix multiplication_.

<img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Ftse1.mm.bing.net%2Fth%3Fid%3DOIP.g63nN35FrxSHUNeNFym0SgHaF7%26pid%3DApi&f=1" width=500>

<img src="https://external-content.duckduckgo.com/iu/?u=http%3A%2F%2Fi.stack.imgur.com%2FyxMKj.png&f=1&nofb=1" width=500>

We can model the process of going from one layer of neural network to another as matrix multiplication

<img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Forgs.mines.edu%2Fdaa%2Fwp-content%2Fuploads%2Fsites%2F38%2F2019%2F08%2F1_Gh5PS4R_A5drl5ebd_gNrg%402x.jpg&f=1&nofb=1" width=500>


Matrix dimensions have to match in order to perform multiplication!

In [None]:
m1 = torch.tensor([[3,4], [2,1]])
m2 = torch.tensor([[1,5], [3,7]])
res = m1 @ m2 # can also use torch.matmul(m1, m2)
print(res)

tensor([[15, 43],
        [ 5, 17]])


In [None]:
m1 = torch.rand(10,5)
m2 = torch.rand(5,1)
res = m1 @ m2
print(res.size())

torch.Size([10, 1])


## Batching

Use of tensors allows us to go perform operations on many samples at a time

For example, a single colored image of size 128x128 can be represented as a 3d tensor of shape (128, 128, 3).

If we want to process 32 images in a single operation, we can stack them together in a tensor of size (32, 128, 128, 3).

In [None]:
img = torch.rand(128, 128, 3)
batch = torch.stack([torch.rand(128, 128, 3) for _ in range(32)])
print(batch.size())

torch.Size([32, 128, 128, 3])


## Operating on batches

Performing mathematical operations on batches is identical to performing them on matrices. We have to make sure that the last dimension of m1 matches the first dimension of m2 though!

Good: $(512, 128, 128, 3) \times (3, 12)$

Bad: $(128, 3, 128) \times (3, 12)$ <- will give an error 

Solution: use `Tensor.transpose(dim1, dim2)` to fix that.

In [None]:
print(batch.size())
w = torch.randn(3, 12)
print(w.size())
res = batch @ w
print(res.size())

torch.Size([32, 128, 128, 3])
torch.Size([3, 12])
torch.Size([32, 128, 128, 12])


## Feature engineering

Is an art of converting real-life objects (photos, text, videos, proteins) into _features_ - numbers that our model can work with.

For some objects it is easier (images are just tensors), for some - harder.

### Tokenisation 

In order to convert text to numbers, we need to _tokenise_ it:

<img src="https://freecontent.manning.com/wp-content/uploads/Chollet_DLfT_01.png" width=500>

### n-grams

For protein sequences, we have a couple of options. We could use each individual aminoacid as a unique token (then we will get alphabet of size 21).

We could also use each pair of aminoacids as a single token (how many tokens we will have then?). This would be called a bi-gram.

## Fully connected neural networks

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/3d/Neural_network.svg/2560px-Neural_network.svg.png" width=500>

__Input neurons__ are our input data - pixel values, tokens etc.

__Weights__ are our trainable parameters - we assume that if we find the right weights (and the right model architecture) we can make correct predictions.

__Hidden neurons__ are the states of our model - we assume that they can "decide" whether to push the data forward or not.

__Activation functions__ decide whether hidden neurons will allow the data to propagate further.

__Output layer__ is our prediction - for a classification problem it is the class probability, for a regression problem it's the predicted value.

## Activation functions

Once we calculated the input value for each neuron (which is the values of all neurons in previous layers multiplied by their respective weights), we apply an activation function to that value.

This function "decides" whether the signal should be propagated further or not.

Examples of common activation functions are:

* Sigmoid - $\frac{1}{1 + exp(-x)}$
* ReLU - $max(0,x)$


<img src="https://pytorch.org/docs/stable/_images/Sigmoid.png" width=400>

<img src="https://pytorch.org/docs/stable/_images/ReLU.png" width=400>


## Neural network as a math problem

<img src="https://i.stack.imgur.com/j2qa7.png" width=500>


This way we can formulate our output $\hat{Y}$ as a mathematical equation:

$\hat{Y} = f_2(f_1(XW_1)W_2)W_3$

$\hat{y} = f_2(f_1(x\times w_1) \times w_2) \times w_3$

And then the loss can be calculated from that:

$L_{cross-entropy} = l(\hat{Y}, Y)$

The cool thing about all of that - it is a defined mathematical function, which means we can take a _derivative_ of the whole thing with respect to the weights (using chain rule).

If we can take a derivative, we can find the gradient. Then we can use the gradient to minimise the loss (again, w.r.t. weights of the model).

<img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Ftse4.mm.bing.net%2Fth%3Fid%3DOIP.lYpF8xJ3TiDoq461I0AcOQHaEn%26pid%3DApi&f=1" width=400>

<img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Ftse4.mm.bing.net%2Fth%3Fid%3DOIP.VvLkkH3MOBHJIRgUMSWCVgHaFS%26pid%3DApi&f=1" width=400>

<img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Ftse4.mm.bing.net%2Fth%3Fid%3DOIP.AC-5oNFKLqNdl7EoxwFJNQHaEc%26pid%3DApi&f=1" width=400>

## Using that, we can train neural networks by:

1. Acquiring the dataset
1. Featurising the data
1. Initialising the model with random weights
1. Defining the loss function
1. Making predictions on a batch of data points
1. Calculating the loss
1. Adjusting the weights according to the gradients
1. Repeating until we succeed

## GPU

Calculations that are required to train the networks are performed much faster on GPU - part of the PC responsible for graphics.

In order to move tensors and models between CPU and GPU, we have to use special commands.

In [None]:
t1 = torch.rand(16,10)
if torch.cuda.is_available():
  t1 = t1.cuda()

nn = torch.nn.Linear(10, 20)
if torch.cuda.is_available():
  nn = nn.cuda()

res = nn.forward(t1)
print(res.size())
print(res.device)

torch.Size([16, 20])
cuda:0


### GPU errors

Sometimes, if we perform an operation on different devices, we can get an error.


In [None]:
t1 = torch.rand(16, 10)
nn = torch.nn.Linear(10,20)
if torch.cuda.is_available():
  t1 = t1.cuda()
nn.forward(t1) # This gives an error, since t1 is on GPU and nn is not

RuntimeError: ignored

#### Fixing

To fix such errors, ensure that all tensors and modules are on the same device

In [None]:
t1 = torch.rand(10,20)
t2 = torch.rand(30,40)
(t1 @ t2.view(20, -1)).size()

torch.Size([10, 60])