# Pytorch intro

## About

Pytorch is a Python library used to implement ML models. It was developer by Facebook researchers. It allows to execute code on CUDA devices or ordinary CPUs; [some support for AMD devices also exist](https://rocm.github.io/pytorch.html), and [there are some effors to run PyTorch on Google's TPUs](https://github.com/pytorch/xla). It is prefered by many developers to implement deep learning models. Unlike other high level libraries (like Keras), it allows forfull customization of learning models. Custom optimization algorithms, less functions etc can be easily added by a programmer, besides the existing ones. 

It favours developing NNs as dynamical computation graphs, i.e. the structure of the neural network can be decide at runtime. A major plus is auto-differentiation module, which allows one to automatically computer the gradients and use them to modify the weights. Thus, gradient based optimiziation models are easily to be implemented.

The official documentation can be found at [Pytorch documentation site](https://pytorch.org/docs/stable/index.html). There are plenty of free resources, like [Pytorch forums](https://discuss.pytorch.org) or [book DEEP LEARNING WITH PYTORCH](https://pytorch.org/deep-learning-with-pytorch).

## Installation

The official installation details are at [https://pytorch.org](https://pytorch.org); we recommend you to follow the steps from this page, as the packages (e.g. cudatookit) are continuously released under different version numbers. It is recommended to create a Python virtual environment, e.g. via `conda`:
```
conda create --name pytorch anaconda --yes
```
followed by `conda activate pytorch && conda update --all --yes`. 

If an NVIDIA GPU is avialable, Pytorch can be installed as:
```
conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
```
If only CPU is to be used, one can install a GPUless distribution as:
```
conda install pytorch torchvision cpuonly -c pytorch
```

## Basic notions

* a `Tensor` object is a multidimensional array similar to NumPy arrays
* class `Dataset`: bridge between your data and n dimensional `Tensor` objects
* class `DataLoader`: loads data from a `Dataset` objects and launches child processes to prepare data for training model

## First steps

### Tensors on CPU

Allows one to represent data and trainable weights. 
![Tensor drawings](./images/tensors.png)
Image source: Ref. 2. 

Unlike NumPy arrays, PyTorch tensors can be loaded into GPUs. 

In [1]:
import torch
a = torch.ones(3)
a

tensor([1., 1., 1.])

In [2]:
points = torch.tensor([4.0, 1.0, 5.0, 3.0, 2.0, 1.0])
points

tensor([4., 1., 5., 3., 2., 1.])

In [3]:
points ** 2

tensor([16.,  1., 25.,  9.,  4.,  1.])

In [4]:
points2 = points.reshape(3, 2)
points2

tensor([[4., 1.],
        [5., 3.],
        [2., 1.]])

In [5]:
# points and points2 share the same memory
points2[1, 1] *= -1
points
assert id(points.storage()) == id(points2.storage())

In [6]:
points2.T

tensor([[ 4.,  5.,  2.],
        [ 1., -3.,  1.]])

In [7]:
three_d_array = torch.ones(3, 4, 7)
three_d_array

tensor([[[1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1.]],

        [[1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1.]],

        [[1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1.]]])

In [8]:
double_points = torch.ones(10, 2, dtype=torch.double)
double_points.dtype

torch.float64

In [9]:
# Convert from numpy to torch and the other way around
points = torch.ones(3, 4)
points_np = points.numpy()
type(points_np)

numpy.ndarray

In [10]:
points = torch.from_numpy(points_np)
type(points)

torch.Tensor

In [11]:
# tensor serialization
torch.save(points, 'tensor.pt')
!dir *.pt

 Volume in drive D is data
 Volume Serial Number is 503D-AB7C

 Directory of d:\work\cercetare\MLReadingGroup\2020\03.March\10

03/09/2020  11:13 PM               387 tensor.pt
               1 File(s)            387 bytes
               0 Dir(s)  1,222,932,566,016 bytes free


In [12]:
del points # now points is undefined
points = torch.load('tensor.pt')
points

tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]])

In [13]:
# recovering the value from a tensor: use item() 
s = points.sum().item()
s

12.0

### Putting tensors on GPU

Note: the lines below are usable if GPU version of PyTorch was installed (doh).

By default, a tensor or a model is to be run on CPU:

In [14]:
points.device

device(type='cpu')

One can change the device as:

In [15]:
points_gpu = points.to(device='cuda')
# for multiple GPUs, choose which GPU to be used
# points_gpu = points.to(device='cuda:1')
print(points_gpu.device)

cuda:0


In [16]:
# similar:
points_gpu = points.cuda()
points_gpu.device

device(type='cuda', index=0)

Methods with suffix _ modify the objcts in place:

In [17]:
points_gpu

tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]], device='cuda:0')

In [18]:
points_gpu.zero_()
points_gpu

tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]], device='cuda:0')

In [19]:
# this is executed on GPU
points_gpu *= 2
points_gpu

tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]], device='cuda:0')

In [20]:
# Puting the tensor back onto the CPU
points_cpu = points_gpu.to(device='cpu')
points_cpu.device

device(type='cpu')

In [21]:
# same as
points_cpu = points_gpu.cpu()
points_cpu.device

device(type='cpu')

More details can be found in Ref 2, chapter 2. Chapter 3 deals with misc. data input: text, image, time series, tabular.

## Model training with Pytorch

A popular approach in ML is to design a loss function, whose optimization lead to proper estimation of parameters for a predictive model. 

For a loss function $L = L(w_1, \dots, w_n, b)$ whose minimum value is sought, the gradient can be used to update the weights $w_i$ and bias $b$ as follows:
$$
w_i = w_i - \alpha \frac{\partial L}{\partial w_i}(w_1, \dots, w_n, b)
\\
b = b - \alpha \frac{\partial L}{\partial b}(w_1, \dots, w_n, b)
$$
with $\alpha > 0$ as learning rate. PyTorch implements computation of partial derivatives (i.e. of the gradient) automatically. 

We will use linear regression as a model to build the relationship between input and output values. The input and output values are, respectively:

In [22]:
t_u = [35.7, 55.9, 58.2, 81.9, 56.3, 48.9, 33.9, 21.8, 48.4, 60.4, 68.4]
t_c = [0.5, 14.0, 15.0, 28.0, 11.0, 8.0, 3.0, -4.0, 6.0, 13.0, 21.0]

We convert them into Tensors:

In [23]:
t_u = torch.tensor(t_u)
t_c = torch.tensor(t_c)

The linear model sought has the form:
$$
t_c = w \cdot t_u + b
$$

where $w$ (weight) and $b$ (bias) are to be determined based on data. We are looking for values of $w$ and $b$ to minimize the loss function
$$
L(w, b) = \frac{1}{n} \sum\limits_{i=0}^{n-1} (t_u[i]-w_u[i])^2
$$
where $n$ is the total number of points in the data set. As the loss function is convex in this case, such $w, b$ to minimize $L$ are unique here.

Let $m(w, b, x)$ denote the output of the predictive model for trainable parameters $w$, $b$ and current input $x$. For a linear regerssion model, $m$ has the form:
$$
m(w, b, x) = w\cdot x + b
$$

The gradient of the lsoss function is:
$$
\nabla_{w, b} L = \begin{bmatrix}
\frac{\partial L}{\partial w}
\\
\frac{\partial L}{\partial b}
\end{bmatrix} = 
\begin{bmatrix}
\frac{\partial L}{\partial m(w, b, x)} \cdot \frac{\partial m(w, b, x)}{\partial w}
\\
\frac{\partial L}{\partial m(w, b, x)} \cdot \frac{\partial m(w, b, x)}{\partial b}
\end{bmatrix}
$$

For simpel models, partial derivatives as above can be computed manually. However, we *do not reject libraries which compute the gradients by themselves* :)

In [24]:
def model(t_u, w, b):
    return w * t_u + b

In [25]:
def loss_fn(t_p, t_c):
    squared_diffs = (t_p - t_c)**2
    return squared_diffs.mean()

To allow partial derivatives to be computed, the parameters $w$ and $b$ are put into a Tensor, which have the gradient tracking ability set to true:

In [26]:
params = torch.tensor([1.0, 0.0], requires_grad=True)

Initially, as no partial derivatives using $params$ were computed, the tensor's gradient is:

In [27]:
params.grad is None

True

To compute the gradients - which will be further used to update the corresponfing weights - one must call the `backward` method:

In [28]:
loss = loss_fn(model(t_u, *params), t_c)
loss.backward()
params.grad is None

False

In [29]:
params.grad

tensor([4517.2969,   82.6000])

One has to carefully decide which tensors require gradient computation (e..g weights, biases) and which of them do not (e.g. input tensors; they are constant, not updateable).

The forward and backward steps are depicted below (source: Ref 2).

![Forward and backward propagation](./images/gradients.png)

Note that before calling `backward`, one has to manually zero the gradients of the parameters via:

```python
if params.grad is not None:
    params.grad.zero_()
```

There are only a few models (e.g. recurrent neural networks) for which manual zeroing is not needed.

One final note is on input scaling: if the inputs are on a large scale, one cat get NaN on infinity values during model evaluation. To achieve this, it is usggested to scale the values. Popular ranges to be considered are $[0, 1]$, $[-1, 1]$. For input value `t_u` it suffices to multiply the values with 0.1, getting the input vector `t_un`. 

The full code follows:

In [30]:
t_c = torch.tensor([0.5, 14.0, 15.0, 28.0, 11.0, 8.0, 3.0, -4.0, 6.0, 13.0, 21.0])
t_u = torch.tensor([35.7, 55.9, 58.2, 81.9, 56.3, 48.9, 33.9, 21.8, 48.4, 60.4, 68.4])
t_un = 0.1 * t_u

def model(t_u, w, b):
    return w * t_u + b

def loss_fn(t_p, t_c):
    squared_diffs = (t_p - t_c)**2
    return squared_diffs.mean()

In [31]:
def training_loop(n_epochs, learning_rate, params, t_u, t_c):
    for epoch in range(1, n_epochs + 1):
        if params.grad is not None:  # <1>
            params.grad.zero_()
        
        t_p = model(t_u, *params) 
        loss = loss_fn(t_p, t_c)
        loss.backward()
        
        params = (params - learning_rate * params.grad).detach().requires_grad_()

        if epoch % 500 == 0:
            print('Epoch %d, Loss %f' % (epoch, float(loss)))
            
    return params

In [32]:
best_params = training_loop(
    n_epochs = 5000, 
    learning_rate = 1e-2, 
    params = torch.tensor([1.0, 0.0], requires_grad=True), # <1> 
    t_u = t_un, # <2> 
    t_c = t_c)

Epoch 500, Loss 7.860116
Epoch 1000, Loss 3.828538
Epoch 1500, Loss 3.092191
Epoch 2000, Loss 2.957697
Epoch 2500, Loss 2.933134
Epoch 3000, Loss 2.928648
Epoch 3500, Loss 2.927830
Epoch 4000, Loss 2.927679
Epoch 4500, Loss 2.927652
Epoch 5000, Loss 2.927647


In [33]:
best_params

tensor([  5.3671, -17.3012], requires_grad=True)

In [34]:
t_hat = best_params[0] * t_un + best_params[1]
t_hat - t_c

tensor([ 1.3593, -1.2992, -1.0648, -1.3448,  1.9155,  0.9439, -2.1068, -1.6009,
         2.6755,  2.1160, -1.5903], grad_fn=<SubBackward0>)

## Bibliography
1. [Pytorch documentation site](https://pytorch.org/docs/stable/index.html)
1. [Free book DEEP LEARNING WITH PYTORCH](https://pytorch.org/deep-learning-with-pytorch)