# Notes on PyTorch

Sigmoid is good for outputting a probability, thus use the sigmoid if you want the output of your neural network to be a probability.

`torch.manual_seed(n)`

This will provide us a random number generation to allow our predictions to be generative. So we can generate these predictions again.

`torch.randn(row, col)`

This function above will create a Tensor of normal variables, values randomly distributed according to the normal distribution with a mean of zero and standard deviation of one.

In [12]:
import torch
torch.manual_seed(18)
torch.randn((1, 5))

tensor([[ 0.5941, -0.1271, -0.7287,  0.7212, -0.5660]])

`torch.randn_like`
This creates a random tensor looking at the shape of the input tensor, and created another tensor with a normal distribution. 

`torch.mm(features, weights)` or `torch.matmul(features, weights)`

This performs a matrix multiplication, but remember the rows of the first input must match the columns of the second.

Need to be careful with the shape of the tensors being passed through. 

`tensor.shape` similar to numpy will allow us to figure out the shapes of our inputs, if it doesn't match, we can reshape, resize or view.

`torch.randn((1, 5)).view(5,1)` or `torch.randn((1, 5)).reshape(5,1)` or `torch.randn((1, 5)).resize_(5,1)`


**Note** the '_' _underscore_ at the end implies -In-Place operations.

In [16]:
torch.randn((1, 5)).resize_(5,1).shape

torch.Size([5, 1])

In [17]:
torch.randn((1, 5)).view(5, 1).shape

torch.Size([5, 1])

## Torchvision

Is a package sits alongside PyTorch, and provides a lot of nice utilities like datasets and models for doing computer vision problems. 

#### Transforms

It also has a module for common image transformations. And can be chained together using `transforms.Compose(...)` the `...` represents a list of transforms. 


**Example**
```python
from torchvision import transforms

transform = transforms.Compose([transforms.Resize(28),
                                transforms.ToTensor(),
transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))])
```

In the above example, Resize changes the shape of all images in the transform, `ToTensor()` changes all image types to Tensors, and Normalize will normalize a tensor image with mean and standard deviation. Given mean: `(M1, ..., Mn)` and std: `(S1,...,Sn)` for `n` channels, this transform will normalize each channel of the input `torch.*Tensor` i.e. `input[channel] = (input[channel] - mean[channel]) / std[channel]`

## Batch Size

What that means and every time we get a set of images and labels out, were getting the batch size number of the images and labels from the dataloader. 

## Backprop

Training multilayer networks is done through backprop, which is just an application of the chainrule from calc. 

<img src='intro-to-pytorch/assets/backprop_diagram.png' width=550px>

In the forward pass through our network, our data and operations go from bottom to top here. We pass the input $x$ through a linear transformation $L_1$ with Weights $W_1$ and biases $b_1$. The output then goes through the sigmoid operation $S$ and another linear transformation $L_2$. Finally we calculate the loss $\ell$. We use the loss as a measure of how bad the network's predictions are. The goal then is to adjust the weights and biases to minimize the loss. 

To train the weights with gradient descent, we propagate the gradient of the loss backwards through the network. Each operation has some gradient between the inputs and outputs. As we send the gradients backwards, **we multiply the incoming gradient with the gradient for the operation.** Mathematically this is really just calculating the gradient of the loss with respect to the weights using the chain rule. 

$$
\large \frac{\partial \ell}{\partial W_1} = \frac{\partial L_1}{\partial W_1} \frac{\partial S}{\partial L_1} \frac{\partial L_2}{\partial S} \frac{\partial \ell}{\partial L_2}
$$


We update our weights using this gradient with some learning rate $\alpha$. 

$$
\large W^\prime_1 = W_1 - \alpha \frac{\partial \ell}{\partial W_1}
$$


In [None]:
# Build a feed-forward network
# define the loss
# get our data
# flatten the images
# do a forward pass
# get our logits


## Autograd 

Now that we know how to calculate a loss, torch, provides a module called Autograd for automatically calculating the gradients of tensors. Which we will use to calculate the gradients of all our parameters with respect to the loss.

Autograd works by keeping track of operations performed on tensors, then going backwards through those operations, calculating gradients along the way. To make sure torch keeps track of operations on a tensor and calculates the gradients, you need to set `requires_grad = True` on a tensor. 

We can also turn off gradients for a block of code using `torch.no_grad()`

Example:

```python
x = torch.zeros(1, requires_grad=True)
>>> with torch.no_grad()::
...    y = x * 2
>>> y.requires_grad
False

```

In [22]:
x = torch.randn(2, 2, requires_grad=True)
print(x)

y = x ** 2
print(y)

print(y.grad_fn)

tensor([[ 1.2412,  0.0113],
        [-0.0857,  1.9370]], requires_grad=True)
tensor([[1.5406e+00, 1.2746e-04],
        [7.3512e-03, 3.7521e+00]], grad_fn=<PowBackward0>)
<PowBackward0 object at 0x10fab3ac8>


We can see the operation that created `y`, a div operation `DivBackward0`

The autograd module keeps track of these operations and knows how to calculate the gradient for each one. In this way, it's able to calculate the gradients for a chain of operations with respect to any one tensor. 

Let's reduce the y tensor to a scalar value, the mean.

In [23]:
z = y.mean()
print(z)

# check the gradients for x, and y, but they are empty -> None
print(x.grad)

tensor(1.3251, grad_fn=<MeanBackward1>)
None


To calculate the gradients, we need to run the `.backward()` method on a Variable, `z` for example. This will calculate the gradient for `z` with respect to `x`.

$$
\frac{\partial z}{\partial x} = \frac{\partial}{\partial x}\left[\frac{1}{n}\sum_i^n x_i^2\right] = \frac{x}{2}
$$

In [25]:
z.backward()
print(x.grad)
print(x/2)

tensor([[ 0.6206,  0.0056],
        [-0.0429,  0.9685]])
tensor([[ 0.6206,  0.0056],
        [-0.0429,  0.9685]], grad_fn=<DivBackward0>)


These gradients calculations are particularly useful for neural networks. For training we need the gradients of the weights with respect to the cost. With PyTorch, we run data forward through the network to calculate the loss, then, go backwards to calculate the gradients with respect to the loss. Once we have the gradients we can make a gradient descent step. 

## Combining Loss and Autograd

When we are creating any network with torch, all of the parameters are initialized with `requires_grad=True`. This means that when we calculate the loss and call `loss.backward()`, the gradients for the parameters are calculated. These gradients are used to update the weights with gradient descent. Below is an example of calculating the gradients using a backwards pass.

```python
from torch import nn

model = nn.Sequential(nn.Linear(784, 128),
                      nn.ReLU(),
                      nn.Linear(128, 64),
                      nn.ReLU(), 
                      nn.Linear(64, 10),
                      nn.LogSoftmax(dim=1))

criterion = nn.NLLLoss()
images, labels = next(iter(trainloader))
images = images.view(images.shape[0], -1)

log_ps = model(images)
loss = criterion(logps, labels)
```

```python
print('Before backward pass: \n', model[0].weight.grad)

loss.backward()

print('After backward pass: \n', model[0].weight.grad)
```

## Training the network

After all of this, we need an optimizer that we will use to update the weights with the gradients. We get these from PyTorch's optim package. For exmaple we use SGD with `optim.SGD`. Creating an optimizer is as easy as.

In [None]:
from torch import optim

optimizer = optim.Adam(model.parameters(), lr=0.003)

The general process for doing one learning step before looping all the data is.

- Make a forward pass through the network
- Use the network output (logits) to calculate the loss
- Perform a backward pass through the network with `loss.backward()` to calculate the gradients. 
- Take a step with the optimizer to update the weights.

We need to be careful because we for learning step, we need to zero the gradients as the gradients are accumulated. This means that we need to zero the gradients on each training pass or you'll retain the gradients from previous training batches. This can be done by using `optimizer.zero_grad()`. 


## Inference and Validation

The goal of validation is to measure the model's performance on data that isn't part of the training set. Performance here is up to the developer to define though. 

With the probabilities, we can get the most likely class using the `ps.topk` method. This returns the $k$ highest values. Since we just want the most likely class, we can use `ps.topk(1)`. This returns a tuple of the top-$k$ and the top-$k$ indices. If the highest value is the fifth element, we'll get back 4 as the index.




### Dropout

During training we want to use dropout to prevent overfitting, but during inference we want to use the entire network. So, we need to turn off dropout during validation, testing, and whenever we're using the network to make predictions. To do this, you use `model.eval()`. This sets the model to evaluation mode where the dropout probability is 0. You can turn dropout back on by setting the model to train mode with `model.train()`. In general, the pattern for the validation loop will look like this, where you turn off gradients, set the model to evaluation mode, calculate the validation loss and metric, then set the model back to train mode.

```python
# turn off gradients
with torch.no_grad():
    
    # set model to evaluation mode
    model.eval()
    
    # validation pass here
    for images, labels in testloader:
        ...

# set model back to train mode
model.train()
```

### Saving and Loading Models

It's impractical to train a network every time you need to use it. Instead, we can save trained networks then load them later to train more or use them for predictions. 

The parameters for PyTorch networks are stored in a model's `state_dict`. We can see the state dict contains the weight and bias matrices for each of our layers. 

The simplest thing to do is simply save the state dict with `torch.save`. For example, we can **save** it to a file `'checkpoint.pth'`.

```python
>>> torch.save(model.state_dict(), 'checkpoint.pth')
```

Then we can load the state dict with `torch.load`. 

```python
>>> state_dict = torch.load('checkpoint.pth')
print(state_dict.keys())
```

And to load the state dict in to the network, you do
```python
>>> model.load_state_dit(state_dict)
```

Seems pretty straightforward, but as usual it's a bit more complicated. Loading the state dict works only if the model architecture is exactly the same as the checkpoint architecture. If I create a model with a different architecture, this fails.

```python
>>> model = fc_model.Network(784, 10, [400, 200, 100])

# This will throw an error because the tensor sizes are wrong!
>>> model.load_state_dict(state_dict)
```

This means we need to rebuild the model exactly as it was when trained. Information about the model architecture needs to be saved in the checkpoint, along with the state dict. To do this, you build a dictionary with all the information you need to compeletely rebuild the model.

```python
>>> checkpoint = {'input_size': 784,
              'output_size': 10,
              'hidden_layers': [each.out_features for each in model.hidden_layers],
              'state_dict': model.state_dict()}

>>> torch.save(checkpoint, 'checkpoint.pth')
```

Now the checkpoint has all the necessary information to rebuild the trained model. You can easily make that a function if you want. Similarly, we can write a function to load checkpoints.

```python
>>> def load_checkpoint(filepath):
...    checkpoint = torch.load(filepath)
...    model = fc_model.Network(checkpoint['input_size'],
                             checkpoint['output_size'],
                             checkpoint['hidden_layers'])
...    model.load_state_dict(checkpoint['state_dict'])
    
...    return model

>>> model = load_checkpoint('checkpoint.pth')
>>> print(model)

```