# Learning to use PyTorch

In [1]:
import torch
import numpy as np

## Tensor initialization
Tensors can be initialized in various ways. Take a look at the following examples:

### Directly from data

Tensors can be created directly from data. The data type is automatically inferred.

In [4]:
data = [[1, 2], [3, 4]]
x_data = torch.tensor(data)
x_data

tensor([[1, 2],
        [3, 4]])

### From a NumPy array

Tensors can be created from NumPy arrays (and vice versa)

In [7]:
np_array = np.array(data)
x_np = torch.from_numpy(np_array)
x_np

tensor([[1, 2],
        [3, 4]])

### From another tensor:

The new tensor retains the properties (shape, datatype) of the argument tensor, unless explicitly overridden. 

In [9]:
x_ones = torch.ones_like(x_data) # retains the properties of x_data
print(f'ones tensor: \n {x_ones} \n')

x_rand = torch.rand_like(x_data, dtype=torch.float)
print(f'random tensor: \n {x_rand} \n')

ones tensor: 
 tensor([[1, 1],
        [1, 1]]) 

random tensor: 
 tensor([[0.8915, 0.5247],
        [0.7910, 0.8178]]) 



### With random or constant values:

`shape` is a tuple of tensor dimensions. In the functions below, it determines the dimensionality of the output tensor. 

In [11]:
shape = (2, 3,)
rand_tensor = torch.rand(shape)
ones_tensor = torch.ones(shape)
zeros_tensor = torch.zeros(shape)

print(f'random tensor: \n {rand_tensor} \n')
print(f'ones tensor: \n {ones_tensor} \n')
print(f'zeros tensor: \n {zeros_tensor} \n')


random tensor: 
 tensor([[0.6257, 0.1930, 0.2026],
        [0.3547, 0.0330, 0.8057]]) 

ones tensor: 
 tensor([[1., 1., 1.],
        [1., 1., 1.]]) 

zeros tensor: 
 tensor([[0., 0., 0.],
        [0., 0., 0.]]) 



## Tensor attributes

Tensor attributes describe their shape, datatype, and the device on which they're stored.

In [12]:
tensor = torch.rand(3, 4)

print(f'shape of tensor: {tensor.shape}')
print(f'datatype of tensor: {tensor.dtype}')
print(f'device tensor is stored on: {tensor.device}')

shape of tensor: torch.Size([3, 4])
datatype of tensor: torch.float32
device tensor is stored on: cpu


## Tensor Operations

Over 100 tensor operations, including transposing, indexing, slicing, mathematical operations, linear algebra, random sampling, and more are comprehensively described [here](https://pytorch.org/docs/stable/torch.html).

Each of them can be run on the GPU (at typically higher speeds than the CPU). 

In [13]:
if torch.cuda.is_available():
    tensor = tensor.to('cuda')
    print(f"device tensor is stored on: {tensor.device}")

### Joining tensors

You can use torch.cat to concatenate a sequence of tensors along a given dimension. See also [torch.stack](https://pytorch.org/docs/stable/generated/torch.stack.html), another tensor joining op that is subtly different from `torch.cat`.

In [14]:
t1 = torch.cat([tensor, tensor, tensor], dim=1)
print(t1)

tensor([[0.1001, 0.1994, 0.4170, 0.4572, 0.1001, 0.1994, 0.4170, 0.4572, 0.1001,
         0.1994, 0.4170, 0.4572],
        [0.5947, 0.0360, 0.5177, 0.8934, 0.5947, 0.0360, 0.5177, 0.8934, 0.5947,
         0.0360, 0.5177, 0.8934],
        [0.6096, 0.3497, 0.1454, 0.8861, 0.6096, 0.3497, 0.1454, 0.8861, 0.6096,
         0.3497, 0.1454, 0.8861]])


In [15]:
t1.shape

torch.Size([3, 12])

### Multiplying tensors

In [16]:
# This computes the element-wise product
print(f'tensor.mul(tensor) \n {tensor.mul(tensor)} \n')
# alternative syntax
print(f'tensor * tensor \n {tensor * tensor}')

tensor.mul(tensor) 
 tensor([[0.0100, 0.0398, 0.1739, 0.2090],
        [0.3537, 0.0013, 0.2681, 0.7981],
        [0.3716, 0.1223, 0.0211, 0.7851]]) 

tensor * tensor 
 tensor([[0.0100, 0.0398, 0.1739, 0.2090],
        [0.3537, 0.0013, 0.2681, 0.7981],
        [0.3716, 0.1223, 0.0211, 0.7851]])


This computes the matrix multiplication between two tensors

In [17]:
print(f'tensor.matmul(tensor.T) \n {tensor.matmul(tensor.T)} \n')
# Alternative syntax:
print(f'tensor @ tensor.T \n {tensor @ tensor.T}')

tensor.matmul(tensor.T) 
 tensor([[0.4327, 0.6911, 0.5965],
        [0.6911, 1.4212, 1.2420],
        [0.5965, 1.2420, 1.3001]]) 

tensor @ tensor.T 
 tensor([[0.4327, 0.6911, 0.5965],
        [0.6911, 1.4212, 1.2420],
        [0.5965, 1.2420, 1.3001]])


### In-place operations
Operations that have a _ suffix are in-place. For example: `x.copy_(y)`, `x.t_()` will change `x`.

In [18]:
print(tensor, '\n')
tensor.add_(5)
print(tensor)

tensor([[0.1001, 0.1994, 0.4170, 0.4572],
        [0.5947, 0.0360, 0.5177, 0.8934],
        [0.6096, 0.3497, 0.1454, 0.8861]]) 

tensor([[5.1001, 5.1994, 5.4170, 5.4572],
        [5.5947, 5.0360, 5.5177, 5.8934],
        [5.6096, 5.3497, 5.1454, 5.8861]])


## NumPy array to Tensor

In [19]:
n = np.ones(5)
t = torch.from_numpy(n)

Changes in the NumPy array reflects in the tensor.

In [20]:
np.add(n, 1, out=n)
print(f't: {t}')
print(f'n: {n}') 

t: tensor([2., 2., 2., 2., 2.], dtype=torch.float64)
n: [2. 2. 2. 2. 2.]


## A gentle introduction to `torch.autograd`

`torch.autograd` is PyTorch's automatic differentiation engine that powers neural network training. In this section, you will get a conceptual understanding of how autograd helps a neural network train. 

### Background

Neural networks (NNs) are a collection of nested functions that are executed on some input data. These functions are defined by __parameters__ (consisting of weights and biases), which in PyTorch are stored in tensors.

Training a NN happens in two steps:

**Forward Propagation:** In forward prop, the NN makes its best guess about the correct output. It runs the input data through each of its functions to make this guess. 

**Backward Propagation:** In backprop, the NN adjusts its parameters proportionate to the error in its guess. It does this by traversing backwards from the output, collecting the derivatives of the error with respect to the parameters of the functions (gradients) and optimizing the parameters using gradient descent. 

## Usage in PyTorch

Let's take a look at a single training step. For this example, we load a pretrained ResNet18 model from torchvision. We create a random data tensor to represent a single image with 3 channels. and height & width of 64, and its corresponding `label` initialized to some random values. Label in pretrained models has shape (1, 1000).

In [21]:
from torchvision.models import resnet18, ResNet18_Weights
model = resnet18(weights=ResNet18_Weights.DEFAULT)
data = torch.rand(1, 3, 64, 64)
labels = torch.rand(1, 1000)

Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /Users/jarl/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100.0%


In [22]:
prediction = model(data) # forward pass

We use the model's prediction and the corresponding label to calculate the error (loss). The next step is to backpropagate this error through the network. Backpropagation is kicked off when we call `.backward()` on the error tensor. Autograd then calculates and stores the gradients for each model parameter in the parameter's `.grad` attribute.

In [23]:
loss = (prediction - labels).sum()
loss.backward()  # backward pass

In [26]:
print(prediction.shape)
print(labels.shape)
print(loss)

torch.Size([1, 1000])
torch.Size([1, 1000])
tensor(-487.3716, grad_fn=<SumBackward0>)


Next, we load an optimizer, in this case SGD with a learning rate of 0.01 and momentum of 0.9. We register all the parameters of the model in the optimizer.

In [28]:
optim = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)

Finally, we call `.step()` to initiate gradient descent. The optimizer adjusts each parameter by its gradient stored in `.grad`.

In [29]:
optim.step()  # gradient descent

### Differentiation in Autograd

Let's take a look at how `autograd` collects gradients. We create two tensors `a` and `b` with `requires_grad=True`. This signals to `autograd` that every operation on them should be tracked. 

In [35]:
a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)

We create another tensor `Q`  from `a` and `b`.

$ Q = 3a^3 - b^2 $

In [36]:
Q = 3*a**3 - b**2

Let's assume `a` and `b` to be parameters of an NN, and `Q` to be the error. In NN training we want gradients of the error w.r.t. parameters, i.e.:

$ \frac{\partial Q}{\partial a} = 9a^2 $, $ \frac{\partial Q}{\partial b} = -2b $

When we call `.backward()` on `Q`, autograd calculates these gradients and stores them in the respectives tensors' `.grad` attribute.

We need to explicityly pass a `gradient` argument in `Q.backward()` because it is a vector. `gradient` is a tensor of the same shape as `Q`, and it represents the gradient of Q wrt itself, i.e.

$ \frac{dQ}{dQ} = 1 $

Evidently, we can also aggregate Q into a scalar and call backward implicitly, like `Q.sum().backward()`.

In [37]:
external_grad = torch.tensor([1., 1.])
Q.backward(gradient=external_grad)

Gradients are now deposited in `a.grad` and `b.grad`

In [38]:
# check if collected gradients are correct
print(9*a**2 == a.grad)
print(-2*b == b.grad)

tensor([True, True])
tensor([True, True])


In [50]:
x = torch.ones(1, requires_grad=True)
y = x + 2
z = y * y * 2

z.backward()
print(x.grad)

tensor([12.])


## Neural networks

Neural networks can be constructed using the torch.nn package. 

`nn` depends on `autograd` to define models and differentiate them. An `nn.Module` contains layers, and a method `forward(input)` that returns the `output`.

A typical training procedure for a neural network is as follows:

* Define the neural network that has some learnable parameters (or weights)
* Iterate over a dataset of inputs
* Process input through the network
* Compute the loss (how far the output is from being correct)
* Propagate gradients back into the network's parameters
* Update the weights of the network, typically using a simple update rule: `weight = weight - learning_rate * gradient`


### Define the network

Let's define the network: 

In [19]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):

    def __init__(self) -> None:
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels, 5x5 square convolution kernel
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 5 * 5, 120)  # 5x5 from image dimension
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square, you can specify with a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = torch.flatten(x, 1) # flatten all dimensions except the batch dimension
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net()
print(net)

Net(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)


You just have to define the `forward` function, and the `backward`function (where gradients are computed) is automatically defined for you using `autograd`. You can use any of the Tensor operations in the `forward` function. 

The learnable parameters of a model are returned by `net.parameters()`

In [20]:
params = list(net.parameters())
print(len(params))
print(params[0].size())  # conv1's .weight

10
torch.Size([6, 1, 5, 5])


Let's try a random 32x32 input. Note: expected input size of this net (LeNet) is 32x32. To use this net on the MNIST dataset, please resize the images from the dataset to 32x32.

In [21]:
input = torch.randn(1, 1, 32, 32)  # batch, channel, width, height?
out = net(input)
print(out)

tensor([[-0.0496, -0.0892, -0.0588,  0.0245, -0.0155, -0.0540,  0.0322,  0.0067,
          0.0110, -0.1615]], grad_fn=<AddmmBackward0>)


Zero the gradient buffers of all parameters and backprops with random gradients:

In [22]:
net.zero_grad()
out.backward(torch.randn(1, 10))

> ### Note
>
> `torch.nn` only supports mini-batches. The entire `torch.nn` package only supports inputs that are a mini-batch of samples and not a single sample.
>
> For example, `nn.Conv2d` will take in a 4D Tensor of `nSamples x nChannels x Height x Width`.
>
> If you have a single sample, just use `input.unsqueeze(0)` to add a fake batch dimension.

**Recap**

* torch.Tensor - A _multi-dimensional array_ with support for autograd operations like `backward()`. Also holds the _gradient_ wrt the tensor. 
* `nn.Module` - Neural network module. _Convenient way of encapsulating parameters_, with helpers for moving them to GPU, exporting, loading, etc.
* `nn.Parameter` - A kind of Tensor, that is _automatically registered_ as a parameter when assigned as an attribute to a `Module`.
* `autograd.Function` - Implements _forward and backward definitions of an autograd operation_. Every `Tensor` operation creates at least a single `Function` node that connects to functions that created a `Tensor` and _encodes its history_.

### Loss function

A loss function takes the (output, target) pair of inputs, and computes a value that estimates how far away the output is from the target. 

There are several different [loss functions](https://pytorch.org/docs/nn.html#loss-functions) under the nn package. A simple loss is: `nn.MSELoss` which computes the mean-squared error between the output and the target. 

For example:

In [23]:
output = net(input)
target = torch.randn(10)
target = target.view(1, -1)
criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)

tensor(1.5909, grad_fn=<MseLossBackward0>)


Now, if you follow `loss` in the backward direction, using its `.grad_fn` attribute, you will see a graph of computations that looks like this:

```
input -> conv2d -> relu -> maxpool2d -> conv2d -> relu -> maxpool2d
      -> flatten -> linear -> relu -> linear -> relu -> linear
      -> MSELoss
      -> loss
```

So, when we call `loss.backward()`, the whole graph is differentiated wrt the neural net parametrs, and all Tensors in the graph that have `requires_grad=True` will have their `.grad` Tensor accumulated with the gradient. 

For illustration, let us follow a few steps backward:

In [24]:
print(loss.grad_fn)  # MSELoss
print(loss.grad_fn.next_functions[0][0])  # Linear
print(loss.grad_fn.next_functions[0][0].next_functions[0][0])  # ReLU

<MseLossBackward0 object at 0x11acfcdc0>
<AddmmBackward0 object at 0x11acff6a0>
<AccumulateGrad object at 0x11acfcdc0>


### Backprop

To backpropagate the error all we have to do is `loss.backward()`. You need to clear the existing gradients though, else gradients will be accumulated to existing gradients.

Now we shall call `loss.backward()`, and have a look at conv1's bias gradients before and after the backward.

In [25]:
net.zero_grad()

print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)
print('\n')

loss.backward()

print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)

conv1.bias.grad before backward
tensor([0., 0., 0., 0., 0., 0.])


conv1.bias.grad after backward
tensor([ 0.0131,  0.0234, -0.0464, -0.0088, -0.0068, -0.0043])


### Update the weights

The simplest update rule used in practice is the Stochastic Gradient Descent (SGD):

````weight = weight - learning_rate * gradient````

We can implement this using simple Python code:

In [26]:
learning_rate = 0.01
for f in net.parameters():
    f.data.sub_(f.grad.data * learning_rate)

However, as you use neural networks, you want to use various different update rules such as SGD, Nesterov-SGD, Adam, RMSProp, etc. To enable this, we built a small package: `torch.optim` that implements all these methods. Using it is very simple:

In [27]:
import torch.optim as optim

# create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)

# in your training loop:
optimizer.zero_grad  # zero the gradient buffers
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step()  # does the update