# 🔥 Intro to PyTorch 🔥

Table of Contents:

0. Introduction
1. Torch Tensors
    1. Basics
    2. Using the GPU
    3. Math
2. Autograd
3. Neural networks in Torch


## 0. Introduction

In this tutorial, we'll walk you through the basics of using `torch`.

Note: we'll use PyTorch and Torch interchangeably.

In many ways, you can think of it as a souped-up `numpy` with a few key differences:
1. You can use `torch` with a GPU for super fast matrix calculations.
2. Torch provides _automatic differentiation_.
3. Torch provides some structure for building neural networks.

This notebook is adapted from [this tutorial](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html), which you can consult as a more in-depth guide.

In [None]:
# first, let's import torch and some other libraries
import torch
import numpy as np

## 1. Torch Tensors: `np.array`'s cool older sibling

Machine learning requires a lot of linear algebra. So all of these libraries (`torch`, `numpy`, `tensorflow`, etc), at their most basic level, are just libraries to aid you (and the computer) in doing linear algebraic operations, quickly.

It makes sense, then, that the basic data structure is the `torch.Tensor`, which is essentially a GPU-enabled `np.array`.

<!-- Tensors are always associated with a memory location. This has always been true, of course, but generally Python abstracts that all away for us. It's worth being explicit now, because you will have different memory locations when moving tensors between the GPU and CPU. -->

### Basics

In [None]:
# uninitialized 5x3 matrix
# values are undefined when matrix is uninitialized (note that this is not the same as matrix of zeros)

torch.empty(5, 3)

tensor([[0.0000e+00, 0.0000e+00, 0.0000e+00],
        [0.0000e+00, 0.0000e+00, 1.1704e-41],
        [0.0000e+00, 2.2369e+08, 0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00],
        [       nan,        nan, 1.2783e+33]])

In [None]:
# randomly initialized (uniform on [0,1))

torch.rand(5, 3)

tensor([[0.1788, 0.2140, 0.3977],
        [0.3545, 0.8557, 0.3537],
        [0.1531, 0.4902, 0.5251],
        [0.9243, 0.0865, 0.2306],
        [0.5489, 0.1476, 0.3132]])

In [None]:
# finally, we can initialize straight from data, just like numpy

x = torch.tensor([[5.5, 3], [4, 1]])
x

tensor([[5.5000, 3.0000],
        [4.0000, 1.0000]])

Just like numpy arrays, tensors are multidimensional and homogeneously typed.

In [None]:
# torch.Size is a tuple
x.size()

torch.Size([2, 2])

In [None]:
x.dtype

torch.float32

In fact, if you're on the CPU, you can very easily convert from tensors to nparrays and back again. They will point to the same memory location.

In [None]:
# wow!
a = x.numpy()
a

array([[5.5, 3. ],
       [4. , 1. ]], dtype=float32)

In [None]:
# add in place (+= 1)
x.add_(1)

tensor([[6.5000, 4.0000],
        [5.0000, 2.0000]])

In [None]:
a

array([[6.5, 4. ],
       [5. , 2. ]], dtype=float32)

### Using the GPU

In [None]:
# We will use ``torch.device`` objects to move tensors in and out of GPU
# The if statement checks if you have a GPU available
# This only works for NVIDIA graphics cards
if torch.cuda.is_available():
    device = torch.device("cuda")          # a CUDA device object
    y = torch.ones_like(x, device=device)  # directly create a tensor on GPU
    x = x.to(device)                       # or just use strings ``.to("cuda")``
    z = x + y
    print(z)
    print(z.to("cpu", torch.double))       # ``.to`` can also change dtype together!

### Math with Torch

Like we alluded to earlier, we can do math with tensors.

There are many, many ways to do the same thing in Torch. To see (almost) all of them, see [here](https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#operations). I'm just going to list the basics, but I recommend checking them all out so you know what to expect when reading other people's code.

As just one example, here's how we can do addition, but you can check the [docs](https://pytorch.org/docs/stable/torch.html#math-operations) for all the math operations available to you.

In [None]:
y = torch.rand(2, 2)

In [None]:
# it really is that easy
x + y

tensor([[7.3992, 4.7970],
        [5.3645, 2.8396]])

In [None]:
torch.add(x, y)

tensor([[7.3992, 4.7970],
        [5.3645, 2.8396]])

## 2. Autograd: calc III in a python module

Every tensor tracks the operations performed on it, so Torch can automatically compute the gradient with respect to that tensor!

Consider this familiar looking set of equations:

$$
z = \bar{\theta} \cdot \bar{x} + b
$$

$$
L = \max({0, 1 - yz})
$$

where $y$ is the target output and $z$ is the output of the linear classifier.

We want to minimize loss with respect to $\theta$.

In [None]:
# let's define some values for our vectors.
# we use requires_grad to tell Torch to track operations on these tensors
# we initialize b to a Torch scalar, and also track operations on it

theta = torch.tensor([1., 2.], requires_grad=True)
x = torch.tensor([3., 1.], requires_grad=True)
b = torch.tensor(3., requires_grad=True)

z = torch.dot(theta, x) + b

Now let's introduce our loss function,  hinge loss.

In [None]:
y = -1  # this is our target value
loss = 1 - y * z

In [None]:
loss

tensor(9., grad_fn=<RsubBackward1>)

Now, we propagate backwards from the loss to get $ \nabla_{}L$, the partial of the loss with respect to each of the variables. This updates all of the tensors which go into the calculation for loss.

In [None]:
loss.backward()

Finally, to find $ \nabla_{\bar{\theta}}L$, we access the `.grad` attribute of our `theta` tensor.

In [None]:
theta.grad

tensor([3., 1.])

Incredible!

<img src="https://media.giphy.com/media/PUBxelwT57jsQ/source.gif" />

Important note: gradients _accumulate_, so you probably want to zero out all of the gradients before. For example, let's see what happens if we run the computations again (starting from `y = torch.dot(...`)

You can read more about autograd [here](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#sphx-glr-beginner-blitz-autograd-tutorial-py) (the tutorial) and a lot more [here](https://pytorch.org/docs/stable/autograd.html) (the documentation) and a lot, lot more [here](https://openreview.net/pdf/25b8eee6c373d48b84e5e9c6e10e7cbbbce4ac73.pdf) (the paper).

In [None]:
theta.grad.data.zero_()

tensor([0., 0.])

## 3. Neural Networks with Torch

Now, let's write a neural network together in PyTorch. It's easy and fun!

We'll start with some concepts:

Torch provides some nice helper functions and code to help us train our neural nets. This ranges from code to load data (`torch.utils.data`) to a variety of activation functions and hidden layers (`torch.nn`). This also means for our code to play nicely with the rest of Torch, we want to follow the same structure as the rest of the Torch codebase.

Enter `torch.nn.Module` (we will abbreviate to `nn.Module`). You can think of `nn.Module` as single "unit" of your architecture. It's anything with layers, an input, and `forward()` function which computes output. For simple models, it might be your entire model. For more complex models, it might be a logical, resuable component (e.g. a large language model that you then fine-tune over).

`torch.nn` uses autograd to define and differentiate models.

### General Outline

In most cases, these are the steps you take to train a neural network:

0. Define the neural network architecture with some learnable parameters (or weights)
1. Iterate over a dataset of inputs
2. Process input through the network
3. Compute the loss (how far is the output from being correct)
4. Propagate gradients back into the network’s parameters
5. Update the weights of the network

We will work with an example dataset trying to classify between 10 characters of a cursive Japanese phonetic writing system: Kuzushiji. You can think of this as MNIST, but cooler.

![Image of KMNIST dataset](https://github.com/rois-codh/kmnist/raw/master/images/kmnist_examples.png)

In [None]:
# first, we download and load the dataset
# if you're running for the first time, it might take a minute to download
import torchvision
root = "./mnist/"
dataset = torchvision.datasets.MNIST(root, train=True, transform=None, target_transform=None, download=True)

In [None]:
dataset

Dataset MNIST
    Number of datapoints: 60000
    Root location: ./mnist/
    Split: Train

Let's see what a single example from the dataset looks like. A 28x28 matrix. We will flatten this to a single 784x1 vector for our toy model.

In [None]:
X, y = dataset[0]
np.array(X).shape

(28, 28)

Great, let's start making our classifier!

### Define the neural network

In [None]:
import torch.nn as nn
import torch.nn.functional as F

In [None]:
# A module is just something with:

class CoolClassifier(nn.Module):
    def __init__(self):
        super(CoolClassifier, self).__init__()
        # 1. layers

    # 2. a function that takes an input
    def forward(self, x):
        # 3. and that returns an output
        return x

In [None]:
class CoolClassifier(nn.Module):
    def __init__(self):
        super(CoolClassifier, self).__init__()
        self.hidden1 = nn.Linear(784, 200)
        self.hidden2 = nn.Linear(200, 10)

    def forward(self, x):
        x = F.relu(self.hidden1(x))
        x = self.hidden2(x)
        return x

In [None]:
net = CoolClassifier()
net

CoolClassifier(
  (hidden1): Linear(in_features=784, out_features=200, bias=True)
  (hidden2): Linear(in_features=200, out_features=10, bias=True)
)

In [None]:
# shape of the first layer?
list(net.parameters())[0].shape

torch.Size([200, 784])

Cool. We have our model (with totally useless weights, of course). Let's forward propagate (but remember to zero out the gradients first.)

### Process an input

In [None]:
X, y = dataset[0]
X = np.array(X).flatten()  # convert 28x28 to 784x1
X = torch.tensor(X, dtype=torch.float32)  # convert int nparray to float tensor
X = X.unsqueeze(0)  # fake a minibatch of size 1

y = torch.tensor(y).unsqueeze(0)

net.zero_grad()
output = net(X)
output

tensor([[  4.6051,  -0.5493, -19.5514,   6.2472, -13.4110,  -3.7954, -18.7314,
         -23.9434, -13.7989,   5.8312]], grad_fn=<AddmmBackward>)

In [None]:
X.shape

torch.Size([1, 784])

As an aside, we learned before that it makes the most sense to use minibatch gradient descent, and this is what Torch was designed for. That means we can only feed in inputs with shape `[batch_size, input_shape]`.

Okay, great, now we just need to teach our model. To compute the loss, we'll use Cross Entropy Loss, since we're doing a classification task.

### Calculate loss

In [None]:
criterion = nn.CrossEntropyLoss()
loss = criterion(output, y)
loss

tensor(10.6601, grad_fn=<NllLossBackward>)

Now we can differentiate w.r.t. the loss (backpropagate!)

### Propagate gradients back

In [None]:
loss.backward()

In [None]:
net.hidden2.bias.grad

tensor([ 1.0439e-01,  6.0272e-04,  3.3698e-12,  5.3924e-01,  1.5644e-09,
        -9.9998e-01,  7.6511e-12,  4.1703e-14,  1.0614e-09,  3.5574e-01])

### Update weights

The simplest way to do this is with SGD.

In [None]:
learning_rate = 0.01
for f in net.parameters():
    # for each parameter, subtract the gradient * learning rate
    f.data.sub_(f.grad.data * learning_rate)

However, scientists have spent lots of time inventing very sophisticated ways of updating the weights beyond just SGD. Torch includes many of these more advanced update rules, called optimizers, in the `torch.optim` module.

As an example, here's how to use the built-in version of SGD, but we might prefer something like Adam instead.

In [None]:
import torch.optim as optim

# create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)

# in your training loop:
optimizer.zero_grad()   # zero the gradient buffers
output = net(X)
loss = criterion(output, y)
loss.backward()
optimizer.step()    # Does the update

### Putting it all together

All of these parts go together to create our training procedure, which should look something like this:

In [None]:
train, val = torch.utils.data.random_split(dataset, [50_000, 10_000])

In [None]:
# See also: https://pytorch.org/tutorials/beginner/nn_tutorial.html#add-validation
# This might take a few minutes to run.

# From before,
# 1. Define the model, then instantiate the model
# net = CoolClassifier()

# The DataLoaders help us create and iterate over minibatches
# We use the collate_fn to do our preprocessing for each batch.
# i.e. convert each image into a 784x1 tensor
def collate_fn(batch):
    X = torch.tensor([np.array(x).flatten() for x, _ in batch], dtype=torch.float32)
    y = torch.tensor([y for _, y in batch])
    return X, y

batch_size = 16
train_dl = torch.utils.data.DataLoader(train, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
val_dl = torch.utils.data.DataLoader(val, batch_size=batch_size * 2, shuffle=False, collate_fn=collate_fn)

# Create our optimizer
optimizer = optim.Adam(net.parameters(), lr=0.0001)
optimizer.zero_grad()

epochs = 5
for epoch in range(epochs):
    # set training mode
    # this affects things like dropout or batch normalization
    net.train()
    for i, (X, y) in enumerate(train_dl):
        output = net(X)
        loss = criterion(output, y)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        if i % 1000 == 0:
            print(f"Training batch loss {loss}")
    # set in evaluation mode
    net.eval()
    num_correct = 0
    num_total = 0
    for X, y in val_dl:
        output = net(X)
        prediction = F.softmax(output)
        num_correct += torch.sum(torch.argmax(prediction, dim=1) == y)
        num_total += len(X)
    print(f"Val accuracy of {num_correct.true_divide(num_total)}")

Training batch loss 21429.099609375
Training batch loss 370.7990417480469
Training batch loss 10.451495170593262
Training batch loss 4.543331146240234




Val accuracy of 0.8579000234603882
Training batch loss 0.9304068684577942
Training batch loss 20.528350830078125
Training batch loss 1.0830923318862915
Training batch loss 2.068761110305786
Val accuracy of 0.9032999873161316
Training batch loss 3.31925892829895
Training batch loss 6.0038886070251465
Training batch loss 0.0
Training batch loss 0.000250410899752751
Val accuracy of 0.9013000130653381
Training batch loss 4.18876952608116e-05
Training batch loss 0.0003012671077158302
Training batch loss 30.823930740356445
Training batch loss 13.435836791992188
Val accuracy of 0.9045000076293945
Training batch loss 5.446247100830078
Training batch loss 0.0
Training batch loss 15.253301620483398
Training batch loss 1.4676566123962402
Val accuracy of 0.9291999936103821
