# Building a Neural Network with PyTorch

**Nicolas Nytko**  
nnytko2@illinois.edu  
https://github.com/nicknytko/cse-pytorch-workshop

Adapted from materials by Matthew West (mwest@illinois.edu)

_"Hands-On with CSE" tutorial series_

April 8, 2021

**Description:** PyTorch allows you to easily train and run machine learning models. It uses standard Python methods for writing code, so it's both simple and powerful. We will cover the core automatic differentiation capabilities of PyTorch, training deep neural networks, managing training and test data, saving and loading models, and show a few examples of neural network implementations. We will assume a good knowledge of Python and NumPy, and basic knowledge of machine learning with neural nets.

# List of resources

- PyTorch tutorials: https://pytorch.org/tutorials/
- PyTorch manual: https://pytorch.org/docs/stable/index.html
- PyTorch paper: https://openreview.net/forum?id=BJJsrmfCZ
- Calculus on computational graphs: http://colah.github.io/posts/2015-08-Backprop/
- Einstein summation in PyTorch: https://rockt.github.io/2018/04/30/einsum

# PyTorch citation

```
@inproceedings{paszke2017automatic,
  title={Automatic differentiation in PyTorch},
  author={Paszke, Adam and Gross, Sam and Chintala, Soumith and Chanan, Gregory and Yang, Edward and DeVito, Zachary and Lin, Zeming and Desmaison, Alban and Antiga, Luca and Lerer, Adam},
  booktitle={NIPS-W},
  year={2017},
  url={https://openreview.net/forum?id=BJJsrmfCZ},
}
```

# What is PyTorch?

- Like NumPy with automatic differentiation via dynamic computation graphs
- Transparent ability to compute on GPUs and in parallel
- Library of neural net functions and constructors
- Library of gradient-based optimizers
- Various other useful functions (e.g., data management)

# Let's get started!

In [None]:
import torch
import numpy as np
import random, datetime
import matplotlib.pyplot as plt
%matplotlib inline

# PyTorch is similar to NumPy

PyTorch works in immediate mode, which is different to the default TensorFlow model.

PyTorch is very picky about datatypes, and defaults to single precision (NumPy defaults to double).

In [None]:
x = torch.tensor([1.0, 2.0, 3.0])

In [None]:
x

In [None]:
x**2

In [None]:
x.dtype

#### Specify datatypes

Use `dtype=torch.float64` and `.double()`

In [None]:
y = x.double()
y

In [None]:
y.dtype

In [None]:
y.float()

In [None]:
y.int()

In [None]:
x = torch.tensor([1, 2, 3])

In [None]:
torch.log(x)

In [None]:
x.dtype

In [None]:
x = torch.tensor([1, 2, 3], dtype=torch.float64)

In [None]:
torch.log(x)

#### Annoying differences from NumPy

`np.sum(x, axis=1)` versus `torch.sum(x, dim=1)`

In [None]:
T = torch.Tensor([[1,2,3], [4,5,6]])
Tary = np.array([[1,2,3], [4,5,6]])

In [None]:
np.sum(Tary, axis=1)

In [None]:
torch.sum(T, dim=1)

#### Convert to/from NumPy arrays

`.numpy()` and `torch.from_numpy()`

In [None]:
x = np.array([1, 2, 3])
y = torch.from_numpy(x)
y

In [None]:
y.numpy()

Memory is shared!

In [None]:
x[0] = 7
y

# `torch.autograd`: Computing derivatives

PyTorch constructs the computation graph as you do operations (dynamic graphs) unlike TensorFlow (static graphs)

Using the computation graph, the chain rule (back propagation) can compute derivatives

Derivatives are available in the leaf nodes

In [None]:
x = torch.tensor(5.0)

In [None]:
y = torch.tensor(3.0, requires_grad=True)

In [None]:
z = x * y**2
z

In [None]:
z.backward()

In [None]:
print(f'x.grad = {x.grad}')

In [None]:
y.grad

$z = x y^2$

$\frac{\partial z}{\partial y} = 2 x y$

In [None]:
2*x*y

#### Control what we differentiate with respect to

`requires_grad=True`

`with no_grad():`

`.detach()`

In [None]:
x = torch.tensor(2.0, requires_grad=True)
y = x*x
print(f'y.requires_grad = {y.requires_grad}')
z = x*y
z.backward()
print(f'dz/dx = {x.grad}')

In [None]:
x = torch.tensor(2.0, requires_grad=True)
y = x*x
y = y.detach() # can't say y.requires_grad = False
print(f'y.requires_grad = {y.requires_grad}')
z = x*y
z.backward()
print(f'dz/dx = {x.grad}')

In [None]:
x = torch.tensor(2.0, requires_grad=True)
with torch.no_grad():
    y = x*x
print(f'y.requires_grad = {y.requires_grad}')
z = x*y
z.backward()
print(f'dz/dx = {x.grad}')

#### Computation graphs are not trees

Re-using a parameter in multiple places makes the graph not be a tree. It's a DAG.

In [None]:
x = torch.tensor(2.0, requires_grad=True)
y = 3*x
z = x**2
w = y + z + x
w.backward()
x.grad

$\frac{\partial w}{\partial x} = \frac{\partial}{\partial x}(3x + x^2 + x) = 3 + 2x + 1$

In [None]:
3 + 2*x + 1

#### The computation graph is destroyed by `backward()`

To retain it for more differentiation, use `backward(retain_graph=True)`

A common use case is multiple outputs with a shared subgraph

Don't forget to free the graph on the last call to prevent memory leaks

In [None]:
x = torch.tensor(3.0, requires_grad=True)
y = x**2
z1 = 3*y
z2 = 4*y

In [None]:
z1.backward() # (retain_graph=True)
x.grad

In [None]:
z2.backward()
x.grad

#### Derivatives of scalars with respect to tensors

In [None]:
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = (x**2).sum()
y.backward()
x.grad

#### Don't do in-place modifications to tensors

But it's fine to do `x = 4 * x`

In [None]:
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
x

In [None]:
x[1] = x[2] + 1
#x = 4*x
x

In [None]:
y = (x**2).sum()
y.backward()

# `torch.optim`: All the common gradient-based optimizers

In [None]:
def f(x):
    return x**2

In [None]:
xvec = np.linspace(-2, 2, 100)
fvec = f(xvec)
plt.plot(xvec, fvec, 'o-', markersize=3)

In [None]:
x = torch.tensor([2.0], requires_grad=True)

In [None]:
opt = torch.optim.SGD([x], lr=0.1)

In [None]:
x_history = [x.detach().numpy().copy()]
for i in range(30):
    print(f'##########')
    print(f'i = {i}')
    print(f'initial x = {x}')
    opt.zero_grad()
    z = f(x)
    print(f'f(x) = {z}')
    z.backward()
    print(f'x.grad = {x.grad}')
    opt.step()
    print(f'updated x = {x}')
    x_history.append(x.detach().numpy().copy())

In [None]:
xvec = np.linspace(-2, 2, 100)
fvec = f(xvec)
plt.plot(xvec, fvec)
plt.plot(x_history, f(np.array(x_history)), 'o-')

# `torch.nn`: Easy neural-network construction

Convention: the first index is the data-item index, so N images each of shape 128 x 128 will be in a tensor of shape N x 128 x 128

In [None]:
x = torch.linspace(0, 2*np.pi, 100)
y = torch.sin(x)
plt.plot(x.numpy(), y.numpy(), 'r.');

In [None]:
class MyModel(torch.nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        N = 8
        self.fc1 = torch.nn.Linear(1, N)
        self.fc2 = torch.nn.Linear(N, N)
        self.fc3 = torch.nn.Linear(N, 1)

    def forward(self, x):
        x = torch.nn.functional.relu(self.fc1(x))
        x = torch.nn.functional.relu(self.fc2(x))
        x = self.fc3(x)
        return x

In [None]:
model = MyModel()

In [None]:
yp = model(x.reshape(100,1))

In [None]:
plt.plot(x.numpy(), y.numpy(), 'r.');
plt.plot(x.numpy(), yp.detach().numpy());

In [None]:
model.fc3.weight

In [None]:
model.fc3.bias

In [None]:
model.fc3.bias.data

In [None]:
for p in model.parameters():
    print(p.shape)

In [None]:
opt = torch.optim.Adam(model.parameters(), lr=0.01)
loss_history = []

In [None]:
for i in range(200):
    opt.zero_grad()
    yp = model(x.reshape(100,1))
    loss = torch.nn.MSELoss()(yp, y.reshape(100,1))
    loss_history.append(loss.item())
    loss.backward()
    opt.step()

In [None]:
plt.plot(loss_history);

In [None]:
plt.plot(x.numpy(), y.numpy(), 'r.');
plt.plot(x.numpy(), yp.detach().numpy());

# Saving and restoring models

Save and load the parameters, not the full models.

In [None]:
torch.save(model.state_dict(), 'model_file.pkl')

In [None]:
model = MyModel()
model.load_state_dict(torch.load('model_file.pkl'))

In [None]:
model.state_dict()