### Intro to NN using PyTorch

#### Content

- Environment
- A simple linear network
- Why is this useful ? ` The Universal Approximator`
- Demo


In [10]:
import torch
import torch.nn as nn

import pandas as pd
import numpy as np
import math

import matplotlib.pyplot as plt


#### The PyTorch Environment

For this part we will use the material from: [PyTorch Examples](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html), proposed by: Justin Johnson.

This tutorial introduces the fundamental concepts of PyTorch through self-contained examples.

At its core, PyTorch provides two main features:

- An n-dimensional Tensor, similar to numpy but can run on GPUs

- Automatic differentiation for building and training neural networks

We will use a problem of fitting $y = sin(x)$  with a third order polynomial as our running example. 
The network will have four parameters, and will be trained with gradient descent to fit random data by 
minimizing the Euclidean distance between the network output and the true output.

#### Numpy Tensors

In [11]:
# Create random input and output data
x = np.linspace(-math.pi, math.pi, 2000)
y = np.sin(x)

# Randomly initialize weights
a = np.random.randn()
b = np.random.randn()
c = np.random.randn()
d = np.random.randn()

learning_rate = 1e-6
for t in range(2000):
    # Forward pass: compute predicted y
    # y = a + b x + c x^2 + d x^3
    y_pred = a + b * x + c * x ** 2 + d * x ** 3

    # Compute and print loss
    loss = np.square(y_pred - y).sum() # f (a, b, c, d)
    """
    ( y_pred - y) ** 2 = f (y, y_pred)
    
    df/dy_pred = 2 * (y_pred - y)
    
    y_pred = a + b * x + c * x ** 2 + d * x ** 3

    df/da = (2 * (a + b * x + c * x ** 2 + d * x ** 3)) / da


    """

    if t % 100 == 99:
        print(f"Step: {t}, Loss: {loss}")

    # Backprop to compute gradients of a, b, c, d with respect to loss

    grad_y_pred = 2.0 * (y_pred - y) # df/dy_pred 

    grad_a = grad_y_pred.sum() # a = [1, 1,2 ,3 ].sum() => sum over line elements

    grad_b = (grad_y_pred * x).sum()

    grad_c = (grad_y_pred * x ** 2).sum()

    grad_d = (grad_y_pred * x ** 3).sum()
    

    # Update weights
    a -= learning_rate * grad_a
    b -= learning_rate * grad_b
    c -= learning_rate * grad_c
    d -= learning_rate * grad_d

print(f'Result: y = {a} + {b} x + {c} x^2 + {d} x^3')

Step: 99, Loss: 1231.1623404557583
Step: 199, Loss: 821.1158296400363
Step: 299, Loss: 548.796583140447
Step: 399, Loss: 367.893176757456
Step: 499, Loss: 247.6818215137023
Step: 599, Loss: 167.7752742736492
Step: 699, Loss: 114.6421303538635
Step: 799, Loss: 79.29912545188282
Step: 899, Loss: 55.7808676231282
Step: 999, Loss: 40.12490280043427
Step: 1099, Loss: 29.698434774697485
Step: 1199, Loss: 22.75159409829005
Step: 1299, Loss: 18.12095821424207
Step: 1399, Loss: 15.03274064037925
Step: 1499, Loss: 12.972109529246726
Step: 1599, Loss: 11.596392233200259
Step: 1699, Loss: 10.677411266707436
Step: 1799, Loss: 10.063162503008883
Step: 1899, Loss: 9.652339396585077
Step: 1999, Loss: 9.377391249518
Result: y = 0.011666870001931336 + 0.8363794764850442 x + -0.002012729212887965 x^2 + -0.09043416195169078 x^3


#### Pytorch Tensors

In [12]:
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# Create random input and output data
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

# Randomly initialize weights
a = torch.randn((), device=device, dtype=dtype)
b = torch.randn((), device=device, dtype=dtype)
c = torch.randn((), device=device, dtype=dtype)
d = torch.randn((), device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(2000):
    # Forward pass: compute predicted y
    y_pred = a + b * x + c * x ** 2 + d * x ** 3

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    if t % 100 == 99:
        print(t, loss)

    # Backprop to compute gradients of a, b, c, d with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_a = grad_y_pred.sum()
    grad_b = (grad_y_pred * x).sum()
    grad_c = (grad_y_pred * x ** 2).sum()
    grad_d = (grad_y_pred * x ** 3).sum()

    # Update weights using gradient descent
    a -= learning_rate * grad_a
    b -= learning_rate * grad_b
    c -= learning_rate * grad_c
    d -= learning_rate * grad_d


print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')

99 3048.1318359375
199 2039.31005859375
299 1366.162841796875
399 916.7518920898438
499 616.5413818359375
599 415.87774658203125
699 281.667724609375
799 191.84512329101562
899 131.6881103515625
999 91.37032318115234
1099 64.32882690429688
1199 46.177547454833984
1299 33.984134674072266
1399 25.786048889160156
1499 20.2694034576416
1599 16.553836822509766
1699 14.048992156982422
1799 12.358757019042969
1899 11.217127799987793
1999 10.44522476196289
Result: y = 0.026495885103940964 + 0.8259782195091248 x + -0.00457098288461566 x^2 + -0.08895467221736908 x^3


#### Autograd
*PyTorch: Tensors and autograd*

In the above examples, we had to manually implement both the forward and backward passes of our neural network. Manually implementing the backward pass is not a big deal for a small two-layer network, but can quickly get very hairy for large complex networks.

Thankfully, we can use automatic differentiation to automate the computation of backward passes in neural networks. The autograd package in PyTorch provides exactly this functionality. When using autograd, the forward pass of your network will define a computational graph; nodes in the graph will be Tensors, and edges will be functions that produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute gradients.

This sounds complicated, it’s pretty simple to use in practice. Each Tensor represents a node in a computational graph. If $x$ is a Tensor that has `x.requires_grad=True` then `x.grad` is another Tensor holding the gradient of `x` with respect to some scalar value.

Here we use PyTorch Tensors and autograd to implement our fitting sine wave with third order polynomial example; now we no longer need to manually implement the backward pass through the network:

*PyTorch: Defining new autograd functions*

Under the hood, each primitive autograd operator is really two functions that operate on Tensors. The forward function computes output Tensors from input Tensors. The backward function receives the gradient of the output Tensors with respect to some scalar value, and computes the gradient of the input Tensors with respect to that same scalar value.

In PyTorch we can easily define our own autograd operator by defining a subclass of torch.autograd.Function and implementing the forward and backward functions. We can then use our new autograd operator by constructing an instance and calling it like a function, passing Tensors containing input data.

In this example we define our model as:
​
$y = a + bP_3(c + dx)$ instead of: $y = a + bx  + cx^2 + dx^2$ where $P_3 = \frac{1}{2} * (5x^3 - 3x)$ is the Legendre Polynomial of degree 3. We write our own custom autograd function for computing forward and backward of $P_3$, and use it to implement our model: `nn.model`.

`nn.model` 
Computational graphs and autograd are a very powerful paradigm for defining complex operators and automatically taking derivatives; however for large neural networks raw autograd can be a bit too low-level.

When building neural networks we frequently think of arranging the computation into layers, some of which have learnable parameters which will be optimized during learning.

In [13]:
# Create Tensors to hold input and outputs.
x = torch.linspace(-math.pi, math.pi, 2000)
y = torch.sin(x)

# For this example, the output y is a linear function of (x, x^2, x^3), so
# we can consider it as a linear layer neural network. Let's prepare the
# tensor (x, x^2, x^3).
p = torch.tensor([1, 2, 3])
xx = x.unsqueeze(-1).pow(p)

# In the above code, x.unsqueeze(-1) has shape (2000, 1), and p has shape
# (3,), for this case, broadcasting semantics will apply to obtain a tensor
# of shape (2000, 3) 

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. The Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
# The Flatten layer flatens the output of the linear layer to a 1D tensor,
# to match the shape of `y`.
model = torch.nn.Sequential(
    torch.nn.Linear(3, 1),
    torch.nn.Flatten(0, 1) # y_pred
)

"""
n
     
n      n => value

n

"""

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-6
for t in range(2000):

    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(xx)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Zero the gradients before running the backward pass.
    # Tensor {grandients = 0}
    
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

# You can access the first layer of `model` like accessing the first item of a list
linear_layer = model[0]

# For linear layer, its parameters are stored as `weight` and `bias`.
print(f'Result: y = {linear_layer.bias.item()} + {linear_layer.weight[:, 0].item()} x + {linear_layer.weight[:, 1].item()} x^2 + {linear_layer.weight[:, 2].item()} x^3')

99 469.720703125
199 321.5616455078125
299 221.26107788085938
399 153.2879180908203
499 107.17369079589844
599 75.85462188720703
699 54.560394287109375
799 40.066009521484375
899 30.18899917602539
999 23.45064926147461
1099 18.848346710205078
1199 15.701372146606445
1299 13.54703140258789
1399 12.070518493652344
1499 11.057430267333984
1599 10.36150074005127
1699 9.882917404174805
1799 9.55342960357666
1899 9.326318740844727
1999 9.169631958007812
Result: y = 0.016424741595983505 + 0.8464678525924683 x + -0.002833540318533778 x^2 + -0.09186913818120956 x^3


#### Optim

Up to this point we have updated the weights of our models by manually mutating the Tensors holding learnable parameters with `torch.no_grad()`. This is not a huge burden for simple optimization algorithms like stochastic gradient descent, but in practice we often train neural networks using more sophisticated optimizers like `AdaGrad`, `RMSProp`, `Adam`, etc.

The optim package in PyTorch abstracts the idea of an optimization algorithm and provides implementations of commonly used optimization algorithms.

In [8]:
# Create Tensors to hold input and outputs.
x = torch.linspace(-math.pi, math.pi, 2000)
y = torch.sin(x)

# For this example, the output y is a linear function of (x, x^2, x^3), so
# we can consider it as a linear layer neural network. Let's prepare the
# tensor (x, x^2, x^3).
p = torch.tensor([1, 2, 3])
xx = x.unsqueeze(-1).pow(p)

# In the above code, x.unsqueeze(-1) has shape (2000, 1), and p has shape
# (3,), for this case, broadcasting semantics will apply to obtain a tensor
# of shape (2000, 3) 

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. The Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
# The Flatten layer flatens the output of the linear layer to a 1D tensor,
# to match the shape of `y`.
model = torch.nn.Sequential(
    torch.nn.Linear(3, 1),
    torch.nn.Flatten(0, 1)
)

# Optimizer
learning_rate = 1e-6
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')


for t in range(2000):

    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(xx)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Automatic update
    optimizer.step()

# You can access the first layer of `model` like accessing the first item of a list
linear_layer = model[0]

# For linear layer, its parameters are stored as `weight` and `bias`.
print(f'Result: y = {linear_layer.bias.item()} + {linear_layer.weight[:, 0].item()} x + {linear_layer.weight[:, 1].item()} x^2 + {linear_layer.weight[:, 2].item()} x^3')

99 5158.91748046875
199 5150.873046875
299 5142.8369140625
399 5134.80810546875
499 5126.78662109375
599 5118.77294921875
699 5110.767578125
799 5102.77001953125
899 5094.7802734375
999 5086.7978515625
1099 5078.82373046875
1199 5070.8564453125
1299 5062.8974609375
1399 5054.9462890625
1499 5047.00244140625
1599 5039.06640625
1699 5031.13818359375
1799 5023.2177734375
1899 5015.3037109375
1999 5007.3984375
Result: y = 0.1793818324804306 + 0.17446064949035645 x + -0.03135479614138603 x^2 + 0.13086235523223877 x^3


### A custom *cool* model

In [9]:
# Let's get some data for this model
import torchvision.datasets as datasets
from torchvision import transforms

t = transforms.Compose([
    transforms.ToTensor()
])

mnist_trainset = datasets.MNIST(root='./data', train=True, download=True, transform=t)
mnist_testset = datasets.MNIST(root='./data', train=False, download=True, transform=t)

ModuleNotFoundError: No module named 'torchvision'

In [132]:
print("Train: {} / Test: {}".format(len(mnist_trainset), len(mnist_testset)))

Train: 60000 / Test: 10000


In [133]:
print(mnist_testset[0][0].shape)

torch.Size([1, 28, 28])


In [134]:
28 * 28

784

In [157]:
class CoolModel(nn.Module):
    def __init__(self, dim_input: int = 28 * 28, dim_output: int = 10, dim_hidden: int = 128) -> None:
        super().__init__()
        self.dim_input = dim_input
        self.dim_hidden = dim_hidden
        self.dim_output = dim_output

        self.net = nn.Sequential(
            nn.Linear(self.dim_input, self.dim_input // 2),
            nn.ReLU(),
            nn.Linear(self.dim_input // 2, self.dim_hidden),
            nn.ReLU(),
            nn.Linear(self.dim_hidden, self.dim_output)
        )

        print(self.net)
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        out = self.net(x)
        return out

In [150]:
cool_net = CoolModel()

Sequential(
  (0): Linear(in_features=784, out_features=392, bias=True)
  (1): Softmax(dim=None)
  (2): Linear(in_features=392, out_features=128, bias=True)
  (3): Softmax(dim=None)
  (4): Linear(in_features=128, out_features=128, bias=True)
  (5): Softmax(dim=None)
  (6): Linear(in_features=128, out_features=128, bias=True)
  (7): Softmax(dim=None)
  (8): Linear(in_features=128, out_features=10, bias=True)
)


In [151]:
# Test the model
image = mnist_testset[0][0]
image = image.reshape(-1, 28 * 28)
image.shape

torch.Size([1, 784])

In [153]:
cool_net(image).shape

torch.Size([1, 10])

In [154]:
mnist_train_dl = torch.utils.data.DataLoader(dataset = mnist_trainset,
                                             batch_size = 32,
                                             shuffle = True)

mnist_test_dl = torch.utils.data.DataLoader(dataset = mnist_testset,
                                      batch_size = 32, 
                                      shuffle = False)

### Let's train the model

In [155]:
optimizer = torch.optim.Adam(params=cool_net.parameters(), lr=1e-6)
loss_fn = nn.CrossEntropyLoss()
num_epochs = 10

In [156]:
for epoch in range(num_epochs):
  for i, (images, labels) in enumerate(mnist_train_dl):
    images = images.view(-1, 28*28)
    labels = labels.float()

    optimizer.zero_grad()
    outputs = cool_net(images)
    outputs, _ = torch.max(outputs, dim=1)

    loss = loss_fn(outputs, labels)
    loss.backward()
    optimizer.step()
    
    
  print('Epoch [%d/%d] Loss: %.4f'%(epoch+1, num_epochs, loss.item()))

Epoch [1/10] Loss: 512.9289
Epoch [2/10] Loss: 471.3401
Epoch [3/10] Loss: 415.8883
Epoch [4/10] Loss: 485.2030
Epoch [5/10] Loss: 592.6409
Epoch [6/10] Loss: 585.7094
Epoch [7/10] Loss: 422.8198
Epoch [8/10] Loss: 634.2297
Epoch [9/10] Loss: 471.3401
Epoch [10/10] Loss: 499.0660


In [None]:
correct = 0
total = 0
for I, (images, labels) in enumerate(mnist_test_dl):
  images = images.view(-1,28*28)
  
  output = cool_net(images)
  _, predicted = torch.max(output, dim=1)
  correct += (predicted == labels).sum()
  total += labels.size(0)

print('Accuracy of the model: %.3f %%' %((100*correct)/(total+1)))

Accuracy of the model: 9.629 %


In [158]:
import torch
import torchvision

In [194]:
n_epochs = 3
batch_size_train = 64
batch_size_test = 1000
learning_rate = 0.01
momentum = 0.5
log_interval = 640

random_seed = 1
torch.backends.cudnn.enabled = False
torch.manual_seed(random_seed)

<torch._C.Generator at 0x205104e1d10>

In [195]:
train_loader = torch.utils.data.DataLoader(
  torchvision.datasets.MNIST('/files/', train=True, download=True,
                             transform=torchvision.transforms.Compose([
                               torchvision.transforms.ToTensor(),
                               torchvision.transforms.Normalize(
                                 (0.1307,), (0.3081,))
                             ])),
  batch_size=batch_size_train, shuffle=True)

test_loader = torch.utils.data.DataLoader(
  torchvision.datasets.MNIST('/files/', train=False, download=True,
                             transform=torchvision.transforms.Compose([
                               torchvision.transforms.ToTensor(),
                               torchvision.transforms.Normalize(
                                 (0.1307,), (0.3081,))
                             ])),
  batch_size=batch_size_test, shuffle=True)

In [196]:
examples = enumerate(test_loader)
batch_idx, (example_data, example_targets) = next(examples)

In [197]:
example_data.shape

torch.Size([1000, 1, 28, 28])

In [198]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [199]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

In [200]:
network = Net()
optimizer = optim.SGD(network.parameters(), lr=learning_rate,
                      momentum=momentum)

In [201]:
train_losses = []
train_counter = []
test_losses = []
test_counter = [i*len(train_loader.dataset) for i in range(n_epochs + 1)]

In [202]:
def train(epoch):
  network.train()
  for batch_idx, (data, target) in enumerate(train_loader):
    optimizer.zero_grad()
    output = network(data)
    loss = F.nll_loss(output, target)
    loss.backward()
    optimizer.step()
    if batch_idx % log_interval == 0:
      print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
        epoch, batch_idx * len(data), len(train_loader.dataset),
        100. * batch_idx / len(train_loader), loss.item()))
      train_losses.append(loss.item())
      train_counter.append(
        (batch_idx*64) + ((epoch-1)*len(train_loader.dataset)))
      # torch.save(network.state_dict(), '/results/model.pth')
      # torch.save(optimizer.state_dict(), '/results/optimizer.pth')

In [203]:
def test():
  network.eval()
  test_loss = 0
  correct = 0
  with torch.no_grad():
    for data, target in test_loader:
      output = network(data)
      test_loss += F.nll_loss(output, target, size_average=False).item()
      pred = output.data.max(1, keepdim=True)[1]
      correct += pred.eq(target.data.view_as(pred)).sum()
  test_loss /= len(test_loader.dataset)
  test_losses.append(test_loss)
  print('\nTest set: Avg. loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
    test_loss, correct, len(test_loader.dataset),
    100. * correct / len(test_loader.dataset)))

In [204]:
test()
for epoch in range(1, n_epochs + 1):
  train(epoch)
  test()


Test set: Avg. loss: 2.3096, Accuracy: 924/10000 (9%)


Test set: Avg. loss: 0.1874, Accuracy: 9439/10000 (94%)


Test set: Avg. loss: 0.1156, Accuracy: 9630/10000 (96%)


Test set: Avg. loss: 0.0984, Accuracy: 9687/10000 (97%)

