<a href="https://colab.research.google.com/github/kristiyandd/ai-combinator/blob/master/pytorch_primer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Neural Networks

This primer assumes familiarity with the concept of neural networks and is geared toward those who want to get started with Pytorch.

# Getting Pytorch Going

Let's get started by importing pytorch and some of the more important sub-modules.

In [0]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import matplotlib
import matplotlib.pyplot as plt
import numpy
import random

# Generate a data set

Before we start, we need a dataset. We could grab a dataset off the internet but we might not understand the nuances of the data very well. Instead, we will generate some data and then train on it.

Let's pretend we are building a self-driving car. The car has proximity sensor and can act on that sensor information: it can accelerate or break, and it can turn the steering wheel left or right. Based on values from the proximity sensors, we want to infer how much energy to apply to the accelerator or breaks and how much to turn the steering wheel. 

Let's further assume that we have also put sensors on human driver's cars and we have been recording the data from the proximity sensors and also recording how much the human drivers press the accelerator or breaks and turn the wheel. This is called a *supervised learning* task. For every set of recorded sensor data, we know what a human has really done in response. Now all we need to do is train a neural network to produce the known-correct response when given a set of sensor data. If our model is good enough, we will trust it to make decisions on the car in situations that it has never seen before.

What we are going to do here is synthetically generate the data set. Let's assume the car has 4 sensor: 
* proximity to car in front (0...1)
* proximity to car in back (0...1)
* proximity to car to the left (0...1)
* proximity to car in on the right (0...1)

We will randomly generate this data.

We also need a supervisions signal: what the car should do under different circumstances. I have provided some simple rules for how the car should behave. This is a bit artificial--why do we need a neural network to drive the car if I know what the rules are and can write the code for those rules? In reality we wouldn't have these rules, but for learning pytorch we need a ground-truth of non-random supervision. So we will pretend that we acquired behavioral data for the car.

There are 2 controls for the car:
* accelerate: how much should we accelerate (-1...1)? Negative values means braking 
* turn: how much should we turn the steering wheel (-1...1)? Negative means left, positive means right.


In [0]:
def make_data(num_data):
  data = []
  for n in range(num_data):

    # Generate a random state
    # the distance to the nearest car in front/back/left/right is normalized from 0.0 (closest) to 1.0 (farthest)
    carInFrontDist = random.random()
    carInBackDist = random.random()
    carLeftDist = random.random()
    carRightDist = random.random()

    # Response to the state. 1 =  brakes/accelerator/steer-left/steer-right is activated. 0=not activated
    # Though binary, we will be using numbers
    accel = 1.0
    turn = 0.0

    # Should I accelerate or brake?
    if carInFrontDist < 0.50:
      # Car is close, brake
      # Unless there is another car close behind
      if carInBackDist > 0.50:
        # Okay to brake
        accel = -carInFrontDist/0.50
      else:
        # Not okay to brake, but at least stop accelerating
        accel = 0
    else:
      # Car in front is not close, continue to accelerate
      accel = (carInFrontDist - 0.50)/0.50

    # Should I turn left or right? (can't do both)
    if carLeftDist < 0.5 or carRightDist < 0.5:
      turn = (1.0 - (carLeftDist)) - (1.0 - carRightDist)

    # Store the data
    x = (carInFrontDist, carInBackDist, carLeftDist, carRightDist)
    y = (accel, turn)
    data.append((x, y))
  return data

In [0]:
train_data = make_data(10000)

Let's take a look at the data. There is a lot, so let's just look at one data element.



In [0]:
train_data[0]

The data is an array. Each element is a tuple. The first element in the tuple is the input data (*x*), which contains the four sensor inputs. The second element in the tuple is the output data (*y*), the correct behavior of the car in response to the sensor inputs.

# Data Preparation

We need to do two things. First, we need to convert our data into tensors, which are the array-like data structures that Pytorch uses. Fortunately, it is easy to create tensors from arrays and tuples.

Second, we need to group chunks of data into batches. A batch is a chunk of data that can be run through the neural network in parallel. This greatly speeds up the training of the neural network and has some other advantages as well.

The following function will get a chunk of data and convert it into tensors. We return a tensor for the input (*x*) and a tensor for the output (*y*). You will see that we are splitting out the inputs from the outputs. 

In [0]:
def get_batch(data, batch_size, index):
  start_index = index * batch_size
  end_index = start_index + batch_size
  batch = data[start_index:end_index]
  x = torch.tensor([e[0] for e in batch])
  y = torch.tensor([e[1] for e in batch])
  return x, y

Let's take a look at a batch:

In [0]:
x, y = get_batch(train_data, 8, 0)

print('x:')
print(x)
print('shape of x:', x.size())

print('y:')
print(y)
print('shape of y:', y.size())

The first row of the *x* tensor should look like the first part of the training data tuple at index 0 above. Each row in the tensor is the input portion of a different line of data. It has been merged into what looks like a multi-dimensional array.

The array has a particular *shape*. In particular it is an 8 x 4 array. The first dimension is for batching (there are 8 data points in the batch). The second dimension reveals that each data point has four values making it up.

The first row of the *y* tensor matches the second part of the training data tuple at index 0 above. The *y* tensor is an 8 x 2 array because there are two controls for the car.

# The Neural Network

Next we need to define the neural network. In pytorch we do this by creating a new class that sub-classes from ```nn.Module```. In this class we will define the different layers and indicate how the layers go together so that inputs flow through the layers to create outputs.

In [0]:
class CarNet(nn.Module):

    def __init__(self):
        # Call the parent class constructor
        super(CarNet, self).__init__()
        # These are the layers
        self.linear1 = nn.Linear(4, 16)
        self.activation1 = nn.Tanh()
        self.linear2 = nn.Linear(16, 8)
        self.activation2 = nn.Tanh()
        self.linear3 = nn.Linear(8, 2)
        self.activation3 = nn.Tanh()

    def forward(self, x):
        h1 = self.activation1(self.linear1(x))
        h2 = self.activation2(self.linear2(h1))
        y_hat = self.activation3(self.linear3(h2))
        return y_hat

Let's walk through this. We are going to create 2 layers of neurons with tanh activations. There are 4 inputs (4 sensor values) and 2 output values (accelerate and turn steering wheel). We use tanh activations instead of the traditional sigmoid because tanh can produce values between -1 and 1.

In the constructor, I have set up the first layer as ```nn.Linear(4, 16)```. This creates a layer of 16 hidden neurons and fully connects it to 4 inputs. Actually, it doesn't know what these hidden neurons will be connected to, except that there will be 4 neurons. ```nn.Linear()``` is a way of growing or shrinking the width of a neural network.

But wait, where are the weights? They are hidden inside the ```nn.Linear()``` object that we just instantiated.

A linear layer doesn't have an activation function---it just grows or shrinks the width of our network. So we need to create another layer that has an activation function. ```activation1 = nn.Tahn()``` is the tanh activation function. It can take an arbitrary number of inputs and produces an identical number of outputs.

For the next layer, we shrink the layer of 16 hidden nodes to a layer of 8 hidden nodes using another ```nn.Linear()``` and another ```nn.Tanh()```. 

Finally, we need to squeeze those eight activations into two outputs.  We use a third ```nn.Linear()``` to reduce from 8 to 2. There is no activation here since we are just fully connecting the previous activations to the outputs.

Once all of our layers are instantiated in the constructor, we haven't specified exactlly how the outputs of one layer flows into another layer. We do that in the ```forward()``` function.

The ```forward()``` function is the forward pass through the neural network. We are going to pass the inputs (*x*) into the forward function, and this function tells us what to do with those inputs. Specifically, we are going to send the inputs through ```linear1```. Recall that ```linear1``` is anticipating 4 inputs and will produce 16 outputs. The input (*x*) is an 8 x 4 tensor, meaning a batch of 8 inputs containing 4 values each.

We send those outputs through ```activation1``` and store them in ```h1``` for convenience.

Let's take a look at just this part of the forward function. The following simulated what is happing in the first layers of the neural network:



In [0]:
linear1 = nn.Linear(4, 16)
activation1 = nn.Tanh()

In [0]:
x, _ = get_batch(train_data, 8, 0)

h1 = activation1(linear1(x))
print(h1)
print("shape of h1:", h1.size())

You are looking at the output activations after running a batch of 8 data points (each consisting of 4 sensor values) through a linear layer and a tanh activation layer.

The variable ```h1``` is a tensor of size 8 x 16. Our four inputs have become 16 values between -1 and 1. We still have 8 batches. These numbers are gibberish. What is happening is that the inputs are being multiplied by weights and adding additional bias values. This is what happens inside the ```nn.Linear()``` module. The network hasn't been trained so the input values are being multiplied with random weights in the linear expansion. Then those values are being run through tanh.

Let's now make a ```CarNet``` object:

In [0]:
net = CarNet()
if torch.cuda.is_available():
  net = net.to('cuda')
print(net)

The ```.to('cuda')``` call moves the network and all it's weights to the GPU's memory if the GPU is available.

We can look at all the weights hidden deep inside the pytorch objects:

In [0]:
for i, param in enumerate(net.parameters()):
  print('Parameter', i)
  print(param, '\n')

The first tensor has a 16 x 4 shape because it stores the weights inside ```linear1```, which is fully connecting the 4 input values to 16 hidden nodes. A fully-connected layer going from 4 inputs to 16 hidden nodes would have 64 weights, and that is what we see.

The second tensor is 16 x 1. These are the bias weights inside ```linear1``` that are applied after the 4 inputs are multiplied by the weights in the prior tensor.

The third tensor is 8 x 16 because these are the weights connecting the layer of 16 hidden notes to the layer of 8 hidden nodes. The fourth tensor is the 8 bias weights.

The fifth and sixth tensors fully connect the eight hidden nodes to the two outputs values and apply bias weights.

The activation functions don't have weights. They just apply their non-linear function to whatever inputs are passed in, regardless of size or shape.

# The Optimizer

Now we have a neural network. Running the network's forward pass will construct the network. Then we will want to backpropagate a loss signal. There are a lot of different optimization functions. We have to pick one and instantiate it. 

We will use an optimization function called ```Adam``` and tell it about the parameters of our neural network. Since the CarNet class is just a wrapper, we must pass in parameters themselves---the weights.**bold text**

In [0]:
optimizer = optim.Adam(net.parameters())

To optimize, we also need to be able to calculate the network's loss. In this case we will use Mean Square Error.

In [0]:
loss_fn = nn.MSELoss()
if torch.cuda.is_available():
  loss_fn = loss_fn.to('cuda')

# Forward Pass

To run the neural net, you call the network class object (CarNet) as if it were a function and pass in the inputs. This automatically calls the ```forward()``` function.

Let's get a batch of data and move it to the GPU. Calling the neural network's forward function produces the predicted output, which we will save in ```y_hat``` to denote that it is a prediction and may not be accurate. (Indeed it won't be accurate because we haven't done any training yet.)

In [0]:
# Get a batch
x, y = get_batch(train_data, 8, 0)

# Move data to the GPU
if torch.cuda.is_available():
  x = x.to('cuda')
  y = y.to('cuda')
  
# Call the forward pass
y_hat = net(x)

Let's take a look at what the output looks like:

In [0]:
print(y_hat)
print('Shape of y_hat:', y_hat.size())

The output tensor is the same shape as the *y* tensor. This is important because next we are going to compare *y_hat* to *y* to see how much we missed the correct answer by. That happens next.

# Compute Loss

Now we can compute the network's loss by passing *y_hat* and *y* into our loss function.

In [0]:
loss = loss_fn(y_hat, y)
print(loss)

# Backward Pass

Now that we have the loss computed, we can backpropagate the loss through our network and the optimizer will adjust the weights for us.

This is kind of weird, but to do this you call ```loss.backward()```. 

In [0]:
optimizer.zero_grad()
loss.backward()
optimizer.step()

What is going on here? When we called ```forward()``` each element remembers how it was created. For example inside ```forward()```, *y_hat* remembered that it was created by ```activation3``` which was a tanh. The inputs to ```activation3``` remembered that it was created by ```linear2``` which was a ```Linear``` layer. Similarly, the *loss* variable remembered that it was created by the ```MSELoss``` function. You can see this remembering when you print tensors (the ```grad_fn``` part of each tensor).

When you call ```backward()``` on the *loss* variable, you are telling it to pass it's value back up this chain (this chain is called a computation graph and is technically a directed acyclic graph). As the loss moves up the computation graph, it will encounter weights, such as those stored inside the ```Linear``` layers. The weights have gradients attached to them and the gradients are updated. This all happens because modules in Pytorch have ```backward()``` functions that know how to compute gradients and you never have to worry about it.

Let's look at the network's weights again. This time the weights are different---the optimizer adjusted them for us. We can also see the gradients that were computed (```.grad.data```) for each weight. 

In [0]:
for i, param in enumerate(net.parameters()):
  print('Parameter', i)
  print(param)
  print('gradients:')
  print(param.grad.data, '\n')

# Training Loop

Putting this all together gives the following training loop. 

There are a few details that didn't come up before. 
* Before calling ```backward()```, we should zero out all the weights using ```optimizer.zero_grad()```. 
* After calling ```backward()```, we should call ```optimizer.step()``` which tells the optimizer that we completed another backward pass so that it can do any preparation for the next pass (like automatically adjusting learning rate).

Run it and you should see the loss going down after every epoch.

In [0]:
num_epochs = 100 # Number of epochs to run
batch_size = 8   # Batch size
num_batches = len(train_data) // batch_size # How many batches per epoch?
loss_history = [] # Keep a record of losses over time for plotting

# Make new network and optimizer
net = CarNet()
if torch.cuda.is_available():
  net = net.to('cuda')
optimizer = optim.Adam(net.parameters())


# Iterate a number of times
for epoch in range(num_epochs):
  epoch_loss = 0  # Keep track of how much loss we've acrued during this epoch
  # An epoch is a complete run through all the batches
  for i in range(num_batches):
    # Zero out gradients
    optimizer.zero_grad()
    # Get a batch
    x, y = get_batch(train_data, batch_size, i)
    if torch.cuda.is_available():
      x = x.to('cuda')
      y = y.to('cuda')
    # Forward pass
    y_hat = net(x)
    # Compute loss
    loss = loss_fn(y_hat, y)
    # Backward pass
    loss.backward()
    # Clean up
    optimizer.step()
    # Keep some stats
    epoch_loss = epoch_loss + loss.item()
  # Done with an epoch, print stats
  print('epoch', epoch, 'epoch loss', epoch_loss/num_batches)
  loss_history.append(epoch_loss/num_batches)   

The exact loss values aren't so important. The key is that the numbers go down. If we plot the loss per epoch, we should see this more clearly.

In [0]:
# set up matplotlib
is_ipython = 'inline' in matplotlib.get_backend()
if is_ipython:
    from IPython import display
plt.ion()


plt.figure(1)
plt.clf()
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.plot(numpy.array(loss_history))

# Evaluation

How do we know the neural network is trained sufficiently? Loss much smaller than when we started. That is a good sign. But what we really need to do is test it on some data that the neural network has never seen before---we want to make sure it isn't memorizing the data.

In [0]:
test_data = make_data(100)

We also need a testing procedure. We could measure how much difference there is between the network's predicted outputs and the true outputs. But maybe that isn't that meaningful. Does it matter if the neural network turns the steering wheel a little bit more than in the original data? The real test would be to drive the car. But we know this isn't a real car and we don't have a simulator anyway. 

So I'm going to make up a testing procedure. I am going to say that if the difference between the predicted acceleration and the true acceleration is greater than 10% then the car will crash. I'll do the same for the steering.

The code below will compute the number of crashes:

In [0]:
total_crashes = 0 # How many crashes?

# Prepare the network for evaluation. This turns off stuff inside the 
#    neural net modules that might result in randomness.
net.eval()

# Iterate through all the test data
for i in range(len(test_data)):
  # This time the batch is just a single datapoint
  x, y = get_batch(test_data, 1, i)
  if torch.cuda.is_available():
    x = x.to('cuda')
    y = y.to('cuda')
  # Forward pass
  y_hat = net(x)
  # Compute the difference
  diff = (y - y_hat).abs()
  # Create a mask, 1 if difference is greater than 0.1
  mask = diff > 0.1
  # You can crash up to two times
  crashes = mask.int().sum()
  total_crashes = total_crashes + crashes.item()

print(total_crashes)



You'll notice in the above code that tensors come with a lot of built-in mathematical functions. You can subtract tensors, call functions like ```.abs()``` and ```.sum()```. You can even apply boolean operators to tensor elements.

How did the car do? Did it crash too many times. You can go back an increase the number of training epochs. You should be able to bring the number of crashes down.