# Training a Neural Network in Pytorch

### Introduction

In this lesson, it's time to dive in and start training a neural network in Pytorch.  We'll see the full cycle from downlading and batching our data, to writing a neural network class, to predicting with our network.

Let's get started.

### Loading our Data

The first item is to load our MNIST dataset.  This time, we'll use the pytorch library to load our data.  The MNIST dataset is located in the torchvision library.  To use it we'll need to import the `datasets` and `transforms` function.

In [5]:
import torch
from torchvision import transforms, datasets

We download the MNIST training data with the following code.

In [6]:
train = datasets.MNIST("", train = True, download = True, 
                       transform = transforms.Compose([transforms.ToTensor()]))

We specify `train = True` for the training set and `download = True` to download the data.  The first argument of `""` specifies a default location for the data.  Then, to format the data as tensors, we use the `transform` argument, passing through:
```python
transforms.Compose([transforms.ToTensor()])
```

For the test data, we repeat the code, but this time set `train = False`.

In [7]:
test = datasets.MNIST("", train = False, download = True, 
                       transform = transforms.Compose([transforms.ToTensor()]))

### Exploring the Data

If we take a look at the `train` and `test` variables, we'll see that they consist of both our features and target values.

In [15]:
train.data.shape

torch.Size([60000, 28, 28])

In [16]:
train.targets.shape

torch.Size([60000])

Now let's reshape our data so that each observation is of length `28*28`.

In [45]:
train_vectors = train.data.reshape(60000, 784)

So now the pixels of each image are represented in a single vector of data.

In [61]:
train_vectors.shape

torch.Size([60000, 784])

And from there, we can organize our data into batches, so that we do not need to perform gradient descent on our entire dataset at once but instead can move through 50 observations at a time.

In [62]:
60000/50

1200.0

In [63]:
train_batched = train_vectors.reshape(1200, 50, 784)

In [64]:
train_batched.shape

torch.Size([1200, 50, 784])

So now we have 1200 batches, each of 50 observations, where each observation is represented by a vector of 784 pixels.

We should also batch our target data.

In [60]:
train_targets_batched = train.targets.reshape(1200, 50)

### Defining our Neural Network

Now it's time to define our neural network.  Remember, this generally consists of two functions the `__init__` function and the `forward` function.  We define the class and inherit from the `nn.Module` class.  Ok, let's try it.

In [55]:
from torch import nn
import torch.nn.functional as F
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.W1 = nn.Linear(28*28, 64)
        self.W2 = nn.Linear(64, 64)
        self.W3 = nn.Linear(64, 64)
        self.W4 = nn.Linear(64, 10)
        
    def forward(self, X):
        A1 = torch.sigmoid(self.W1(X))
        A2 = torch.sigmoid(self.W2(A1))
        A3 = torch.sigmoid(self.W3(A2))
        Z4 = self.W4(A3)
        return F.log_softmax(Z4, dim = 1)

Then, we'll initialize our neural network so we can pass some data through it.

In [56]:
net = Net()

Now let's see if we can pass some data through our function.

In [65]:
first_batch = train_batched[0]

In [66]:
first_batch.shape

torch.Size([50, 784])

In [68]:
batch_predictions = net(first_batch.float())

In [69]:
batch_predictions.shape

torch.Size([50, 10])

Our code is looking good.  We see that for each observation we make 10 different predictions.  Let's take a look at one of them.

In [70]:
batch_predictions[0]

tensor([-2.4098, -2.6822, -1.9079, -2.5482, -2.2421, -2.4496, -2.6603, -2.7427,
        -2.2518, -1.6980], grad_fn=<SelectBackward>)

We may be a little surprised by the output here, but this is just because we are using `log softmax` instead of softmax.  The log softmax is literally just the logarithm of softmax.  So if you're more comfortable looking at the original softmax outputs, that's fine.  We can get back to the softmax by applying the exponent.

In [71]:
torch.exp(batch_predictions[0])

tensor([0.0898, 0.0684, 0.1484, 0.0782, 0.1062, 0.0863, 0.0699, 0.0644, 0.1052,
        0.1831], grad_fn=<ExpBackward>)

In [72]:
torch.exp(batch_predictions[0]).sum()

tensor(1., grad_fn=<SumBackward0>)

> Some of the benefits of log softmax are in the resources below.  The main benefit is that it tends to punish wrong results more than softmax.



### Training our Network

Now let's move towards training our network.   

The main new item is that to update the parameters of our neural network, we need to use an optimizer, which can update our parameters through a `step` function.

In [73]:
import torch.optim as optim
optimizer = optim.Adam(net.parameters(), lr=0.0005)
x_loss = nn.CrossEntropyLoss()

> We pass our optimizer the parameters it should update, as well as the learning rate with which to perform gradient descent.

Next up is the training process.  

1. We go through the training data eight times.  
2. And on each batch of data, we make predictions by passing our data through the neural network, and then calculating the loss. 
3. We call `loss.backward` to have the neural network use backpropagation to calculate the gradient for our linear layers.
4. Then we use `optimizer.step` to update the parameters.  
5. At the top of the our training procedure we call `net.zero_grad()` to remove any previously calculated gradients on our linear layers.

In [75]:
train_batched.shape, train_targets_batched.shape
# 1200 batches, 50 per batch, 784 features 

(torch.Size([1200, 50, 784]), torch.Size([1200, 50]))

In [86]:
x_loss(batch_predictions, train_targets_batched[0]) 

tensor(2.3563, grad_fn=<NllLossBackward>)

In [None]:
for epoch in range(8):
    for X_batch, y_batch in zip(train_batched.float(), train_targets_batched):
        net.zero_grad()
#         X_reshaped = X_batch.view(-1,28*28)
        prediction_batch = net(X_batch)
        loss = x_loss(prediction_batch, y_batch) 
        loss.backward()  
        optimizer.step()
    print(loss)

### Evaluating the Neural Network

After training our neural network, the next step is to evaluate the neural network.  Let's start by using our neural network to make predictions on the test set.

In [89]:
first_test_obs = testset.dataset.data.view(-1, 784).float()

In [90]:
first_test_obs.shape

torch.Size([10000, 784])

In [91]:
predictions_test = net(first_test_obs)

Ok, let's take a look at some of these predictions.

In [92]:
torch.set_printoptions(sci_mode = False)
# we turn off scientific notation
torch.exp(predictions_test[0])

tensor([    0.0000,     0.0011,     0.0010,     0.0013,     0.0001,     0.0000,
            0.0000,     0.9943,     0.0001,     0.0023], grad_fn=<ExpBackward>)

And we can identify this top integer with the argmax function.

In [93]:
torch.argmax(predictions_test, axis = 1)[:20]

tensor([7, 2, 1, 0, 4, 1, 4, 9, 4, 9, 0, 6, 9, 0, 1, 5, 9, 7, 3, 4])

And we can see that it looks like almost all of our predictions are correct.

In [94]:
testset.dataset.targets[:20]

tensor([7, 2, 1, 0, 4, 1, 4, 9, 5, 9, 0, 6, 9, 0, 1, 5, 9, 7, 3, 4])

We can see that this does a good job of taking in data and predicting the targets.

In [95]:
from sklearn.metrics import accuracy_score, plot_confusion_matrix

accuracy_score(testset.dataset.targets, torch.argmax(predictions_test, axis = 1))

0.9119

### Wrapping Up

Before finishing up, we should point out that we can also use a DataLoader to take our data and batch it.  For example, we can start by redownloading our data.

In [98]:
from torchvision import transforms, datasets

train = datasets.MNIST("", train = True, download = True, 
                       transform = transforms.Compose([transforms.ToTensor()]))

And then use the DataLoader to organizer our data into batches of 50.

In [99]:
trainset = torch.utils.data.DataLoader(train, batch_size = 50, shuffle = True)
testset = torch.utils.data.DataLoader(test, batch_size = 50, shuffle = True)

Now let's take a look at the shape of each batch.  

In [100]:
for x_batch, y_batch in trainset:
    print(x_batch.shape), print(y_batch.shape)
    break

torch.Size([50, 1, 28, 28])
torch.Size([50])


So we can see that in each batch there are 50 observations.  And really, we'll evetually reshape our this data so that each observation is a row of 784 features.

In [106]:
for epoch in range(8):
    for X_batch, y_batch in trainset:
        net.zero_grad()  
        X_reshaped = X_batch.view(-1,28*28)
        prediction_batch = net(X_reshaped)
        loss = x_loss(prediction_batch, y_batch) 
        loss.backward()  
        optimizer.step()
    print(loss)

tensor(0.2257, grad_fn=<NllLossBackward>)
tensor(0.2327, grad_fn=<NllLossBackward>)
tensor(0.2048, grad_fn=<NllLossBackward>)
tensor(0.1304, grad_fn=<NllLossBackward>)
tensor(0.0523, grad_fn=<NllLossBackward>)
tensor(0.1471, grad_fn=<NllLossBackward>)
tensor(0.0372, grad_fn=<NllLossBackward>)
tensor(0.0449, grad_fn=<NllLossBackward>)


Ok, that's better.

### Summary

In this lessons we went through the full cycle of loading our data to training a neural network in Pytorch.  We saw two different ways to organize our data into batches both using the `reshape` method and by using the `torch.utils.data.DataLoader` mechanism.  

After placing our data in the proper shape, we then defined our neural network and made a single prediction.  Then we moved to training our network.  This involved initializing an optimizer.

```python
import torch.optim as optim
optimizer = optim.Adam(net.parameters(), lr=0.0005)
```

And then looping through our training data.  The key code was the following:

```python
for epoch in range(8):
    for X_batch, y_batch in trainset:
        net.zero_grad()  
        X_reshaped = X_batch.view(-1,28*28)
        prediction_batch = net(X_reshaped)
        loss = x_loss(prediction_batch, y_batch) 
        loss.backward()  
        optimizer.step()
    print(loss)
```

### Resources

[Log Softmax Pytorch message board](https://discuss.pytorch.org/t/logsoftmax-vs-softmax/21386/2)

[Log Probability Wikipedia](https://en.wikipedia.org/wiki/Log_probability)