# Training a Neural Network in Pytorch

### Introduction

In this lesson, it's time to dive in and start training a neural network in Pytorch.  We'll see the full cycle from downlading and batching our data, to writing a neural network class, to predicting with our network.

Let's get started.

### Loading our Data

The first item is to load our MNIST dataset.  This time, we'll use the pytorch library to load our data.  The MNIST dataset is located in the torchvision library.  To use it we'll need to import the `datasets` and `transforms` function.

In [106]:
import torch
from torchvision import transforms, datasets

We download the MNIST training data with the following code.

In [107]:
train = datasets.MNIST("", train = True, download = True, 
                       transform = transforms.Compose([transforms.ToTensor()]))

We specify `train = True` for the training set and `download = True` to download the data.  The first argument of `""` specifies a default location for the data.  Then, to format the data as tensors, we use the `transform` argument, passing through:
```python
transforms.Compose([transforms.ToTensor()])
```

For the test data, we repeat the code, but this time set `train = False`.

In [108]:
test = datasets.MNIST("", train = False, download = True, 
                       transform = transforms.Compose([transforms.ToTensor()]))

### Exploring the Data

If we take a look at the `train` and `test` variables, we'll see that they consist of both our features and target values.

In [109]:
train.data.shape

torch.Size([60000, 28, 28])

In [110]:
train.targets.shape

torch.Size([60000])

We can select an individual value like so.

In [111]:
train.targets[0]

tensor(5)

### Batching our Data

Generally, we will not pass through all 60000 observations of our data at once, but rather will calculate the gradient on groups of data, called batches.

To group our observations into batches, Pytorch provides us with the DataLoader.

In [112]:
trainset = torch.utils.data.DataLoader(train, batch_size = 50, shuffle = True)
testset = torch.utils.data.DataLoader(test, batch_size = 50, shuffle = True)

Now if we take a look at the data, we'll see that it's placed into groups of fifty observations.  Let's take a look at the shape of each batch.

In [113]:
for x_batch, y_batch in trainset:
    print(x_batch.shape), print(y_batch.shape)
    break

torch.Size([50, 1, 28, 28])
torch.Size([50])


And really, we'll evetually reshape our this data so that each observation is a row of 784 features.

In [114]:
for x_batch, y_batch in trainset:
    print(x_batch.view(-1,28*28).shape), print(y_batch.shape)
    break

torch.Size([50, 784])
torch.Size([50])


Ok, that's better.

### Defining our Neural Network

Now it's time to define our neural network.  Remember, this generally consists of two functions the `__init__` function and the `forward` function.  We define the class and inherit from the `nn.Module` class.  Ok, let's try it.

In [64]:
from torch import nn
import torch.nn.functional as F
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.W1 = nn.Linear(28*28, 64)
        self.W2 = nn.Linear(64, 64)
        self.W3 = nn.Linear(64, 64)
        self.W4 = nn.Linear(64, 10)
        
    def forward(self, X):
        A1 = torch.sigmoid(self.W1(X))
        A2 = torch.sigmoid(self.W2(A1))
        A3 = torch.sigmoid(self.W3(A2))
        Z4 = self.W4(A3)
        return F.log_softmax(Z4, dim = 1)

Then, we'll initialize our neural network so we can pass some data through it.

In [65]:
net = Net()

Now let's see if we can pass some data through our function.

In [66]:
x_batch.shape

torch.Size([50, 1, 28, 28])

Remember, we should reshape our data so that each observation is of length `28*28`.

In [67]:
x_batch.view(-1, 28*28).shape

torch.Size([50, 784])

In [68]:
batch_predictions = net(x_batch.view(-1, 28*28))

In [69]:
batch_predictions.shape

torch.Size([50, 10])

Our code is looking good.  We see that for each observation we make 10 different predictions.  Let's take a look at one of them.

In [75]:
batch_predictions[0]

tensor([-2.0290, -2.3152, -2.4808, -2.4230, -2.4528, -2.2802, -2.1875, -2.1077,
        -2.4909, -2.3796], grad_fn=<SelectBackward>)

We may be a little surprised by the output here, but this is just because we are using `log softmax` instead of softmax.  The log softmax is literally just the logarithm of softmax.  So if you're more comfortable looking at the original softmax outputs, that's fine.  We can get back to the softmax by applying the exponent.

In [78]:
torch.exp(batch_predictions[0])

tensor([0.1315, 0.0987, 0.0837, 0.0887, 0.0861, 0.1023, 0.1122, 0.1215, 0.0828,
        0.0926], grad_fn=<ExpBackward>)

In [79]:
torch.exp(batch_predictions[0]).sum()

tensor(1.0000, grad_fn=<SumBackward0>)

> Some of the benefits of log softmax are in the resources below.  The main benefit is that it tends to punish wrong results more than softmax.



### Training our Network

Now let's move towards training our network.  A lot of the code, should be pretty understandable because from what we learned about gradients in previous lessons.  

The main new item is that, to update the parameters of our neural network through a `step` function, we need to initialize an `optimizer`.

In [81]:
import torch.optim as optim
optimizer = optim.Adam(net.parameters(), lr=0.0005)
x_loss = nn.CrossEntropyLoss()

> We pass our optimizer the parameters it should update, as well as the learning rate with which to perform gradient descent.

Next up is the training process.  We go through the training data eight times.  And on each batch of data, we make predictions by passing our data through the neural network, and then calculating the loss.  We call `loss.backward` to have the neural network use backpropagation to calculate the gradient for our linear layers.

Then we use `optimizer.step` to update the parameters.  At the top of the our training procedure we call `net.zero_grad()` to remove any previously calculated gradients on our linear layers.

In [105]:
for epoch in range(8):
    for X_batch, y_batch in trainset:
        net.zero_grad()  
        X_reshaped = X_batch.view(-1,28*28)
        prediction_batch = net(X_reshaped)
        loss = x_loss(prediction_batch, y_batch) 
        loss.backward()  
        optimizer.step()
    print(loss)

tensor(0.0568, grad_fn=<NllLossBackward>)
tensor(0.1170, grad_fn=<NllLossBackward>)
tensor(0.0610, grad_fn=<NllLossBackward>)
tensor(0.0823, grad_fn=<NllLossBackward>)
tensor(0.0437, grad_fn=<NllLossBackward>)
tensor(0.0288, grad_fn=<NllLossBackward>)
tensor(0.0223, grad_fn=<NllLossBackward>)
tensor(0.0044, grad_fn=<NllLossBackward>)


### Evaluating the Neural Network

After training our neural network, the next step is to evaluate the neural network.  Let's start by using our neural network to make predictions on the test set.

In [89]:
# testset.dataset.data

In [91]:
predictions_test = net(testset.dataset.data.view(-1, 784).float())

Ok, let's take a look at some of these predictions.

In [100]:
torch.set_printoptions(sci_mode = False)

In [101]:
torch.exp(predictions_test[0])

tensor([    0.0000,     0.0006,     0.0001,     0.0056,     0.0000,     0.0000,
            0.0000,     0.9919,     0.0000,     0.0017], grad_fn=<ExpBackward>)

And we can identify this top integer with the argmax function.

In [103]:
torch.argmax(predictions_test, axis = 1)[:20]

tensor([7, 2, 1, 0, 4, 1, 4, 9, 8, 9, 0, 6, 9, 0, 1, 3, 9, 7, 3, 4])

And we can see that it looks like almost all of our predictions are correct.

In [104]:
testset.dataset.targets[:20]

tensor([7, 2, 1, 0, 4, 1, 4, 9, 5, 9, 0, 6, 9, 0, 1, 5, 9, 7, 3, 4])

We can see that this does a good job of taking in data and predicting the targets.

In [44]:
from sklearn.metrics import accuracy_score, plot_confusion_matrix

accuracy_score(testset.dataset.targets, torch.argmax(predictions_test, axis = 1))

0.9629

### Resources

[Log Softmax Pytorch message board](https://discuss.pytorch.org/t/logsoftmax-vs-softmax/21386/2)

[Log Probability Wikipedia](https://en.wikipedia.org/wiki/Log_probability)