<a href="https://colab.research.google.com/github/Maupin1991/ML_pytorch_tutorial/blob/master/4_TrainNeuralNetworkWithPytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training a neural network

Now we know all the objects to combine in order to build a neural network and to feed to it some input. We're ready for working at Google!

...

...

Not really...



We forgot to train the network! It has to learn some mapping between the inputs and the desired output.
* The first thing is to define a criterion for telling the network right from wrong. 
* After, we need to show the training set to the network and update its weights for improving its performance

In [0]:
import torch
from torch import nn
from torchvision import datasets, transforms
import numpy as np
import matplotlib.pyplot as plt
from torch import optim


np.random.seed(99)
torch.manual_seed(10);

## Model

In [0]:
# Model class
class Network(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 64)
        self.fc4 = nn.Linear(64, 10)
        
    def forward(self, x):
        # make sure input tensor is flattened
        x = x.view(x.shape[0], -1)
        
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = torch.relu(self.fc3(x))
        x = torch.log_softmax(self.fc4(x), dim=1)
        
        return x

## Dataset

We are going to use the MNIST dataset. As we know, it contains 60k training images and 10k testing images. These are a lot of images, and we would have to wait a long time before seeing our training results. We can use the Subset class of torch for reducing the dataset size. Of course we are going to have less training data and our network will have "less" examples to see, but as we will see we can get to good accuracy even with this reduced set.

The subset class takes as input the dataset to sample and the indices to use. We can easily generate (and **shuffle**) the indices with numpy.

In [0]:
# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor()])
n_train = 5000
n_test = 1000
# validation set is 20% of the training set
valid_size = 0.2

epochs = 100

# Download and load the data
trainset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, 
                          train=True, transform=transform)
testset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, 
                          train=False, transform=transform)

# Splitting train/validation and testing set

# training set
train_idxs = np.arange(len(trainset))
np.random.shuffle(train_idxs)
train_idxs = train_idxs[:n_train].tolist()

# subsample validation set from training set
n_valid = int(np.floor(n_train * valid_size))
valid_idxs = train_idxs[:n_valid]
train_idxs = train_idxs[n_valid:]

# testing set
test_idxs = np.arange(len(testset))
np.random.shuffle(test_idxs)
test_idxs = test_idxs[:n_test].tolist()

# extract only the selected indices
train_subset = torch.utils.data.Subset(trainset, train_idxs)
valid_subset = torch.utils.data.Subset(trainset, valid_idxs)
test_subset = torch.utils.data.Subset(testset, test_idxs)

# data loader (finally)
trainloader = torch.utils.data.DataLoader(train_subset, batch_size=32)
validloader = torch.utils.data.DataLoader(valid_subset, batch_size=32)
testloader = torch.utils.data.DataLoader(test_subset, batch_size=32)


## Loss function


The nn module defines specific loss classes that we can use to determine the error of our network.
Read the documentation carefully. The cross entropy loss module combines a log-softmax layer and a negative-log-likelihood loss (useful for classification problems with C classes).

This means we need to pass to this function the scores of the last layer, so we won't need the softmax layer in the end! Another option would be to separate the steps, including the softmax layer in the network, and using the NLLLoss module of the package nn.

In [0]:
criterion = nn.NLLLoss()

## Backpropagation

We need to update the weights of the network towards the direction that gives us a smaller error. To do this, we use **gradient descent**, propagating the gradient of the loss backward through all layers of the network. 

As we propagate the error backwards, we update the layers' weights by subtracting the gradient of the loss w.r.t each weight multiplied by the **learning rate** .

We will use **autograd**, a PyTorch module that calculates automatically the gradients of tensors. This module keeps track of all the operations 

In [0]:
x = torch.rand(1, 1, requires_grad=True)
print("tensor x:",x)
print("x.requires_grad=",x.requires_grad)
print("gradient of x:", x.grad)
squared = x**2
squared.backward()

# derivative of x**2 is 2*x
# notice the grad_fn attribute of the tensor
print("manually computed gradient:", x*2)
print("autograd:", x.grad)

## Loss + Autograd

With PyTorch, we run data forward through the network to calculate the loss, then, go backwards to calculate the gradients with respect to the loss. Once we have the gradients we can make a gradient descent step.

In [0]:
net = Network()
images, labels = next(iter(trainloader))
output = net(images)
loss = criterion(output, labels)

print('Before backward pass: \n', net.fc1.weight.grad)

loss.backward()

print('After backward pass: \n', net.fc1.weight.grad)

## Optimizers

Since training networks is what PyTorch users do all the time, it would be better to not reinvent the wheel everytime. In torch there is a module called **optim**, which contains implementations of several optimizers. 

* **optim.SGD** is the most used. It's vanilla Stochastic Gradient Descent
* **optim.Adam** allow us to set momentum, which we've seen is useful for avoiding local minima

REMEMBER: the optimizer accumulates the gradients, so for each step we will have to "cleanup" before the next step. We can use **optimizer.zero_grad()**.

In [0]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("Using device:  ", device)

In [0]:
net = Network()

In [0]:
# Optimizers require the parameters to optimize and a learning rate
optimizer = optim.SGD(net.parameters(), lr=0.03)

## We're ready to train our network!

In [0]:
net.to(device)
train_losses, valid_losses = [], []

# train model and evaluate with validation set
for e in range(epochs):
    running_loss = 0
    for images, labels in trainloader:
        images, labels = images.to(device), labels.to(device)

        optimizer.zero_grad()
        
        log_ps = net(images)
        loss = criterion(log_ps, labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        
    else:
        valid_loss = 0
        accuracy = 0
        
        # Turn off gradients for validation, saves memory and computations
        with torch.no_grad():
            net.eval()
            for images, labels in validloader:
                images, labels = images.to(device), labels.to(device)
                log_ps = net(images)
                valid_loss += criterion(log_ps, labels)
                
                ps = torch.exp(log_ps)
                top_p, top_class = ps.topk(1, dim=1)
                equals = top_class == labels.view(*top_class.shape)
                accuracy += torch.mean(equals.type(torch.FloatTensor))
        
        net.train()
        
        train_losses.append(running_loss/len(trainloader))
        valid_losses.append(valid_loss/len(validloader))

        print("Epoch: {}/{}.. ".format(e+1, epochs),
              "Training Loss: {:.3f}.. ".format(train_losses[-1]),
              "Validation Loss: {:.3f}.. ".format(valid_losses[-1]),
              "Validation Accuracy: {:.3f}".format(accuracy/len(validloader)))

In [0]:
# evaluate on testing set
with torch.no_grad():
    net.eval()
    test_loss = 0
    accuracy = 0

    for images, labels in testloader:
        images, labels = images.to(device), labels.to(device)
        output = net(images)
        test_loss += criterion(output, labels)

        top_p, top_class = output.topk(1, dim=1)
        equals = top_class == labels.view(*top_class.shape)
        accuracy += torch.mean(equals.type(torch.FloatTensor))

    print("Test Accuracy: {:.3f}".format(accuracy/len(testloader)))

We have seen that everything works fine. But what happens if we try to add some noise to the images?

In [0]:
transform = transforms.Compose([transforms.ToTensor(),
                                transforms.Lambda(lambda x : x + torch.randn_like(x)/2)
                              ])

testset_noise = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, 
                          train=False, transform=transform)

# testing set
test_idxs = np.arange(len(testset_noise))
np.random.shuffle(test_idxs)
test_idxs = test_idxs[:n_test].tolist()

test_subset = torch.utils.data.Subset(testset_noise, test_idxs)
testloader_noisy = torch.utils.data.DataLoader(test_subset, batch_size=32)


In [0]:
# evaluate on noisy set
with torch.no_grad():
    net.eval()
    test_loss = 0
    accuracy = 0

    for images, labels in testloader_noisy:
        images, labels = images.to(device), labels.to(device)
        output = net(images)
        test_loss += criterion(output, labels)

        top_p, top_class = output.topk(1, dim=1)
        equals = top_class == labels.view(*top_class.shape)
        accuracy += torch.mean(equals.type(torch.FloatTensor))

    print("Test Noise Accuracy: {:.3f}".format(accuracy/len(testloader)))

We can see that accuracy has dropped. Why did this happen?

In [0]:
plt.plot(train_losses[1:], label='Training loss')
plt.plot(valid_losses, label='Validation loss')
plt.legend(frameon=False);

We see that the training loss and the testing loss are diverging. This is a clear sign of overfitting. This means that the model is not able to generalize when new **unseen** data are presented as input.

We can try to reduce this gap by applying some regularization technique.

## Overfitting

In [0]:
class DropoutNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 64)
        self.fc4 = nn.Linear(64, 10)

        # Dropout module with 0.1 drop probability
        self.dropout = nn.Dropout(p=0.1)

    def forward(self, x):
        # make sure input tensor is flattened
        x = x.view(x.shape[0], -1)

        # Now with dropout
        x = self.dropout(torch.relu(self.fc1(x)))
        x = self.dropout(torch.relu(self.fc2(x)))
        x = self.dropout(torch.relu(self.fc3(x)))

        # output so no dropout here
        x = torch.log_softmax(self.fc4(x), dim=1)

        return x


In [0]:
# !!! Use dropout !!!
net = DropoutNetwork()

# !!! Add weight decay to the optimizer !!!
optimizer = optim.SGD(net.parameters(), lr=0.03, weight_decay=0.01)

net.to(device)
train_losses, valid_losses = [], []

# train model and evaluate with validation set
for e in range(epochs):
    running_loss = 0
    for images, labels in trainloader:
        images, labels = images.to(device), labels.to(device)

        optimizer.zero_grad()
        
        log_ps = net(images)
        loss = criterion(log_ps, labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        
    else:
        valid_loss = 0
        accuracy = 0
        
        # Turn off gradients for validation, saves memory and computations
        with torch.no_grad():
            net.eval()
            for images, labels in validloader:
                images, labels = images.to(device), labels.to(device)
                log_ps = net(images)
                valid_loss += criterion(log_ps, labels)
                
                ps = torch.exp(log_ps)
                top_p, top_class = ps.topk(1, dim=1)
                equals = top_class == labels.view(*top_class.shape)
                accuracy += torch.mean(equals.type(torch.FloatTensor))
        
        net.train()
        
        train_losses.append(running_loss/len(trainloader))
        valid_losses.append(valid_loss/len(validloader))

        print("Epoch: {}/{}.. ".format(e+1, epochs),
              "Training Loss: {:.3f}.. ".format(train_losses[-1]),
              "Validation Loss: {:.3f}.. ".format(valid_losses[-1]),
              "Validation Accuracy: {:.3f}".format(accuracy/len(validloader)))

In [0]:
# evaluate on testing set
with torch.no_grad():
    net.eval()
    test_loss = 0
    accuracy = 0

    for images, labels in testloader:
        images, labels = images.to(device), labels.to(device)
        output = net(images)
        test_loss += criterion(output, labels)

        top_p, top_class = output.topk(1, dim=1)
        equals = top_class == labels.view(*top_class.shape)
        accuracy += torch.mean(equals.type(torch.FloatTensor))

    print("Test Accuracy: {:.3f}".format(accuracy/len(testloader)))

In [0]:
# evaluate on noisy set
with torch.no_grad():
    net.eval()
    test_loss = 0
    accuracy = 0

    for images, labels in testloader_noisy:
        images, labels = images.to(device), labels.to(device)
        output = net(images)
        test_loss += criterion(output, labels)

        top_p, top_class = output.topk(1, dim=1)
        equals = top_class == labels.view(*top_class.shape)
        accuracy += torch.mean(equals.type(torch.FloatTensor))

    print("Test Noise Accuracy: {:.3f}".format(accuracy/len(testloader)))

In [0]:
plt.plot(train_losses, label='Training loss')
plt.plot(valid_losses, label='Validation loss')
plt.legend(frameon=False);

In [0]:
nb_classes = 10

confusion_matrix = torch.zeros(nb_classes, nb_classes)
with torch.no_grad():
    for i, (inputs, classes) in enumerate(testloader):
        inputs = inputs.to(device)
        classes = classes.to(device)
        outputs = net(inputs)
        _, preds = torch.max(outputs, 1)
        for t, p in zip(classes.view(-1), preds.view(-1)):
                confusion_matrix[t.long(), p.long()] += 1


In [0]:

x = confusion_matrix.numpy()
from pandas import DataFrame
print(DataFrame(x))
plt.imshow(confusion_matrix.numpy());

## Use the network

In [0]:
# obtain one batch of test images
dataiter = iter(testloader)
images, labels = dataiter.next()
images.numpy()
classes = range(10)

# move model inputs to cuda, if GPU available
images_cuda = images.to(device)

# get sample outputs
output = net(images_cuda)
# convert output probabilities to predicted class
_, preds_tensor = torch.max(output, 1)
preds = np.squeeze(preds_tensor.cpu().numpy())
# plot the images in the batch, along with predicted and true labels
fig = plt.figure(figsize=(25, 4))
for idx in range(20):
    ax = fig.add_subplot(2, 20/2, idx+1, xticks=[], yticks=[])
    plt.imshow(images[idx][0, :, :], interpolation='nearest', cmap='gray')
    plt.axis("off")
    ax.set_title("{} ({})".format(classes[preds[idx]], classes[labels[idx]]),
                 color=("green" if preds[idx]==labels[idx].item() else "red"))