In [None]:
import torch
import torch.nn as nn
import torchvision
from torchvision import datasets, transforms
import torch.nn.functional as F
from torch.autograd import Variable
import torch.optim as optim

In this tutorial we will show you how to construct a model in pytorch, train a model and subsequently use it to make predictions.

The tutorial will use the standard dataset used in digit classification, i.e. *MNIST*. 

# Pytorch cheat sheet

https://pytorch.org/tutorials/beginner/ptcheat.html

# Model construction

## Construction of models using predefined layers and functions

Pytorch is a machine learning package that constructs it's models as a subclass of the torch.nn.Module class. A model subclass requires the following components in order to be useable with the library


1.   **__ init __**, contains all layers and their trainable parameters that will be utilised in the network
2.   **forward**, a function that describes the computation that the network needs to follow in order to complete a forward propagation 



In [None]:
class Net(nn.Module):
    #This defines the structure of the NN.
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()  #Dropout
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        #Convolutional Layer/Pooling Layer/Activation
        x = F.relu(F.max_pool2d(self.conv1(x), 2)) 
        #Convolutional Layer/Dropout/Pooling Layer/Activation
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        #Fully Connected Layer/Activation
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        #Fully Connected Layer/Activation
        x = self.fc2(x)
        #Softmax gets probabilities. 
        return F.log_softmax(x, dim=1)
net = Net()

## Writing custom layers and activation functions

Custom layers need to be written as a subclass of the nn.Module similary as to how a network is defined. Like the definition of a network it needs an init and forward function.

The main difference however is that in the init function we define the learnable parameters and their initialisation method

The code cell belows depicts the implementation of the prewritten linear mapping layer **nn.Linear** in the torch standard library

In [None]:
class Linear(nn.Module):
    def __init__(self, input_features, output_features, bias=True):
        super(Linear, self).__init__()
        self.input_features = input_features
        self.output_features = output_features

        # nn.Parameter is a special kind of Tensor, that will get
        # automatically registered as Module's parameter once it's assigned
        # as an attribute. Parameters and buffers need to be registered, or
        # they won't appear in .parameters() (doesn't apply to buffers), and
        # won't be converted when e.g. .cuda() is called. You can use
        # .register_buffer() to register buffers.
        # nn.Parameters require gradients by default.
        self.weight = nn.Parameter(torch.Tensor(output_features, input_features))
        if bias:
            self.bias = nn.Parameter(torch.Tensor(output_features))
        else:
            # You should always register all possible parameters, but the
            # optional ones can be None if you want.
            self.register_parameter('bias', None)

        # Not a very smart way to initialize weights
        self.weight.data.uniform_(-0.1, 0.1)
        if bias is not None:
            self.bias.data.uniform_(-0.1, 0.1)

    def forward(self, input):
        # See the autograd section for explanation of what happens here.
        return LinearFunction.apply(input, self.weight, self.bias)

    def extra_repr(self):
        # (Optional)Set the extra information about this module. You can test
        # it by printing an object of this class.
        return 'in_features={}, out_features={}, bias={}'.format(
            self.in_features, self.out_features, self.bias is not None
        )

Writing custom functions to be used in the network requires that you write subclasses of the Function. This class needs to be defined with


1.   **forward**, which describes the forward propagation of the input 
2.   **backward**, which describes how the gradients are to be computed with respect to all the inputs received for this function.

The code cell belows depicts the implementation fo the linear function in the nn.Function module.

In [None]:
# Inherit from Function
class LinearFunction(Function):

    # Note that both forward and backward are @staticmethods
    @staticmethod
    # ctx is a context object that can be used to 
    # stash information for backward computation,
    # bias is an optional argument
    def forward(ctx, input, weight, bias=None):
        ctx.save_for_backward(input, weight, bias)
        output = input.mm(weight.t())
        if bias is not None:
            # this will expand the dimension of the bias from (dim,) to (dim,1)
            # .expand_as(output) to get it in the same dimensions as output
            output += bias.unsqueeze(0).expand_as(output)
        return output

    # This function has only a single output, so it gets only one gradient
    @staticmethod
    def backward(ctx, grad_output):
        # This is a pattern that is very convenient - at the top of backward
        # unpack saved_tensors and initialize all gradients w.r.t. inputs to
        # None. Thanks to the fact that additional trailing Nones are
        # ignored, the return statement is simple even when the function has
        # optional inputs.
        input, weight, bias = ctx.saved_tensors
        grad_input = grad_weight = grad_bias = None

        # These needs_input_grad checks are optional and there only to
        # improve efficiency. If you want to make your code simpler, you can
        # skip them. Returning gradients for inputs that don't require it is
        # not an error.
        if ctx.needs_input_grad[0]:
            grad_input = grad_output.mm(weight)
        if ctx.needs_input_grad[1]:
            grad_weight = grad_output.t().mm(input)
        if bias is not None and ctx.needs_input_grad[2]:
            grad_bias = grad_output.sum(0)

        return grad_input, grad_weight, grad_bias

For a more in depth discussion on the extension of torch check https://pytorch.org/docs/stable/notes/extending.html?highlight=linearfunction

# Model optimization


From Kaggle: 
"MNIST ("Modified National Institute of Standards and Technology") is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike."

[Read more.](https://www.kaggle.com/c/digit-recognizer)


<a title="By Josef Steppan [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:MnistExamples.png"><img width="512" alt="MnistExamples" src="https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png"/></a>

# Loading data and preprocessing data

### DataLoaders

Pytorch makes use of the DataLoader class which forms a pipeline that is utilized during the optimization of the network

```
DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
           batch_sampler=None, num_workers=0, collate_fn=None,
           pin_memory=False, drop_last=False, timeout=0,
           worker_init_fn=None)
```
Which gives a an iteratable. This class enables a parallelized datastream that allows for multiple CPUs to simultaneously feed data to the GPU.

The dataset keyword is a subclass of the the data.Dataset class and can be written for custom datasets (see additional resources)


In [None]:
train_batch = 64
test_batch = 64

kwargs = {}

train_loader = torch.utils.data.DataLoader(datasets.MNIST('../data', train=True, download=True,
                                                          transform=transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.1307,), (0.3081,))])),
                                          batch_size=train_batch, shuffle=True, **kwargs)
test_loader = torch.utils.data.DataLoader(datasets.MNIST('../data', train=False, 
                                                         transform=transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.1307,), (0.3081,))])),
                                          batch_size=test_batch, shuffle=True, **kwargs)

**Additional resources**

https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel

https://towardsdatascience.com/building-efficient-custom-datasets-in-pytorch-2563b946fd9f

## Model training

### Optimization and inference functions

In [None]:
def train(epoch, log_interval, cuda=False):
    #Set the internal state of the model to training mode
    #Some layers behabe differently during training then during the evaluation mode
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        if cuda is True:
            # changing the the variables data,target to another device (GPU)
            # all variables (+model) need to be on the same device!
            data, target = data.cuda(), target.cuda()
        #Variables in Pytorch are differenciable. 
        data, target = Variable(data), Variable(target)
        # set the gradients to zero before starting to do backpropragation 
        # because PyTorch accumulates the gradients on subsequent backward passes
        optimizer.zero_grad()
        output = model(data)
        # Calculate the loss The negative log likelihood loss. It is useful to train a classification problem with C classes.
        loss = F.nll_loss(output, target)
        #dloss/dx for every Variable 
        loss.backward()
        #to do a one-step update on our parameter.
        optimizer.step()
        #Print out the loss periodically. 
        if batch_idx % log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),100. * batch_idx / len(train_loader), loss.data))

In [None]:
def test(cuda=False):
    model.eval()
    test_loss = 0
    correct = 0
    for data, target in test_loader:
        if cuda is True:
            data, target = data.cuda(), target.cuda()
        data, target = Variable(data, volatile=True), Variable(target)
        output = model(data)
        test_loss += F.nll_loss(output, target, size_average=False).data # sum up batch loss
        pred = output.data.max(1, keepdim=True)[1] # get the index of the max log-probability
        correct += pred.eq(target.data.view_as(pred)).long().cpu().sum()

    test_loss /= len(test_loader.dataset)
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))


### GPU allocation of the model

After defining the training and test function we can start to train the model. However, if we wish to train on available then we need to copy our model to an available gpu

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

Should a CUDA compatible device be available we can create our model on said device either through the 


*   .cuda() command
*   .to_device(*device_name*)



In [None]:
model = Net()
model.cuda()

And we can check whether it has in fact been allocated to the GPU by checking the model parameters

In [None]:
next(model.parameters()).is_cuda

### Model training

In [None]:
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
for epoch in range(1,100):
    train(epoch, 100,cuda=True)
    test(cuda=True)

# Post optimization

## Saving models and loading models

When saving a model for inference, it is only necessary to save the trained model’s learned parameters. Saving the model’s state_dict with the torch.save() function will give you the most flexibility for restoring the model later, which is why it is the recommended method for saving models.

A common PyTorch convention is to save models using either a .pt or .pth file extension.


    torch.save: Saves a serialized object to disk. This function uses Python’s pickle utility for serialization. Models, tensors, and dictionaries of all kinds of objects can be saved using this function.
    torch.load: Uses pickle’s unpickling facilities to deserialize pickled object files to memory. This function also facilitates the device to load the data into (see Saving & Loading Model Across Devices).
    torch.nn.Module.load_state_dict: Loads a model’s parameter dictionary using a deserialized state_dict. For more information on state_dict, see What is a state_dict?.


In [None]:
#Save the models state dictionary to the specified PATH
torch.save(model.state_dict(), PATH)

#Reload the model by loading in the state dictionary and subsequently updating the state dict of your model class
model = TheModelClass(*args, **kwargs)
model.load_state_dict(torch.load(PATH))
model.eval()