
# Introduction

MNIST ("Modified National Institute of Standards and Technology") is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike. 

In this tutorial, your goal is to correctly identify digits from the Kaggle MNIST dataset of tens of thousands of handwritten images. We will walk through the development of several standard deep learning pipelines using PyTorch that are capable of correctly identifying digits from the MNIST dataset. We will then see how to customize standard deep learning pipelines to improve model performance. After successfully training a custom model, we will see how to submit the model's predictions to Kaggle for scoring.

This tutorial assumes some basic knowledge of neural networks.  If you’re not already familiar with neural networks, then you can learn the basics concepts behind neural networks (and PyTorch!) at [course.fast.ai](https://course.fast.ai/). 

* Tutorial materials are derived from [_What is torch.nn really?_](https://pytorch.org/tutorials/beginner/nn_tutorial.html) by Jeremy Howard, Rachel Thomas, Francisco Ingham.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

from sklearn import decomposition, ensemble, manifold, metrics, model_selection, pipeline, preprocessing
import torch

In [None]:
%matplotlib inline

# Setting up an account with Kaggle (Optional, but recommended!)

### 1. Register for an account with Kaggle

In order to download Kaggle competition data you will first need to create a [Kaggle](https://www.kaggle.com/) account.

### 2. Create an API key

Once you have registered for a Kaggle account you will need to create some [API credentials](https://github.com/Kaggle/kaggle-api#api-credentials) in order to be able to use the `kaggle` CLI to download data.



# Getting the MNIST data
If you are using Binder to run this notebook, then the data has already been downloaded for you! If you are using Google Colab to run this notebook, then you will need to download the data before proceeding.

## Downloading the data from Kaggle
If you have a Kaggle account and API key, then you can provide your Kaggle username and API key in the cell below and execute the code to download the Kaggle [Digit Recognizer: Learn computer vision with the famous MNIST data](https://www.kaggle.com/c/digit-recognizer) competition data. **Before attempting to download the competition data you will need to login to your Kaggle account and accept the rules for this competition.**

In [None]:
%%bash
export KAGGLE_USERNAME="YOUR_USERNAME"
export KAGGLE_KEY="YOUR_API_KEY"
kaggle competitions download -c digit-recognizer -p ../data/raw/mnist/

## Downloading the data from GitHub
If you are running this notebook using Google Colab but did not want to bother with setting up a Kaggle account and API key, then you can dowload the data from our GitHub repository by running the code in the following cells.

In [None]:
import os
import requests


TRAIN_URL = "https://raw.githubusercontent.com/kaust-vislab/pytorch-tutorials/master/data/raw/mnist/train.csv"
TEST_URL = "https://raw.githubusercontent.com/kaust-vislab/pytorch-tutorials/master/data/raw/mnist/test.csv"
SAMPLE_SUBMISSION_URL = "https://raw.githubusercontent.com/kaust-vislab/pytorch-tutorials/master/data/raw/mnist/sample_submission.csv"


def fetch_mnist_data():
    if not os.path.isdir("../data/raw/mnist/"):
        os.makedirs("../data/raw/mnist/")
    
    with open("../data/raw/mnist/train.csv", 'wb') as f:
        response = requests.get(TRAIN_URL)
        f.write(response.content)
        
    with open("../data/raw/mnist/test.csv", 'wb') as f:
        response = requests.get(TEST_URL)
        f.write(response.content)
    
    with open("../data/raw/mnist/sample_submission.csv", 'wb') as f:
        response = requests.get(SAMPLE_SUBMISSION_URL)
        f.write(response.content)

In [None]:
fetch_mnist_data()

# Load the MNIST data

In [None]:
!head ../data/raw/mnist/train.csv

In [None]:
mnist_arr = np.loadtxt("../data/raw/mnist/train.csv", delimiter=',', skiprows=1, dtype=np.int64)

In [None]:
# raw features are between 0 and 255
mnist_arr.min(), mnist_arr.max()

## Rescale the raw data

Data for individual pixels is stored as integers between 0 and 255. Neural network models work best when numerical features are scaled. To rescale the raw features we can use tools from the [Scikit-Learn preprocessing module](https://scikit-learn.org/stable/modules/preprocessing.html).

In [None]:
training_target, training_features = mnist_arr[:, 0], mnist_arr[:, 1:]

In [None]:
min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
scaled_training_features = min_max_scaler.fit_transform(training_features)

## Check out a training sample

In [None]:
_, ax = plt.subplots(1,1)
_ = ax.imshow(scaled_training_features[0].reshape((28, 28)), cmap="gray")

## Visualizing training samples using PCA

[Principal Components Analysis (PCA)](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) can be used as a visualization tool to see if there are any obvious patterns in the training samples.

In [None]:
_prng = np.random.RandomState(42)
pca = decomposition.PCA(n_components=2, random_state=_prng)
transformed_training_features = pca.fit_transform(scaled_training_features)

In [None]:
fig, ax = plt.subplots(1, 1)
_ = ax.scatter(transformed_training_features[:,0], transformed_training_features[:,1], c=training_target, alpha=0.05)
ax.set_xlabel("Component 1")
ax.set_ylabel("Component 2")
ax.set_title("PCA", fontsize=20)

## Visualizing training sample using t-SNE

[t-distributed Stochastic Neighbor Embedding (t-SNE)](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html#sklearn.manifold.TSNE) is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.

It is highly recommended to use another dimensionality reduction method (e.g. PCA) to reduce the number of dimensions to a reasonable amount if the number of features is very high. This will suppress some noise and speed up the computation of pairwise distances between samples.

In [None]:
# initial PCA to determine n_components that capture 95% of sample variance
_prng = np.random.RandomState(42)
pca = decomposition.PCA(random_state=_prng)
pca.fit_transform(scaled_training_features)
n_components = np.sum(pca.explained_variance_ratio_.cumsum() < 0.95)

As suggested above we will go ahead and use an initial PCA step in our embedding pipeline to reduce the overall number of features which will speed up convergence of the t-SNE algorithm.

In [None]:
_prng = np.random.RandomState(42)

embedding_pipeline = pipeline.make_pipeline(
    decomposition.PCA(n_components, random_state=_prng),
    manifold.TSNE(n_components=2, random_state=_prng)
)

In [None]:
transformed_training_features = embedding_pipeline.fit_transform(scaled_training_features)

In [None]:
fig, ax = plt.subplots(1, 1)
_ = ax.scatter(transformed_training_features[:,0], transformed_training_features[:,1], c=training_target, alpha=0.05)
ax.set_xlabel("Component 1")
ax.set_ylabel("Component 2")
ax.set_title("t-SNE", fontsize=20)

# Classical ML Benchmark Model

To provide a point of comparison for our neural network models, let's use PCA to peform dimensionality reduction and then train a Random Forest Classifier.

In [None]:
# initial PCA to determine n_components that capture 95% of sample variance
_prng = np.random.RandomState(42)
pca = decomposition.PCA(random_state=_prng)
pca.fit_transform(scaled_training_features)
pca.set_params(n_components=np.sum(pca.explained_variance_ratio_.cumsum() < 0.95))

# second PCA fit to compute the transformed features
transformed_training_features = pca.fit_transform(scaled_training_features)

In [None]:
# evaluate a reasonable classifier using CV
clf = ensemble.RandomForestClassifier(n_estimators=100, random_state=_prng)
accuracy_scores = model_selection.cross_val_score(clf,
                                                  transformed_training_features,
                                                  training_target,
                                                  scoring="accuracy",
                                                  cv=5)

In [None]:
accuracy_scores.mean()

### Make predictions

In [None]:
# retrain the classifier using the entire dataset
clf.fit(transformed_training_features, training_target)

# load the testing features (note we use transform method and NOT fit_transform!)
_testing_features = np.loadtxt("../data/raw/mnist/test.csv", delimiter=',', skiprows=1, dtype=np.int64)
_scaled_testing_features = min_max_scaler.transform(_testing_features)
_transformed_testing_features = pca.transform(_scaled_testing_features)

# make predictions 
predictions = clf.predict(_transformed_testing_features)

### Reformat predictions

In [None]:
# submission format for kaggle
!head ../data/raw/mnist/sample_submission.csv

In [None]:
import os
import time

import pandas as pd

if not os.path.isdir("../data/kaggle-submissions/mnist/"):
    os.makedirs("../data/kaggle-submissions/mnist/")

timestamp = time.strftime("%Y%m%d-%H%M%S")
number_predictions, = predictions.shape
df = pd.DataFrame({"ImageId": range(1, number_predictions + 1), "Label": predictions})
df.to_csv(f"../data/kaggle-submissions/mnist/submission-{timestamp}.csv", index=False)

### Submit to Kaggle!

Once you have successfully submited your predictions then you can check the [Digit-Recognizer competition](https://www.kaggle.com/c/digit-recognizer) website and see how well your best model compares to your peers.

In [None]:
%%bash
export KAGGLE_USERNAME="YOUR_USERNAME"
export KAGGLE_KEY="YOUR_API_KEY"
kaggle competitions submit digit-recognizer \
  -f $(ls ../data/kaggle-submissions/mnist/submission-*.csv | tail -n 1) \
  -m "Untuned PCA-RandomForestClassifier benchmark pipeline"

# Neural network from scratch

## Split the MNIST data into training and validation sets

Since Kaggle has already split the MNIST data set into training and testing data sets, we only need to split our training data set into training and validation data. We will use the validation data to make sure that we are not over-fitting our models.

In [None]:
prng = np.random.RandomState(42)
training_arr, validation_arr = model_selection.train_test_split(mnist_arr, test_size=0.20, random_state=prng)

In [None]:
training_arr.shape

In [None]:
# first column is the label (i.e., target), remaining columns are pixel values (i.e., features)
training_target, training_features = training_arr[:, 0], training_arr[:, 1:]

In [None]:
validation_arr.shape

In [None]:
validation_target, validation_features = validation_arr[:, 0], validation_arr[:, 1:]


Next let's create a simple model using nothing but [PyTorch tensor operations](https://pytorch.org/docs/stable/tensors.html). PyTorch uses `torch.tensor` rather than `numpy.ndarray` so we need to convert data.

In [None]:
training_target = torch.tensor(training_target)
scaled_training_features = torch.tensor(scaled_training_features, dtype=torch.float32)

validation_target = torch.tensor(validation_target)
scaled_validation_features = torch.tensor(scaled_validation_features, dtype=torch.float32)

In [None]:
scaled_training_features

In [None]:
training_target

PyTorch provides methods to create random or zero-filled tensors, which we will use to create our weights and bias for a simple linear model. These are just regular tensors, with one very special addition: we tell PyTorch that they require a gradient. This causes PyTorch to record all of the operations done on the tensor, so that it can calculate the gradient during back-propagation automatically!

For the weights, we set `requires_grad` after the initialization, since we don’t want that step included in the gradient. (Note that a trailling `_` in PyTorch signifies that the operation is performed _in-place_.)

In [None]:
number_samples, number_features = scaled_training_features.shape

# using Xavier initialization (divide weights by sqrt(number_features))
weights = torch.randn(number_features, 10) / number_features**0.5
weights.requires_grad_() # trailing underscore indicates in-place operation
bias = torch.zeros(10, requires_grad=True)

Thanks to PyTorch’s ability to calculate gradients automatically, we can use any standard Python function (or callable object) in a model! So we will start by writing a function to peform matrix multiplication and broadcasted addition called `linear_transformation`. We will also need an activation function, so we’ll write a function called `log_softmax_activation` and use it. 

**N.B.** Although PyTorch provides lots of pre-written loss functions, activation functions, and so forth, you can easily write your own using plain python. PyTorch will even create fast GPU or vectorized CPU code for your function automatically.

In [None]:
def linear_transformation(X):
    return X @ weights + bias

def log_softmax_activation(X):
    return X - X.exp().sum(-1).log().unsqueeze(-1)
    
def logistic_regression(X):
    Z = linear_transformation(X)
    return log_softmax_activation(Z)

In the above, the `@` stands for the dot product operation. We will call our function on one batch of data (in this case, 64 images). Note that our predictions won’t be any better than random at this stage, since we start with random weights.

In [None]:
batch_size = 64
output = logistic_regression(scaled_training_features[:batch_size])

In [None]:
output[1]

As you see, the `output` tensor contains not only the tensor values, but also a gradient function, `grad_fn`. We’ll use this later to do back propagation to update the model parameters.

Let’s implement `negative_log_likelihood` to use as the loss function. Again, we can just use standard Python code.

In [None]:
def negative_log_likelihood(output, target):
    m, _ = output.shape
    return -output[range(m), target].mean()
    

In [None]:
negative_log_likelihood(output, training_target[:batch_size])

Let’s also implement a function to calculate the `accuracy` of our model: for each prediction, if the index with the largest value matches the target value, then the prediction was correct.

In [None]:
def accuracy(output, target):
    predictions = torch.argmax(output, dim=1)
    return (predictions == target).float().mean()

For comparison purposes we can compute the accuracy of our model with randomly initialized parameters.

In [None]:
accuracy(output, training_target[:batch_size])

We can now run a training loop. For each iteration, we will:

* select a mini-batch of data (of size `batch_size`)
* use the model to make predictions
* calculate the loss
* `loss.backward()` updates the gradients of the model.

We now use these gradients to update the weights and bias (i.e., model parameters). We do this within the `torch.no_grad()` context manager, because we do not want these actions to be recorded for our next calculation of the gradient. You can read more about how PyTorch’s Autograd records operations [here](https://pytorch.org/docs/stable/notes/autograd.html).

We then set the gradients to zero, so that we are ready for the next loop. Otherwise, our gradients would record a running tally of all the operations that had happened (i.e. loss.backward() adds the gradients to whatever is already stored, rather than replacing them).

In [None]:
model_fn = logistic_regression
loss_fn = negative_log_likelihood

number_epochs = 2
number_batches = (number_samples - 1) // batch_size + 1

learning_rate = 0.5
for epoch in range(number_epochs):
    for batch in range(number_batches):
        
        # forward pass
        start = batch * batch_size
        X = scaled_training_features[start:(start + batch_size)]
        y = training_target[start:(start + batch_size)]
        loss = loss_fn(model_fn(X), y)
        
        # back propagation
        loss.backward()
        with torch.no_grad():
            weights -= learning_rate * weights.grad
            bias -= learning_rate * bias.grad
            weights.grad.zero_()
            bias.grad.zero_()
            

That’s it: we’ve created and trained a minimal neural network (in this case, a logistic regression, since we have no hidden layers) entirely from scratch! Let’s check the loss and accuracy and compare those to what we got earlier. We expect that the loss will have decreased and accuracy to have increased, and they have.

In [None]:
loss, accuracy(model_fn(X), y)

# Refactor using `torch.nn.functional`

We will now refactor our code using [torch.nn](https://pytorch.org/docs/stable/nn.html) modules to make it more concise and flexible. The first and easiest step is to make our code shorter by replacing our hand-written activation and loss functions with those from [torch.nn.functional](https://pytorch.org/docs/stable/nn.html#torch-nn-functional).

Since we are using negative log likelihood loss and log softmax activation in this tutorial, we can use [torch.nn.functional.cross_entropy](https://pytorch.org/docs/stable/nn.html#cross-entropy) which combines the two.

In [None]:
import torch.nn.functional as F

In [None]:
F.cross_entropy(linear_transformation(X), y), accuracy(linear_transformation(X), y)

# Refactor using `torch.nn.Module`

Next up, we’ll use [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#module) and [torch.nn.Parameter](https://pytorch.org/docs/stable/nn.html#parameters), for a clearer and more concise training loop. In this case, we want to create a class that holds our weights, bias, and method for the forward step. `torch.nn.Module` has a number of attributes and methods (such as `parameters()` and `zero_grad()`) which we will be using.

In [None]:
from torch import nn


class MNISTLogisticRegression(nn.Module):
    
    def __init__(self):
        super().__init__()
        self._weights = nn.Parameter(torch.randn(784, 10) / 784**0.5)
        self._bias = nn.Parameter(torch.zeros(10))
        
    def forward(self, X):
        return X @ self._weights + self._bias
    


Since we’re now using an object instead of just using a function, we first have to instantiate our model.

In [None]:
model_fn = MNISTLogisticRegression()

Now we can calculate the loss in the same way as before. Note that `torch.nn.Module` objects are used as if they are functions (i.e they are callable), but behind the scenes Pytorch will call the `forward` method.

In [None]:
F.cross_entropy(model_fn(X), y)

Previously in our training loop we had to update the values for each parameter by name and manually zero out the grads for each parameter separately.  With our refactoring we can take advantage of `model_fn.parameters()` and `model_fn.zero_grad()` (which are both defined by PyTorch for `torch.nn.Module` base class!) to make those steps more concise and less prone to the error of forgetting some of our parameters, particularly if we had a more complicated model.

In order to facilitate re-use and continued refactoring, we can encapsulate the logic of our deep learning pipeline in the following functions. 

In [None]:
def partial_fit(model_fn, loss_fn, X_batch, y_batch):
    # forward pass
    loss = loss_fn(model_fn(X_batch), y_batch)

    # back propagation
    loss.backward()
    with torch.no_grad():
        for parameter in model_fn.parameters():
            parameter -= learning_rate * parameter.grad
        model_fn.zero_grad()


def fit(model_fn, loss_fn, X, y, number_epochs=2, batch_size=64):
    number_samples, _ = X.shape 
    number_batches = (number_samples - 1) // batch_size + 1
    for epoch in range(number_epochs):
        for batch in range(number_batches):
            start = batch * batch_size
            X_batch = X[start:(start + batch_size)]
            y_batch = y[start:(start + batch_size)]
            partial_fit(model_fn, loss_fn, X_batch, y_batch)

In [None]:
model_fn = MNISTLogisticRegression()
loss_fn = F.cross_entropy

In [None]:
fit(model_fn, loss_fn, scaled_training_features, training_target)

In [None]:
loss_fn(model_fn(X), y)

# Refactoring using `torch.nn.Linear`

Instead of defining and initializing `self._weights` and `self._bias`, and calculating `X  @ self._weights + self._bias`, we will instead use the Pytorch class [torch.nn.Linear](https://pytorch.org/docs/stable/nn.html#linear) to define a linear layer which does all that for us. Pytorch has many types of predefined layers that can greatly simplify our code, and since the library code is highly optimized using PyTorch's predefined layers often makes our code faster too.

In [None]:
from torch import nn


class MNISTLogisticRegression(nn.Module):
    
    def __init__(self):
        super().__init__()
        self._linear_layer = nn.Linear(784, 10)
        
    def forward(self, X):
        return self._linear_layer(X)
    


In [None]:
model_fn = MNISTLogisticRegression()
loss_fn = F.cross_entropy

In [None]:
fit(model_fn, loss_fn, scaled_training_features, training_target)

In [None]:
loss_fn(model_fn(X), y)

# Refactoring using `torch.optim`

Pytorch also has a package with various optimization algorithms, [torch.optim](https://pytorch.org/docs/stable/optim.html). We can use the step method from our optimizer to take a forward step, instead of manually updating each parameter.

In [None]:
from torch import optim

In [None]:
def partial_fit(model_fn, loss_fn, X_batch, y_batch, opt):
    # forward pass
    loss = loss_fn(model_fn(X_batch), y_batch)

    # back propagation
    loss.backward()
    opt.step()
    opt.zero_grad() # don't forget to reset the gradient after each batch!

        
def fit(model_fn, loss_fn, X, y, opt, number_epochs=2, batch_size=64):
    number_samples, _ = X.shape 
    number_batches = (number_samples - 1) // batch_size + 1
    for epoch in range(number_epochs):
        for batch in range(number_batches):
            start = batch * batch_size
            X_batch = X[start:(start + batch_size)]
            y_batch = y[start:(start + batch_size)]
            partial_fit(model_fn, loss_fn, X_batch, y_batch, opt)

In [None]:
model_fn = MNISTLogisticRegression()
loss_fn = F.cross_entropy
opt = optim.SGD(model_fn.parameters(), lr=0.5)

In [None]:
fit(model_fn, loss_fn, scaled_training_features, training_target, opt)

In [None]:
loss_fn(model_fn(X), y)

# Refactor using `torch.utils.data.TensorDataSet`

The [torch.utils.data](https://pytorch.org/docs/stable/data.html#module-torch.utils.data) module contains a number of useful classes that we can use to further simplify our code. PyTorch has an abstract `Dataset` class. A Dataset can be anything that has a `__len__` function (called by Python’s standard `len` function) and a `__getitem__` function as a way of indexing into it.

PyTorch’s `TensorDataset` is a `Dataset` wrapping tensors. By defining a length and way of indexing, this also gives us a way to iterate, index, and slice along the first dimension of a tensor. This will make it easier to access both the independent and dependent variables in the same line as we train.


In [None]:
from torch.utils import data

In [None]:
def fit(model_fn, loss_fn, data_set, number_samples, opt, number_epochs=2, batch_size=64):
    number_batches = (number_samples - 1) // batch_size + 1
    for epoch in range(number_epochs):
        for batch in range(number_batches):
            start = batch * batch_size
            X_batch, y_batch = data_set[start:(start + batch_size)]
            partial_fit(model_fn, loss_fn, X_batch, y_batch, opt)

In [None]:
model_fn = MNISTLogisticRegression()
loss_fn = F.cross_entropy
training_data_set = data.TensorDataset(scaled_training_features, training_target)
opt = optim.SGD(model_fn.parameters(), lr=0.5)

In [None]:
# note the annoying dependence on number of samples!
fit(model_fn, loss_fn, training_data_set, number_samples, opt)

In [None]:
loss_fn(model_fn(X), y)

# Refactor using `torch.utils.data.DataLoader`

Pytorch’s `DataLoader` is responsible for managing batches. You can create a `DataLoader` from any `Dataset`. `DataLoader` makes it easier to iterate over batches. Rather than having to use `data_set[start:(start + batch_size)]`, the `DataLoader` gives us each minibatch automatically.

In [None]:
def fit(model_fn, loss_fn, data_loader, opt, number_epochs=2, batch_size=64):
    for epoch in range(number_epochs):
        for X_batch, y_batch in data_loader:
            partial_fit(model_fn, loss_fn, X_batch, y_batch, opt)

In [None]:
model_fn = MNISTLogisticRegression()
loss_fn = F.cross_entropy
training_data_loader = data.DataLoader(training_data_set, batch_size=batch_size, shuffle=True)
opt = optim.SGD(model_fn.parameters(), lr=0.5)

In [None]:
# now we no longer have the annoying dependency on number of samples!
fit(model_fn, loss_fn, training_data_loader, opt)

In [None]:
loss_fn(model_fn(X), y)

Thanks to Pytorch’s `torch.nn.Module`, `torch.nn.Parameter`, `Dataset`, and `DataLoader`, our training loop is now dramatically smaller and easier to understand. Let’s now try to add the basic features necessary to create effecive models in practice.

# Adding Validation

In the first part of this tutorial, we were just trying to get a reasonable training loop set up for use on our training data. In reality, you always should also have a validation set, in order to identify if you are overfitting.

Shuffling the training data is important to prevent correlation between batches and overfitting. On the other hand, the validation loss will be identical whether we shuffle the validation set or not. Since shuffling takes extra time, it makes no sense to shuffle the validation data.

We’ll use a batch size for the validation set that is twice as large as that for the training set. This is because the validation set does not need backpropagation and thus takes less memory (it doesn’t need to store the gradients). We take advantage of this to use a larger batch size and compute the loss more quickly.

In [None]:
def fit(model_fn, loss_fn, training_data_loader, opt, validation_data_loader=None, number_epochs=2):
    
    for epoch in range(number_epochs):
        model_fn.train()
        for X_batch, y_batch in training_data_loader:
            partial_fit(model_fn, loss_fn, X_batch, y_batch, opt)
        
        # compute validation loss after each training epoch
        if validation_data_loader is not None:
            model_fn.eval()
            with torch.no_grad():
                batch_losses, batch_sizes = zip(*[(loss_fn(model_fn(X), y), len(X)) for X, y in validation_data_loader])
                validation_loss = np.sum(np.multiply(batch_losses, batch_sizes)) / np.sum(batch_sizes)
            print(f"Training epoch: {epoch}, Validation loss: {validation_loss}")


In [None]:
model_fn = MNISTLogisticRegression()
loss_fn = F.cross_entropy
training_data_loader = data.DataLoader(training_data_set, batch_size=batch_size, shuffle=True)
opt = optim.SGD(model_fn.parameters(), lr=0.5)

_validation_data_set = data.TensorDataset(scaled_validation_features, validation_target)
validation_data_loader = data.DataLoader(_validation_data_set, batch_size=2*batch_size)

In [None]:
fit(model_fn, loss_fn, training_data_loader, opt, validation_data_loader)

# Switching to CNN

We are now going to build our neural network with three convolutional layers. Because none of the functions in the previous section assume anything about the model form, we’ll be able to use them to train a CNN without any modification!

We will use Pytorch’s predefined [torch.nn.Conv2d](https://pytorch.org/docs/stable/nn.html#conv2d) class as our convolutional layer. We define a CNN with 3 convolutional layers. Each convolution is followed by a [torch.nn.functional.relu](https://pytorch.org/docs/stable/nn.html#id26) non-linear activation function. At the end, we perform an average pooling using [torch.nn.functional.avg_pool2d](https://pytorch.org/docs/stable/nn.html#avg-pool2d).

In [None]:
class MNISTCNN(nn.Module):
    
    def __init__(self):
        super().__init__()
        self._conv1 = nn.Conv2d(1, 16, kernel_size=3, stride=2, padding=1)
        self._conv2 = nn.Conv2d(16, 16, kernel_size=3, stride=2, padding=1)
        self._conv3 = nn.Conv2d(16, 10, kernel_size=3, stride=2, padding=1)
        
    def forward(self, X):
        X = X.view(-1, 1, 28, 28) # implicit knowledge of MNIST data shape!
        X = F.relu(self._conv1(X))
        X = F.relu(self._conv2(X))
        X = F.relu(self._conv3(X))
        X = F.avg_pool2d(X, 4)
        return X.view(-1, X.size(1))
    

In [None]:
model_fn = MNISTCNN()
opt = optim.SGD(model_fn.parameters(), lr=0.1, momentum=0.9)

In [None]:
# note that we can re-use the loss function as well as trainig and validation data loaders
fit(model_fn, loss_fn, training_data_loader, opt, validation_data_loader)

# Refactor using `torch.nn.Sequential`

PyTorch has another handy class we can use to simply our code: [torch.nn.Sequential](https://pytorch.org/docs/stable/nn.html#sequential). A `Sequential` object runs each of the modules contained within it, in a sequential manner. This is a simpler way of writing our neural network.

To take advantage of this, we need to be able to easily define a custom layer from a given function. For instance, PyTorch doesn’t have a view layer, and we need to create one for our network. `LambdaLayer` will create a layer that we can then use when defining a network with `Sequential`.

In [None]:
class LambdaLayer(nn.Module):
    
    def __init__(self, f):
        super().__init__()
        self._f = f
        
    def forward(self, X):
        return self._f(X)
    


In [None]:
model_fn = nn.Sequential(
    LambdaLayer(lambda X: X.view(-1, 1, 28, 28)),
    nn.Conv2d(1, 16, kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.Conv2d(16, 16, kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.Conv2d(16, 10, kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.AvgPool2d(4),
    LambdaLayer(lambda X: X.view(X.size(0), -1))
)

opt = optim.SGD(model_fn.parameters(), lr=0.1, momentum=0.9)

In [None]:
fit(model_fn,
    loss_fn,
    training_data_loader,
    opt,
    validation_data_loader)

# Generalize our pipeline by wrapping our DataLoader

Our CNN is fairly concise, but it only works with MNIST, because:

1. It assumes the input is a 28*28 long vector
2. It assumes that the final CNN grid size is 4*4 (since that’s the average pooling kernel size we used)

Let’s get rid of these two assumptions, so our model works with any 2d single channel image. First, we can remove the initial Lambda layer by moving the data preprocessing into a generator:

In [None]:
class WrappedDataLoader:
    
    def __init__(self, data_loader, f):
        self._data_loader = data_loader
        self._f = f
        
    def __len__(self):
        return len(self._data_loader)
    
    def __iter__(self):
        for batch in iter(self._data_loader):
            yield self._f(*batch)


Next, we can replace `nn.AvgPool2d` with [torch.nn.AdaptiveAvgPool2d](https://pytorch.org/docs/stable/nn.html#adaptiveavgpool2d), which allows us to define the size of the output tensor we want, rather than the input tensor we have. As a result, our model will work with any size input.

In [None]:
model_fn = nn.Sequential(
    nn.Conv2d(1, 16, kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.Conv2d(16, 16, kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.Conv2d(16, 10, kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.AdaptiveAvgPool2d(1),
    LambdaLayer(lambda X: X.view(X.size(0), -1))
)

opt = optim.SGD(model_fn.parameters(), lr=0.1, momentum=0.9)

_preprocess = lambda X, y: (X.view(-1, 1, 28, 28), y)
training_data_loader = WrappedDataLoader(training_data_loader, _preprocess)
validation_data_loader = WrappedDataLoader(validation_data_loader, _preprocess)

In [None]:
fit(model_fn,
    loss_fn,
    training_data_loader,
    opt,
    validation_data_loader)

# Using GPU(s) to accelerate training

GPUs can significantly speedup training of deep neural networks. If you are running this notebook in Google Colab, then you can take advantage of free GPUs to accelerate training of your models! To take advantage of GPU acceleration from the tool bar select `Runtime` -> `Change Runtime Type` and then select `GPU` from the hardware accelerator dropdown menu. **Changing the runtime type requires that the Python kernel be restarted and will require you to re-run relevant cells in the notebook!**

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

In [None]:
# define your deep learning model and make it available to the GPU
model_fn = nn.Sequential(
    nn.Conv2d(1, 16, kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.Conv2d(16, 16, kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.Conv2d(16, 10, kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.AdaptiveAvgPool2d(1),
    LambdaLayer(lambda X: X.view(X.size(0), -1))
)
model_fn.to(device)

# define a loss
loss_fn = F.cross_entropy

opt = optim.SGD(model_fn.parameters(), lr=0.1, momentum=0.9)

_preprocess = lambda X, y: (X.view(-1, 1, 28, 28).to(device), y.to(device))
training_data_loader = WrappedDataLoader(training_data_loader, _preprocess)
validation_data_loader = WrappedDataLoader(validation_data_loader, _preprocess)

In [None]:
fit(model_fn,
    loss_fn,
    training_data_loader,
    opt,
    validation_data_loader,
    number_epochs=5)

# Create your own model!

Using the above code as a template, try and create your own deep learning model to classify the MNIST data. Here are a few ideas to try.

1. Add more convolutional layers.
2. Add more neurons in each convolutional layer(s).
3. Try different activation layers.
4. Try using a different optimizer.
5. Try tuning the hyper-parameters of your chosen optimizer.
6. Train the model for more epochs (but don't overfit!)

In [None]:
model_fn = ???
model_fn.to(device)

opt = ???

In [None]:
fit(model_fn,
    loss_fn,
    training_data_loader,
    opt,
    validation_data_loader,
    number_epochs=5)

# Submitting to Kaggle

If you have created a Kaggle account, then you can submit your model's predictions to Kaggle and see how you stack up against your peers.

## Re-train the model using the entire training set

In [None]:
# re-scale the training features
_training_target, _training_features = mnist_arr[:, 0], mnist_arr[:, 1:]
_scaled_training_features = min_max_scaler.fit_transform(_training_features)

# create the tensors
_scaled_training_features_tensor = torch.tensor(_scaled_training_features, dtype=torch.float32)
_training_target_tensor = torch.tensor(_training_target)

# create the data loader
_training_data = data.TensorDataset(_scaled_training_features_tensor, _training_target_tensor)
_training_data_loader = data.DataLoader(_training_data, batch_size=batch_size, shuffle=True)

# wrap the data loader to reshape the data as needed
reshape = lambda X, y: (X.view(-1, 1, 28, 28).to(device), y.to(device))
wrapped_training_data_loader = WrappedDataLoader(_training_data_loader, reshape)


In [None]:
fit(model_fn,
    loss_fn,
    wrapped_training_data_loader,
    opt,
    number_epochs=5)

## Use trained model to make predictions using the test data

In [None]:
# note we use transform method and NOT fit_transform!
_testing_features = np.loadtxt("../data/raw/mnist/test.csv", delimiter=',', skiprows=1, dtype=np.int64)
_scaled_testing_features = min_max_scaler.transform(_testing_features)
scaled_testing_features_tensor = torch.tensor(_scaled_testing_features, dtype=torch.float32)

In [None]:
output = model_fn(scaled_testing_features_tensor.view(-1, 1, 28, 28).to(device))
predictions = torch.argmax(output, dim=1)

In [None]:
predictions

## Reformat predictions

In [None]:
# submission format for kaggle
!head ../data/raw/mnist/sample_submission.csv

In [None]:
import os
import time

import pandas as pd

if not os.path.isdir("../data/kaggle-submissions/mnist/"):
    os.makedirs("../data/kaggle-submissions/mnist/")

timestamp = time.strftime("%Y%m%d-%H%M%S")
number_predictions, = predictions.shape
df = pd.DataFrame({"ImageId": range(1, number_predictions + 1), "Label": predictions.cpu()})
df.to_csv(f"../data/kaggle-submissions/mnist/submission-{timestamp}.csv", index=False)

## Submit to Kaggle!

Once you have successfully submited your predictions then you can check the [Digit-Recognizer competition](https://www.kaggle.com/c/digit-recognizer) website and see how well your best model compares to your peers.

In [None]:
%%bash
export KAGGLE_USERNAME="YOUR_USERNAME"
export KAGGLE_KEY="YOUR_API_KEY"
kaggle competitions submit digit-recognizer \
  -f $(ls ../data/kaggle-submissions/mnist/submission-*.csv | tail -n 1) \
  -m "My first ever Kaggle submission!"