# Machine learning pipeline and MNIST

## Contents

1. Loading data
2. Preprocessing
3. Building different models
4. Training models
5. Model selection
6. Model evaluation

**Objectives:**

1. Put into practice what you have learned with the tutorials
1. Get familiar with the MNIST dataset
1. Build a classic machine learning pipeline that includes the following steps:
    - Loading data
    - Preprocessing data
    - Building different models
    - Training models
    - Selecting the best model
    - Evaluating the best model
    
Of course, in reality, there are more intermediate steps and the pipeline looks more like an iterative process than a straight line, but after completing this notebook you should already have a good understanding of the machine learning pipeline and how to implement it in pytorch. :) 

**Andrew's videos related to this notebook**

If you are struggling with some concepts when completing this notebook, you can (re-)watch the following videos: 

- About train/validation/test datasets:  
    - [Train/Dev/Test Sets (C2W1L01)](https://www.youtube.com/watch?v=1waHlpKiNyY&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=2)
    - [Train/Dev/Test Set Distributions (C3W1L05)](https://www.youtube.com/watch?v=M3qpIzy4MQk&list=PLkDaE6sCZn6E7jZ9sN_xHwSHOdjUxUW_b&index=6)
    - [Sizeof Dev and Test Sets (C3W1L06)](https://www.youtube.com/watch?v=_Fe5kKmFieg&list=PLkDaE6sCZn6E7jZ9sN_xHwSHOdjUxUW_b&index=7)
- About the machine learning pipeline and model Performance:
    - [Bias/Variance (C2W1L02)](https://www.youtube.com/watch?v=SjQyLhQIXSM&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=3)
    - [Basic Recipe for Machine Learning (C2W1L03)](https://www.youtube.com/watch?v=C1N_PDHuJ6Q&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=4)
    - [Avoidable Bias (C3W1L09)](https://www.youtube.com/watch?v=CZf3oo0fuh0&list=PLkDaE6sCZn6E7jZ9sN_xHwSHOdjUxUW_b&index=10)
    - [Improving Model Performance (C3W1L12)](https://www.youtube.com/watch?v=zg26t-BH7ao&list=PLkDaE6sCZn6E7jZ9sN_xHwSHOdjUxUW_b&index=13)
    


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
import torch.optim as optim
from datetime import datetime
import matplotlib.pyplot as plt
import numpy as np
from torch.utils.data import random_split
from collections import Counter

torch.manual_seed(123)

## 1. Loading data

In this notebook, we will use the MNIST dataset, a large dataset of black and white images of handwritten digits.

**TODO**

Write a ``load_MNIST`` function that:  
1. Loads the MNIST dataset (see [torchvision.datasets.MNIST](https://pytorch.org/vision/stable/datasets.html#mnist))
2. Splits the dataset into 3: training, validation and test datasets
3. Returns the 3 datasets

**Hints**

You can adapt the ``load_cifar`` function written in the beginning of the 2nd and 3rd tutorials. (and all steps of this function are detailed in the 1st tutorial, "1.1 Loading the CIFAR dataset in Pytorch") 

In [None]:
#TODO

#### Plot one instance of each class   

**TODO**

Plot once instance of each class.

**Hints**

- You can adapt the corresponding code from the 1st tutorial, "1.3 Plot images".
- You can use ``cmap='gray'`` when calling [ax.imshow](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.imshow.html?highlight=imshow#matplotlib.axes.Axes.imshow) in order to get black and white images.

In [None]:
#TODO

#### Count the number of elements for each class   

**TODO**

1. Count the number of elements for each class in each of the 3 datasets. 
2. Does the dataset seem balanced?

**Hints**

You can adapt the corresponding code from the 1st tutorial, "1.4 Count how many samples there are for each class"

In [None]:
#TODO

## 2. Preprocessing

The preprocessing step typically comes *after* loading the data, but as we saw in the tutorials, the preprocessor can be passed as a parameter of the loading function in Pytorch. 

#### TODO

1. Compute the mean and the standard deviation of the training dataset. Note that contrary to the cifar dataset, the MNIST dataset contains only black and white images, so there is only one channel left. The mean and standard deviation should then be a scalar and not a tensor of 3 elements.
1. Re-load your 3 datasets, now including a preprocessor that:
    1. Crop the images from 28x28 to 24x24 (see [transforms.CenterCrop](https://pytorch.org/vision/stable/transforms.html#torchvision.transforms.CenterCrop))
    2. Convert images to tensors (see [transforms.ToTensorp](https://pytorch.org/vision/stable/transforms.html#torchvision.transforms.ToTensor))
    3. Normalize the dataset using the computed mean and standard deviation (see [transforms.Normalize](https://pytorch.org/vision/stable/transforms.html#torchvision.transforms.Normalize))
    
**Hints**
- You can adapt the corresponding code from the 1st tutorial. "2. Transforms"
- Use [transforms.Compose](https://pytorch.org/vision/stable/transforms.html#torchvision.transforms.Compose) to define your preprocessor as a combination of your transforms

In [None]:
#TODO

In [None]:
#TODO

## 3. Building different models

**TODO**

Define 2 (or 3) neural networks by writing classes inheriting the ``nn.Module`` class.

**Hints**

- See the 3rd tutorial, "2.2 Using the functional API"
- 2-3 layers are enough, otherwise the training will take too long if you don't have a gpu.
- Use only linear layers [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear), (and potentially non-trainable layers such as [F.max_pool2d](https://pytorch.org/docs/stable/generated/torch.nn.functional.max_pool2d.html?highlight=max_pool#torch.nn.functional.max_pool2d) or [torch.flatten](https://pytorch.org/docs/stable/generated/torch.flatten.html#torch.flatten)) we will study the other types of layers later on this semester.
- Remember that we don't need a softmax function in the output layer if we use [nn.CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html?highlight=crossentro#torch.nn.CrossEntropyLoss) as the loss function


In [None]:
#TODO

## 4. Training models

**TODO**  
Write a function ``train`` that 
- Trains the model for ``n`` epochs (complete passes through the training dataset)
- Computes and stores the training loss and the validation loss for each epoch
- Returns the list of training and validation losses

**Hints**
- You can find how to train and compute the training loss in the the ``train`` function in the tutorials. However, you need to modify this function in order to return the validation loss as well.
- Use [mode.train()](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.train) and [model.eval()](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.eval) correctly (see ``train`` and ``compute_accuracy`` in the tutorials)
- Use "with [torch.no_grad()](https://pytorch.org/docs/stable/generated/torch.no_grad.html#torch.no_grad):" correctly (see ``compute_accuracy`` in the tutorials)
- Don't backpropagate the loss when computing validation loss (see ``compute_accuracy`` in the tutorials)
- Don't update weights when computing validation loss (see ``compute_accuracy`` in the tutorials)
- Here, we want the validation loss, not the accuracy on the validation dataset.
- Both ``train`` and ``compute_accuracy`` are written at the beginning of the 3rd tutorial, with detailed explanation in the 2nd tutorial.

In [None]:
#TODO

**TODO**  

1. Define a train loader and a validation loader. 
2. Train 3 different models, i.e.:
    1. Instanciate a model using your custom Pytorch modules  
    2. Choose an optimizer (and its parameters, e.g the learning rate)  
    3. Choose a loss function  (e.g cross-entropy)
    4. Train your model and store its training and validation loss

**Hints**

- Again, you can find some hints in the tutorials
- 20 epochs or even a bit less are enough if your computer is a bit slow
- If you defined only 2 modules in section 3, you can also play with the learning rate 

In [None]:
#TODO

In [None]:
#TODO

#### Ploting the evolution of the training loss and the validation loss during the training

**TODO**

For each of your 3 models, plot the training loss and the validation loss. You can adapt the code in the cell below to plot your curves.

In [None]:
xvalues = np.linspace(0, 20, 1000)
yvalues01 = np.cos(xvalues)
yvalues02 = np.sin(xvalues)

fig, ax = plt.subplots()

# Plot cosine and specify its legend label
ax.plot(xvalues, yvalues01, label='cosine')
# Plot sine and specify its legend label
ax.plot(xvalues, yvalues02, label='sine')
ax.set_title('Cosine and Sine')
ax.set_xlabel('x label')
ax.set_ylabel('y label')
# Show legend
ax.legend()

In [None]:
#TODO

## 5. Model selection

1. Write a function ``compute_accuracy`` that computes the accuracy of a given model on a given dataset. You can find all you need in the tutorials.
2. Select your best model, that is to say:
    1. For each model, compute the accuracy on the validation dataset
    2. Choose the model with the highest accuracy
3. Print the training and the validation accuracies of your selected model

In [None]:
#TODO

## 6. Model evaluation

**TODO**

1. Evaluate your selected model, that is to say, compute and print the accuracy of your selected model on the test dataset

In [None]:
#TODO