### Submission Instructions

Just fill in the markdown and code cells below with your arguments and functions, and run the Python lines given. Make sure the notebook works fine by executing `Kernel/Restart & Run All`.
  
Once the notebook is ready,
1. Create a folder named `afi_last_name1_last_name2` with the team's last names.

2. Put in that folder:

* a file `mp_afi_last_name1_last_name2.ipynb` with the cells below completed. Make sure it works by executing Kernel/Restart & Run All.
* a file `mp_afi_last_name1_last_name2.html` with an html rendering of the previous .ipynb file (just apply File / Download as HTML after a correct run of Kernel/Restart & Run All).
* a file `mp_afi_last_name1_last_name2.pdf` with a pdf print of the html file **without any code**.

3. Compress the folder to a `afi_last_name1_last_name2.7z` 7z (or zip) file.

**Very important!!!**

Make sure you follow the file naming conventions above; the miniproject won't be graded until that is so.

## Recommendations in notebook writing

Notebooks are a great tool for data and model exploration. But in that process a lot of Python garbage can get into them as a consequence of the trial and error process.

But once these tasks are done and one arrives to final ideas and insights on the problem under study, the notebook should be **thoroughly cleaned** and the notebook should **concentrate on the insights and conclussions** without, of course, throwing away the good work done.

Below there are a few guidelines about this.

* Put the useful bits of your code as functions on a **Python module** (plus script, if needed) that is imported at the notebook's beginning. 
* Of course that module should be **properly documented** and **formatted** (try to learn about PEP 8 if you are going to write a lot of Python).
* Leave in the notebook **as little code as possible**, ideally one- or two-line cells calling a function, plotting results or so on.
* **Avoid boilerplate code**. If needed, put it in a module.
* Put on the notebook some way to **hide/display the code** (as shown below).
* The displayed information **should be just that, informative**. So forget about large tables, long output cells, dataframe or array displays and so on.
* Emphasize **insights and conclusions**, using as much markdown as needed to clarifiy and explain them.
* Make sure that **number cells consecutively starting at 1.**
* And, of course, make sure that **there are no errors left**. To avoid these last pitfalls, run `Kernel\Restart Kernel and Run All Cells`.

And notice that whoever reads your notebook is likely to toggle off your code and consider just the markdown cells. Because of this, once you feel that your notebook is finished,
* let it rest for one day, 
* then open it up, toggle off the code 
* and read it to check **whether it makes sense to you**.

If this is not the case, **the notebook is NOT finished!!!**

Following these rules you are much more likely to get good grades at school (and possibly also larger bonuses at work).

**IMPORTANT: before turning in your work, please REMOVE FROM IT THE PREVIOUS TWO CELLS**

In [1]:
from IPython.display import HTML

HTML('''
<script>code_show=true; 

function code_toggle() {
    if (code_show){
    $('div.input').hide();
    } else {
    $('div.input').show();
    }
    code_show = !code_show
} 

$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to show or hide your raw code."></form>
''')

# The MNIST Problem

The MNIST database (Modified National Institute of Standards and Technology database[1]) is a large database of handwritten digits that is commonly used for training and testing advanced machine learning algorithms. General references are:

**MNIST database**. Wikipedia. https://en.wikipedia.org/wiki/MNIST_database.

**THE MNIST DATABASE of handwritten digits**. Yann LeCun, Courant Institute, NYU Corinna Cortes, Google Labs, New York Christopher J.C. Burges, Microsoft Research, Redmond. http://yann.lecun.com/exdb/mnist/

**Classification datasets results**. Rodrigo Benenson. https://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html

The MNIST database contains 60,000 training images and 10,000 testing images. In our dataset the images will be 32 x 32 greyscale digit rasters.
In order to manage our computations in reasonable time, we are going to build our models working only with the test subset, which you can also further randomly split into train-validation and test subsets. If possible, you can also use the original train subset as a test one.

Do that taking into account you computing environment, but also trying to get models as good as possible.

### Student contributions

* Student `last_name_1` has ...
* Student `last_name_2` has ...

In [2]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [3]:
import os
import sys
import time

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import joblib
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import cross_val_score, cross_val_predict, StratifiedKFold, GridSearchCV, StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, recall_score, precision_score, confusion_matrix

from sklearn.pipeline import Pipeline

## Loading Data

There are several ways of getting the MNIST dataset. A simple one is to import it from the Keras library:

`from keras.datasets import mnist`  
`from matplotlib import pyplot`
 
`(train_X, train_y), (test_X, test_y) = mnist.load_data()`

We are going to use another version where the shape of each pattern is given by a $32 \times 32 \times 1$ tensor, as the original $28 \times 28$ images have been 0 padded. Thus, you may have to reshape it to either a matrix or a vector depending on the task you want to perform.

In [4]:
f_bnch = "E:\\data_bunch\\mnist\\mnist_32.bnch.joblib"
mnist = joblib.load(f_bnch)
print(mnist.keys())

print("data_shape: {0}".format(mnist['data'].shape))
print("data_test_shape: {0}".format(mnist['data_test'].shape))

dict_keys(['DESCR', 'data', 'target', 'data_test', 'target_test'])
data_shape: (60000, 32, 32, 1)
data_test_shape: (10000, 32, 32, 1)


## Data Exploration, Visualization and Correlations

Descriptive statistics, boxplots and histograms.

### Some examples

Plot 10 randomly chosen digit images as 5 x 2 subplots.

In [5]:
#code the plotting here

### Descriptive analysis

Build a DataFrame to make easier the exploratory analysis.

In [6]:
#define the DataFrame here

Describe the basic statistics of the pixels on the positions in the range `[494 : 502]` of the reshaped patterns.

In [7]:
#perform the description here

### Boxplots

Compute and display the boxplots of pixels in the range `[494 : 502]`.

In [8]:
#code the boxplots here

### Histograms and scatterplots

Plot pairplots and histograms over the previous pixel range using `sns.pairplot`.  
To do so select first two target digits which you may think should be quite different (e.g., 6 and 7) and apply `pairplot` only on patterns from those two targets.

In [9]:
#select two target digits and apply sns.pairplot on the indicated pixel range

### Correlations

Use the previous digit selection and pixel range but drop the `target` column.

Use directly a heatmap to display the correlations.

In [10]:
#display the correlations of the pixel range as a heatmap

### Data Analysis Conclusions

Write down here a summary of your conclusions after the basic data analysis

# Classiffiers

We are going to build an MLP classifier **over the test dataset**.  
But before working with any classifier, we split first the test dataset into a train-validation and a test subset.  
Use for this the class `StratifiedShuffleSplit` from scikit-learn. Set the `test_size` parameter to either `0.5` or `0.75`.

### Splitting the test dataset

In [11]:
#split the test dataset here specifying clearly your choices

## MLP Classifier

### CV Hyperparametrization

Define an appropriate `MLPClassifier` and discuss in some detail your choices of the `MLPClassifier` parameters to ensure a proper convergence.

Then, perform CV to select proper `alpha` and `hidden_layer_sizes` hyperparameters.

In [16]:
#define an appropriate MLP classifier and perform CV to select proper alpha and hidden_layer_sizes

### Search Results 

We first examine the test scores of the 5 best hyperparameters.

In [17]:
#transfor the CV results into a DataFrame and display the 5 best results

We analyze the CV results to check whether the CV ranges used are correct.

In [18]:
#plot the test scores that correspond to each alpha; do this only for the best MLP architecture found

### Test MLPC Performance

We check the test accuracy and confusion matrix.

In [19]:
#compute the test predictions, their accuracy and confusion matrix and discuss your results

### Conclusions on the MLP classifier

Give here a discussion as complete as possible of your results 

## SVC Classifier

### CV Hyperparametrization

Define an appropriate `SVC` model and discuss in some detail your choices of the `SVC` parameters to ensure a proper convergence.

Then, perform CV to select proper `C` and `gamma` hyperparameters.

In [16]:
#define an appropriate SVC classifier and perform CV to select proper  `C` and `gamma` hyperparameters.

### Search Results 

We first examine the test scores of the 5 best hyperparameters.

In [17]:
#transfor the CV results into a DataFrame and display the 5 best results

We analyze the CV results to check whether the CV ranges used are correct.

In [18]:
#analyze graphically the test scores that correspond to each C and gamma hyperparameter

### Test SVC Performance

We check the test accuracy and confusion matrix.

In [19]:
#compute the test predictions, their accuracy and confusion matrix and discuss your results

### Conclusions on the SVC classifier

Give here a discussion as complete as possible of your results 

## Comparing the MLP and SVC Classifiers

Discuss the respective performances considering the choices you have made for each model and possible improvements on your approach.  
Consider both classification results and training times.