# Datacamp: Introduction to Numpy

    Alexandre Gramfort : alexandre.gramfort@inria.fr

The main goal of this notebook is to get familiar with Python and NumPy by manipulating a famous dataset in machine learning.

The data are embedded in the `scikit-learn` library.

This dataset is known as digits and contains images of hand-written digits with theirs associated labels.

# I - Manipulations and visualization of the `digtits` dataset

## Imports et intialisation

In [None]:
%matplotlib inline                      

import numpy as np                      # charge un package pour le numérique
import matplotlib.pyplot as plt         # charge un package pour les graphiques

## Dataset description

We will load the `digits` dataset available in `scikit-learn` (the import name is `sklearn`). This dataset contains images of hand-written digits.

In [None]:
# Load the dataset from scikit-learn
from sklearn.datasets import load_digits

digits = load_digits()
X, y = digits.data, digits.target

In [None]:
X

In [None]:
X.shape

In [None]:
X.ndim

In [None]:
y.shape

In [None]:
y.size

In [None]:
X.size, 1797*64

In [None]:
X[0, :].size

In [None]:
y

In [None]:
y.max(), y.min()

In [None]:
np.unique(y)

In [None]:
print(f"Number of pixels (features):      {X.shape[1]}")
print(f"Number of images (samples):       {X.shape[0]}")
print(f"Number of classes:                {len(np.unique(y))}")

In [None]:
# Choose any image (sample)
idx_to_test = 15

print("Show a line of the array (i.e., image):")
print(X[idx_to_test, :])
print("Show the associted class (i.e., associated class):")
print(y[idx_to_test])

<div class="alert alert-success">
    <b>EXERCISE:</b>
     <ul>
      <li>What is the data type of X? y?</li>
      <li>Change `idx_to_test`. Without showing y[idx_to_test], can you recognize the number of this sample?</li>
    </ul>
</div>

## Data visualization:

The digitized images have a size of 8 x 8 accounting for a total of 64 pixels. They are stored into a row vector which need to be reshaped to be visualized as an image. You can use the function `numpy.reshape` to transform the 1D-array into a 2D-array of 8 x 8 values.

In [None]:
# We use `imshow` to visualize the the image
plt.imshow(np.reshape(X[0, :], (8, 8)));

In [None]:
# Use a grayscale colormap for a better visualization
plt.imshow(np.reshape(X[idx_to_test, :], (8, 8)),
           cmap='gray', aspect='equal', interpolation='nearest')
plt.colorbar()
plt.title(f"The associated class with the image at index {idx_to_test} is {y[idx_to_test]}")

<div class="alert alert-success">
    <b>EXERCICE:</b>
     <ul>
      <li>Show an image with 1 line and 1 column over 2.</li>
      <li>Show the previous image by removing the pixel on the border?</li>
      <li>Show the pixel distribution using `plt.hist`).</li>
    </ul>
</div>


## Basic statistics:

Pour mieux comprendre la base de données on va s'intéresser à quelques statistiques. 
On commence par calculer les moyennes et variances par classes pour chacun des chiffres. La moyenne par classe se visualise comme une image qui est une représentantion moyenne pour chaque chiffre de zéro à neuf. Idem pour la variance, ce qui permet alors de voir les parties avec les plus grandes variations entre les membres d'une même classe.

To better understand the database, we will check a couple of statistics.
We can start by looking at the mean and variances for each class digit.
We can plot the mean and variance which should be a 8 x 8 image.

* What is the representation of the mean and variance images?

In [None]:
classes_list = np.unique(y).astype(int)
print("Liste des classes en présence: ", classes_list)

<div class="alert alert-success">
    <b>EXERCICE:</b>
     <ul>
      <li>Calculer un représentant moyen du chiffre 0 (l'image qui en pixel i,j contient la valeur moyenne du pixel i,j parmis tous les 0)</li>
      <li>Avec une boucle `for` calculer le représentant moyen pour chaque chiffre</li>
      <li>Faire la même chose en remplaçant la moyenne par l'écart type</li>
      <li>Afficher toutes les images sur une grille à l'aide de la fonction `plt.subplots`</li>
    </ul>
</div>


# II - Nearest centroids classification

The aim of this exercise is to implement your own classifier based on an intuitive idea.
For a new image, we will predict the class for which the mean digit is the closest (in the feature space).

<div class="alert alert-success">
    <b>EXERCISE:</b>
     <ul>
      <li>Split the dataset into 2 part. We will denote the with the following variable X_train, y_train, X_test, and y_test for the training data and labels and testing data and labels, respectively.</li>
      <li>For each class, compute the mean digit image on the training set. We will denote the resulting variable `centroids_train`.</li>
      <li>For each sample in the testing set, compute the nearest centroids. Compute the percentage of good predictions to evaluate the performance of your classifier.</li>
    </ul>
</div>