# Data representation in machine learning

We will use extensively numpy, pandas, and matplotlib libraries over the lectures.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

### A Simple Example: the Iris Dataset

Classifiers are algorithms which we will automatically derive statistical rules from a set of data to provide a decision. In this first session, we will focus on the data structure which is usually used. Indeed, we will start by taking a toy dataset called `iris` available in `scikit-learn`.

The data consists of measurements of three different iris flower species. There are three different species of iris in this particular dataset as illustrated below:

<table style="width:100%">
  <tr>
    <th>Species</th>
    <th>Image</th>
  </tr>
  <tr>
    <td>Iris Setosa</td>
    <td><img src="./figures/iris_setosa.jpg" width="80%"></td>
  </tr>
  <tr>
    <td>Iris Versicolor</td>
    <td><img src="./figures/iris_versicolor.jpg" width="80%"></td>
  </tr>
  <tr>
    <td>Iris Virginica</td>
    <td><img src="./figures/iris_virginica.jpg" width="80%"></td>
  </tr>
</table>

Botanists might be interested in differentiating the iris species automatically. Let's check what data they collected about these flowers.

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()

`iris` is Python dictionary containing all information about the dataset. We can review the different data which we loaded by checking the keys of the dictionary.

In [None]:
iris.keys()

`iris.data` will contain the measurements done by the botanist while `iris.target` corresponds to the species.

In [None]:
iris.data[:5]

In [None]:
iris.target[:5]

Checking the size of the data array, we will be able to understand the data representation

In [None]:
iris.data.shape

`iris.data` is a 2D array containing 150 rows and 4 columns. Each line is a measurement (i.e., sample) while each column is a flower characteristic (i.e., features).

In [None]:
iris.feature_names

Checking the variable `iris.feature_names`, we can infer the characteristics measured by the botanists. Indeed, they measured the length and width of the petal and sepal of the 150 iris flowers. See below to know the difference between a sepal and petal:

<img src="figures/petal_sepal.jpg" alt="Sepal" style="width: 50%;"/>

(Image: "Petal-sepal". Licensed under CC BY-SA 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Petal-sepal.jpg#/media/File:Petal-sepal.jpg)

We could use pandas dataframe to organise those information in a single data structure.

In [None]:
df_X = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df_X.head()

We can have a quick look of the interaction between the different features.

In [None]:
pd.plotting.scatter_matrix(df_X, figsize=(8, 8), diagonal='kde',
                           c=iris.target);

## Exercise

Later, we will work with images. Color images are represented as a 3D array while grayscale image are reprenseted as a 2D array. Let's open an image in Python

In [None]:
import os
from imageio import imread

In [None]:
filename = os.path.join('figures', 'iris_setosa.jpg')
img = imread(filename)

In [None]:
print(img.shape)

In [None]:
plt.imshow(img)
plt.axis('off');

We could only select one of the channel (e.g., the red channel).

In [None]:
img_red = img[..., 0]
plt.imshow(img_red, cmap=plt.cm.gray)
plt.axis('off');

In [None]:
img_red.shape

### Questions:

Let's imagine the following classification problem. From the iris dataset, the measurements are not anymore the width and height from the sepal and petals. Instead, the botanists took a picture of each of the sample.

* How would you organize the array of data to follow the structure (n_samples, n_features) as previously done?
* What is required regarding the size of the image?

`scikit-learn` provides a toy dataset named `digits` representing hand-written digits. The idea is to detect which digit is written on each image.

* Load the dataset by importing the function `load_digits` from `sklearn.datasets`.

In [None]:
# %load solutions/00_01.py

* List the keys of the dictionary created when calling the function `load_digits`

In [None]:
# %load solutions/00_02.py

* What is the difference between `data` and `images`? Hint: check the size of those arrays.

In [None]:
# %load solutions/00_03.py

In [None]:
# %load solutions/00_04.py

* Plot the first array in `images` using `matplotlib`

In [None]:
# %load solutions/00_05.py

* Using the array in `data`, reshape the array to be able to plot the same sample which corresponds to the first line. Hint: use the function reshape to get the desired array shape.

In [None]:
# %load solutions/00_06.py

Before to go in the next notebook, we will run a machine learning experiment. We will learn a linear classifier on a chunck of the dataset and test our model on the left-out data.

* Import the function `train_test_split` from `sklearn.model_selection`. Check the documentation and split the data into a training and testing sets.

In [None]:
# %load solutions/00_07.py

* Import the following module:
    * `make_pipeline` from `sklearn.pipeline`
    * `MinMaxScaler` from `sklearn.preprocessing`
    * `SGDClassifier` from `sklearn.linear_model`

In [None]:
# %load solutions/00_08.py

* Create a `scikit-learn` pipeline by pipelining a scaler and a classifier. Check the documentation of `make_pipeline` if you have some doubts. For the classifier, use the `log` loss instead of the default one.

In [None]:
# %load solutions/00_09.py

* Train the classifier only on the training data.

In [None]:
# %load solutions/00_10.py

* We can now evaluate our classifier. Import the `recall_score` from `sklearn.metrics` and use it to check the predictions of our classifier on the testing data. Do not average the results.

In [None]:
# %load solutions/00_11.py