In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### 1. Data science formalism

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()

In [None]:
iris.keys()

In supervised machine learning, you have some data with the corresponding label for these data. Let's check what are those data.

In [None]:
X_df = pd.DataFrame(iris.data, columns=iris.feature_names)

In [None]:
X_df.tail()

In [None]:
y = pd.Series(iris.target, name='target')
y = y.apply(lambda x: iris.target_names[x])

In [None]:
y.tail()

In [None]:
sns.pairplot(data=pd.concat([X_df, y], axis=1), hue='target')

I created a pandas dataframe and a pandas series from the original data. We should check what type of data were these original data.

In [None]:
type(iris.data)

In [None]:
type(iris.target)

So these variables are NumPy array. NumPy is a package which allows to work with numeric data efficiently in Python. It is the main used package in scikit-learn to handle data. Let's give an example. We will train a classifier which will predict the class of the majority nearest neighbors.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

Create the classifier.

In [None]:
classifier = KNeighborsClassifier()

Train the classifier.

In [None]:
classifier.fit(iris.data, iris.target)

Let say that we got an iris flower and took the measurements of the petal and sepal and organise it the same way as before.

In [None]:
new_flower = np.array([[5.1, 3.5, 1.4, 0.2]])

Our classifier will be able to tell use which class this flower should be.

In [None]:
classifier.predict(new_flower)

Be aware that this classifier could have been directly the dataframe or series because all data are already numeric and scikit-learn could have manage to convert them into NumPy arrays internally.

In [None]:
classifier.fit(X_df, y)

In [None]:
classifier.predict(new_flower)

## 2. Difference between NumPy array and a pandas dataframe

We can quickly check what is the difference between a NumPy array and a Pandas dataframe by printing them.

In [None]:
iris.data

In [None]:
X_df.head()

So the dataframe got an index and some column names. However, they represent the same data.

In [None]:
X_df.shape

In [None]:
iris.data.shape

Anothe major difference is about the data types:

In [None]:
X_df.dtypes

In [None]:
iris.data.dtype

A dataframe as a data types for each column while a single one for the numpy array. Note that we can always get a numpy array from a dataframe.

In [None]:
X_df.values

We can select some values from an array similarly to the selection by position of Pandas.

In [None]:
iris.data[0, :]

In [None]:
iris.data[:, 0]

In [None]:
iris.target[0]

## 2. Quick numerical analysis

We already saw that we could use Pandas not make quick numerical analysis.

In [None]:
X_df.mean()

However, you don't want to always convert your data to a dataframe to compute simple stats. Let's compute the `mean` as we would do in Pandas. What this mean value is representing?

In [None]:
iris.data.mean()

In [None]:
X_df.mean().mean()

So we can use the `axis` keyword to further explain how to compute a statistic.

In [None]:
iris.data.mean(axis=0)

In [None]:
iris.data.mean(axis=1)

## 3. Particular example of image classification

A very common use case in classification is to classifify image. I want to present this case since it is not straightforward to know how to represent our data.

In [None]:
from sklearn.datasets import load_digits

digits = load_digits()

In [None]:
digits.keys()

So we have some image data. We can first check the shape of this numpy array.

In [None]:
digits.images.shape

And we can plot the first sample of this array which is an image.

In [None]:
plt.imshow(digits.images[0])

However, we saw that scikit-learn require 2D data array. Let's check how the data were stored.

In [None]:
digits.data.shape

In [None]:
digits.data[0]

In [None]:
digits.target.shape

In [None]:
n_images, height, width = digits.images.shape
print(n_images, height, width)

So we can easily go from a 3D array to a 2D array and vice-versa.

In [None]:
plt.imshow(digits.data[0].reshape((height, width)))

Now, we can train a classifier and see how it behave.

In [None]:
classifier.fit(digits.data, digits.target)

In [None]:
new_example = digits.data[0]

In [None]:
classifier.predict(new_example)

We get a nasty error because we provide a 1D vector to the classifier. Scikit-learn requires a 2D array such that we know the difference between a sample and a feature.

In [None]:
new_example.shape

In [None]:
new_example.reshape(-1, 1).shape

In [None]:
new_example[:, np.newaxis].shape

In [None]:
new_example.reshape(1, -1).shape

In [None]:
new_example[np.newaxis, :].shape

Now that we know how to specify that our vector was a single sample, we can try to predict again.

In [None]:
classifier.predict(new_example[np.newaxis, :])

What if I got 2 single samples to classify and that I want to join them to make the predictions.

In [None]:
another_example = digits.data[1]

In [None]:
datasets = np.vstack([new_example, another_example])

In [None]:
datasets.shape

In [None]:
datasets

In [None]:
classifier.predict(datasets)