# Supervised classification with `scikit-learn`
----
In this notebook, we will train and compare different supervised classification algorithms implemented in the python libary `scikit-learn`. Supervised classification is about learning from already labelled training data to predict the class of unlabeled data.

## Load training data
----
We start by importing `numpy` and one of the many demo datasets that come with `scikit-learn`:

In [1]:
import numpy as np
from sklearn.datasets import load_iris

Next, we load the dataset into a variable called `iris`:

In [2]:
iris = load_iris()

 This dataset was published in 1936 and is a widely used example dataset for machine learning:
 
 https://en.wikipedia.org/wiki/Iris_flower_data_set
 
 ![Image of Iris features](https://www.pngkey.com/png/detail/82-826789_iris-iris-sepal-and-petal.png)


## Investigate data
----
Four features were measured:

In [3]:
print(iris['feature_names'])

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


The measured values are our training data, we store them in the array 'X':

In [4]:
X = iris.data

### TASK 1:
----
How many samples were measured?

In [5]:
X.shape

(150, 4)

----

We store the actual categories (ground truth) in the array 'y'. There are three categories:

In [6]:
y = iris.target
np.unique(y)
print(iris['target_names'])

['setosa' 'versicolor' 'virginica']


### TASK 2:

Display the four measured values and the ground truth for the first 10 samples

In [7]:
print(y[:10])

[0 0 0 0 0 0 0 0 0 0]


## Splitting data into training and test set
----
A very important practice in machine learning is to hold back ("hide") some of the training data from the algorithm to evaluate its performance:

In [8]:
from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(X, y, train_size=0.5, random_state=0)

print(train_X.shape)
print(test_X.shape)

(75, 4)
(75, 4)


## Training a classifier in `scikit-learn`
----
The different classifiers in scikit-learn use a common interface, which makes it easier to compare different classifiers. We start with a very simple example, based on neighborhood.

First, the classifier function needs to be imported:


In [9]:
from sklearn.neighbors import KNeighborsClassifier

As a first step, we have to create an instance of the classifier using its constructor. If we do not pass any parameters, the default values will be used:

In [10]:
knn = KNeighborsClassifier(n_neighbors=2)
knn

Next, we use the 'fit()' function of the classifier to train it, and pass the training and test data:

In [11]:
knn.fit(train_X, train_y)

## Applying a pre-trained classifier to test data
----
After the training has finished, we can use the 'predict()' function on the previously unseen test data:

In [12]:
knn.predict(test_X)

array([2, 1, 0, 2, 0, 2, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 2, 1,
       0, 0, 1, 0, 0, 1, 1, 0, 2, 1, 0, 2, 2, 1, 0, 2, 1, 1, 2, 0, 2, 0,
       0, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 0, 2, 1, 1, 1,
       1, 2, 0, 0, 2, 1, 0, 0, 1])

## Measuring the performance of a classifier
----
How well did the prediction actually work? Let's look at the real ground truth labels:

In [13]:
print(test_y)

[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 1 1 1 2 0 2 0 0 1 2 2 2 2 1 2 1 1 2 2 2 2 1 2 1 0 2 1 1 1 1 2 0 0 2 1 0 0
 1]


Can you spot the difference? To quantify the prediction accuracy, we can use the 'score' function of the classifier:

In [14]:
print("Accuracy = {:.4f}".format(knn.score(test_X, test_y)))

Accuracy = 0.8933


### TASK 3:
----
Try some other classifiers from scikit-learn (https://scikit-learn.org/stable/supervised_learning.html) and compare their performance with the KNN algorithm.

Modify the parameters of the algorithms and try to improve their accuracy. If you change the parameters, you have to retrain the classifier using `fit()`.

In [15]:
# Examples:
#
# from sklearn.linear_model import LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
# from sklearn.svm import SVC





In [16]:
rf = RandomForestClassifier()

In [17]:
rf.fit(train_X, train_y)

In [18]:
print("Accuracy = {:.4f}".format(rf.score(test_X, test_y)))

Accuracy = 0.9333
