# Workshop: Clustering and Classification with Python

The objective of this workshop is to introduce you to the basics of clustering and classification with Python. We will use the [scikit-learn](http://scikit-learn.org/stable/) library, which is a very popular library for machine learning in Python. We will also use the [pandas](http://pandas.pydata.org/) library for data manipulation and the [matplotlib](http://matplotlib.org/) library for plotting (seen in the previous workshop).


For this workshop, we will use the [Iris flower data set](https://en.wikipedia.org/wiki/Iris_flower_data_set), which is a very famous data set in machine learning. It contains 150 samples of 3 different species of iris flowers (50 samples for each species). For each sample, 4 features are measured: the length and the width of the sepals and petals, in centimeters. The goal is to classify the samples into the 3 different species based on these 4 features.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

#### Exercise 1: Discovering the scikit-learn library

First of all, we will use the sklearn library to load the Iris data set, and use pandas to study it and plot it.

In [2]:
# Load the data 

# Convert it to a pandas dataframe

# Display the first 5 rows of the dataframe

#### Exercise 2: Split the data set into a training set and a test set

To evaluate the performance of a classifier, we need to test it on data that it has never seen before. Therefore, we need to split the data set into a training set and a test set. We will use the training set to train the classifier, and the test set to evaluate its performance.

In [3]:
# Separate the data into a test and training set (70% training and 30% test)


#### Exercise 3: Visualize the data set

Before training a classifier, it is always a good idea to visualize the data set. Many ways of visualizing the data set are possible, but we will use a scatter plot of the data points, where each data point is represented by a dot, and the x and y coordinates of the dot are the values of the first and second features of the data point. We will use the pandas library to do this.

In [4]:
# Visualize the data


Some questions need to be answered before studying the data set:

- How many features are there?
- What type of classification problem is this? (binary, multi-class, multi-label, ...)
- How many classes are there?
- Which features are the most discriminative? (i.e. which features allow to distinguish the different classes the best?)
- Are the classes linearly separable? (i.e. can we draw a straight line to separate the classes?)

#### Exercise 4: Identify the most discriminative features

We saw in the previous exercise that the features are not equally discriminative. Some features allow to distinguish the different classes better than others. Therefore, some features can be removed without affecting the performance of the classifier. This is called feature selection.

In order to select the most discriminative features, multiple methods exist. Create a feature ranking.

Use sklearn to do this.

In [None]:
# Establish which features are the most important

# Rank the features in order of importance

# Print the feature ranking

#### Exercise 5: Train a supervised classifier

Now that we know which features are the most discriminative, we can train a classifier. As this dataset is a clustering problem, we will use a clustering algorithm.

In order to learn the different clustering algorithms, we will start by using a supervised classifier.

You can use any clustering supervised classifier you want. But I recommend using a K-Nearest Neighbors classifier. You can find the documentation [here](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier).

In [5]:
# Implement a KNN classifier, train it on the training set and print the accuracy score on the training set



#### Exercise 6: Evaluate the performance of the classifier

Now that we have trained a classifier, we need to evaluate its performance. We will use the test set to do this. 

1. Predict the class of each sample in the test set.
2. Compare the predicted class with the true class of each sample in the test set. (use sklearn to do this)
3. Compute and plot the confusion matrix. (use sklearn to do this)

> A confusion matrix is a table that is often used to describe the performance of a classifier on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm.

In [6]:
# Print the accuracy score on the test set


# Print the confusion matrix


#### Exercise 7: Implement an unsupervised clustering algorithm

Now that we have trained a supervised classifier, we will train an unsupervised clustering algorithm. I suggest the K-Means algorithm. You can find the documentation [here](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans).

#### Exercise 8: Evaluate the performance of the unsupervised clustering algorithm

Now that we have trained an unsupervised clustering algorithm, we need to evaluate its performance. As done previously, we will use the test set to do this.

#### Exercise 9: Plot each of the estimated clusters and the true clusters

Now that we have trained and testing both algorithm. We will plot the results of each algorithm and observe the differences, and try to explain the errors of each algorithm.


Which algorithm do you think is the best for this dataset? Why?

# Challenge

Now that you have trained and tested both algorithms, you can try to improve the performance of the algorithms. You can try to improve the performance of the supervised classifier by changing the parameters of the classifier, or by using a different classifier. You can also try to improve the performance of the unsupervised clustering algorithm by changing the parameters of the algorithm, or by using a different algorithm.

It is also possible to improve the process to do a dimensionality reduction before training the classifier. You can try to do this.

# Conclusion

In this workshop, we have seen how to use the scikit-learn library to train a supervised and an unsupervised clustering algorithms. Clustering is a very important aspect of machine learning, and it is used in many applications. Such as image segmentation, document clustering, market segmentation, ...

# To go further

Multiple other clustering algorithms exist. You can try to use them and compare their performance. You can find a list of clustering algorithms [here](http://scikit-learn.org/stable/modules/clustering.html#clustering).