# Classification

The goal of this interactive demo is to show you how a machine learning model can perform classification. However, keep in mind, that the code in this notebook was simplified for the demo, and should not be used as a plug and play example for real machine learning projects.

In this notebook we will explore three different types of classification models:

- Decision trees
- K-nearest neighbors
- Support vecotr machines

## The dataset

For the purpose of this demo we will use the [palmer penguin dataset](https://allisonhorst.github.io/palmerpenguins/), and only look at two relevant features. Using only two features, allows us to plot the dataset in a 2-dimensional figure and color code the background according to the target class - keep in mind that this is only possible for such low-dimensional datasets, and not something that is otherwise routinely done in machine learning projects.

Let's go ahead and download and prepare the dataset:

In [None]:
import pandas as pd

# Load penguins dataset
df = pd.read_csv('penguins.csv')

# Drop missing values
df.dropna(inplace=True)

# Show 5 random rows in the dataset
df.sample(5)

Let's extract the size of the dataset and report how many penguins were recorded for each class.

In [None]:
# Report size of the dataset
print(f"The dataset contains {df.shape[0]} entries, with {df.shape[1]} features.")

In [None]:
# Report class distribution
df.species.value_counts(normalize=True) * 100

For our classification demo, we are only focused on 'bill length' and 'bill depth'. So let's reduce the dataset to only what we need: A feature matrix $X$ and a target vector $y$.

In [None]:
# Extract feature matrix X
X = df[['bill_length_mm', 'bill_depth_mm']]
X.columns = ["Bill length [mm]", "Bill depth [mm]"]

# Extract target vector y
y = df['species']
species = y.unique()
y = y.map({'Adelie': 0, 'Chinstrap': 1, 'Gentoo': 2})

Before we train a classifier on our dataset, let's first take a look at it. To fasciliate this, we prepared a small plotting routine for you.

In [None]:
from utils import plot_data

plot_data(X, y)

Great, everything is ready for the classification!

## Classification using Decision Trees

Let's start with a rule-based classifier - the **if-then-else** decision trees. And as this is a simple demo, let's only focus on one single hyper-parameter - the **maximum depth of the tree**.

In [None]:
# Spefiy maximum depth of the tree
max_depth = 1

# Train decision tree classifier
from sklearn import tree

clf = tree.DecisionTreeClassifier(max_depth=max_depth)
%time clf.fit(X, y)

print("\nFinished training.")

In [None]:
# Plot decision tree
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 4))
tree.plot_tree(clf, feature_names=X.columns, filled=True,
               proportion=True, class_names=species, impurity=False);

And to better understand how this decision tree partitions the 2-dimensional plane of the dataset, let's plot the dataset together with the decision boundaries.

In [None]:
from utils import plot_decision_boundaries

plot_decision_boundaries(
    X, y, clf, species, title=f"Decision Tree with depth of {max_depth}")

<div class="alert alert-success">
  <h2>Exercise</h2>
    <p></p>
Change the <code>max_depth</code> parameter above to something between 1 to 100, and rerun the two code cells. Based on the outputs, what do you think is the best value for <code>max_depth</code>?
</div>

## Classification using K-nearest neighbors

Next, let's take a look at a k-nearest neighbors model. Once more, to keep things simple, let's only look at one single hyper-parameter - the **number of neighbors** to consider for the model.

In [None]:
# Spefiy number of neighbors to consider
n_neighbors = 1

# Train k-nearest neighbour classifier
from sklearn import neighbors

clf = neighbors.KNeighborsClassifier(n_neighbors=n_neighbors, n_jobs=-1)
%time clf.fit(X, y)

print("\nFinished training.")

One the model is trained, we can once more plot the decision boundaries for our model.

In [None]:
from utils import plot_decision_boundaries

plot_decision_boundaries(X, y, clf, species, title=f"KNN with {n_neighbors} neighbors")

<div class="alert alert-success">
  <h2>Exercise</h2>
    <p></p>
Change the <code>n_neighbors</code> parameter above to something between 1 to 300 (the higher the number, the longer the computation), and rerun the two code cells. Based on the outputs, what do you think is the best value for <code>n_neighbors</code>?
</div>

## Classification using Support Vector Machines

Last but certainly not least, let's take a look at a support vector machine (SVM) model. This time, let's look at two hyper-parameter - the **regularization parameter `C`** and **rbf-kernel parameter `gamma`**  .

In [None]:
# Spefiy regularization and rbf-kernel parameter
C = 1
gamma = 0.1

# Train support vector machine classifier
from sklearn.svm import SVC

clf = SVC(kernel="rbf", C=C, gamma=gamma)
%time clf.fit(X, y)

print("\nFinished training.")

One the model is trained, we can once more plot the decision boundaries for our model.

In [None]:
from utils import plot_decision_boundaries

plot_decision_boundaries(X, y, clf, species, scaled=True, title="SVM with")

<div class="alert alert-success">
  <h2>Exercise</h2>
    <p></p>
Change the <code>C</code> and <code>gamma</code> parameter above to something between 0.0001 and 1000 (we recommend to use factors of 10, i.e. 0.01, 0.1, 1, 10,...), and rerun the two code cells. Based on the outputs, what do you think are the best values for <code>C</code> and <code>gamma</code>?
</div>