<img src=../images/gdd-logo.png width=300px align=right>

# Classification

In this notebook, we shall classify penguins species based on bodily measurements using the Scikit-Learn API. 

We shall first introduce the dataset and the Scikit-Learn library. Afterwards we will cover the following aspects:

- [Loading in the data](#loading-in-the-data)    
    - [<mark>Exploring the dataset</mark>](#exploring-the-dataset) 
    - [Visualising the dataset](#visualising-the-dataset)  
- [Preparing the data for sklearn](#preparing)
    - [Splitting the dataset](#train-test-split)
- [Model creation & evaluation](#model)
    - [Training and evaluating a Scikit-Learn model](#steps)
    - [Alternative metrics](#metrics)
    - [Prediction and Inference](#inference)
    - [Visualising the model](#vis)
    - [<mark>Choosing a different model</mark>](#choosing-models)  

![](https://github.com/allisonhorst/palmerpenguins/raw/master/man/figures/logo.png)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

## About the data
The data was collected and made available by Dr. Kristen Gorman and the Palmer Station, Antartica LTER. Their goal was to provide a great dataset for data exploration, visualisation and - in this case - a demonstration of the Scikit-Learn API. 

The data set contains measurements for different species of penguins living at the Palmer station:

|Field|Description|
|:---|:---|
|species|The species of the penguin: Adelie, Chinstrap or Gentoo|
|island|The island on which the penguin was spotted|
|bill_length_mm|The length of the penguin's bill in mm|
|bill_depth_mm|The depth of the penguin's bill in mm|
|flipper_length_mm|The length of the penguin's flipper in mm|
|body_mass_g|The weight of the penguin in grams|
|sex|The gender of the penguin - Female or Male|

<img src="../images/02_Classification_Penguins/culmen_depth.png" width="600">

## Scikit-Learn
Scikit-Learn is *the* library for machine learning in Python. You could consider it the swiss army knife of machine learning. A wide variety of machine learning models are implemented by the community and core developers, with a consistent API. Once you master this API, it's easy to apply a wide variety of machine learning algorithms, and you have a handy tool to help you out with preprocessing, model evaluation and model selection. 

#### Why Scikit-Learn?
- Many available machine learning models
- Models are implemented by an expert team and checked by a large community
- Covers most machine-learning tasks
- Commitment to documentation, consistency and usability
- Designed to work with other key Python libraries (NumPy, Pandas etc)

<a id = 'loading-in-the-data'></a>
## 1. Loading in the data

There are many places your data can originate from. Maybe you want to load it from a Excel file you have stored locally on your system, maybe you have a .csv file stored online somewhere. Scikit-learn comes with various standard datasets that can be used for practice, that can be loaded if you have Scikit-Learn installed on your system. 

Our dataset will be loaded in as a Pandas dataframe and can be used as such. Pandas is a powerful library for data wrangling.

In [None]:
penguins = pd.read_csv('../data/penguins.csv')
penguins.head(10)

<a id = 'exploring-the-dataset'></a>
## <mark> Exercise: Exploring the dataset </mark>

Below are some typical things you may want to check as part of your initial investigation of the dataset.

1. How many rows and columns are present in the data?

2. Which data types are used by each column?

3. Are there any missing values?

4. How many species are there?

5. How many penguins are there for each species?

<a id = 'visualising-the-dataset'></a>
## Visualising the dataset 

To understand the dataset better it can be useful to create some visualisations.

Below is a  histogram of the penguin's flipper lengths:

In [None]:
sns.histplot(data=penguins, x='flipper_length_mm')

We can use visualisations to examine how different the data is for the different species.

For example, here is a histogram of flipper lengths *for the different species*. Would you be able to separate the species based on this measurement alone?

In [None]:
sns.histplot(data=penguins, x='flipper_length_mm', hue='species')

Let's examine the relationship between two variables.

Below is a scatter plot of flipper length vs. body mass:

In [None]:
sns.scatterplot(data=penguins, x='flipper_length_mm', y='body_mass_g')

It may be easier to distinguish different species when we look at more than one variable.

Here is a a scatter plot of flipper length vs. body mass *for the different species*. Would you be able to separate the species based on the relationship between these measurements?

In [None]:
sns.scatterplot(data=penguins, x='flipper_length_mm', y='body_mass_g', hue='species')

Seaborn also allows us to see this information for each numeric feature:

In [None]:
sns.pairplot(data=penguins, hue='species')

<a id = 'preparing'></a>
## 2. Preparing the data for Scikit-Learn

The first thing we might notice here is that there are some data point entries that have no value - the value simply says `NaN`. This means this information is missing. 

In [None]:
(
    penguins
    .loc[penguins.isnull().any(axis=1)]
)

Unfortunately, that also means the information cannot be used as is to create a machine learning model with Scikit-Learn. We must find a way to deal with the missing values. 

There are multiple strategies for dealing with missing data. For example, you could replace a missing values with the mean of the column. E.g. if for a particular penguin the value for body mass is missing, you could replace the NaN with the mean recorded body mass of all penguins. 

Scikit-Learn even provides us with a great interface to apply such transformations. For the moment, however, we simply choose to discard all the incomplete data points with pandas `.dropna()` functionality. 

In [None]:
penguins_cleaned = penguins.dropna()
penguins_cleaned.head()

Second of all, we notice that we have more information than the penguin measurements _bill length, bill depth, flipper length_ and _body mass_.

Although we could incorporate this extra information (sex of the penguin and the island where the penguin was spotted), this requires some extra preprocessing outside of the scope of this notebook. We choose to focus on our four discussed features first.

We then use our knowledge of Pandas to create our feature matrix $X$ and target vector $y$.

In [None]:
feature_columns = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']

X = penguins_cleaned.loc[:, feature_columns]
y = penguins_cleaned.loc[:, 'species']

print(f'The shape of feature matrix X is: {X.shape}')
print(f'The shape of target vector y is: {y.shape}')

The feature matrix columns are also known as the predictive variable.

The target vector is also known as the dependent variable.

A feature matrix $X$ consists of $n$ samples with $m$ features - in this case $n=333$ and $m=4$.

In [None]:
X.head()

Each row in the feature matrix $X$ corresponds to a value in the target vector $y$.

In [None]:
y.head()

In [None]:
y.unique()

Our model will then attempt to learn a relationship that can map a row in $X$ to the corresponding value in $y$.

<a id = 'train-test-split'></a>
### Splitting the dataset
An important goal of machine learning is to create a model that does not only do well on the data that it has already seen, but will also perform well under new circumstances on data that it has not seen before. We call this _generalization_. 

Imagine this: Penguin A is a gentoo (bill length of 33, bill depth of of 16, flipper length of 180 and body mass of 3500 grams). 

<img src="../images/02_Classification_Penguins/gentoo.jpg" width="300">

Penguin A was presented during the training of our model; that means, penguin A was one of the examples that the algorithm used to create an understanding of what a gentoo looks like and how you can distinguish it from a chinstrap or adélie. 

If we want to know how well our model does, asking the model to classify our penguin A does not give us a lot of information. 

Even if the model is correct, do we know whether it has really truly learned the relationship between the features and the targets (ie. flipper length of >X is always species Y), or has it simply memorized the original data and does it recognise penguin A from the training phase? 

That's why we want to separate our dataset into two parts:
* The _training_ set: this is the data (features and targets) that will guide the learning process. 
* The _test_ set: this is the data (features and targets) that we will use to _evaluate_ how well our model has learned. 

<img src="../images/02_Classification_Penguins/train-test.png" width="600">

Scikit-Learn's `train_test_split` function allows us to split the data in a train- and test set. By default, the test set size is set to 25% and the data is shuffled. 

In [None]:
from sklearn.model_selection import train_test_split
help(train_test_split)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

print(f'The size of our feature matrix for the train set is: {X_train.shape}')
print(f'The size of our target vector for the train set is: {y_train.shape}')

print(f'\nThe size of our feature matrix for the test set is: {X_test.shape}')
print(f'The size of our target vector for the test set is: {y_test.shape}')

Let's see if our data is in fact shuffled: 

In [None]:
y_test.values

<a id = 'model'></a>
## 3. Model creation and evaluation

Now we're ready to create our machine learning model! 

Scikit-Learn has a rich collection of algorithms readily available. Depending on the case you are working on, Scikit-Learn most likely has a model that will suit your purposes. 

<a id = 'steps'></a>
## Training a Scikit-Learn model

Below are the steps for training a model using the Scikit-Learn API 
1. Choosing a model class and importing that model.
2. Choosing the model hyperparameters by instantiating this class with desired values.
3. Training the model to the preprocessed train data by calling the `fit()` method of the model instance.
4. Evaluating model's performance using available metrics.

In [None]:
# Step 1: import the chosen algorithm 
from sklearn.tree import DecisionTreeClassifier

In [None]:
help(DecisionTreeClassifier)

<img src="../images/02_Classification_Penguins/tree.png" width="600">

In [None]:
# Step 2: instantiate the model with the chosen hyperparameters
model = DecisionTreeClassifier(max_depth=2)

In [None]:
# Step 3: train the model with the training data
model.fit(X_train, y_train)

We have now trained a model that can be used to make predictions on new data. Remember our test set? That's new, unseen data to the model that we can now create predictions on. 

In [None]:
y_pred = model.predict(X_test)
y_pred[0:10]

We can compare these predictions against our original data to see how well our model does. 

In [None]:
y_test[0:10].values

Fortunately, we don't have to do that comparison ourselves. Scikit-Learn has made many implementations of possible metrics readily available, such as accuracy. 

$\text{accuracy} = \frac{correct}{total}$

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

Pretty good! 

Alternatively you can use the `.score()` method. On a Decision Tree this will return the accuracy score:

In [None]:
model.score(X_test, y_test)

<a id = 'metrics'></a>
## Alternative metrics

But accuracy is not the only metric you could be interested in. Alternatives are, for example, _precision_ and _recall_. 

<!-- * _Precision_ is the proportion of positive identifications that was actually correct. 
* _Recall_ is the proportion of actual positives that was identified correctly.
* _F1 score_ is a function of precision and recall, that you use when you seek a balance between precision and recall.  -->

#### Precision & Recall

Predictions about a class fall into four categories:
* True Positive: Correctly predict item is that class
* True Negative: Correctly predict item is NOT that class
* False Positive: Incorrectly predict item is that class
* False Negative: Incorrectly predict item is NOT that class


<img src="../images/02_Classification_Penguins/TPFN.png" width="350">

In a classification task, the **precision** for a class is the number of true positives (i.e. the number of items correctly labelled as belonging to the positive class) divided by the total number of elements labelled as belonging to the positive class (i.e. the sum of true positives and false positives, which are items incorrectly labelled as belonging to the class).

<img src="../images/02_Classification_Penguins/precision.png" width="200">

**Recall** in this context is defined as the number of true positives divided by the total number of elements that actually belong to the positive class (i.e. the sum of true positives and false negatives, which are items which were not labelled as belonging to the positive class but should have been).

<img src="../images/02_Classification_Penguins/recall.png" width="200">

The differences between these metrics can be explained with this example:
Let's say you create a model that should classify email messages as spam or not spam. _Precision_ measures the percentage of emails flagged as spam that were correctly classified, while _recall_ measures the percentage of actual spam emails that were correctly classified. 

In some cases, precision is more important. For YouTube's recommendation system for example: you won't be able to show _ALL_ relevant videos, but it is important that the ones you do show _are_ relevant. 

However, in medical context, _recall_ is often more important. After all, if we mistakingly tell a person with cancer that they're healthy, that can have more severe consequences than the other way around. 

In [None]:
# Adelie precision comparison
(
    pd.DataFrame({'y_pred': y_pred, 'y_test': np.array(y_test)})
    .loc[lambda df: df['y_pred']=='Adelie']
)

In [None]:
# Adelie recall comparison
(
    pd.DataFrame({'y_test': np.array(y_test), 'y_pred': y_pred})
    .loc[lambda df: df['y_test']=='Adelie']
)

If the number of classes is not too large, we can also produce a confusion matrix to interpret how good the predicitions were.

The raw **confusion matrix** can be quickly acquired as shown below: 

In [None]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, y_pred))

In [None]:
confusion_df = pd.DataFrame(confusion_matrix(y_test, y_pred), columns=y.unique(), index=y.unique())
confusion_df

In [None]:
cm = sns.heatmap(confusion_df, annot=True)
cm.set(xlabel='predicted label', ylabel='true label', title='Confusion matrix');

In `sklearn` the classification report can give us a breakdown of the precision and recall for each species of penguin:

In [None]:
from sklearn.metrics import classification_report

report = classification_report(y_test, y_pred)
print(report)

**F1 score** is a combination of both precision and recall:

${\displaystyle F_{1}={\frac {2}{\mathrm {recall} ^{-1}+\mathrm {precision} ^{-1}}}=2\cdot {\frac {\mathrm {precision} \cdot \mathrm {recall} }{\mathrm {precision} +\mathrm {recall} }}={\frac {\mathrm {tp} }{\mathrm {tp} +{\frac {1}{2}}(\mathrm {fp} +\mathrm {fn} )}}}$

Precision, recall and F1 are also all available with Scikit-Learn.

<a id='inference'></a>
## Prediction and Inference

Sometimes our goal may be inference, rather than prediction.

**Prediction**: Generalizing the relationship to future observations that the model has not yet seen.

**Inference**: Finding which predictors are more associated with the response.


In [None]:
model.feature_importances_

In [None]:
inference_df = pd.DataFrame(columns  = X.columns, data = [model.feature_importances_])
inference_df

<a id = 'vis'></a>
## Model Visualisation

One of the advantages of decision trees over some of the other available models, is that decision trees are relatively easy to interpret. By visualising the tree-like structure of the decision tree, we can understand why the model classifies samples the way it does.

In [None]:
from sklearn.tree import plot_tree

fig, ax = plt.subplots(figsize=(14,10))

plot_tree(model, 
          ax=ax, 
          feature_names = feature_columns, 
          class_names = y.unique());

In [None]:
model.predict_proba(X_train)

<a id = 'choosing-models'></a>
## <mark>Choosing a different model </mark>

What happens when we're interested in a model other than a decision tree? 

That's actually really easy. You simply replace the chosen model with another and the rest of the pipeline can stay the same.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

# Uncomment the model that you want to try
model = DecisionTreeClassifier()
# model = RandomForestClassifier()
# model = KNeighborsClassifier()
# model = SVC()

In [None]:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
report = classification_report(y_test, y_pred)
print(f'Model accuracy: {model.score(X_test, y_test)}')
print(report)

# Summary

Scikit-Learn is an excellent, resourceful tool for machine learning in Python. We've seen how we can split a dataset with `train_test_split` into a train and test set, create and train a model, use the trained model to create predictions, and how to use the tools from `sklearn.metrics` to evaluate how good the model is. 
![](../images/02_Classification_Penguins/palmer-penguins.png) 