In [1]:
# Optional: change Jupyter Notebook theme to GDD theme
from IPython.core.display import HTML
HTML(url='https://gdd.li/jupyter-theme')

![footer_logo](https://marysia.nl/assets/GDD/css/logo.png)

# Unpacking the "Black Box"


In this notebook, we shall take our first steps towards understanding our models based on an example dataset on penguin species classification. In this notebook, we will explore two different models -- one easily interpretable, and one slightly more difficult to understand, and investigate how we can *flip the prediction*. That is, change the prediction outcome of our model by altering the feature values. 

### Outline
1. [The data](#loading-in-the-data)
1. [Create the model](#dectree)
1. [Exercise 1: Flip the Prediction](#ex1) 
1. [Create SVM model: Flip the Prediction](#svm)
1. [Exercise 2](#ex2) 


![](https://github.com/allisonhorst/palmerpenguins/raw/master/man/figures/logo.png)


<a id = 'loading-in-the-data'></a>

# 1. The Data
The data was collected and made available by Dr. Kristen Gorman and the Palmer Station, Antartica LTER. Their goal was to provide a great dataset for data exploration, visualisation and - in this case - a demonstration of the scikit-learn API. 

The data set contains measurements for different species of penguins living at the Palmer station:

|Field|Description|
|:---|:---|
|species|The species of the penguin: Adelie, Chinstrap or Gentoo|
|island|The island on which the penguin was spotted|
|bill_length_mm|The length of the penguin's bill in mm|
|bill_depth_mm|The depth of the penguin's bill in mm|
|flipper_length_mm|The length of the penguin's flipper in mm|
|body_mass_g|The weight of the penguin in grams|
|sex|The gender of the penguin - Female or Male|

<img src="https://github.com/allisonhorst/palmerpenguins/raw/master/man/figures/culmen_depth.png" width="600">

### Explore the Data

First of all, we load in the data using the Pandas data wrangling library. We make a small alteration to the data -- we drop all datapoints about the Adélie penguin species altogether. This choice was made to better illustrate the various interpretability techniques we will discuss throughout this workshop, but you are free to change the code in order to kep the Adélie penguins. 
![footer_logo](https://github.com/allisonhorst/palmerpenguins/raw/master/man/figures/lter_penguins.png) 

We then explore the dataset. 

In [None]:
import pandas as pd 

penguins = (
    pd.read_csv('data/penguins.csv')
    .loc[lambda d: d['species'] != 'Adelie']
    .dropna()
)
penguins.head()

In [None]:
penguins.shape

In [None]:
penguins['species'].value_counts()

In [None]:
penguins.groupby('species').mean()

In [None]:
import seaborn as sns

sns.scatterplot(data=penguins, 
                x='flipper_length_mm',
                y='body_mass_g',
                hue='species');

Based on this visualisation on only the flipper length and the body mass, the problem of separating Chinstrap from Gentoo penguins does not seem very challenging. Can you think of a general rule to separate Gentoos from Chinstraps based on this image? 

<a id = 'dectree'></a>

# 2. Create the Model

Nevertheless, we will create a model anyway! In order to do that, we must select the columns ("features") we would like to use for the model and the column ("target") that we would like to predict. To simplify our problem, we will focus on flipper length and body mass as features.

In [None]:
# Define features & target
feature_columns = ['flipper_length_mm', 'body_mass_g']
target = 'species'

# Set X and y
X = penguins.loc[:, feature_columns]
y = penguins.loc[:, target]

print(f'The shape of feature matrix X is: {X.shape}')
print(f'The shape of target vector y is: {y.shape}')

An important goal of machine learning is to create a model that does not only do well on the data that it has already seen, but will also perform well under new circumstances on data that is has not seen before. We call this _generalization_. 

That's why we want to separate our dataset into two parts:
* The _training_ set: this is the data (features and targets) that will guide the learning process. 
* The _test_ set: this is the data (features and targets) that we will use to _evaluate_ how well our model has learned. 

<img src="images/train-test.png" width="600">

Scikit-learn's `train_test_split` function allows us to split the data in a train- and testset. By default, the test set size is set to 25% and the data is shuffled. 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

print(f'The size of our feature matrix for the train set is: {X_train.shape}')
print(f'The size of our target vector for the train set is: {y_train.shape}')

print(f'\nThe size of our feature matrix for the test set is: {X_test.shape}')
print(f'The size of our target vector for the test set is: {y_test.shape}')

Now we're ready to create our machine learning model! 

Scikit-learn has a rich collection of algorithms readily available. Depending on the case you are working on, scikit-learn most likely has a model that will suit your purposes. 

We will choose a _Decision Tree_ -- a simple algorithm known for its native interpretability. 

In [None]:
# Import the chosen algorithm.
from sklearn.tree import DecisionTreeClassifier

# Instantiate the model with the chosen hyperparameters.
model = DecisionTreeClassifier(max_depth=2)

# Train the model with the *train* set. 
model.fit(X_train, y_train)

# Evaluate the accuracy with the *test* set. 
model.score(X_test, y_test)

A >98\% accuracy! Not bad! 

Let's see if we can understand what the model has learned in order to make prediction. A big advantage of decision trees is that they are just that - trees, which makes them easy to visualise. Luckily, sklearn has some neat functionality built-in to visualise the tree that was created. 

In [None]:
import matplotlib.pyplot as plt 
from sklearn.tree import plot_tree

fig = plt.figure(figsize=(25,20))   # Set the size of the output
plot_tree(model,   # pass the model -- this is the tree.
          feature_names=X_train.columns,   # give the feature names 
          class_names=model.classes_,   # give the class names
          filled=True);   # different colors for different classes

Our model seemed pretty great, a >98\% accuracy. However, let's investigate what mistakes were made on the test set -- that is, an example from the test set where the *prediction* did not match the intended *target*. 

In [None]:
(
    X_test
    .assign(target = y_test)
    .assign(prediction = model.predict(X_test))
    .assign(correct = lambda d: d['target'] == d['prediction'])
    .loc[lambda d: ~d['correct']]
)

<a id = 'ex1'></a>
# 3. Exercise
<div class="exercise" markdown="1">

### Exercise 
#### Flip the Prediction


The Chinstrap with a flipper length of 210 mm and a body mass of 4100 grams was mistaken by the model for a Gentoo penguin. But why? Change the values for flipper_length_mm and/or body_mass_g in the code cell below to change the prediction (from Gentoo to Chinstrap). Use the decision tree as a guide! 

Write down what change in values caused a change in prediction.

</div>

In [None]:
flipper_length_mm = 210.0   # change this value
body_mass_g = 4100.0   # or this value! .. or both. 

example_datapoint = pd.DataFrame.from_dict({
    'flipper_length_mm': [flipper_length_mm], 
    'body_mass_g': [body_mass_g]
})

model.predict(example_datapoint)[0]

# 4. Create SVM model

Decision trees are naturally easy to interpret, because they can be visualised. However, decision trees are not the only machine learning models out there. Let us try out a different kind of model on the same dataset and problem, and investigate what changes...

In this case, we pick a *support vector classifier*.

In [None]:
from sklearn.svm import SVC

model_svm = SVC()
model_svm.fit(X_train, y_train)
model_svm.score(X_test, y_test)

Our model performs slightly worse than the decision tree, though still pretty good overall. Let us rehash the earlier example, which the decision tree got wrong. 

In [None]:
flipper_length_mm = 210.0
body_mass_g = 4100.0

example_datapoint = pd.DataFrame.from_dict({
    'flipper_length_mm': [flipper_length_mm], 
    'body_mass_g': [body_mass_g]
})

tree_pred = model.predict(example_datapoint)
svm_pred = model_svm.predict(example_datapoint)

print(f'Prediction decision tree: {tree_pred[0]}')
print(f'Prediction SVM: {svm_pred[0]}')

The SVM correctly predicts that the penguin with 210mm flipper length and 4100g body bass is a Chinstrap! However, this SVM model isn't perfect -- it makes a few other mistakes:

In [None]:
(
    X_test
    .assign(target = y_test)
    .assign(prediction = model_svm.predict(X_test))
    .assign(correct = lambda d: d['target'] == d['prediction'])
    .loc[lambda d: ~d['correct']]
)

<a id = 'ex2'></a>
# 5. Exercise 

<div class="exercise" markdown="1">

### Exercise 
#### Flip the Prediction

Let's do the same thing as we did with the previous model: change the prediction. Take the Chinstrap with a flipper length of 195 milimeter and a body mass of 4400 grams. The SVM model incorrectly classifies this as a Gentoo. Change the values in the cell below to flip the prediction to a Chinstrap! 

Write down what change in values caused a change in prediction.

</div>

In [None]:
flipper_length_mm = 195
body_mass_g = 4400

example_datapoint = pd.DataFrame.from_dict({
    'flipper_length_mm': [flipper_length_mm], 
    'body_mass_g': [body_mass_g]
})

svm_pred = model_svm.predict(example_datapoint)

print(f'Prediction SVM: {svm_pred[0]}')

----------------------------- 