<img src='images/gdd-logo.png' width='300px' align='right' style="padding: 15px">

# Palmer Penguins: a machine learning example

In this notebook we are going to cover:

- [Data Science and Machine Learning](#ds)
- [Meet the Penguins](#about) 
- [Scikit-learn](#sl) 
- [Loading the data](#load) 
- [Preparing the data for sklearn](#prep) 
- [Model creation & evaluation](#mce) 
- [Model visualisation](#mv) 
- [Choosing a different model](#ms) 

-----

<a id='ds'></a>

## Data Science and Machine Learning

<font color='darkblue'>*A computer program is said to learn from experience **E** with respect to some set of tasks **T** and performance measure P if its performance at tasks in **T**, as measured by **P**, improves with experience **E**.*</font>

Tom M. Mitchell (prominent Machine Learning researcher)

- **T** = determine the species of a penguin
- **E** = previously collected data from penguins *(flipper length, body mass, bill length and bill depth)*
- **P** = a performance metric (e.g. accuracy) 


<a id='about'></a>

## Meet the Palmer Penguins!


The data was collected and made available by Dr. Kristen Gorman and the Palmer Station, Antartica LTER. The goal of the dataset is to provide a great dataset for data exploration, visualisation and - in this case - a demonstration of the scikit-learn API. 

![](https://raw.githubusercontent.com/STATS250SBI/palmerpenguins/master/man/figures/lter_penguins.png)

<img src='images/bill-dimensions.png' width='300px' align='right' style="padding: 15px">

### Classification

We're going to look using the following features to predict the species of penguin:

- Bill length (mm) 
- Bill depth (mm)
- Flipper length (mm) 
- Body mass (grams) 



<a id='sl'></a>

<img src='https://upload.wikimedia.org/wikipedia/commons/thumb/0/05/Scikit_learn_logo_small.svg/1200px-Scikit_learn_logo_small.svg.png' width='300px' align='right' style="padding: 15px">



## Scikit-Learn
Scikit-learn is *the* library for machine learning in Python, often considered the swiss army knife of machine learning. 


#### Why scikit-learn?
- Many available machine learning models
- Models are implemented by an expert team and checked by a large community
- Consistent API for wide variety of algorithms
- Covers most machine-learning tasks
- Commitment to documentation, consistency and usability
- Designed to work with other key Python libraries (NumPy, Pandas etc)

<a id='load'></a>

## Loading our data

There are many places your data can originate from. Maybe you want to load it from a Excel file you have stored locally on your system, maybe you have a .csv file stored online somewhere. Scikit-learn comes with various standard datasets that can be used for practice, that can be loaded if you have scikit-learn installed on your system. 

However, the dataset we will be using today (the Palmer penguins dataset) does not come from scikit-learn, but from a visualisation package called `seaborn`. A dataset loaded from seaborn will be a Pandas dataframe and can be used as such. Pandas is a powerful library for data wrangling.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
penguins = sns.load_dataset('penguins')
penguins.head(10)

In [None]:
penguins['island'].unique()

<a id='prep'></a>

## Preparing the data for scikit-learn

Scikit-learn estimators assume that all values in an array are numerical, and that all have and hold meaning.

**Missing values:**
Some data point entries that have no value which scikit-learn so we must find a way to deal with the missing values. 

**Dropped unwanted fields**
Some features (sex of the penguin and the island where the penguin was spotted) are not numerical. Since we are not interested in these, we will drop them.

### <mark>Activty</mark>

Drop the missing values and the unwanted columns. Overwrite the penguins dataset.

***Hint:***

|Syntax|
|:---|
|`df.dropna()`|
|`df.drop(columns, axis)`|

***Answers***

In [None]:
%load answers/prep-sklean.py

### Creating Feature Matrix and Target Vector

We use our knowledge of Pandas to create:

|**Feature matrix X**|**Target vector y**|
|:---:|:---:|
|Describes the data. |Describes the output. |
|The attributes to base predictions on. |The label that you want to predict. |
|Shape N x M (number of samples x number of features)|Shape N (one output for every data point)|
|Example: for each penguin, we have bill length, bill depth, body mass and flipper length. The shape of the feature matrix would be N (number of penguins) x 4.|Example: if we have recorded the features of 200 penguins, our y would contain the corresponding species of those 200 penguins (3 different options). |

In [None]:
X = penguins.drop('species', axis=1)
y = penguins['species']

print(f'The shape of feature matrix X is: {X.shape}')
print(f'The shape of target vector y is: {y.shape}')

In [None]:
X.head()

In [None]:
y.head()

### Splitting the dataset
An important goal of machine learning is to create a model that does not only do well on the data that it has already seen, but will also perform well under new circumstances on data that is has not seen before. We call this _generalization_. 

Imagine this: Penguin A is a gentoo (bill length of 33, bill depth of of 16, flipper length of 180 and body mass of 3500 grams).   Penguin A was presented during the training of our model; that means, penguin A was one of the examples that the algorithm used to create an understanding of what a gentoo looks like and how you can distinguish it from a chinstrap or adélie. 

If we want to know how well our model does, asking the model to classify our penguin A does not give us a lot of information. Even if the model is correct, do we know whether it has really truly learned the relationship between the features and the targets (ie. flipper length of >X is always species Y), or has it simply memorized the original data and does it recognise penguin A from the training phase? 

That's why we want to separate our dataset into two parts:
* The _training_ set: this is the data (features and targets) that will guide the learning process. 
* The _test_ set: this is the data (features and targets) that we will use to _evaluate_ how well our model has learned. 

Scikit-learn's `train_test_split` function allows us to split the data in a train- and testset. By default, the test set size is set to 25% and the data is shuffled. 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

print(f'The size of our feature matrix for the train set is: {X_train.shape}')
print(f'The size of our target vector for the train set is: {y_train.shape}')

print(f'\nThe size of our feature matrix for the test set is: {X_test.shape}')
print(f'The size of our target vector for the test set is: {y_test.shape}')

Let's see if our data is in fact shuffled: 

In [None]:
y_test.values

<a id='mce'></a>

## Model creation and evaluation

Now we're ready to create our machine learning model! 

Scikit-learn has a rich collection of algorithms readily available. Depending on the case you are working on, scikit-learn most likely has a model that will suit your purposes. 

#### Scikit-Learn API usage steps when training a model
1. Choosing a model class and importing that model 
2. Choosing the model hyperparameters by instantiating this class with desired values.
3. Training the model to the preprocessed train data by calling the `fit()` method of the model instance.
4. Evaluating model's performance using available metrics

In [None]:
# Step 1: import the chosen algorithm 
from sklearn.tree import DecisionTreeClassifier

In [None]:
help(DecisionTreeClassifier)

In [None]:
# Step 2: instantiate the model with the chosen hyperparameters
model = DecisionTreeClassifier()

In [None]:
# Step 3: train the model with the training data
model.fit(X_train, y_train)

We have now trained a model that can be used to make predictions on new data. Remember our test set? That's new, unseen data to the model that we can now create predictions on. 

In [None]:
y_pred = model.predict(X_test)
y_pred[0:10]

We can compare these predictions against our original data to see how well our model does. 

In [None]:
y_test[0:10].values

Fortunately, we don't have to do that comparison ourselves. Scikit-learn has made many implementations of possible metrics readily available, such as accuracy. 

$\text{accuracy} = \frac{correct}{total}$

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_pred, y_test)

Pretty good! 

But accuracy is not the only metric you could be interested in. Alternatives are, for example, _precision_ and _recall_. 

* _Precision_ is the proportion of positive identifications that was actually correct. 
* _Recall_ is the proportion of actual positives that was identified correctly.
* _F1 score_ is a function of precision and recall, that you use when you seek a balance between precision and recall. 

In some cases, precision is more important. For YouTube's recommendation system for example: you won't be able to show _ALL_ relevant videos, but it is important that the ones you do show _are_ relevant. 

However, in medical context, _recall_ is often more important. After all, if we mistakingly tell a person with cancer that they're healthy, that can have more severe consequences than the other way around. 

Precision, recall and F1 are all available with scikit-learn.

In [None]:
from sklearn.metrics import classification_report

report = classification_report(y_pred, y_test)
print(report)

<a id='mv'></a>

## Model Visualisation

One of the advantages of decision trees over some of the other available models, is that decision trees are relatively easy to interpret. By visualising the tree-like structure of the decision tree, we can understand why the model classifies samples the way it does.

In [None]:
from sklearn.tree import plot_tree

fig, ax = plt.subplots(figsize=(14,10))

plot_tree(model, 
          ax=ax, 
          feature_names = X.columns, 
          class_names = y.unique());

<a id='ms'></a>

## Choosing a different model 

What happens when we're interested in a model other than the decision tree? 

That's actually really easy. You simply replace the chosen model with another and the rest of the pipeline can stay the same.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# Uncomment the model that you want to try
# model = DecisionTreeClassifier()
# model = RandomForestClassifier()
# model = KNeighborsClassifier()

In [None]:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
report = classification_report(y_test, y_pred)
print(f'Model accuracy: {model.score(X_test, y_test)}')
print(report)

<img src='images/gdd-logo.png' width='300px' align='right' style="padding: 15px">

# Summary

Scikit-learn is an excellent, resourceful tool for machine learning in Python. 

We've seen how we can split a dataset with `train_test_split` into a train and test set, create and train a model, use the trained model to create predictions, and how to use the tools from `sklearn.metrics` to evaluate how good the model is. 

**Want to learn more? Join us on a public course:**
- [Python for Data Analysts](https://godatadriven.com/training/python-for-data-analysts-training/)
- [Certified Python for Data Science](https://godatadriven.com/training/data-science-python-foundation-training/)
- [And more!](https://godatadriven.com/what-we-do/train/#upcoming)

Interested in our other courses? Download our [Training Guide](https://godatadriven.com/topic/training-brochure/)

---
<img src='images/download.png' width='80px' align='left'>

**If you would like to <mark>save this notebook</mark> there is a Download button at the top of the page. This will download the `.ipynb`**

If you are not planning to get Anaconda but you want to save the work you've done, got to `File -> Download as` and choose `.html`.

<img src='images/visit-repo.png' width='60px' align='left'>

Alternatively you can click Visit repo at the top to navigate to the github repo where you can download everything as a `.zip` file. 