In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Set figure size to (12,6)
plt.rcParams['figure.figsize'] = (12,6)

- Define Business Goal
- Get Data
- Train-Test-split
- Exploratory Data Analysis
- Feature Engineering
- Fit a model
- Model optimization - Hyperparameter Optimization / Model selection - Cross Validation
- Test Data - Provides an estimate of the model performance for data the model was not trained on
- Deploy the model (or not)

# Cross Validation

## 1) Define Business Goal

Build a model that can accurately classify the species of a penguin given its Flipper Length (mm) and Sex.

**!!!The value chosen is arbitrary by Stefan!!!***<br>
The model will be helpful if it is able to predict 70% of the observations correctly. (Accuracy: 0.7)


## 2) Get the data

In [2]:
df = pd.read_csv('penguins.csv', sep=',').dropna()
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male


## 3) Split the Data

Why are we doing this again?

- We want to make sure the model generalizes well to unseen data
- We want to prevent the model from overfitting on some random patterns in the training data
- Therefore we separate part of the data (the test data) and keep it locked up until we are done with our modelling process
- Calculating the evaluation metrics on the test data gives us an **estimate on how well the model is doing on unseen data/how well the model is able to generalize**

Based on the outcome of our model on the test data we decide whether we will go forward and actually use the model in practice (deploy the model).

![overfitting](under_vs_overfitting.png)

In [4]:
# Assign X and y
X = df[['flipper_length_mm', 'sex']]
y = df['species']

In [5]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Inspect the shapes
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((249, 2), (84, 2), (249,), (84,))

## 4) Exploratory Data Analysis

Remember to do the exploratory data analysis on the training data only.

## 5) Feature Engineer

pipeline has (name, transformer)

ColumnTransformer has (name, transformer, columns)

In [13]:
column_transformer = ColumnTransformer([
    # (name, transformer (eg. OneHotEncoder, passthrough), columns)
    ('pass', 'passthrough', ['flipper_length_mm']), # 'passthrough' will just take the specified column as it is
    ('label_encoder', OneHotEncoder(handle_unknown='ignore'), ['sex'])
])

In [15]:
X_train_fe = pd.DataFrame(column_transformer.fit_transform(X_train), 
                          columns=['flipper_length_mm', 'female', 'male'])
X_train_fe.head()

Unnamed: 0,flipper_length_mm,female,male
0,187.0,0.0,1.0
1,230.0,0.0,1.0
2,190.0,0.0,1.0
3,199.0,1.0,0.0
4,208.0,1.0,0.0


Why should we use ``drop='first'``?
- It gets rid of redundant information. That will speed up the algorithm
- If we don't use drop=first, the resulting columns will be linearly dependent. This will be a problem if we want to find solutions for the model mathematically to actually calculate them.
- Leaving in linearly dependent columns makes interpretations of feature coefficients difficult. If you really want to understand the effect of feature a on y then you should drop a column.

Why should we use ``handle_unknown='ignore'``?
- It is very advantageous if we encounter a category for some feature in the test data that did not exist in the training data. If we don't set handle_unknown to ignore, this will result in an error. If we set it to ignore, it will handel it and set the value for all known categories to 0
- If the model is solved algorithmically the cost of including a redundant column is not very high

## 6) Train the model(s)

In [16]:
m = LogisticRegression(random_state=44, max_iter=1000)

In [17]:
m.fit(X_train_fe, y_train)

LogisticRegression(max_iter=1000, random_state=44)

In [18]:
print(f'The training accuracy of the model is {round(m.score(X_train_fe, y_train), 2)}')

The training accuracy of the model is 0.82


## 7) Cross Validation

Does your model overfit?

In [19]:
from sklearn.model_selection import cross_val_score, validation_curve

In [29]:
X_train_fe.head()

Unnamed: 0,flipper_length_mm,female,male
0,187.0,0.0,1.0
1,230.0,0.0,1.0
2,190.0,0.0,1.0
3,199.0,1.0,0.0
4,208.0,1.0,0.0


In [30]:
y_train.head()

324    Chinstrap
267       Gentoo
36        Adelie
311    Chinstrap
192       Gentoo
Name: species, dtype: object

In [20]:
cross_val_accuracy = cross_val_score(estimator=m
                   , X=X_train_fe
                   , y=y_train
                   , cv=10
                   , scoring='accuracy')

In [21]:
print(f'The training accuracy of the model is {round(m.score(X_train_fe, y_train), 2)}')

The training accuracy of the model is 0.82


In [22]:
print(f'The average cross validation accuracy of the model is {round(cross_val_accuracy.mean(), 2)}')

The average cross validation accuracy of the model is 0.8


In [25]:
cross_val_accuracy

array([0.88 , 0.76 , 0.88 , 0.72 , 0.76 , 0.68 , 0.8  , 0.84 , 0.84 ,
       0.875])

$$
Accuracy = \frac{TP + TN}{TP + FP + TN + FN}
$$

Training score and cross validation score are very close to each other. This suggests that the model is not overfitting.

It would be an indication of overfitting if the training score was higher thatn the cross validation score.

## 8) Calculate Test Score

In [26]:
X_test_fe = column_transformer.transform(X_test)
# On the test data we only want to use .transform, never .fit or .fit_transform()

In [27]:
print(f'The test accuracy of the model is {round(m.score(X_test_fe, y_test), 2)}')

The test accuracy of the model is 0.8


## 9) Deploy the model