# Machine Learning Workflow

**In this session, we will walk through solving a Machine Learning problem from one end to another.**

### 1. Define Business Goal

**We want to predict the species of a penguin from their body measures.**

![](penguin_heads.png)

We will assume this goal as given, although in practice, figuring out *what* the task is a lot of work.

In [None]:
import pickle

import numpy as np
import pandas as pd
import seaborn as sns

from matplotlib import pyplot as plt

from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier

### 2. Get Data
getting clean data is also a huge effort. Again, this is not the topic of this session.

In [None]:
df = pd.read_csv('../data/penguins_simple.csv', sep=';')
df.head(3)

### 3. Split Data into Training, Validation and Test sets

* training: what we use to explore the data and train the model
* validation: what we use to iteratively evaluate and optimize the model
* test: what we use **once** to estimate the error rate of the model

In [None]:
train_val, test = train_test_split(df, test_size=0.2, random_state=42)
train, val = train_test_split(train_val, test_size=0.2, random_state=43)

#### Exercise 1: Check the number of items in train, val and test

### 4. Explore Data

use **only** the training data for exploration to prevent *data leakage*.

In [None]:
# distribution of target classes
train['Species'].value_counts().plot.barh()

#### Exercise 2: Interpret the bar plot. Can we achieve an accuracy better than 33%?

In [None]:
# correlation of numerical features
sns.pairplot(train, hue='Species')

#### Exercise 3: Interpret the pair plot. 
* Can we hope to predict anything?
* Is there a species that is easier to predict?

### 4. Define X and y

In [None]:
# X is a matrix of input features
COLUMNS = ['Culmen Length (mm)', 'Culmen Depth (mm)',
           'Flipper Length (mm)', 'Body Mass (g)', 'Sex'
          ]
Xtrain = train[COLUMNS]
Xval = val[COLUMNS]

# y is a categorical variable --> Classification
ytrain = train['Species']
yval = val['Species']

In [None]:
Xtrain.shape, ytrain.shape

In [None]:
Xval.shape, yval.shape

### 5. Feature Engineering

In [None]:
# convert the MALE/FEMALE column to 0/1
ohc = ColumnTransformer([
    ('one-hot', OneHotEncoder(drop='first', handle_unknown='error', sparse=False), ['Sex']),
    ('do nothing', 'passthrough', COLUMNS[:-1])
])
ohc.fit(Xtrain)
Xtrans = ohc.transform(Xtrain)
Xtrans.shape

#### Exercise 4: Inspect the data type of `Xtrans`

### 6. Train a Model

In [None]:
m = DecisionTreeClassifier(max_depth=2)   # here we set hyperparameters
m.fit(Xtrans, ytrain)

### 7. Evaluate the Model

In [None]:
ypred = m.predict(Xtrans)
acc_train = accuracy_score(ytrain, ypred)
f"training accuracy {acc_train:4.2f}"

#### Exercise 5: The code below does not work. What did we miss?

In [None]:
ypred_val = m.predict(Xval)
acc_val = accuracy_score(yval, ypred_val)
f"validation accuracy {acc_val:4.2f}"

Now optimize until you are happy with the outcome, or stop.

### 8. Estimate model error

do this *once*

#### Exercise 6: Complete the code

In [None]:
Xtest = test[COLUMNS]
ytest = test['Species']

Xtest_trans = ...
ypred_test = ...
acc_test = ...
f"validation accuracy {acc_val:4.2f}"

### 9. Deploy the model
here we just save the model to a file to use it elsewhere.

In [None]:
pickle.dump(m, open('../models/penguin_tree.pkl', 'wb'))

In [None]:
ls -l penguin*