# Machine Learning Workflow Overview
Let’s have a short overview on the basic steps of machine learning. We will see that there is no magic behind it - only a set of methods and best practices. In addition, the notebook will show that switching between different machine learning models is easy to do when sticking to some conventions. The overall workflow is:
* [Data Import and Preparation](#Data-Import-and-Preparation)
* [Data Exploration](#Data-Exploration)
* [Feature Selection and Engineering](#Feature-Selection-and-Engineering)
* [Model Definition](#Model-Definition)
* [Training](#Training)
* [Validation and Performance](#Validation-and-Performance)

The overall workflow has be taken as an iterative process. The [**scikit-learn**](http://scikit-learn.org/stable/) package or short `sklearn` provides the relevant models and tools.
### Import of packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
plt.style.use('ggplot')

## Data Import and Preparation

The data preparation steps can take most of the time of the full workflow. Especially in real world data, often information is missing, sanity checks have to be performed, datasets have to be joined from different sources and much more.

For our example, we use a famous dataset that comes with the `sklearn` package. It is the [**Iris flower dataset**](https://en.wikipedia.org/wiki/Iris_flower_data_set) and describes three different types of Iris flowers (Iris setosa, Iris versicolor, Iris virginica). For more information of the dataset see also [here](http://archive.ics.uci.edu/ml/datasets/Iris). 

There are other open datasets to get used with machine learning. Have a look in the [documentation](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets) of `sklearn` for other examples like handwritten digits for classification or house-prices for regression.

In [None]:
# Import datasets
from sklearn import datasets

# Load Iris dataset
iris = datasets.load_iris()
meta = iris.DESCR

In [None]:
# Getting information of the dataset
for key in iris: 
    print(key)

In [None]:
# Construction of DataFrame
df = pd.DataFrame(iris.data)

In [None]:
df.head()

In [None]:
# Set the column name
df.columns = iris['feature_names']
df.head()

This dataset is a good sandbox for a classification task. Our **target classes**, the iris species, are already included in the data set together with some variables (or **features**) describing the flowers.

In [None]:
# Adding target column
df['target'] = iris.target
df.head()

#### _Remark: Real world data_

On other datasets getting the information like column names can be the first task in data preparation and take some time. In addition, one of the main parts is to aggregate data from different sources to one data structure (here: `pandas.DataFrame`) on which the machine learning model will be applied. In general, the ML-algorithm need numerical data as input so that strings have to be encoded (see Feature Engineering). But also units have to be checked, time series have to be set accordingly to the correct file format (see `pandas.to_datetime`). One other major part is to perform sanity checks on the data, to check for missing values and maybe compensate outliers. As mention before, all in all this take more than half of the actual time.

## Data Exploration

As the data import and preparation is already finished we can go over to a first glance on the data itself. Our task is to get a first _feeling_ for the data. Getting used to all features and try to figure out if they have enough power for classification of the target. If we do not find any differences and all attributes behave the same way for each target then machine learning will not give you any promising result as well!

In total we have 150 samples, four features and the target information.

In [None]:
df.shape

Features are length and width of sepal and pental, respectively. The unit is cm.

In [None]:
df.columns

There is only numerical data.

In [None]:
df.dtypes

There are no missing data.

In [None]:
df.isna().any()

There are three different species. Each species has fifty samples.

In [None]:
df.groupby('target')['target'].count()

We are now looking for differences between the species by calculating some summary statistics for each of them:

In [None]:
df[df['target']==0].describe().loc[['mean','std']].round(2)

In [None]:
df[df['target']==1].describe().loc[['mean','std']].round(2)

In [None]:
df[df['target']==2].describe().loc[['mean','std']].round(2)

We can already detect that the first species (target = 0) has a significantly lower mean petal length and width. Because of this difference, petal length should be a valuable feature for a classifier. However, let's check by plotting the distribution of the feature for the different species.

In [None]:
# 0: 'sepal length (cm)'
# 1: 'sepal width (cm)'
# 2: 'petal length (cm)'
# 3: 'petal width (cm)'

variable = iris.feature_names[3]

for i in [0,1,2]:
    df[df['target']==i][variable].plot(
        kind='hist',
        bins=np.linspace(df[variable].min(),df[variable].max(),15),
        figsize=(8,5),   
        alpha=0.5,       # transparency
        label=f'{i} {iris.target_names[i]}',
        legend=True
    );

In [None]:
# 0: 'sepal length (cm)'
# 1: 'sepal width (cm)'
# 2: 'petal length (cm)'
# 3: 'petal width (cm)'

variable_A = iris.feature_names[0]
variable_B = iris.feature_names[1]

df.plot(
    kind='scatter',
    x=variable_A,
    y=variable_B,
    c=df['target'],
    figsize=(8,5), # (width, height)
    cmap='Set1',   # colormap
    s=50,          # dot size
    colorbar=False
);

## Feature Selection and Engineering

So far, we got a broad overview of our data and could detect some promising/meaningful features for a classification task. For the actual training of a ML-model we need to select **features** (**Feature Selection**) as input to classify our **target**. In our example we use all four features but we could also select only a part of them. In real world data it often makes sense to take only a selection as computing power can be limiting or as non meaningful features do not improve the overall performance of the classificator.

>  **Coming up with features is difficult, time-consuming, requires expert knowledge. "Applied machine learning" is basically feature engineering.** &mdash; Andrew Ng:

Besides selection, creating of additional features (**feature engineering**) can be another crucial step. In this case we are fine with the four features we have but in real world data we always have to perform feature engineering to develop the full potential of ML. Some examples are 
- encoding of features (e.g. categories to numerical features), 
- apply transformations to features (e.g. logscale), 
- generate new features (e.g. simple stats)
- rounding, binning, sampling, ...

In [None]:
df.columns

In [None]:
# Feature Selection:
training_features = [
    'sepal length (cm)', 
    'sepal width (cm)', 
    'petal length (cm)',
    'petal width (cm)'
]

## Model Definition
Let's we start with the actual ML. Keep in mind we already spent at least half of our time with the steps before! In `sklearn` we can import several different models for ML like simple [decision trees](http://scikit-learn.org/stable/modules/tree.html) to more advanced models as a [random forest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) or a [multi layer perceptron (MLP)](http://scikit-learn.org/stable/modules/neural_networks_supervised.html). If we stick to some conventions it is quite easy to use the same workflow to switch between models and compare the results.

In [None]:
# Import of ML models
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier

In [None]:
# Defining the model with key parameters
model = RandomForestClassifier(n_estimators=25, max_depth=2)
#model = MLPClassifier(hidden_layer_sizes=(20,), activation='relu')
#model = GradientBoostingClassifier(n_estimators=20, max_depth=2)
#model = MLPClassifier()

#### Tuning the Model
To get the most out of our model we have to perform tuning of the models **hyper-parameters**. Hyper-parameters are parameters of the learning algorithm, rather than parameters of the trained model. Here we will use only standard parameters which are often fine for a first try. However, we should always set some key parameters.

## Training

We use all features as **input (`X`)**. Just to be safe we once again check for missing values and drop these samples. In addition we define our **target (`y`)** we want to predict.

In [None]:
X = df[training_features]
y = df['target']

#### Preparation for Validation

For subsequent performance checks and validation, we split our dataset into two parts - a **training and a test set** ([`train_test_split`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)).
The rationale behind this is, simply put, that a model should not be evaluated on the same data that it was trained with: A model that memorizes all labels it has seen would achieve a perfect score - but it would clearly be useless when faced with new data to classify. We say that it would fail to **generalize**.

Consequently, we use only the `train`-part for the training and put aside the `test`-part. In the example we split the dataset into two half (`test_size=0.5`). Keep in mind that train- and test-dataset should have the same information in it. For example it means that in both datasets all three species should contributing in equal share.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)

In [None]:
for i in (0,1,2):
    test = y_test[y_test==i].count()
    train = y_train[y_train==i].count()
    print(f'Species {i}: test {test} and train {train}')

#### Machine Learning: Actual Training

After these preparations, the actual training is as simple as calling the `fit` method of the algorithm:

In [None]:
model.fit(X_train, y_train)

## Validation and Performance

The actual machine learning training is done. Let's have a look on our results and compare how good our model performs on our training data and the test datasets. If we see the same performance on both sets we can take this as a strong indicator for a valid model. If the model performs much better on our training dataset, there is something wrong (-> **overtraining**)! With `model.predict_proba(dataset)` we get the prediction for each sample to belong to each iris species.

In [None]:
y_proba_test = model.predict_proba(X_test)
y_proba_train = model.predict_proba(X_train)

In [None]:
# We get three probabilities per sample
y_proba_test.shape

In [None]:
# Sum for each sample is one
y_proba_test.sum(axis=1)

In [None]:
# Check results for one species i
i=1
y_proba_test_i = y_proba_test[:,i]
y_proba_train_i = y_proba_train[:,i]

# Probability for one species i
y_proba_test_i.shape

In [None]:
# How many do really belong to species i?
y_proba_test_i[(y_test == i)].shape

In [None]:
# Respective probabilities
y_proba_test_i[(y_test == i)]

In [None]:
# Have a look if wrong samples get a high
# probability for the species? (->False Positives)
y_proba_test_i[(y_test != i)].round(2)

In [None]:
# How many probabilities are greater than 50 percent?
(y_proba_test_i > 0.50).sum()

In [None]:
# How many samples do really belong to specie i? 
y_proba_test_i[(y_test == i).values].shape

In [None]:
# Comparison of all species:
for i in range(3):
    y_proba_test_i = y_proba_test[:,i]
    species_i = y_proba_test_i[(y_test == i)].mean()
    not_species_i = y_proba_test_i[(y_test != i)].mean()
    print(f'Species {i}: Mean probability for true {species_i:.2f} and false {not_species_i:.2f} predictions.')

In [None]:
# This returns the 'decision' off the classifier
y_pred_test = model.predict(X_test)
y_pred_test[0:10]

In [None]:
print((y_pred_test==0).sum())
print((y_pred_test==1).sum())
print((y_pred_test==2).sum())

### Hypothesis test

So far, we did everything by hand. There is a easy way to check the results by visualization. Each chart gives the probability of all samples to belong to one species. In addition, each color gives the true membership. A good classifier will show a good splitting.

In [None]:
for i in [0,1,2]:
    y_proba_test_i = y_proba_test[:,i]
    
    plt.figure(figsize=(10, 4))
    
    plt.hist(y_proba_test_i[(y_test == 0).values], bins=np.linspace(0,1,20), alpha=0.5, density=False, label=iris.target_names[0])
    plt.hist(y_proba_test_i[(y_test == 1).values], bins=np.linspace(0,1,20), alpha=0.5, density=False, label=iris.target_names[1])
    plt.hist(y_proba_test_i[(y_test == 2).values], bins=np.linspace(0,1,20), alpha=0.5, density=False, label=iris.target_names[2])

    plt.legend()
    if i == 0:
        plt.title('Hypothesis: Sample belongs to species 0')
    elif i == 1:
        plt.title('Hypothesis: Sample belongs to species 1')
    elif i == 2:
        plt.title('Hypothesis: Sample belongs to species 2')
    plt.xlabel('Probability')
    plt.ylabel('Frequency')
    
    #plt.yscale('log',nonposy='clip')
    plt.show()

### Confusion Matrix

The [**Confusion Matrix**](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) (Table of Confusion) gives for each class how many samples are classified correctly (principal diagonal) and how many classifications are false. In addition, it shows to which wrong class the samples are assigned. In our case we get a 3x3 matrix. The sum of a row are all members of each species and the sum of a column returns the predicted members of a class. A _perfect_ classificator would have only entries on the pricipal diagonal.

In [None]:
from sklearn.metrics import confusion_matrix

y_pred_test = model.predict(X_test)
truth = y_test

cm = confusion_matrix(truth, y_pred_test)
print(cm)

In [None]:
# Sum of true members (row)
(y_test.values==2).sum()

In [None]:
# Sum of predicted members (column)
(y_pred_test==2).sum()

The Confusion Matrix can be condensed to a binary classification for each class. The result is a 2x2 matrix. The sum of the first row are all true members (**Positives**, P) consisting of **True Positives** (TP) and **False Negatives** (FN). The sum of the second row are all false members (**Negatives**, N) consisting of the **False Positives** (FP) and **True Negatives** (TN).

x | classified as Positives | classified as Negatives
-|-|-
**Positives (P)** | True Positives (TP) | False Negatives (FN)
**Negatives (N)** | False Positives (FP)  | True Negatives (TN) 

In [None]:
# Function to condensate to a binary classification
def make_bina_class(model, X_sample, i, threshold=0.0, check_max=True):
    proba = model.predict_proba(X_sample)
    if check_max:
        y_pred_test = model.predict(X_sample)
        bina_class = [0 if (pred == i) and (proba[pos][i] >= threshold) else 1 for pos, pred  in enumerate(y_pred_test)]
    else:
        bina_class = [0 if (pred >= threshold) else 1 for pred in proba[:,i]]
    return bina_class   

# This is done in the List Comprehension:
   #for pos, pred  in enumerate(y_pred_test):
   #    if (pred == i) & (proba[pos][i] >= threshold):
   #        bina_class.append(0)
   #    else:
   #        bina_class.append(1)

In [None]:
y_pred_test = model.predict(X_test)
truth = y_test
cm = confusion_matrix(truth, y_pred_test)
print(f'Confusion Matrix 3x3')
print(cm)

for i in range(3):
    pred = make_bina_class(model, X_test, i)
    truth_i = [0 if j == i else 1 for j in y_test]
    cm= confusion_matrix(truth_i,pred)
    print(f'\n Confusion Matrix for Species {i}')
    print(cm)

### ROC Curve
The Receiver Operating Characteristics (ROC) are a slightly more sophisticated way to validate a model. A ROC curve shows the true positive rate as a function of the false positive rate. When given a certain Hypothesis and an acceptable false-positive rate, we see how many samples that truly fit the Hypothesis we can select. In addition, we show the results for the train and test dataset in comparison, to detect deviations.

In [None]:
from sklearn.metrics import roc_curve

In [None]:
for i in [0,1,2]:
    y_proba_test_i = y_proba_test[:,i]
    y_proba_train_i = y_proba_train[:,i]
    
    plt.figure(figsize=(5, 5))
    plt.plot(*roc_curve(y_test == i, y_proba_test_i)[:2], label='test')
    plt.plot(*roc_curve(y_train == i, y_proba_train_i)[:2], label='train')
    plt.plot([0, 1],[0, 1], color='black', linestyle=':')
    plt.title(f'ROC curve species {i}')
    plt.xlabel('false positive rate')
    plt.ylabel('true positive rate') 
    plt.legend(loc='best')
    plt.show();    

### AUC and Accuracy

There are several other performance indicators to validate the trained model. For example the area under the ROC Curve (**A**rea **U**nder **C**urve **AUC**) or the mean **Accuracy** $\bigl(\frac{TP + TN}{P + N}\bigr)$ can be taken into account. The accuracy gives the overall correctly classified samples despite if they belong to the Positives or Negatives.

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score

In [None]:
y_pred_test = model.predict(X_test)

# How many are misclassified
print(f'misclassified: {(y_pred_test != y_test).sum()}')

In [None]:
data=[]
for i in (0,1,2):
    y_proba_test_i = y_proba_test[:,i]
    data.append(roc_auc_score(y_test.values == i, y_proba_test_i))

pd.DataFrame(data,columns=['AUC'])



In [None]:
print(f'Mean Accuracy: {model.score(X_test, y_test):.3f}')

### Feature Importance
Several machine learning models return a score for the feature importance within the classificator. This can be used to perform more training steps to improve the model, improve computing time or feedback this to the initial data acquisition. If we detect that one feature is very important for the classificator it maybe a good idea to improve the quality of this feature or engineer equivalent features. In addition, this step can highlight features which were not be be expected to be important and can lead to a rethinking of strategies.

In [None]:
# Only works for GradientBoostingClassifier or RandomForestClassifier
if (str(model)[0:3] != 'MLP'):
    plt.figure(figsize=(5, 5))
    plt.barh(range(len(X.columns)), model.feature_importances_)
    plt.yticks(range(len(X.columns)), X.columns)
    plt.show()
else:
    print("Works only for Classifyer like GradientBoostingClassifier or RandomForestClassifier")

As we already detected in the exploration step the petal width and length have the highest impact to the classification.

#### *Remark: Iterative process!*
In this case it is quite easy to get a valid model. However, training and validation has to be an iterative process in machine learning.

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_