# Model training

In this notebook a number of different machine learning algorithms will be fitted on the iris dataset and the results will be compared. Due the type of task being a simple classification task and the dataset being balanced, using accuracy as the primary metric to determine model performance is warranted.

In [107]:
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

This is the Isolation Forest outlier filter. Unfortunately, due to its functionality of removing samples and appyling the transformation not only on the features but also on the target, it cannot comply with scikit-learns Transformer API and cannot therefore be included as part of the pipeline. This step will be performed separately as a preprocessing step.

In [108]:
class IsolationForestOutlierRemover:
    def __init__(self, contamination):
        self.contamination = contamination

    def transform(self, X, y):
        iforest = IsolationForest(
            n_estimators=100, contamination=self.contamination, random_state=0
        )
        pred = iforest.fit_predict(X)
        return X.iloc[pred == 1], y.iloc[pred == 1]

In [109]:
def load_iris_features_and_target():
    features = [
        "sepal length (cm)",
        "sepal width (cm)",
        "petal length (cm)",
        "petal width (cm)",
    ]
    target = ["species"]
    iris_df = pd.read_csv("iris.csv")
    return iris_df[features], iris_df[target]

In [110]:
features_df, target_df = load_iris_features_and_target()

X_train, X_test, y_train, y_test = train_test_split(
    features_df, target_df, train_size=0.8, random_state=0, stratify=target_df
)

outlier_remover = IsolationForestOutlierRemover(0.05)

X_train, y_train = outlier_remover.transform(X_train, y_train)



## SVM

In [111]:
param_grid = [
    {"C": [0.1, 1, 10, 100, 1000], "kernel": ["linear"]},
    {"C": [0.1, 1, 10, 100, 1000], "gamma": [0.1, 0.001, 0.0001], "kernel": ["rbf"]},
]

svm_classifier = SVC(random_state=0)

svm_pipe = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("pca", PCA(n_components=0.99)),
        ("svc", GridSearchCV(svm_classifier, param_grid, cv=5, n_jobs=-1)),
    ]
)

In [112]:
svm_pipe.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


In [113]:
y_pred = svm_pipe.predict(X_test)
accuracy_score(y_test, y_pred)

1.0

## Gradient Boosting

In [114]:
param_grid = {
    "n_estimators": [25, 50, 100, 150, 200, 300, 500],
    "learning_rate": [0.5, 0.2, 0.1, 0.01],
    "max_depth": [3, 5, 10],
    "min_samples_split": [2, 5, 10],
}

gb_classifier = GradientBoostingClassifier(random_state=0)

gb_pipe = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("pca", PCA(n_components=0.99)),
        ("gb", GridSearchCV(gb_classifier, param_grid, cv=5, n_jobs=-1)),
    ]
)

In [115]:
gb_pipe.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


In [116]:
y_pred = gb_pipe.predict(X_test)

accuracy_score(y_pred=y_pred, y_true=y_test)

1.0

## Random Forest

In [117]:
param_grid = {
    "n_estimators": [25, 50, 100, 150],
    "max_depth": [3, 5, 10],
    "min_samples_split": [2, 5, 10],
}

rf_classifier = RandomForestClassifier(random_state=0)

rf_pipe = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("pca", PCA(n_components=0.99)),
        ("gb", GridSearchCV(rf_classifier, param_grid, cv=5, n_jobs=-1)),
    ]
)

In [118]:
rf_pipe.fit(X_train, y_train)

  self.best_estimator_.fit(X, y, **fit_params)


In [119]:
y_pred = rf_pipe.predict(X_test)

accuracy_score(y_pred=y_pred, y_true=y_test)

0.9666666666666667

## MLP

In [120]:
param_grid = {
    "hidden_layer_sizes": [(5,), (10,), (20,)],
    "alpha": [1e-05, 1e-03, 1e-02, 1e-01, 0],
    "learning_rate": ["constant", "invscaling", "adaptive"],
    "learning_rate_init": [1e-05, 1e-03, 1e-02, 1e-01],
}

mlp_classifier = MLPClassifier(solver="sgd", max_iter=10000000000, random_state=0)

mlp_pipe = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("pca", PCA(n_components=0.99)),
        ("gb", GridSearchCV(mlp_classifier, param_grid, cv=5, n_jobs=-1)),
    ]
)

In [121]:
mlp_pipe.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


In [122]:
y_pred = mlp_pipe.predict(X_test)

accuracy_score(y_pred=y_pred, y_true=y_test)

1.0

Out of the four trained models three (__SVM, Gradient Boosting, MLP__) have scored a 100% accuracy. Due to that result there is no need to look at other metrics such as precision or recall because those metrics will also result in a perfect score. To determine which model to deploy their inference runtimes on the test dataset will be determined where the fastest model will be chosen.

## Runtime measurements for inference

We test the runtime the runtime of each model with __timeit__.

In [128]:
%timeit -r 15 svm_pipe.predict(X_test)

878 µs ± 8.1 µs per loop (mean ± std. dev. of 15 runs, 1,000 loops each)


In [129]:
%timeit -r 15 gb_pipe.predict(X_test)

1.3 ms ± 10.3 µs per loop (mean ± std. dev. of 15 runs, 1,000 loops each)


In [130]:
%timeit -r 15 rf_pipe.predict(X_test)

2.98 ms ± 43.3 µs per loop (mean ± std. dev. of 15 runs, 100 loops each)


In [131]:
%timeit -r 15 mlp_pipe.predict(X_test)

857 µs ± 4.69 µs per loop (mean ± std. dev. of 15 runs, 1,000 loops each)


__MLP__ provides the overall fastest mean runtime and will therefore be used as the final model for deployment.

In [137]:
mlp_pipe['gb'].best_params_

{'alpha': 1e-05,
 'hidden_layer_sizes': (5,),
 'learning_rate': 'constant',
 'learning_rate_init': 0.1}