## Automated Machine Learning
We will be working with [Heart Failure Dataset](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction).

### Libraries
- [HyperOpt](https://hyperopt.github.io/hyperopt/) ([Tutorial](https://towardsdatascience.com/optimise-your-hyperparameter-tuning-with-hyperopt-861573239eb5))


### Instructions
1. Choose a dataset. Build and train a baseline for comparison. Try a set of possible machine learning algorithms (13 algorithms) using their default hyperparamters and choose the one with the highest performance for comparison.

2. Based on the problem at hand, you study the potential pipeline structure,
algorithms or feature transformers at each step, hyper-parameters ranges. Use
hyperOpt with the potential search space to beat the baseline.

3. Monitor the performance of you the constructed pipeline from the previous step across different time budgets (number of iterations) and report the least time budget that you are able to outperform the baseline.

4. Determine whether the difference in performance between the constructed pipeline and the baseline is statistically significant.

In [1]:
import numpy as np
import scipy as scp
import pandas as pd
import seaborn as sn

In [2]:
df = pd.read_csv('./heart_failure.csv')
df.sample(3)

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
722,60,F,ASY,150,258,0,LVH,157,N,2.6,Flat,1
63,46,M,ASY,120,277,0,Normal,125,Y,1.0,Flat,1
25,36,M,NAP,130,209,0,Normal,178,N,0.0,Up,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


In [None]:
from util import preprocess, find_baseline

df = preprocess(df)
scores = find_baseline(df)

In [5]:
for name, score in sorted(scores.items(), key=lambda t: t[1], reverse=True):
    print(f'{name:20}: {score:.3f}')

Random Forest       : 0.871
AdaBoost            : 0.863
Naive Bayes         : 0.855
Linear SVM          : 0.850
Logistic Regression : 0.849
QDA                 : 0.845
Decision Tree       : 0.837
Neural Network      : 0.826
Gaussian Process    : 0.753
KNN                 : 0.708
RBF SVM             : 0.553


In [6]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from hyperopt import fmin, hp, tpe, STATUS_OK
from util import create_objective, preprocess

df = pd.read_csv('./heart_failure.csv')
df = preprocess(df)

search_space = {
    'n_estimators': hp.randint('n_estimators', 50, 150),
    'criterion': hp.choice('criterion', ['gini', 'entropy']),
    'max_depth': hp.randint('max_depth', 10, 200),
    'min_samples_split': hp.uniform('min_samples_split', 0, 0.1),
    'min_samples_leaf': hp.uniform('min_samples_leaf', 0, 0.1),
    'min_weight_fraction_leaf': hp.uniform('min_weight_fraction_leaf', 0, 0.1),
    'max_features': hp.choice('max_features', ['sqrt', 'log2']),
}

classifier = RandomForestClassifier

best_params = fmin(
    fn=create_objective(classifier, df),
    space=search_space,
    algo=tpe.suggest,
    max_evals=200
)

best_params

100%|██████████| 200/200 [01:33<00:00,  2.13trial/s, best loss: 0.12096103587550489]


{'criterion': 1,
 'max_depth': 174,
 'max_features': 1,
 'min_samples_leaf': 0.0001302013692076486,
 'min_samples_split': 0.018254882328141313,
 'min_weight_fraction_leaf': 1.8990128508197986e-05,
 'n_estimators': 136}