## Automated Machine Learning
We will be working with [Heart Failure Dataset](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction).

### Libraries
- [HyperOpt](https://hyperopt.github.io/hyperopt/) ([Tutorial](https://towardsdatascience.com/optimise-your-hyperparameter-tuning-with-hyperopt-861573239eb5))


### Instructions
1. Choose a dataset. Build and train a baseline for comparison. Try a set of possible machine learning algorithms (13 algorithms) using their default hyperparamters and choose the one with the highest performance for comparison.

2. Based on the problem at hand, you study the potential pipeline structure,
algorithms or feature transformers at each step, hyper-parameters ranges. Use
hyperOpt with the potential search space to beat the baseline.

3. Monitor the performance of you the constructed pipeline from the previous step across different time budgets (number of iterations) and report the least time budget that you are able to outperform the baseline.

4. Determine whether the difference in performance between the constructed pipeline and the baseline is statistically significant.

In [1]:
import numpy as np
import scipy as scp
import pandas as pd
import seaborn as sn

In [2]:
df = pd.read_csv('./heart_failure.csv')
df.sample(3)

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
10,37,F,NAP,130,211,0,Normal,142,N,0.0,Up,0
880,52,M,NAP,172,199,1,Normal,162,N,0.5,Up,0
135,49,M,NAP,115,265,0,Normal,175,N,0.0,Flat,1


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


In [None]:
from sklearn.model_selection import train_test_split
from util import find_baseline, preprocess

df = preprocess(df)
train_df, test_df = train_test_split(df, test_size=0.33, random_state=42)
scores = find_baseline(train_df, test_df)

In [5]:
for name, score in sorted(scores.items(), key=lambda t: t[1], reverse=True):
    print(f'{name:20}: {score:.3f}')

Random Forest       : 0.888
AdaBoost            : 0.871
Naive Bayes         : 0.871
Linear SVM          : 0.855
QDA                 : 0.855
Logistic Regression : 0.855
Gaussian Process    : 0.851
Neural Network      : 0.845
Decision Tree       : 0.799
KNN                 : 0.703
RBF SVM             : 0.594


In [6]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from hyperopt import fmin, hp, tpe, STATUS_OK, space_eval
from util import create_objective, preprocess

df = pd.read_csv('./heart_failure.csv')
df = preprocess(df)

search_space = {
    'n_estimators': hp.randint('n_estimators', 50, 150),
    'criterion': hp.choice('criterion', ['gini', 'entropy']),
    'max_depth': hp.randint('max_depth', 10, 200),
    'min_samples_split': hp.uniform('min_samples_split', 0, 0.1),
    'min_samples_leaf': hp.uniform('min_samples_leaf', 0, 0.1),
    'min_weight_fraction_leaf': hp.uniform('min_weight_fraction_leaf', 0, 0.1),
    'max_features': hp.choice('max_features', ['sqrt', 'log2']),
}

classifier = RandomForestClassifier

optimized_params = fmin(
    fn=create_objective(classifier, train_df),
    space=search_space,
    algo=tpe.suggest,
    max_evals=250
)

space_eval(search_space, optimized_params)

100%|██████████| 250/250 [02:27<00:00,  1.69trial/s, best loss: 0.1495934959349594] 


{'criterion': 'gini',
 'max_depth': 99,
 'max_features': 'sqrt',
 'min_samples_leaf': 0.061884564599754485,
 'min_samples_split': 0.009717845648543742,
 'min_weight_fraction_leaf': 0.02165754820804295,
 'n_estimators': 144}

In [8]:
from sklearn.metrics import accuracy_score

params = space_eval(search_space, optimized_params)
model = classifier(**params, random_state=42)

X_train = train_df.drop('HeartDisease', axis=1).values
X_test = test_df.drop('HeartDisease', axis=1).values

y_train = train_df['HeartDisease'].values
y_test = test_df['HeartDisease'].values

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

score = accuracy_score(y_test, y_pred)

print(f'Accuracy score: {score:.3f}')

Accuracy score: 0.851
