## Automated Machine Learning
We will be working with [Heart Failure Dataset](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction).

### Libraries
- [HyperOpt](https://hyperopt.github.io/hyperopt/)

### Instructions
1. Choose a dataset. Build and train a baseline for comparison. Try a set of possible machine learning algorithms (13 algorithms) using their default hyperparamters and choose the one with the highest performance for comparison.

2. Based on the problem at hand, you study the potential pipeline structure,
algorithms or feature transformers at each step, hyper-parameters ranges. Use
hyperOpt with the potential search space to beat the baseline.

3. Monitor the performance of you the constructed pipeline from the previous step across different time budgets (number of iterations) and report the least time budget that you are able to outperform the baseline.

4. Determine whether the difference in performance between the constructed pipeline and the baseline is statistically significant.

In [1]:
import numpy as np
import scipy as scp
import pandas as pd
import seaborn as sn

In [2]:
df = pd.read_csv('./heart_failure.csv')
df.sample(3)

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
530,50,M,ASY,133,218,0,Normal,128,Y,1.1,Flat,1
350,53,M,ASY,120,0,1,Normal,120,N,0.0,Flat,1
846,39,M,ASY,118,219,0,Normal,140,N,1.2,Flat,1


In [None]:
from util import preprocess, find_baseline

df = preprocess(df)
scores = find_baseline(df)

In [4]:
for name, score in sorted(scores.items(), key=lambda t: t[1], reverse=True):
    print(f'{name:20}: {score:.3f}')

Random Forest       : 0.869
AdaBoost            : 0.863
Naive Bayes         : 0.855
Linear SVM          : 0.850
Logistic Regression : 0.849
QDA                 : 0.845
Decision Tree       : 0.837
Neural Network      : 0.798
Gaussian Process    : 0.753
KNN                 : 0.708
RBF SVM             : 0.553
