# Heart Disease Classification Hyperparameter Study

## Hyperparameters to play with

1. **n_estimator** 

sets the number of decision trees to be used in the forest.

**`[100, 120, 300, 500, 800, 1200]`**

2. **max_depth**  

set the max depth of the tree.

If not set then there is no cap. The tree will keep expanding until all leaves are pure.

Limiting the depth is good for pruning trees to prevent over-fitting on noisy data.

**`[5, 8, 15, 25, 30, None]`**

3. **max_features**

set the number of features to consider for the best node split

Default is “auto”, which means that the square root of the number of features is used for every split in the tree.

“None” means that all features are used for each split.

Each decision tree in the random forest will typically use a random subset of features for splitting.

**`[log2, sqrt, auto, None]`**

4. **min_samples_split**

The minimum number of samples needed before a split (differentiation) is made in an internal node

**`[1,2,5,10,15,100]`**

5. **min_samples_leafs**

The minimum number of samples needed to create a leaf (decision) node.

Default is 1. This means that a split point at any depth will only be allowed if there is at least 1 sample for each path.

**`[1,2,5,10]`**

## Import Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
# set up seed
np.random.seed(42)

# import and inspect the dataset
heartdisease_df = pd.read_csv("heart_disease_dataset.csv")

# separate X and y values
X = heartdisease_df.drop(columns="target")
y = heartdisease_df["target"]

# import sklearn's train test split 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

clf = RandomForestClassifier()

# fitting the model and training on the test set
clf.fit(X_train,y_train)
clf.score(X_test, y_test)

0.8524590163934426

## Define Evaluation Metrics

In [2]:
def evaluate_preds(y_true, y_preds):
    accuracy = accuracy_score(y_true, y_preds)
    precision = precision_score(y_true, y_preds)
    recall = recall_score(y_true, y_preds)
    f1 = f1_score(y_true, y_preds)
    metric_dict = {"accuracy": round(accuracy, 2),
                   "precision": round(precision,2),
                   "recall": round(recall,2),
                   "f1":round(f1,2)}
    print(f"Accuracy: {accuracy * 100:2f}%")
    print(f"Precision: {precision:.2f}")
    print(f"Recall: {recall:.2f}")
    print(f"F1 Score: {f1:.2f}")
    
    return metric_dict

## Using RandomizedSearchCV to find the best Hyperparameters

In [3]:
from sklearn.model_selection import RandomizedSearchCV

param_grid = {"n_estimators":[100,120,300,500,800,1000],
              "max_depth":[5,8,15,25,30,None],
              "max_features":["auto","sqrt"],
              "min_samples_split":[2,5,10,15],
              "min_samples_leaf":[1,2,5,10]
               
}

# n_jobs=1 ; pertains to the amount of core processor to be used
clf = RandomForestClassifier(n_jobs=1)

# setup RandomizedSearchCV
rs_clf = RandomizedSearchCV(estimator=clf,
                            param_distributions=param_grid,
                            n_iter=10,
                            cv=5,
                            verbose=2)

# fit randomized search to the training data
rs_clf.fit(X_train, y_train);

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] n_estimators=500, min_samples_split=10, min_samples_leaf=1, max_features=auto, max_depth=15 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  n_estimators=500, min_samples_split=10, min_samples_leaf=1, max_features=auto, max_depth=15, total=   1.0s
[CV] n_estimators=500, min_samples_split=10, min_samples_leaf=1, max_features=auto, max_depth=15 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.9s remaining:    0.0s


[CV]  n_estimators=500, min_samples_split=10, min_samples_leaf=1, max_features=auto, max_depth=15, total=   0.9s
[CV] n_estimators=500, min_samples_split=10, min_samples_leaf=1, max_features=auto, max_depth=15 
[CV]  n_estimators=500, min_samples_split=10, min_samples_leaf=1, max_features=auto, max_depth=15, total=   1.0s
[CV] n_estimators=500, min_samples_split=10, min_samples_leaf=1, max_features=auto, max_depth=15 
[CV]  n_estimators=500, min_samples_split=10, min_samples_leaf=1, max_features=auto, max_depth=15, total=   0.9s
[CV] n_estimators=500, min_samples_split=10, min_samples_leaf=1, max_features=auto, max_depth=15 
[CV]  n_estimators=500, min_samples_split=10, min_samples_leaf=1, max_features=auto, max_depth=15, total=   0.9s
[CV] n_estimators=1000, min_samples_split=2, min_samples_leaf=10, max_features=auto, max_depth=25 
[CV]  n_estimators=1000, min_samples_split=2, min_samples_leaf=10, max_features=auto, max_depth=25, total=   1.7s
[CV] n_estimators=1000, min_samples_split

[CV]  n_estimators=300, min_samples_split=2, min_samples_leaf=5, max_features=auto, max_depth=5, total=   0.5s
[CV] n_estimators=300, min_samples_split=2, min_samples_leaf=5, max_features=auto, max_depth=5 
[CV]  n_estimators=300, min_samples_split=2, min_samples_leaf=5, max_features=auto, max_depth=5, total=   0.5s
[CV] n_estimators=300, min_samples_split=2, min_samples_leaf=5, max_features=auto, max_depth=5 
[CV]  n_estimators=300, min_samples_split=2, min_samples_leaf=5, max_features=auto, max_depth=5, total=   0.5s
[CV] n_estimators=300, min_samples_split=2, min_samples_leaf=5, max_features=auto, max_depth=5 
[CV]  n_estimators=300, min_samples_split=2, min_samples_leaf=5, max_features=auto, max_depth=5, total=   0.5s
[CV] n_estimators=300, min_samples_split=2, min_samples_leaf=5, max_features=auto, max_depth=5 
[CV]  n_estimators=300, min_samples_split=2, min_samples_leaf=5, max_features=auto, max_depth=5, total=   0.5s
[CV] n_estimators=800, min_samples_split=15, min_samples_leaf

[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:   50.1s finished


In [4]:
rs_clf.best_params_

{'n_estimators': 1000,
 'min_samples_split': 2,
 'min_samples_leaf': 10,
 'max_features': 'auto',
 'max_depth': 25}

In [5]:
rs_y_preds = rs_clf.predict(X_test)
rs_metrics = evaluate_preds(y_test, rs_y_preds)

Accuracy: 86.885246%
Precision: 0.85
Recall: 0.91
F1 Score: 0.88


## Putting it all together with Pipeline

In [6]:
# import libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline

# generate seed
np.random.seed(42)

# import dataset
heartdisease_df = pd.read_csv("heart_disease_dataset.csv")

# create a modeling pipeline
model = Pipeline(steps=[("model", RandomForestClassifier())])

# split X and y target and features
X = heartdisease_df.drop(columns="target")
y = heartdisease_df["target"]

# generate train test split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

# fitting and scoring the dataset
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.8524590163934426

In [7]:
pipe_grid = {"model__n_estimators":[100,120,300,500,800,1000],
             "model__max_depth":[5,8,15,25,30,None],
             "model__max_features":["auto","sqrt"],
             "model__min_samples_split":[2,5,10,15],
             "model__min_samples_leaf":[1,2,5,10],
            }
gs_model = RandomizedSearchCV(model, pipe_grid, cv=5, verbose=2)
gs_model.fit(X_train,y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] model__n_estimators=500, model__min_samples_split=10, model__min_samples_leaf=1, model__max_features=auto, model__max_depth=15 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  model__n_estimators=500, model__min_samples_split=10, model__min_samples_leaf=1, model__max_features=auto, model__max_depth=15, total=   0.8s
[CV] model__n_estimators=500, model__min_samples_split=10, model__min_samples_leaf=1, model__max_features=auto, model__max_depth=15 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.7s remaining:    0.0s


[CV]  model__n_estimators=500, model__min_samples_split=10, model__min_samples_leaf=1, model__max_features=auto, model__max_depth=15, total=   0.8s
[CV] model__n_estimators=500, model__min_samples_split=10, model__min_samples_leaf=1, model__max_features=auto, model__max_depth=15 
[CV]  model__n_estimators=500, model__min_samples_split=10, model__min_samples_leaf=1, model__max_features=auto, model__max_depth=15, total=   0.8s
[CV] model__n_estimators=500, model__min_samples_split=10, model__min_samples_leaf=1, model__max_features=auto, model__max_depth=15 
[CV]  model__n_estimators=500, model__min_samples_split=10, model__min_samples_leaf=1, model__max_features=auto, model__max_depth=15, total=   0.8s
[CV] model__n_estimators=500, model__min_samples_split=10, model__min_samples_leaf=1, model__max_features=auto, model__max_depth=15 
[CV]  model__n_estimators=500, model__min_samples_split=10, model__min_samples_leaf=1, model__max_features=auto, model__max_depth=15, total=   0.8s
[CV] mode

[CV]  model__n_estimators=100, model__min_samples_split=5, model__min_samples_leaf=5, model__max_features=sqrt, model__max_depth=25, total=   0.2s
[CV] model__n_estimators=100, model__min_samples_split=5, model__min_samples_leaf=5, model__max_features=sqrt, model__max_depth=25 
[CV]  model__n_estimators=100, model__min_samples_split=5, model__min_samples_leaf=5, model__max_features=sqrt, model__max_depth=25, total=   0.2s
[CV] model__n_estimators=100, model__min_samples_split=5, model__min_samples_leaf=5, model__max_features=sqrt, model__max_depth=25 
[CV]  model__n_estimators=100, model__min_samples_split=5, model__min_samples_leaf=5, model__max_features=sqrt, model__max_depth=25, total=   0.2s
[CV] model__n_estimators=100, model__min_samples_split=5, model__min_samples_leaf=5, model__max_features=sqrt, model__max_depth=25 
[CV]  model__n_estimators=100, model__min_samples_split=5, model__min_samples_leaf=5, model__max_features=sqrt, model__max_depth=25, total=   0.2s
[CV] model__n_es

[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:   51.9s finished


RandomizedSearchCV(cv=5, error_score=nan,
                   estimator=Pipeline(memory=None,
                                      steps=[('model',
                                              RandomForestClassifier(bootstrap=True,
                                                                     ccp_alpha=0.0,
                                                                     class_weight=None,
                                                                     criterion='gini',
                                                                     max_depth=None,
                                                                     max_features='auto',
                                                                     max_leaf_nodes=None,
                                                                     max_samples=None,
                                                                     min_impurity_decrease=0.0,
                                                            

In [8]:
gs_model.score(X_test, y_test)

0.8688524590163934

## Saving and loading  Model

In [9]:
from joblib import dump, load

# save model to file
dump(gs_model, filename="heartdisease_rgs_classification.joblib")

['heartdisease_rgs_classification.joblib']

In [10]:
# import joblib model
loaded_job_model = load(filename="heartdisease_rgs_classification.joblib")

In [11]:
# evaluate and make predictions
joblib_y_preds = loaded_job_model.predict(X_test)
evaluate_preds(y_test, joblib_y_preds)

Accuracy: 86.885246%
Precision: 0.85
Recall: 0.91
F1 Score: 0.88


{'accuracy': 0.87, 'precision': 0.85, 'recall': 0.91, 'f1': 0.88}

### Sources and References:

https://towardsdatascience.com/hyper-parameter-tuning-and-model-selection-like-a-movie-star-a884b8ee8d68

https://www.udemy.com/course/complete-machine-learning-and-data-science-zero-to-mastery/

https://www.kaggle.com/ronitf/heart-disease-uci