## Heart Disease Classification with a Random Forest 

In this assignment, we will go back to the problem of predicting presence of heart disease. In the first homework, we had used a decision tree classifier. We will see if we can have better classification performance with Random Forests. We will use the same [heart disease](http://archive.ics.uci.edu/ml/datasets/heart+Disease) dataset.

A RandomForest classifier builds different decision trees for subsets of the examples in the dataset. Sampling is done with replacement, hence the subsets can overlap. For the final prediction each decision tree votes and majority decision becomes the final prediction.

RandomForest classifier has many hyperparameters that can be tuned to improve performance. A list is available [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn-ensemble-randomforestclassifier). We will use grid search with cross-validation to try different parameter settings. In a grid search, different models are fit for different configurations of hyperparameters. For example, assume model *M* has 2 tunable hyperparameters *a* and *b*. Assuming each of the hyperparameters take on 2 possible values, our grid search will evaluate 4 models.

In [None]:
# necesary imports
import Orange
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.impute import SimpleImputer
import numpy as np

`heart_disease` dataset is provided in Orange library. We load the dataset, extract the features and labels here.

In [None]:
heart_disease = Orange.data.Table('heart_disease')
X =  heart_disease.X # get features
y = heart_disease.Y # get labels

This dataset has some examples with missing values for some of the features. We will impute these entries to the mean of the feature.

In [None]:
imp = SimpleImputer(missing_values=np.nan, strategy='mean') # initialize the imputer
X = imp.fit_transform(X) # fill in the missing values with means of the feature values

We will split our dataset into 67\% train and 33\% test set. We will do stratified sampling to keep the class distribution in both partitions.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.33, random_state = 42)

We initialize and fit our random classifier. `max_depth` specifies the maximum depth of trees. `random_state` is set for reproducibility. Then, we test the classifier performance on the test set.

In [None]:
rf = RandomForestClassifier(max_depth=2, random_state=0)
rf.fit(X_train, y_train)

In [None]:
rf.score(X_test, y_test)

`sklearn` provides `GridSearchCV` class that implements grid search. We specify the parameters that we want to search over in a dictionary where the keys are the names of the hyperparameters of the estimator we are using (an instance of the `RandomForest` classifier in our case).  Performance of each configuration is evaluated using cross-validation. `refit` parameter in the initialization statement is to make the classifier run again with the best model found. Lastly, we report the accuracy for the best model found.

Note: `clf.fit` method takes about 8-10 minutes to finish.

In [None]:
search_grid = {'n_estimators': range(10,200,10), 'max_depth': range(1, 20)}
clf = GridSearchCV(rf, search_grid, refit=True, verbose=True)

In [None]:
clf.fit(X_train,y_train)

In [None]:
clf.score(X_test,y_test)

### Exercises

1. Available parameters that can be searched over are listed in the [`RandomForestClassifer` documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn-ensemble-randomforestclassifier). Look at the defitions of the parameters below. Specify your own `search_grid`, perform grid search and report accuracy. 

Note: The set of all possible configurations will be the Cartesian product of the possible values you specify in the `search_grid` dictionary. If you search over 5 hyperparameters and each takes 3 possible values, `GridSearchCV` will fit and evaluate 5<sup>3</sup> = 243 models.

- n_estimators
- criterion
- max_depth
- min_samples_split
- min_samples_leaf
- min_weight_fraction_leaf
- max_features
- max_leaf_nodes
- min_impurity_decrease
- min_impurity_split