## Random Forest Classifier

Let us repeat this exercise with Random Forrests


In this practical we will repeat the analysis we have done in the previous practical. Instead of DecisionTree classifier we will RandomForest Classifier, an ensemble approach. We will use the same data file.  
The data originates form the following publication:

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1292-2

```
'clinical_biomarkers.csv'
``` 

clinical_biomarkers_raw.csv : file will give you the full ist of raw data.

In [8]:
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import IPython.display
from sklearn.model_selection import LeaveOneOut, GridSearchCV, KFold, StratifiedKFold
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

# own mini- library
import session_helpers

%matplotlib inline

### Some required data pre-processing

Here, you can load a small helper file, allowing to plot the learnt tree using a programme called graphviz.  
This library assume that graphviz is installed locally (which is true for the Jupyer Lab environment on bearportal)


In [7]:
## Loading the biomarker data and pre-process 

df = pd.read_csv("clinical_biomarkers.csv")
df = df.set_index(['Sample'])

In [10]:
df_ex = df.copy()
df_ex['Response'] = df_ex['Response'].map(
    {
        'C.':'C.',
        'C. R.':'C. R.',
        'Low':'Low',
        'Int. I.':'Int. I.',
        'Int. II.':'Int. II.',
        'Int. II. R.':'Int. II. R.',
        'High':'High',
        'High R.':'High R.',
    })

df_ex = df_ex[df_ex['Response'].notna()]

# For consitency
# target column
y = df_ex['Response']
# this drops the column 'Response' for the dataframe and stores it in X
X = df_ex.drop(['Response'],axis=1)

X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=3)

print(X_train)

        Hb (g/dL)  RBC (mil/cmm)   PCV (%)  RET ABS(mil/cmm)   MCV(fL)  \
Sample                                                                   
43       0.418995       0.172987  0.571843          1.335644  0.477972   
39      -1.215545      -1.542079 -1.763370         -0.079740 -0.148539   
27       0.418995      -0.490910 -0.171179         -0.315637  0.425763   
11      -0.071367      -0.241948 -0.861128         -1.023329 -0.775051   
44      -0.561729       0.034675 -0.277325         -0.787432 -0.409586   
73       0.418995       0.919871  0.784135          0.156157 -0.252958   
69      -1.378999      -1.016494 -0.754982         -0.315637  0.425763   
32      -1.215545      -1.791041 -0.648836          0.392055  1.678786   
45       2.053534       1.307144  1.898669         -0.551535  0.530182   
57      -2.196269      -2.316625 -2.081808          2.986925  0.582391   
65      -1.378999       0.117662 -0.914201         -0.079740 -1.297144   
55      -0.398275      -0.739871 -0.11

### Load the RandomForest Classifier

In [12]:
from sklearn.ensemble import RandomForestClassifier


### Grid search

In [None]:
parameters = {
    'n_estimators': [2,3,5], 
    'max_depth':[1,2,3,4],
    'min_samples_leaf':[2,5,10]
}

random_f_model = RandomForestClassifier() 
rf_grid_search = GridSearchCV(random_f_model, parameters, cv=5,scoring='balanced_accuracy') # weighted == F1 Measure for multi-class
grid_search = rf_grid_search.fit(X_train, y_train)



### Best model

In [None]:
best_random_f_model = rf_grid_search.best_estimator_ # best model according to grid search 

best_random_f_model.get_params()

### Re-use the model on another dataset



In [None]:
import matplotlib.pyplot as plt # plotting and visulisation
import seaborn as sns # nicer (easier) visualisation
%matplotlib inline
from sklearn.metrics import RocCurveDisplay,accuracy_score,roc_curve

## Evaluate on test data
y_test_predicted = best_random_f_model.predict(X_test)


print('Confusion Matrix of best model on test')
print(confusion_matrix(y_test,y_test_predicted))
print("Decision Tree Accuracy on test data: ", accuracy_score(y_test, y_test_predicted))



#### Your Task : Train a Decision Tree Classifier and apply it on the test data set. 

- Task 1 : Train a decision tree classifier on the open ml Lipid data set. 
Data Link : https://openml.org/search?type=data&status=active&id=1480

- Task 2 : Split the data in to train and test [30% for the test] 

- Task 3 : Using a five fold cross validation and within the cross validation lopp
    Task 3.1 Train a Decision Tree Classifier
    Task 3.2 Train a Random Forest Classifier
    
- Task 4 : Train a Decision Tree and Random Forest using all Train data. Apply both trained classifier on the test data. 

#### Question  1: Which classifier perform best based on F1 score ?
#### Question  2: Which classifier perform best based on Accuracy ? 

In [None]:
from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score, roc_curve, auc
from sklearn.metrics import RocCurveDisplay
from sklearn.model_selection import LeaveOneOut, GridSearchCV, KFold, StratifiedKFold
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier



## Task 1 : Load the open ML lipid data set. 
# lipid_data = ...

# ... complete rest of the section ... ##

In [None]:
from sklearn.metrics import accuracy_score,f1_score

X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=3)

## dt_model = ...
## rf_model = ...

kf = StratifiedKFold(n_splits=5, random_state=15, shuffle=True)
for count_k,(train_index, test_index) in enumerate(kf.split(X,y)):
    X_train = X.iloc[train_index]
    X_test  = X.iloc[test_index]
    y_train = y.iloc[train_index]
    y_test  = y.iloc[test_index]
    ## .... Complete this section ... ##   