## Decision Trees and Random Forests

### Data Loading

In [None]:
!git clone https://github.com/icomse/8th_workshop_MachineLearning.git
import os
os.chdir('8th_workshop_MachineLearning/data')

In [None]:
# Now we use pandas library to create a dataframe.
import numpy as np
import pandas as pd

Read in the prepared data

In [None]:
df_featurized=pd.read_csv('../data/featurized_mixture.csv')

In [None]:
df_featurized.head()

In [None]:
df_featurized.describe()

Shuffle the data, as there is some correlation in the inputs. 

In [None]:
df_featurized=df_featurized.sample(frac=1.0)

## Splitting

In [None]:
# Splitting into train and test sets according to the original dataset
y_train = df_featurized[df_featurized['Status'] =='Training']['EXP. Data']
x_train = df_featurized[df_featurized['Status'] =='Training'].drop(['EXP. Data', 'HBD','HBD_smiles', 'Status', 'HBA', 'HBA_smiles'], axis=1)
y_test = df_featurized[df_featurized['Status'] =='Test']['EXP. Data']
x_test = df_featurized[df_featurized['Status'] =='Test'].drop(['EXP. Data', 'HBD','HBD_smiles', 'Status', 'HBA', 'HBA_smiles'], axis=1)
print('train size: ',x_train.shape[0])
print('test size: ',x_test.shape[0])

### Decision Trees

In [None]:
from sklearn import tree
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, mean_absolute_error


In [None]:
dt_model = tree.DecisionTreeRegressor(max_depth=3)
dt_model.fit(x_train,y_train)

In [None]:
y_pred_train = dt_model.predict(x_train)
y_pred_test = dt_model.predict(x_test)
print("Train MSE=",mean_squared_error(y_pred_train,y_train))
print("Test MSE=",mean_squared_error(y_pred_test,y_test))

In [None]:
print(y_pred_test)

Let's look at the tree!

In [None]:
plt.figure(figsize=(8, 6), dpi=300)
tree.plot_tree(dt_model,feature_names=x_train.columns)
plt.show()

**Hacking**: look up some of the decision tree options, and see if you can do better!  Remember, you want to do better on the TEST MSE.

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
rf_model = RandomForestRegressor(max_depth=5,
                                 n_estimators=200,
                                 min_samples_split=3,
                                 min_samples_leaf=1)
rf_model.fit(x_train,y_train)

In [None]:
y_pred_train = rf_model.predict(x_train)
y_pred_test = rf_model.predict(x_test)
print("Train MSE=",mean_squared_error(y_pred_train,y_train))
print("Test MSE=",mean_squared_error(y_pred_test,y_test))
print("Train MAE=",mean_absolute_error(y_pred_train,y_train))
print("Test MAE=",mean_absolute_error(y_pred_test,y_test))

Are we going to try to visualize these!  NO! 

**Hacking**: Look up some of the additional random forest options, and see if you can do better!

### Hyperparameter search

Let's try to automate the process above.  `scikit-learn` has code to do this!  It's called `GridSearchCV`, Grid, and it peforms cross-validation on a whole list of parameters.

In [None]:
import sklearn

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

# Hyperparameter tuning and cross validation
param_grid = {
    'n_estimators': [50,100],  # Number of trees
    'max_depth': [5, 10, 20],      # Maximum depth of each tree
    'max_features': [1.0, 'sqrt', 'log2'],    # Number of features to consider at each split
    'min_samples_split': [2,3,4],  # Minimum number of samples required to split a node
    'min_samples_leaf': [1,2,3]     # Minimum number of samples required at each leaf node
}
rf_model = RandomForestRegressor()  # model creation
grid_search = GridSearchCV(rf_model,
                           param_grid=param_grid,
                           cv=5,
                           verbose=3,
                           scoring='neg_mean_squared_error',
                           return_train_score=True) # will go through all possible combinations in the param grid
grid_search.fit(x_train,y_train) # fitting to train set

In [None]:
print("Best hyperparameters: ", grid_search.best_params_)

This can be pretty random!  Last time I ran this I got:
```
Best hyperparameters:  {'max_depth': 10, 'max_features': 1.0, 'min_samples_leaf': 1, 'min_samples_split': 3, 'n_estimators': 100} ```

In [None]:
# Now, let's see the performance on our held out test set
rf_model = RandomForestRegressor(**grid_search.best_params_).fit(x_train,y_train)
print('training score: ',r2_score(rf_model.predict(x_train),y_train).round(3))
print('test score: ', r2_score(rf_model.predict(x_test),y_test).round(3))

#parity plot
plt.scatter(y_test, rf_model.predict(x_test))
plt.plot(y_test,y_test)
plt.text(65, 36, s ='r2 score: {}'.format(r2_score(rf_model.predict(x_test), y_test).round(3)))
plt.text(65, 34, s ='MAE: {}'.format(mean_absolute_error(rf_model.predict(x_test), y_test).round(3)))
plt.text(65, 32, s ='MSE: {}'.format(mean_squared_error(rf_model.predict(x_test), y_test).round(3)))
plt.xlabel('Actual Value')
_ = plt.ylabel('Predicted Value')

In [None]:
#Analizing most important features
df_ft_imp_rf = pd.DataFrame({'feature': x_train.columns,'importance': rf_model.feature_importances_}).sort_values('importance',ascending=True)
df_ft_imp_rf.tail(10).plot.barh('feature','importance')
plt.show()

# Some parting words 
Some things to keep in mind about random forest methods
*   It's a fairly robust model, and usually doesn't overfit
*   As it's a tree based model, scaling of the data is not required
*   It's not good at extrapolation, see here: https://www.kaggle.com/code/carlmcbrideellis/extrapolation-do-not-stray-out-of-the-forest





