# Random Forest
### aka, a lot of random trees

![forest](img/forest.jpeg)

## Outcomes

- differentiate between decision trees and random forest 
- explain what makes random forest so hella cool
- explore the fine-tuning options in `sklearn` for random forest
- build a random forest in `sklearn`


### Scenario: 
We've made a decision tree, but we are concerned it might not generalize well. What to do?


### Could use k-fold cross validation

![dectree](img/decisiontree.png)

### But with same data, might get same results
![same](img/sameresult.png)

### It's like crowd sourcing. 
Could ask a lot of **_similar_** people
![min](img/minions.gif)

Or could ask a more _**diverse**_ group of people
![waldo](img/waldo.gif)

### Want to create a more diverse set of trees

![forest](img/randomforest.png)

### How do you diversify?

You create $m$ trees that randomly sample from the your data.<br>
Then at each node, $p$ features are randomly chosen to be considered when splitting.

![mind](img/mindblown.gif)

### Specifics:

 $m$ trees defaults to 100 unless otherwise specified.<br>
 $p$ features defaults to square root of total features.

### Bagging

This technique is called _bagging_ because the samples are **_bootstrapped_** and then the results of each tree are **_aggregated_**

![bag](img/bag.jpeg)

### Built in cross-validation

Because each tree is made on a **sample**, the algorithm also calculates the **Out of Bag**(OOB) Error averaged for each tree. 

In [None]:
!pip install pydotplus

In [None]:
# libraries for decision trees

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier 
from sklearn import tree 
# from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus
import pandas as pd 
import numpy as np
%matplotlib inline

In [None]:
# New ones for random forest

np.random.seed(0)
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, auc
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier

## Scenario: Pima Indians diabetes dataset

<img src="img/0_IunJJNPI_F6U8ii9.jpeg" style="height:200px">


<br>

> This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
- [Pima Indians Diabetes Database](https://www.kaggle.com/uciml/pima-indians-diabetes-database)

>The Pima Indians of the Gila River Indian Community have participated in longitudinal studies of the etiology of diabetes since 1965 (20).
- [Genetic Studies of the Etiology of Type 2 Diabetes in Pima Indians](https://diabetes.diabetesjournals.org/content/53/5/1181)

In [None]:
diabetes = pd.read_csv('diabetes.csv')

In [None]:
diabetes.head()

In [None]:
diabetes.describe().T

## Do we need to clean the data?

In [None]:
X = diabetes.drop(columns=['Outcome'])
Y = diabetes['Outcome']

In [None]:
diabetes.Outcome.value_counts()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state= 10)  

In [None]:
classifier = DecisionTreeClassifier(random_state=10)  
classifier.fit(X_train, y_train)  

In [None]:
y_pred = classifier.predict(X_test)  

In [None]:
acc = accuracy_score(y_test,y_pred) * 100
print("Accuracy is :{0}".format(acc))

# Check the AUC for predictions
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
print("\nAUC is :{0}".format(round(roc_auc,2)))

# Create and print a confusion matrix 
print('\nConfusion Matrix')
print('----------------')
pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

In [None]:
# Train a DT classifier
classifier2 = DecisionTreeClassifier(random_state=10, criterion='entropy')  
classifier2.fit(X_train, y_train)  
# Make predictions for test data
y_pred = classifier2.predict(X_test) 
# Calculate Accuracy 
acc = accuracy_score(y_test,y_pred) * 100
print("Accuracy is :{0}".format(acc))
# Check the AUC for predictions
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
print("\nAUC is :{0}".format(round(roc_auc,2)))
# Create and print a confusion matrix 
print('\nConfusion Matrix')
print('----------------')
print(pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True))

In [None]:
classifier2.feature_importances_
def plot_feature_importances(model):
    n_features = X_train.shape[1]
    plt.figure(figsize=(8,8))
    plt.barh(range(n_features), model.feature_importances_, align='center') 
    plt.yticks(np.arange(n_features), X_train.columns.values) 
    plt.xlabel("Feature importance")
    plt.ylabel("Feature")

plot_feature_importances(classifier2)

In [None]:
pred = classifier2.predict(X_test)
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))

## Random forest in code

`n_estimators` = $m$<br>
`max_features` = $p$

In [None]:
forest = RandomForestClassifier(n_estimators=100, max_depth= 5)
forest.fit(X_train, y_train)

#### Get accuracy of training data

In [None]:

forest.score(X_train, y_train)

#### Get accuracy of test data

In [None]:
forest.score(X_test, y_test)

In [None]:
plot_feature_importances(forest)

### Let us try to fine tune this model a bit

In [None]:
forest_2 = RandomForestClassifier(n_estimators = 10, max_features= 2, max_depth= 2)
forest_2.fit(X_train, y_train)

In [None]:
forest_2.score(X_train, y_train)

In [None]:
forest_2.score(X_test, y_test)

### Hyper-parameters for decision trees

`n_estimators` : the number of trees in the forest<br>
`criterion`: “gini”,”entropy” <br>
`max_features`: the number of random features to be considered when looking for the best split <br>
`max_depth`:  the maximum number of levels of a tree<br>
`bootstrap`: whether or not bootstrap samples are used to build trees <br>
`oob_score`: whether or not to use out-of-bag samples to estimate the generalization accuracy<br>
`n_jobs`: how many cores you want to use when training your trees<br>


In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [30, 100, 300],
    'min_samples_split': [2, 4, 6],
    'min_samples_leaf': [2, 4, 6]
}

In [None]:
gs = GridSearchCV(forest, param_grid, cv=5)
gs.score(X_test, y_test)
gs.best_params_

### Benefits
**Strong performance**: The Random Forest algorithm usually has very strong performance on most problems, when compared with other classification algorithms. Because this is an ensemble algorithm, the model is naturally resistant to noise and variance in the data, and generally tends to perform quite well.

**Interpretability**: Conveniently, since each tree in the Random Forest is a Glass-Box Model (meaning that the model is interpretable, allowing us to see how it arrived at a certain decision), the overall Random Forest is, as well! You'll demonstrate this yourself in the upcoming lab, by inspecting feature importances for both individual trees and the entire Random Forest itself.

### Drawbacks
**Computational Complexity**: Like any ensemble method, training multiple models means paying the computational cost of training each model. On large datasets, the runtime can be quite slow compared to other algorithms.

**Memory Usage**: Another side effect of the ensembled nature of this algorithm, having multiple models means storing each in memory. Random Forests tend to have a larger memory footprint that other models. Whereas a parametric model like a Logistic Regression just needs to store each of the coefficients, a Random Forest has to remember every aspect of every tree! It's not uncommon to see larger Random Forests that were trained on large datasets have memory footprints in the 10s, or even hundreds of MB. For data scientists working on modern computers, this isn't typically a problem--however, there are special cases where the memory footprint can make this an untenable choice--for instance, an app on a smartphone that uses machine learning may not be able to afford to spend that much disk space on a Random Forest model!

### Questions to consider

How do Random Forests handle the bias-variance tradeoff? <br>
What would be another way of using ensembling methods to tackle the bias-variance tradeoff?

Additional Resources<br>
https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf<br>
https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm


Another flatiron slidedeck [here](https://docs.google.com/presentation/d/1bUwvdvg4bDRVzE3YaLSQZcsx-7t2ZFnaEGxjQHjxAoc/edit?usp=sharing)