# <font color='MAROON'>Day 3</font>
# <font color='BROWN'>Learning Algorithms</font>
### <font color='FIREBRICK'>Part 2</font>


## Random Forests
Fortunately, with libraries such as Scikit-Learn, it’s now easy to implement hundreds of machine learning algorithms in Python. It’s so easy that we often don’t need any underlying knowledge of how the model works in order to use it. While knowing all the details is not necessary, it’s still helpful to have an idea of how a machine learning model works under the hood. This lets us diagnose the model when it’s underperforming or explain how it makes decisions, which is crucial if we want to convince others to trust our models.

We'll start with a single decision tree and a simple problem, and then work our way to a random forest and a real-world problem.

Once we understand how a single decision tree thinks, we can transfer this knowledge to an entire forest of trees.

In [None]:
import numpy as np
import pandas as pd

# Set random seed to ensure reproducible runs
RSEED = 50

### Start Simple: Basic Problem
To begin, we'll use a very simple problems with only two features and two classes. This is a binary classification problem.

First, we create the features X and the labels y. There are only two features, which will allow us to visualize the data and which makes this a very easy problem.

In [None]:
X = np.array([[2, 2], 
              [2, 1],
              [2, 3], 
              [1, 2], 
              [1, 1],
              [3, 3]])

y = np.array([0, 1, 1, 1, 0, 1])

#### Data Visualization
To get a sense of the data, we can graph the data points.

In previous days you learned how to do this. You can also graph the data point with the number showing the label on a piece of paper.

In [None]:
# Your code


Our data only has two features (predictor variables), x1 and x2 with 6 data points/samples divided into 2 different labels.

Even though there are only two features, this is a linearly inseparable problem, which means that we can’t draw a single straight line through the data to classify the points. We can however draw a series of straight lines that divide the data points into boxes, which we’ll call nodes. In fact, this is what a decision tree does during training. Effectively, a decision tree is a non-linear model built by constructing many linear boundaries.

### Single Decision Tree
Here we quickly build and train a single decision tree on the data using Scikit-Learn. The tree will learn how to separate the points, building a flowchart of questions based on the feature values and the labels. At each stage, the decision tree makes splits by maximizing the reduction in Gini impurity.

We'll use the default hyperparameters for the decision tree which means it can grow as deep as necessary in order to completely separate the classes. This will lead to overfitting because the model memorizes the training data, and in practice, we usually want to limit the depth of the tree so it can generalize to testing data.

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Make a decision tree and train
tree = DecisionTreeClassifier(random_state=RSEED)
tree.fit(X, y)

In [None]:
print(f'Decision tree has {tree.tree_.node_count} nodes with maximum depth {tree.tree_.max_depth}.')

Our decision tree formed 9 nodes and reached a maximum depth of 3. It will have achieved 100% accuracy on the training data because we did not limit the depth and it therefore can classify every training point perfectly.

The created decision tree is as follows:

<img src = "images/dt.png">

A decision tree is an intuitive model: it makes decisions much as we might when faced with a problem by constructing a flowchart of questions. For each of the nodes (except the leaf nodes), the five rows represent:

1. Question asked about the data based on a feature: this determines the way we traverse down the tree for a new datapoint.
2. gini: the Gini Impurity of the node. The average (weighted by samples) gini impurity decreases with each level of the tree.
3. samples: number of training observations in the node
4. value: [number of samples in the first class, number of samples in the second class]
5. class: the class predicted for all the points in the node if the tree ended at this depth (defaults to 0 for a tie).
The leaf nodes (the terminal nodes at each branch) do not have a question because they are where the tree makes a prediction. All of the samples in a leaf node are assigned the same class.

### Gini Impurity
The Gini Impurity represents the probability that a randomly selected sample from the node will be incorrectly classified according to the distribution of samples in the node. At the top, there is a 44.4% chance that a randomly selected point would be incorrectly classified. The Gini Impurity is how the decision tree makes splits. It splits the samples based on the value of a feature that reduces the Gini Impurity by the largest amount. If we do the math, the average (weighted by number of samples) Gini Impurity decreases as we move down the tree.

Eventually, the average Gini Impurity goes to 0.0 as we correctly classify each point. However, correctly classifying every single training point is usually not a good indicator because that means the model will not be able to generalize to the testing data! This model correclty classifies every single training point because we did not limit the maximum depth and during training, we give the model the answers as well as the features.

### Limit Maximum Depth
In practice, we usually want to limit the maximum depth of the decision tree (even in a random forest) so the tree can generalize better to testing data. Although this will lead to reduced accuracy on the training data, it can improve performance on the testing data.

In [None]:
# Limit maximum depth and train
short_tree = DecisionTreeClassifier(max_depth = 2, random_state=RSEED)
short_tree.fit(X, y)

print(f'Model Accuracy: {short_tree.score(X, y)}')

<img src="images/dt2.png">

Our model no longer gets perfect accuracy on the training data. However, it probably would do better on the testing data since we have limited the maximum depth to prevent overfitting. This is an example of the bias - variance tradeoff in machine learning. A model with high variance has learned the training data very well but often cannot generalize to new points in the test set. On the other hand, a model with high bias has not learned the training data very well because it does not have enough complexity. This model will also not perform well on new points.

Limiting the depth of a single decision tree is one way we can try to make a less biased model. Another option is to use an entire forest of trees, training each one on a random subsample of the training data. The final model then takes an average of all the individual decision trees to arrive at a classification. This is the idea behind the random forest.

Hopefully this simple example has given you an idea of how a Decision Tree makes classifications. It looks at the features and the labels, and tries to construct a flowchart of questions that end in the correct classification for each label. If we don't limit the depth of the tree, it can correctly classify every single point in the training data. This will lead to overfitting though and an inability to do well on testing data. We didn't have any testing data in this example, but in the next problem, using a real-world dataset, we do and we'll see how overfitting can be an issue!

In previous sessions you learned how to create a decision tree.

Load the data, split it to training (``X_train, y_train``) and test set (``X_test, y_test``), fit a decision tree on training set and find the accuracy on test set.

In [None]:
# Your code

## Random Forest
Now we can move on to a more powerful model, the random forest. This takes the idea of a single decision tree, and creates an ensemble model out of hundreds or thousands of trees to reduce the variance. Each tree is trained on a random set of the observations, and for each split of a node, only a subset of the features are used for making a split. When making predictions, the random forest averages the predictions for each of the individual decision trees for each data point in order to arrive at a final classification.

Creating and training a random forest in extremely easy in Scikit-Learn.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create the model with 100 trees
model = RandomForestClassifier(n_estimators=100, 
                               random_state=RSEED, 
                               max_features = 'sqrt',
                               n_jobs=-1, verbose = 1, max_depth = 5)

# Fit on training data
model.fit(X_train, y_train.values.ravel())

We can see how many nodes there are for each tree on average and the maximum depth of each tree. There were 100 trees in the forest.

In [None]:
n_nodes = []
max_depths = []

for ind_tree in model.estimators_:
    n_nodes.append(ind_tree.tree_.node_count)
    max_depths.append(ind_tree.tree_.max_depth)
    
print(f'Average number of nodes {int(np.mean(n_nodes))}')
print(f'Average maximum depth {int(np.mean(max_depths))}')

Predict the labels for test set and report the accuracy.

In [None]:
# Your code

### Research
Find more about ensemble models. Name some of them and explain how they work.

*Recall*, Random forest is one type of ensemble models.

In [None]:
# Your answer (3-4 line):

References:
- An Implementation and Explanation of the Random Forest in Python [link](https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76).