# 1: Introduction
The most powerful method to reduce decision tree overfitting is called the random forest algorithm. In this mission, we'll learn how to construct and apply random forests.

In [1]:
import pandas as pd
import numpy as np

names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
           'marital_status', 'occupation', 'relationship', 'race', 'sex',
           'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'high_income']
income = pd.read_csv('adult.data',names=names)

In [2]:
target_col = ['workclass','education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country', 'high_income']

for target in target_col:
    col = pd.Categorical.from_array(income[target])
    income[target] = col.codes

columns = ["age", "workclass", "education_num", "marital_status",
           "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]

In [3]:
import math

np.random.seed(1)

income = income.reindex(np.random.permutation(income.index))

train_max_row = int(math.floor(income.shape[0]*.8))

train = income.iloc[:train_max_row]
test= income.iloc[train_max_row:]

# 2: Ensemble models
A random forest is a kind of ensemble model. Ensembles combine the predictions of multiple models to create a more accurate final prediction. We'll make a simple ensemble to see how it works.

We'll create two decision trees with slightly different parameters, and check their accuracy separately. Later on, we'll combine their predictions and compare the accuracy.

In [4]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score

clf = DecisionTreeClassifier(random_state=1,min_samples_leaf=75)
clf.fit(train[columns],train['high_income'])

clf2 = DecisionTreeClassifier(random_state=1,max_depth=6)
clf2.fit(train[columns],train['high_income'])

predictions_clf1 = clf.predict(test[columns])
predictions_clf2 = clf2.predict(test[columns])

auc_clf1 = roc_auc_score(predictions_clf1,test['high_income'])
auc_clf2 = roc_auc_score(predictions_clf2,test['high_income'])

print auc_clf1,auc_clf2

0.784853009097 0.771031199892


# 3: Combining our predictions
When we have multiple classifiers making predictions, we can treat each set of predictions as a column in a matrix.

Ultimately, we don't want this matrix, though -- we want one prediction per row in the training data. To do this, we'll need to create rules to turn each row of our matrix of predictions into a single number.

There are many ways to get from the output of multiple models to a final vector of predictions. One method is majority voting, where each classifier gets a "vote", and the most commonly voted value for each row wins. 

This only works if there are more than 2 classifiers.

we'll have to use a different method to combine predictions.

We'll take the mean of all the items in a row. Right now, we're using the predict method, which returns either 0 or 1.

We can instead use the predict_proba method, which will predict a probability from 0 to 1 that a given class is the right one for a row. Since 0 and 1 are our two classes, we'll get a matrix with as many rows as the income dataframe and 2 columns.

If we just take the second column, we get the average value that the classifier would predict for that row. If there's a .9 probability that the correct classification is 1, we can use the .9 as the value the classifier is predicting. This will give us a continuous output in a single vector instead of just 0 or 1. 



In [5]:
predictions = clf.predict_proba(test[columns])[:,1]
predictions2 = clf2.predict_proba(test[columns])[:,1]

averages = (predictions + predictions2) / 2.
averages_round = np.round(averages)

print(roc_auc_score(averages_round,test['high_income']))

0.789959895266


# 4: Why ensembling works
As we can see from the previous screen, the combined predictions of the two trees are more accurate than any single tree.

Both models are approaching the problem slightly differently, and building a different tree because we used different parameters for each. Each tree makes different predictions in different areas. Even though both trees have about the same accuracy, when we combine them, the result is stronger because it leverages the strengths of both approaches.

The more "diverse", or dissimilar, the models used to construct an ensemble, the stronger the combined predictions will be. Ensembling a decision tree and a logistic regression model, which use very different approaches to arrive at their answers, will result in stronger predictions than ensembling two decision trees with similar parameters.

On the other side, if the models you ensemble are very similar in how they make predictions, you'll get a negligible boost from ensembling.

Ensembling models with very different accuracies will not generally improve your accuracy. Ensembling a model with a .75 AUC and a model with a .85 AUC on a test set will usually result in an AUC somewhere in between the two original values.

# 5: Bagging
In order to make ensembling effective, we have to introduce variation into each individual decision tree model.

There are two main ways to introduce variation in a random forest -- bagging and random feature subsets. We'll dive into bagging first.

In a random forest, each tree isn't trained using the whole dataset. Instead, it's trained on a random sample of the data, or a "bag".This sampling is performed with replacement. When we sample with replacement, after we select a row from the data we're sampling, we put the row back in the data so it can be picked again. Some rows from the original data may appear in the "bag" multiple times.

In [9]:
# We'll build 10 trees
tree_count = 10

#Each bag will have 60% of the number of original rows
bag_proportion = .6

predictions = []
for i in range(tree_count):
    # We select 60% of the rows from train, sampling with replacement.
    # We set a random state to ensure we'll be able to replicate our results.
    # We set it to i instead of a fixed values so we don't get the same sample every loop.
    # That would make all of our trees the same.
    bag = train.sample(frac=bag_proportion,replace=True,random_state=i)
    
    clf = DecisionTreeClassifier(random_state=1,min_samples_leaf=75)
    clf.fit(bag[columns],bag['high_income'])
    
    #Using the model, make predictions on the test data.
    predictions.append(clf.predict_proba(test[columns])[:,1])
predictions_average = np.mean(predictions,axis=0)
predictions_average_round = np.round(predictions_average)

print(roc_auc_score(predictions_average_round,test['high_income']))

0.785415640465


# 6: Selecting random features
With the bagging example from the previous screen, we gained some accuracy over a single decision tree. Now, we'll go back to the decision tree algorithm we created 2 missions ago to explain random feature subsets. 

We first pick a maximum number of features that we want to evaluate each time we split the tree. This has to be less than the total number of columns in the data.

Every time we split, we pick a random sample of features from the data. We then compute the information gain for each feature in our random sample, and pick the one with the highest information gain to split on.

We're repeating the same process to select the optimal split for a node, but we'll only evaluate a constrained set of features, selected randomly. This introduces variation into the trees, and makes for more powerful ensembles.

In [34]:
import math

def calc_entropy(column):
    # Compute the counts of each unique value in the column
    counts = np.bincount(column)
    probabilities = counts/float(len(column))
    
    entropy = 0
    
    for prob in probabilities:
        if prob > 0:
            entropy += prob*math.log(prob,2)
            
    return -entropy

def calc_information_gain(data,split_name,target_name):
    #Calculate original entropy.
    original_entropy = calc_entropy(data[target_name])
    
    #Find the median of the column we're spliting
    column = data[split_name]
    median = column.median()
    
    left_split = data[column <= median]
    right_split = data[column > median]
    
    #Loop through the splits, and calculate the subset entropy
    to_subtract = 0
    for subset in [left_split,right_split]:
        prob = (subset.shape[0] / float(data.shape[0]))
        to_subtract += prob * calc_entropy(subset[target_name])
    
    #return information gain
    return original_entropy - to_subtract

In [39]:
data = pd.DataFrame([
    [0,4,20,0],
    [0,4,60,2],
    [0,5,40,1],
    [1,4,25,1],
    [1,5,35,2],
    [1,5,55,1]
    ])

data.columns = ["high_income", "employment", "age", "marital_status"]

# Set a random seed to make results reproducible.
np.random.seed(1)

tree = {}
nodes = []

# The function to find the column to split on.
def find_best_column(data,target_name,columns):
    information_gains = []
    
    cols = np.random.choice(columns,2)
    
    for col in cols:
        information_gain = calc_information_gain(data,col,'high_income')
        information_gains.append(information_gain)
        
    highest_gain_index = information_gains.index(max(information_gains))
    
    highest_gain = cols[highest_gain_index]
    
    return highest_gain

# The function to construct an id3 decision tree.
def id3(data, target, columns, tree):
    unique_targets = pd.unique(data[target])
    nodes.append(len(nodes) + 1)
    tree["number"] = nodes[-1]

    if len(unique_targets) == 1:
        if 0 in unique_targets:
            tree["label"] = 0
        elif 1 in unique_targets:
            tree["label"] = 1
        return
    
    best_column = find_best_column(data, target, columns)
    column_median = data[best_column].median()
    
    tree["column"] = best_column
    tree["median"] = column_median
    
    left_split = data[data[best_column] <= column_median]
    right_split = data[data[best_column] > column_median]
    split_dict = [["left", left_split], ["right", right_split]]
    
    for name, split in split_dict:
        tree[name] = {}
        id3(split, target, columns, tree[name])

id3(data,'high_income',['employment','age','martial_status'],tree)
print(tree)

{'column': 'age', 'right': {'column': 'age', 'right': {'number': 11, 'label': 0}, 'median': 55.0, 'number': 7, 'left': {'column': 'age', 'right': {'number': 10, 'label': 1}, 'median': 47.5, 'number': 8, 'left': {'number': 9, 'label': 0}}}, 'median': 37.5, 'number': 1, 'left': {'column': 'employment', 'right': {'number': 6, 'label': 1}, 'median': 4.0, 'number': 2, 'left': {'column': 'age', 'right': {'number': 5, 'label': 1}, 'median': 22.5, 'number': 3, 'left': {'number': 4, 'label': 0}}}}


# 7: Random subsets in scikit-learn
We can also repeat our random subset selection process in scikit-learn. We just set the splitter parameter on DecisionTreeClassifier to "random", and the max_features parameter to "auto". If we have N columns, this will pick a subset of features of size N−−√, compute the gini coefficient (similar to information gain) for each, and split the node on the best column in the subset.

In [40]:
tree_count = 10

bag_proportion = .7

predictions = []
for i in range(tree_count):
    bag = train.sample(frac=bag_proportion,replace=True,random_state=i)
    # Fit a decision tree model to the 'bag'
    clf = DecisionTreeClassifier(random_state=1,min_samples_leaf=75,splitter='random',max_features='auto')
    clf.fit(bag[columns],bag['high_income'])
    
    predictions.append(clf.predict_proba(test[columns])[:,1])
    
combined = np.sum(predictions,axis=0) / 10.
rounded = np.round(combined)

print(roc_auc_score(rounded,test['high_income']))

0.789767997764


# 8: Putting it all together
When we instantiate a RandomForestClassifier, we pass in an n_estimators parameter that indicates how many trees to build. There's never any harm in building more trees, but each tree will take time to build, so more trees will take longer.

RandomForestClassifier has a similar interface to DecisionTreeClassifier, and we can use the fit and predict methods to train and make predictions.

In [46]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=10,random_state=1,min_samples_leaf=75)
clf.fit(train[columns],train['high_income'])

print(roc_auc_score(clf.predict(test[columns]),test['high_income']))

0.791634978035


# 9: Parameter tweaking
Similarly to decision trees, we can tweak a few parameters with random forests:

- min_samples_leaf
- min_samples_split
- max_depth
- max_leaf_nodes

These parameters apply to the individual trees in the model, and change how they are constructed. There are also parameters specific to the random forest that alter how it's constructed as a whole:

- n_estimators
- bootstrap -- defaults to True. Bootstrap aggregation is another name for bagging, and this indicates whether to turn it on.

By tweaking parameters, we can increase the accuracy of the forest. The easiest tweak is to increase the number of estimators we use. This has diminishing returns -- going from 10 trees to 100 will make a bigger difference than going from 100 to 500, which will make a bigger difference than going from 500 to 1000. The accuracy increase function is logarithmic, so increasing the number of trees beyond a certain number (usually 200) won't help much at all.

In [55]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=150,random_state=1,min_samples_leaf=75)

clf.fit(train[columns],train['high_income'])

predictions = clf.predict(test[columns])
print(roc_auc_score(predictions,test['high_income']))

0.793788646293


# 10: Reducing overfitting
One of the major advantages of random forests over single decision trees is they overfit less. Although each individual decision tree in a random forest varies widely, the average of their predictions is less sensitive to the input data than a single tree is. This is because while one tree can construct an incorrect and overfit model, the average of 100 or more trees will be more likely to hone in on the signal and ignore the noise. The signal will be the same across all the trees, whereas each tree will hone into the noise differently. This means that the average will discard the noise and keep the signal.

In the code cell, you'll see that we've fit a single decision tree to the training set, and made predictions for both the training set and testing set. The AUC for the training set predictions is .789, and the AUC for the testing set is .784. This indicates a mild degree of overfitting.

In [57]:
clf = DecisionTreeClassifier(random_state=1,min_samples_leaf=75)
clf.fit(train[columns],train['high_income'])

predictions = clf.predict(train[columns])
print roc_auc_score(predictions,train['high_income'])

predictions = clf.predict(test[columns])
print roc_auc_score(predictions,test['high_income'])

clf = RandomForestClassifier(n_estimators=150,random_state=1,min_samples_leaf=75)
clf.fit(train[columns],train['high_income'])

predictions = clf.predict(train[columns])
print roc_auc_score(predictions,train['high_income'])

predictions = clf.predict(test[columns])
print roc_auc_score(predictions,test['high_income'])

0.789854859593
0.784853009097
0.794137608735
0.793788646293


# 11: When to use random forests
The random forest algorithm is incredibly powerful, but isn't applicable to all tasks. The main strengths of a random forest are:
- Very accurate predictions -- Random forests achieve near state of the art performance on many machine learning tasks. Along with neural networks and gradient boosted trees, they are typically one of the top performing algorithms.
- Resistance to overfitting -- due to how they're constructed, random forests are fairly resistant to overfitting. Parameters like max_depth still have to be set and tweaked, though.

The main weaknesses are:
- Hard to interpret -- because we've averaging the results of many trees, it can be hard to figure out why a random forest is making predictions the way it is.

- Longer creation time -- making two trees takes twice as long as making one, 3 takes three times as long, and so on. Luckily, we can exploit multicore processors to parallelize tree construction. Scikit allows us to do this through the n_jobs parameter on RandomForestClassifier. We'll get more into parallelization later.

Given these tradeoffs, it makes sense to use random forests in situations where accuracy is of the utmost importance, and being able to interpret or explain the decisions the model is making isn't key. In cases where time is of the essence, or interpretability is important, a single decision tree may be a better choice.

