# 1: Introduction
The most powerful method to reduce decision tree overfitting is called the random forest algorithm. In this mission, we'll learn how to construct and apply random forests.

In [1]:
import pandas as pd
import numpy as np

names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
           'marital_status', 'occupation', 'relationship', 'race', 'sex',
           'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'high_income']
income = pd.read_csv('adult.data',names=names)

In [2]:
target_col = ['workclass','education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country', 'high_income']

for target in target_col:
    col = pd.Categorical.from_array(income[target])
    income[target] = col.codes

columns = ["age", "workclass", "education_num", "marital_status",
           "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]

In [3]:
import math

np.random.seed(1)

income = income.reindex(np.random.permutation(income.index))

train_max_row = int(math.floor(income.shape[0]*.8))

train = income.iloc[:train_max_row]
test= income.iloc[train_max_row:]

# 2: Ensemble models
A random forest is a kind of ensemble model. Ensembles combine the predictions of multiple models to create a more accurate final prediction. We'll make a simple ensemble to see how it works.

We'll create two decision trees with slightly different parameters, and check their accuracy separately. Later on, we'll combine their predictions and compare the accuracy.

In [5]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score

clf = DecisionTreeClassifier(random_state=1,min_samples_leaf=75)
clf.fit(train[columns],train['high_income'])

clf2 = DecisionTreeClassifier(random_state=1,max_depth=6)
clf2.fit(train[columns],train['high_income'])

predictions_clf1 = clf.predict(test[columns])
predictions_clf2 = clf2.predict(test[columns])

auc_clf1 = roc_auc_score(predictions_clf1,test['high_income'])
auc_clf2 = roc_auc_score(predictions_clf2,test['high_income'])

print auc_clf1,auc_clf2

0.784853009097 0.771031199892


# 3: Combining our predictions
When we have multiple classifiers making predictions, we can treat each set of predictions as a column in a matrix.

Ultimately, we don't want this matrix, though -- we want one prediction per row in the training data. To do this, we'll need to create rules to turn each row of our matrix of predictions into a single number.

There are many ways to get from the output of multiple models to a final vector of predictions. One method is majority voting, where each classifier gets a "vote", and the most commonly voted value for each row wins. 

This only works if there are more than 2 classifiers.

we'll have to use a different method to combine predictions.

We'll take the mean of all the items in a row. Right now, we're using the predict method, which returns either 0 or 1.

We can instead use the predict_proba method, which will predict a probability from 0 to 1 that a given class is the right one for a row. Since 0 and 1 are our two classes, we'll get a matrix with as many rows as the income dataframe and 2 columns.

If we just take the second column, we get the average value that the classifier would predict for that row. If there's a .9 probability that the correct classification is 1, we can use the .9 as the value the classifier is predicting. This will give us a continuous output in a single vector instead of just 0 or 1. 



In [11]:
predictions = clf.predict_proba(test[columns])[:,1]
predictions2 = clf2.predict_proba(test[columns])[:,1]

averages = (predictions + predictions2) / 2.
averages_round = np.round(averages)

print(roc_auc_score(averages_round,test['high_income']))

0.789959895266


# 4: Why ensembling works
As we can see from the previous screen, the combined predictions of the two trees are more accurate than any single tree.

Both models are approaching the problem slightly differently, and building a different tree because we used different parameters for each. Each tree makes different predictions in different areas. Even though both trees have about the same accuracy, when we combine them, the result is stronger because it leverages the strengths of both approaches.

The more "diverse", or dissimilar, the models used to construct an ensemble, the stronger the combined predictions will be. Ensembling a decision tree and a logistic regression model, which use very different approaches to arrive at their answers, will result in stronger predictions than ensembling two decision trees with similar parameters.

On the other side, if the models you ensemble are very similar in how they make predictions, you'll get a negligible boost from ensembling.

Ensembling models with very different accuracies will not generally improve your accuracy. Ensembling a model with a .75 AUC and a model with a .85 AUC on a test set will usually result in an AUC somewhere in between the two original values.