We used a modified version of ID3, which is a bit simpler than the most common tree building algorithms, C4.5, and CART. However, the basics are all the same, and so we can apply the principles we learned about how decision trees work to any tree construction algorithm.



# 2: Using decision trees with scikit-learn
We can use the scikit-learn package to fit a decision tree. The interface is very similar to other algorithms we've fit in the past.

We use the DecisionTreeClassifier class for classification problems, and DecisionTreeRegressor for regression problems. Both of these classes are in the sklearn.tree package.

In this case, we're predicting a binary outcome, so we'll use a classifier.

The first step is to train the classifier on the data. We'll use the fit method on a classifier to do this.

In [23]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier

names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
           'marital_status', 'occupation', 'relationship', 'race', 'sex',
           'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'high_income']
income = pd.read_csv('adult.data',names=names)

In [45]:
target_col = ['workclass','education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country', 'high_income']

for target in target_col:
    col = pd.Categorical.from_array(income[target])
    income[target] = col.codes

columns = ["age", "workclass", "education_num", "marital_status",
           "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]
# Instantiate the classifier
# Set random_state to 1 to keep results consistent
clf = DecisionTreeClassifier(random_state=1)

clf.fit(income[columns],income['high_income'])

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=1, splitter='best')

# 3: Splitting the data into train and test sets
Now that we've fit a model, we can make predictions. We'll want to split our data into training and testing sets first.

We can avoid overfitting by always making predictions and evaluating error on data that our algorithm hasn't been trained with. This will show us when we're overfitting by giving us a realistic error on data that the algorithm hasn't seen before.

In [50]:
import math

np.random.seed(1)

income = income.reindex(np.random.permutation(income.index))

train_max_row = int(math.floor(income.shape[0]*.8))

train = income.iloc[:train_max_row]
test= income.iloc[train_max_row:]

# 4: Evaluating error
 AUC ranges from 0 to 1, and is ideal for binary classification. The higher the AUC, the more accurate our predictions. 

In [51]:
from sklearn.metrics import roc_auc_score

clf = DecisionTreeClassifier(random_state=1)
clf.fit(train[columns], train["high_income"])

predictions = clf.predict(test[columns])
error = roc_auc_score(predictions,test['high_income'])

print error

0.701110628846


# 5: Compute error on the training set
The AUC for the predictions on the testing set is about .7. Let's compare this against the AUC for predictions on the training set to see if the model is overfitting. 

It's normal for the model to predict the training set better than the testing set. After all, it has full knowledge of that data and the outcomes. However, if the AUC between training set predictions and actual values is significantly higher than the AUC between test set predictions and actual values, it's a sign that the model may be overfitting.

In [52]:
predictions = clf.predict(train[columns])
print(roc_auc_score(predictions,train['high_income']))

0.97449683987


# 6: Decision tree overfitting
 There's no hard and fast rule on when overfitting is happening, but our model is predicting the training set much better than it's predicting the test set. Splitting the data into training and testing sets doesn't prevent overfitting -- it just helps us detect it and fix it.

We've built our tree in such a way that it can perfectly predict the training set -- but, the way the tree has been constructed doesn't make sense when we step back.

These rules are very specific to the training set.

All we've done is "pruned" the tree, and removed some of the lower leaves. We've made some of the higher up nodes into leaves instead.

This actually has lower accuracy on our training set, but it will generalize better to new examples, because it matches reality better.

Trees overfit when they have too much depth, and make overly complex rules that match the training data, but aren't able to generalize well to new data.

This may seem to be a strange principle at first, but the more depth a tree has, typically the worse it performs on new data.

# 7: Building a shallower tree
There are three main ways to combat overfitting:

- "Prune" the tree after building to remove unneeded leaves.
-  Use ensembling to blend the predictions of many trees.
- Restrict the depth of the tree while you're building it.

We can restrict how deep the tree is built with a few parameters when we initialize the DecisionTreeClassifier class:

- max_depth -- this globally restricts how deep the tree can go.
- min_samples_split -- The minimum number of rows needed in a node before it can be split. For example, if this is set to 2, then nodes with 2 rows won't be split, and will become leaves instead.
- min_samples_leaf -- the minimum number of rows that a leaf must have.
- min_weight_fraction_leaf -- the fraction of input rows that are required to be at a leaf.
- max_leaf_nodes -- the maximum number of total leaves. This will cap the count of leaf nodes as the tree is being built.



In [57]:
clf = DecisionTreeClassifier(random_state=1,min_samples_split=5)
clf.fit(train[columns],train['high_income'])

predictions_train = clf.predict(train[columns])
train_auc = roc_auc_score(predictions_train,train['high_income'])

predictions_test = clf.predict(test[columns])
test_auc = roc_auc_score(predictions_test,test['high_income'])

print train_auc,test_auc

0.933086058571 0.719213185068
