We used a modified version of ID3, which is a bit simpler than the most common tree building algorithms, C4.5, and CART. However, the basics are all the same, and so we can apply the principles we learned about how decision trees work to any tree construction algorithm.



# 2: Using decision trees with scikit-learn
We can use the scikit-learn package to fit a decision tree. The interface is very similar to other algorithms we've fit in the past.

We use the DecisionTreeClassifier class for classification problems, and DecisionTreeRegressor for regression problems. Both of these classes are in the sklearn.tree package.

In this case, we're predicting a binary outcome, so we'll use a classifier.

The first step is to train the classifier on the data. We'll use the fit method on a classifier to do this.

In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier

names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
           'marital_status', 'occupation', 'relationship', 'race', 'sex',
           'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'high_income']
income = pd.read_csv('adult.data',names=names)

In [2]:
target_col = ['workclass','education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country', 'high_income']

for target in target_col:
    col = pd.Categorical.from_array(income[target])
    income[target] = col.codes

columns = ["age", "workclass", "education_num", "marital_status",
           "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]
# Instantiate the classifier
# Set random_state to 1 to keep results consistent
clf = DecisionTreeClassifier(random_state=1)

clf.fit(income[columns],income['high_income'])

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=1, splitter='best')

# 3: Splitting the data into train and test sets
Now that we've fit a model, we can make predictions. We'll want to split our data into training and testing sets first.

We can avoid overfitting by always making predictions and evaluating error on data that our algorithm hasn't been trained with. This will show us when we're overfitting by giving us a realistic error on data that the algorithm hasn't seen before.

In [3]:
import math

np.random.seed(1)

income = income.reindex(np.random.permutation(income.index))

train_max_row = int(math.floor(income.shape[0]*.8))

train = income.iloc[:train_max_row]
test= income.iloc[train_max_row:]

# 4: Evaluating error
 AUC ranges from 0 to 1, and is ideal for binary classification. The higher the AUC, the more accurate our predictions. 

In [4]:
from sklearn.metrics import roc_auc_score

clf = DecisionTreeClassifier(random_state=1)
clf.fit(train[columns], train["high_income"])

predictions = clf.predict(test[columns])
error = roc_auc_score(predictions,test['high_income'])

print error

0.70319628094


# 5: Compute error on the training set
The AUC for the predictions on the testing set is about .7. Let's compare this against the AUC for predictions on the training set to see if the model is overfitting. 

It's normal for the model to predict the training set better than the testing set. After all, it has full knowledge of that data and the outcomes. However, if the AUC between training set predictions and actual values is significantly higher than the AUC between test set predictions and actual values, it's a sign that the model may be overfitting.

In [5]:
predictions = clf.predict(train[columns])
print(roc_auc_score(predictions,train['high_income']))

0.973887981629


# 6: Decision tree overfitting
 There's no hard and fast rule on when overfitting is happening, but our model is predicting the training set much better than it's predicting the test set. Splitting the data into training and testing sets doesn't prevent overfitting -- it just helps us detect it and fix it.

We've built our tree in such a way that it can perfectly predict the training set -- but, the way the tree has been constructed doesn't make sense when we step back.

These rules are very specific to the training set.

All we've done is "pruned" the tree, and removed some of the lower leaves. We've made some of the higher up nodes into leaves instead.

This actually has lower accuracy on our training set, but it will generalize better to new examples, because it matches reality better.

Trees overfit when they have too much depth, and make overly complex rules that match the training data, but aren't able to generalize well to new data.

This may seem to be a strange principle at first, but the more depth a tree has, typically the worse it performs on new data.

# 7: Building a shallower tree
There are three main ways to combat overfitting:

- "Prune" the tree after building to remove unneeded leaves.
-  Use ensembling to blend the predictions of many trees.
- Restrict the depth of the tree while you're building it.

We can restrict how deep the tree is built with a few parameters when we initialize the DecisionTreeClassifier class:

- max_depth -- this globally restricts how deep the tree can go.
- min_samples_split -- The minimum number of rows needed in a node before it can be split. For example, if this is set to 2, then nodes with 2 rows won't be split, and will become leaves instead.
- min_samples_leaf -- the minimum number of rows that a leaf must have.
- min_weight_fraction_leaf -- the fraction of input rows that are required to be at a leaf.
- max_leaf_nodes -- the maximum number of total leaves. This will cap the count of leaf nodes as the tree is being built.



In [6]:
clf = DecisionTreeClassifier(random_state=1,min_samples_split=5)
clf.fit(train[columns],train['high_income'])

predictions_train = clf.predict(train[columns])
train_auc = roc_auc_score(predictions_train,train['high_income'])

predictions_test = clf.predict(test[columns])
test_auc = roc_auc_score(predictions_test,test['high_income'])

print train_auc,test_auc

0.934501028744 0.713198095219


# 8: More parameter tweaking
By restricting min_samples_split to 5, we managed to boost test AUC to .713 from .702. Training set AUC decreased from .971 to .934, showing that the model we built was less overfit to the training set than before.

In [9]:
clf = DecisionTreeClassifier(random_state=1,max_depth=4,min_samples_split=25)
clf.fit(train[columns],train['high_income'])
predictions = clf.predict(test[columns])

test_auc = roc_auc_score(predictions,test['high_income'])

train_predictions = clf.predict(train[columns])
train_auc = roc_auc_score(train_predictions,train['high_income'])

print test_auc,train_auc

0.796980156519 0.790555914131


# 9: Tweaking the depth
We just improved AUC again. The test set AUC is .796, and the training set AUC is .790. We aren't overfitting anymore, as both AUCs are about the same.

In [10]:
clf = DecisionTreeClassifier(random_state=1,min_samples_split=100,max_depth=2)
clf.fit(train[columns],train['high_income'])
predictions = clf.predict(test[columns])
test_auc = roc_auc_score(predictions,test['high_income'])

train_predictions = clf.predict(train[columns])
train_auc = roc_auc_score(train_predictions,train['high_income'])

print test_auc,train_auc

0.774502262504 0.775022084636


# 10: Underfitting
Our accuracy went down in the past screen relative to the screen before. This is because we're now underfitting. Underfitting is what happens when our model is too simple to actually explain the relations between the variables.

And here's the "right fit" tree. This tree explains the data properly, without overfitting:図

Let's trim this tree even more to show what happens when the model isn't complex enough to explain the data:図

 This model is too simple to model reality -- which is younger people make less, middle-aged people make more, and elderly people make less.

# 11: The bias-variance tradeoff
By artificially restricting the depth of our tree, we prevent it from creating a complex enough model to correctly categorize some of the rows. If we don't perform the artificial restrictions, the tree becomes too complex, and fits quirks in the data that only exist in the training set, but don't generalize to new data.

High bias can cause underfitting -- if a model is consistently failing to predict the correct value, it may be that it is too simple to actually model the data.

High variance can cause overfitting -- if a model is very susceptible to small changes in the input data, and changes its predictions massively, then it is likely fitting itself to quirks in the training data, and not making a generalizable model.

In general, decision trees suffer from high variance. The whole structure of a decision tree can change if you make a minor alteration to its training data. By restricting the depth of the tree, we increase the bias and decrease the variance. If we restrict the depth too much, we increase bias to the point where it will underfit.

Generally, you'll need to use your intuition and manually tweak parameters to get the "right" fit.

# 12: Exploring decision tree variance
We can induce variance and see what happens with a decision tree. To add noise, we'll just add a column of random values.

 A model with high variance (like a decision tree) will pick up on this noise, and overfit to it. This is because models with high variance are very sensitive to small changes in input data.

In [14]:
np.random.seed(1)

income['noise'] = np.random.randint(4,size=income.shape[0])

columns = ["noise", "age", "workclass", "education_num", "marital_status",
           "occupation", "relationship", "race", "sex", "hours_per_week",
           "native_country"]

train_max_row = int(math.floor(income.shape[0]*.8))
train = income.iloc[:train_max_row]
test = income.iloc[train_max_row:]

clf = DecisionTreeClassifier(random_state=1)
clf.fit(train[columns],train['high_income'])

predictions_train = clf.predict(train[columns])
train_auc = roc_auc_score(predictions_train,train['high_income'])

predictions_test = clf.predict(test[columns])
test_auc = roc_auc_score(predictions_test,test['high_income'])

print train_auc,test_auc

0.989559993763 0.697818001302


# 13: Pruning
As you can see above, the random noise column causes significant overfitting.Our test set accuracy decreases to .697, and our training set accuracy increases to .989.

Another technique is called pruning. Pruning involves building a full tree, and then removing the leaves that don't add to prediction accuracy. Pruning prevents a model from becoming overly complex, and can make a simpler model with higher accuracy on the testing set.

Pruning is less commonly used than parameter optimization (like we just did), and ensembling. That's not to say that it isn't an important technique, and we'll cover it in more depth down the line.

# 14: When to use decision trees
Let's go over the main advantages and disadvantages of decision trees. The main advantages of decision trees are:

- Easy to interpret
- Relatively fast to fit and make predictions
- Able to handle multiple types of data
- Can pick up nonlinearities in data, and are usually fairly accurate

The main disadvantage is a tendency to overfit.

In tasks where it's important to be able to interpret and convey why the algorithm is doing what it's doing, decision trees are a good choice. 

The most powerful way to reduce decision tree overfitting is to create ensembles of trees. A popular algorithm to do this is called random forest. We'll cover random forests in the next mission. In cases where prediction accuracy is the most important consideration, random forests usually perform better.