### Decision Trees
In the past two Decision Tree Notebook, we learned about how decision trees are constructed. We used a modified version of <b>ID3</b>, which is a bit simpler than the most common tree building algorithms, [C4.5](https://en.wikipedia.org/wiki/C4.5_algorithm), and [CART](https://en.wikipedia.org/wiki/Predictive_analytics#Classification_and_regression_trees).
However, the basics are all the same, and so we can apply the principles we learned about how decision trees work to any tree construction algorithm.

1. <b>Using the Decision Tree with scikit_Learn</b>
2. <b> Splitting the Data into Train and Test Sets </b>
3. <b> Evaluating the error</b>
4. <b> Computuing the error on the training set </b>
5. <b> Decision Tree Overfitting </b>
6. <b>Building A Shallower Tree</b>
7. <b>More Parameter Tweaking</b>
8. <b> Exploring Decision Tree Variance</b>
9. <b> Pruning</b>
10. <b>When To Use Decision Trees</b>

In [1]:
import json
import matplotlib
import warnings
import pandas as pd
import numpy as np
import math
import pickle
from matplotlib import pyplot as plt
from IPython.core.pylabtools import figsize
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score


warnings.simplefilter("ignore")
root = r"/Users/Kenneth-Aristide/anaconda3/bin/python_prog/ML/styles/bmh_matplotlibrc.json"
s = json.load(open(root))
matplotlib.rcParams.update(s)
% matplotlib inline

In [2]:
_headers =['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation',
         'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country']

income = pickle.load(open("/Users/Kenneth-Aristide/anaconda3/bin/python_prog/ML/data/income.pickle", "rb"))

### Using Decision Trees With Scikit-Learn
We can use scikit-Learn package to fit decision tree.
We use the <i><b>DecisionTreeClassifier</b></i> class for classification problems and <i><b>DecisionTreeRegressor</b></i> for regression problems. Both of these classes are in the <i><b>sklearn.tree</b></i> package.

In [3]:
features = _headers

# Instantiate the classifier
# set random_state to 1, to keep the result consistent
clf = DecisionTreeClassifier(random_state = 1)

clf.fit(income[features], income["high_income"])

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=1, splitter='best')

In [4]:
class DecisionTreeClass:
    def __init__(self, class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None,
                min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=1,
                splitter ='best'):
        
        """
        Convenience function:
            initialize a DecisionTreeClassifier
        """
        self.max_depth = max_depth
        #self.min_sample_leaf = min_sample_leaf
        self.min_samples_split = min_samples_split
        self.random_state = random_state
    
    def learn(self, X, y):
        clf = DecisionTreeClassifier(max_depth = self.max_depth, min_samples_split=self.min_samples_split,
                                     random_state = self.random_state)
        clf.fit(X, y)
        return clf
        
    def predict(self, clf, new_X):
        predictions = clf.predict(new_X)
        return predictions
    
    def compute_score(self, predictions, labels):
        auc_score = roc_auc_score(labels, predictions)
        return auc_score

### Splitting the Data into Train and Test Sets 
Now that we've fit a model, we can make predictions. We'll want to split our data into training and testing sets first. If we don't, we'll be making predictions on the same data that we train our algorithm with. This leads to overfitting, and will make our error appear lower than it is.<br>

<b>Overfitting</b> is the first example, where you memorize the details of the training set, but are unable to generalize to new examples that you're asked to make predictions on

In [5]:
shuffled_index = np.random.permutation(income.index)
shuffled_income = income.iloc[shuffled_index]

split_line = math.floor(income.shape[0] * .8)
train = shuffled_income[:split_line]
test = shuffled_income[split_line:]

### Evaluating the error
While there are many methods to evaluate error with classification, we'll use [AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_curve), which we covered extensively earlier in the machine learning material.

AUC ranges from 0 to 1, and is ideal for binary classification. The higher the AUC, the more accurate our predictions.

In [6]:
clf = DecisionTreeClassifier(random_state = 1)
clf.fit(train[features], train["high_income"])

predictions = clf.predict(test[features])

auc_score = roc_auc_score(test["high_income"], predictions)
print(auc_score)

0.741645009914


### Computuing the error on the training set 
The AUC for the predictions on the testing set is about .694. Let's compare this against the AUC for predictions on the training set to see if the model is overfitting.

It's normal for the model to predict the training set better than the testing set. After all, it has full knowledge of that data and the outcomes. However, if the AUC between training set predictions and actual values is <b>significantly higher</b> than the AUC between test set predictions and actual values, it's a sign that the model may be <i><b>overfitting</b></i>.

In [7]:
train_predictions = clf.predict(train[features])
auc_score_train = roc_auc_score(train["high_income"], train_predictions)
auc_score_train

1.0

### Decision Tree Overfitting
Our AUC on the training set was perfect. The AUC on the test set was 0.743. There's no hard and fast rule on when overfitting is happening, but our model is predicting the training set much better than it's predicting the test set. Splitting the data into training and testing sets doesn't prevent overfitting -- it just helps us detect it and fix it.

Trees overfit when they have too much depth, and make overly complex rules that match the training data, but aren't able to generalize well to new data.

This may seem to be a strange principle at first, but the more depth a tree has, typically the worse it performs on new data

### Building A Shallower Tree
There are three main ways to combat overfitting :

1. <b>Prune</b> the tree after building to remove unneeded leaves
2. Use <b>ensembling</b> to blend the predictions of many trees.
3. Restrict the <b>depth</b> of the tree while you're building it.

We'll explore all of these, but we'll look at the third method first.

By controlling how deep the tree can go while we build it, we keep the rules more general than they would be otherwise. This prevents the tree from overfitting

<i>
We can restrict how deep the tree is built with a few parameters when we initialize the $DecisionTreeClassifier$ class:

<b>max_depth</b> -- this globally restricts how deep the tree can go.

<b>min_samples_split</b> -- The minimum number of rows needed in a node before it can be split. For example, if this is set to 2, then nodes with 2 rows won't be split, and will become leaves instead.

<b>min_samples_leaf</b> -- the minimum number of rows that a leaf must have.

<b>min_weight_fraction_leaf</b> -- the fraction of input rows that are required to be at a leaf.

<b>max_leaf_nodes</b> -- the maximum number of total leaves. This will cap the count of leaf nodes as the tree is being built.

As you can see, some of these parameters don't make sense together. Having max_depth and max_leaf_nodes together isn't allowed

In [8]:
min_samples_split = 13

# Initialize the model
model = DecisionTreeClass(min_samples_split = 13)

# Learn, predict and compute score
clf = model.learn(train[features], train['high_income'])
test_predictions = model.predict(clf, test[features])
auc_score_test = model.compute_score(test['high_income'], test_predictions)

train_predictions = model.predict(clf, train[features])
auc_score_train = model.compute_score(train['high_income'], train_predictions)

print("train score {0} \n test_score {1}".format(auc_score_train, auc_score_test))

train score 0.9178262040397265 
 test_score 0.7594088171011247


### More Parameter Tweaking
It seem we reduce a bit overfitting of our previous model.
Now let's play with other parameters

In [9]:
min_samples_split = 13
max_depth = 7
random_state = 1

# Initialize the model
model = DecisionTreeClass(min_samples_split = min_samples_split, random_state = random_state, max_depth = max_depth)

# Learn, predict and compute score
clf = model.learn(train[features], train['high_income'])
test_predictions = model.predict(clf, test[features])
auc_score_test = model.compute_score(test['high_income'], test_predictions)

train_predictions = model.predict(clf, train[features])
auc_score_train = model.compute_score(train['high_income'], train_predictions)

print("train score {0} \n test_score {1}".format(auc_score_train, auc_score_test))

train score 0.8329212397511541 
 test_score 0.8196028773699573


We aren't overfitting anymore since both AUC valeus are about the same. Let's tweak the parameters more aggressively, and see what happens!

In [10]:
min_samples_split = 100
max_depth = 2
random_state = 1

# Initialize the model
model = DecisionTreeClass(min_samples_split = min_samples_split, random_state = random_state, max_depth = max_depth)

# Learn, predict and compute score
clf = model.learn(train[features], train['high_income'])
test_predictions = model.predict(clf, test[features])
auc_score_test = model.compute_score(test['high_income'], test_predictions)

train_predictions = model.predict(clf, train[features])
auc_score_train = model.compute_score(train['high_income'], train_predictions)

print("train score {0} \n test_score {1}".format(auc_score_train, auc_score_test))

train score 0.7953180123943951 
 test_score 0.7842609522908529


Our accuracy went down.

This is because we're now underfitting. Underfitting is what happens when our model is too simple to actually explain the relations between the variables.

### The Bias-Variance Tradeoff
By artificially restricting the depth of our tree, we prevent it from creating a complex enough model to correctly categorize some of the rows. If we don't perform the artificial restrictions, the tree becomes too complex, and fits quirks in the data that only exist in the training set, but don't generalize to new data.

This is known as the bias-variance tradeoff. If we take a random sample of training data and create many models, if the predictions of the models for the same row are far apart from each other, we have high variance. If we take a random sample of training data, and create many models, and the predictions of the models for the same row are close together, but far from the actual value, then we have high bias.

High bias can cause underfitting -- if a model is consistently failing to predict the correct value, it may be that it is too simple to actually model the data.

High variance can cause overfitting -- if a model is very susceptible to small changes in the input data, and changes its predictions massively, then it is likely fitting itself to quirks in the training data, and not making a generalizable model.

It's called the bias-variance tradeoff because decreasing one will usually increase the other. This is a limitation of all machine learning algorithms.

In general, decision trees suffer from high variance. The whole structure of a decision tree can change if you make a minor alteration to its training data. By restricting the depth of the tree, we increase the bias and decrease the variance. If we restrict the depth too much, we increase bias to the point where it will underfit.

Generally, you'll need to use your intuition and manually tweak parameters to get the "right" fit.

### Exploring Decision Tree Variance

We can induce variance and see what happens with a decision tree. To add noise to the data, we'll just add a column of random values. A model with high variance (like a decision tree) will pick up on this noise, and overfit to it. This is because models with high variance are very sensitive to small changes in input data.

In [11]:
np.random.seed(1)
income['noise'] = np.random.randint(4, size = income.shape[0])
_headers =['noise', 'age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation',
         'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country']

split_line = math.floor(income.shape[0] * .8)
shuffled_index = np.random.permutation(income.index)
shuffled_income = income.loc[shuffled_index]
train = shuffled_income[:split_line]
test = shuffled_income[split_line:]

features = _headers[:-1]
# Initialize the model
model = DecisionTreeClass(random_state = 1)

# Learn, predict and compute score
clf = model.learn(train[features], train['high_income'])
test_predictions = model.predict(clf, test[features])
auc_score_test = model.compute_score(test['high_income'], test_predictions)

train_predictions = model.predict(clf, train[features])
auc_score_train = model.compute_score(train['high_income'], train_predictions)

print("train score {0} \n test_score {1}".format(auc_score_train, auc_score_test))


train score 0.9999747129924645 
 test_score 0.7437527265252553


### Pruning
As you can see above, the random noise column causes significant overfitting. Our test set accuracy decreases to .743, and our training set accuracy increases to .999

As you can see above, the random noise column causes significant overfitting. Our test set accuracy decreases to .691, and our training set accuracy increases to .975.

One way to prevent overfitting that we tried before was to prevent the tree from growing beyond a certain depth. Another technique is called $pruning$. Pruning involves building a full tree, and then removing the leaves that don't add to prediction accuracy. Pruning prevents a model from becoming overly complex, and can make a simpler model with higher accuracy on the testing set.

Pruning is less commonly used than <b>parameter optimization</b> (like we just did), and <b>ensembling</b>. That's not to say that it isn't an important technique, and we'll cover it in more depth down the line.


### When To Use Decision Trees

Let's go over the main advantages and disadvantages of decision trees. The main advantages of decision trees are:

1. Easy to interpret
2. Relatively fast to fit and make predictions
3. Able to handle multiple types of data
4. Can pick up nonlinearities in data, and are usually fairly accurate

The main disadvantage is a tendency to overfit.

<b>In tasks where it's important to be able to interpret and convey why the algorithm is doing what it's doing, decision trees are a good choice.</b>

The most powerful way to reduce decision tree overfitting is to create ensembles of trees. A popular algorithm to do this is called random forest.
In cases where prediction accuracy is the most important consideration, random forests usually perform better.