# Decision Trees 

- Can be applied to both classification and regression problems
- Upside down tree
- Root, Internal(branches), Terminal(leaves) nodes


# Regression
- Recursive binary splitting the split that minimizes the sum of the squared deviations(RSS) from the mean in the two separate partitions.
- A top down greedy approach(best at the time doesn't look ahead)
- can lead to overfitting pruning is needed via cross validation

![alt text](regressiontreeinr.png "Title")

# Classification Tree
- similar to regression but qualitative not quantitative.
- Instead of RSS it can use classification error rate, cross-entropy or, most favorable, the Gini index.
- Gini index scored on node purity. What proportion of the observations in the node from the same class. A score is taken at each split.

In [81]:
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

In [82]:
iris = load_iris()
X= iris.data
y= iris.target

In [83]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3)


In [84]:
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [85]:
y_pred = dtree.predict(X_test)

In [86]:
print confusion_matrix(y_test, y_pred)
print '\n'
print classification_report(y_test, y_pred)

[[12  0  0]
 [ 0 14  5]
 [ 0  1 13]]


             precision    recall  f1-score   support

          0       1.00      1.00      1.00        12
          1       0.93      0.74      0.82        19
          2       0.72      0.93      0.81        14

avg / total       0.89      0.87      0.87        45



# A reason for decision trees

- non linear boundaries

![alt text](classdemo_03.png "Title")

# Advantages of DT
- Easy to explain and interpret. 
- Resembles yes/no, left/right familiar to human decision making 

# Disadvantages

- Lack prediction accuracy seen in other models
- suffer from high variance. If we split the training data into two parts at random, and fit a decision tree to both halves, the results that we get could be quite different. 

# How can decision trees be improved?
- Bootstrap aggregating(bagging), Boosting and Random Forests

# Bagging
- taking repeated samples from the (single) training data set to decrease varience. 
- Number of estimators should not matter as bagging is not known to overfit
- bagging improves prediction accuracy at the expense of interpretability. (no single tree)

In [87]:
from sklearn.ensemble import BaggingClassifier

In [88]:
bag = BaggingClassifier(DecisionTreeClassifier())
bag.fit(X,y)

BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False)

In [89]:
y_pred = bag.predict(X_test)

In [90]:
print confusion_matrix(y_test, y_pred)
print '\n'
print classification_report(y_test, y_pred)

[[12  0  0]
 [ 0 19  0]
 [ 0  0 14]]


             precision    recall  f1-score   support

          0       1.00      1.00      1.00        12
          1       1.00      1.00      1.00        19
          2       1.00      1.00      1.00        14

avg / total       1.00      1.00      1.00        45



# Boosting
- Instead of growing each tree individually and combining to make the best predictor like bagging. Boosting grows it's trees sequentially 
- each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.
- fits new trees to areas where there is poor performance. Thus, it can lead to overfitting


In [91]:
from sklearn.ensemble import AdaBoostClassifier

In [92]:
ada = AdaBoostClassifier()
ada.fit(X,y)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)

In [93]:
y_pred = ada.predict(X_test)


In [94]:
print confusion_matrix(y_test, y_pred)
print '\n'
print classification_report(y_test, y_pred)

[[12  0  0]
 [ 0 17  2]
 [ 0  1 13]]


             precision    recall  f1-score   support

          0       1.00      1.00      1.00        12
          1       0.94      0.89      0.92        19
          2       0.87      0.93      0.90        14

avg / total       0.94      0.93      0.93        45



# Random Forest(tm)
- Similar to bagging, we build a number of decision trees on bootstrapped training samples. 
- Different because we do not use all of the features unlike bagging. 
- This process aims to level the playing field by not continuously placing one very strong predictor in the data set at the root node, but rather, use the strong predictor along with a number of other moderately strong predictors.
- Random Forest is not susceptible to overfitting

In [95]:
from sklearn.ensemble import RandomForestClassifier

In [96]:
rfc = RandomForestClassifier()
rfc.fit(X,y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [97]:
y_pred = rfc.predict(X_test)

In [98]:
print confusion_matrix(y_test, y_pred)
print '\n'
print classification_report(y_test, y_pred)

[[12  0  0]
 [ 0 18  1]
 [ 0  0 14]]


             precision    recall  f1-score   support

          0       1.00      1.00      1.00        12
          1       1.00      0.95      0.97        19
          2       0.93      1.00      0.97        14

avg / total       0.98      0.98      0.98        45

