# Introducing Decision Trees
Decision Trees unlike Logistic Regression are an example of a nonparametric machine learning algorithm. Decision Trees won’t be defined by a list of parameters as we’ll see in the upcoming lessons. The reason many people love decision trees is because they are very easy to interpret. It is basically a flow chart of questions that you answer about a datapoint until you get to a prediction.

In [1]:
# necessary import
from sklearn.tree import DecisionTreeClassifier as DT

import numpy as np
import pandas as pd


In [2]:
# pandas dataframe
df = pd.read_csv('../titanic.csv')
# create new column
df['Male'] = df['Sex'] == 'male'
# convert to array
features = ['Pclass','Male','Age','Siblings/Spouses','Parents/Children','Fare']
X = df[features].values
y = df['Survived'].values

In [3]:
from sklearn.model_selection import train_test_split as split
X_train, X_test, y_train, y_test = split(X, y, random_state=22)

In [4]:
# creating a decision tree model
model = DT()
model.fit(X_train,y_train)
print(model)

DecisionTreeClassifier()


In [5]:
print(model.predict([[3, True, 22, 1, 0, 7.25]]))

[0]


In [6]:
from sklearn.model_selection import KFold
from sklearn.metrics import (accuracy_score,
    precision_score,recall_score,f1_score,
    precision_recall_fscore_support,confusion_matrix)


kf = KFold(n_splits=5, shuffle=True)
for criterion in ['gini', 'entropy']:
    print("Decision Tree - {}".format(criterion))
    accuracy = []
    precision = []
    recall = []
    for train_index, test_index in kf.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        dt = DT(criterion=criterion)
        dt.fit(X_train, y_train)
        y_pred = dt.predict(X_test)
        accuracy.append(accuracy_score(y_test, y_pred))
        precision.append(precision_score(y_test, y_pred))
        recall.append(recall_score(y_test, y_pred))
    print("accuracy:", np.mean(accuracy))
    print("precision:", np.mean(precision))
    print("recall:", np.mean(recall), '\n')
    print()

Decision Tree - gini
accuracy: 0.7643940836666031
precision: 0.6927375779554301
recall: 0.7061688760011047 


Decision Tree - entropy
accuracy: 0.758795150130134
precision: 0.6907424299434834
recall: 0.6775865633482734 




# Visualizing Decision Tree
We want to create a png image of our graph. We'll use scikit-learn's export_graphviz function.

In [7]:
from sklearn.tree import export_graphviz as export

In [8]:
# graph objects are stored as .dot files
dot_file = export(model,feature_names=features)

In [9]:
# import the 'graphviz' executable
import graphviz as visual
graph = visual.Source(dot_file)
file = 'dt_titanic'
graph.render(file,format='png',cleanup=True)

'dt_titanic.png'

# Pruning Decision Trees
Decision Trees are incredibly prone to overfitting. Since they can keep having additional nodes in the tree that split on features, the model can really dig deep into the specifics of the training set.We have a few options for how to limit the tree growth. Here are some commonly used pre-pruning techniques.
- Max depth: Only grow the tree up to a certain depth, or height of the tree
- Leaf size: Don’t split a node if the number of samples at that node is under a threshold
- Number of leaf nodes: Limit the total number of leaf nodes allowed in the tree


In [10]:
# pruning our tree
dt = DT(max_depth=4,min_samples_leaf=10,max_leaf_nodes=20)
dt.fit(X_train,y_train)

In [11]:
def draw(model,features,name):
    dot_file = export(model,feature_names=features)
    graph = visual.Source(dot_file)
    graph.render(name,format='png',cleanup=True)
    return 0

In [12]:
draw(dt,features,'dt_titanic_prepruned')

0

# Grid Search
We’re not going to be able to intuit best values for the pre-pruning parameters. But scikit-learn has a grid search class built in that will do this for us.

In [13]:
from sklearn.model_selection import GridSearchCV as grd

GridSearchCV has four parameters that we’ll use:
1. The model (in this case a DecisionTreeClassifier)
2. Param grid: a dictionary of the parameters names and all the possible values
3. What metric to use (default is accuracy)
4. How many folds for k-fold cross validation

In [14]:
param_grid = {
    'max_leaf_nodes':[15,20,25,30],
    'max_depth':[5,8,10,12],
    'min_samples_leaf':[5,10,15]
    
}
# creat a grid search object
gs = grd(dt,param_grid,scoring='f1',cv=5)

In [15]:
gs.fit(X,y)
print("Best params:", gs.best_params_)

Best params: {'max_depth': 12, 'max_leaf_nodes': 15, 'min_samples_leaf': 5}


In [16]:
print("Best score:",gs.best_score_)

Best score: 0.7679461659113965


In [17]:
from sklearn.linear_model import LogisticRegression as LgR

In [22]:
def compare_logistic_regression_decision_tree():
    kf = KFold(n_splits=5, shuffle=True)
    dt_accuracy_scores = []
    dt_precision_scores = []
    dt_recall_scores = []
    lr_accuracy_scores = []
    lr_precision_scores = []
    lr_recall_scores = []
    for train_index, test_index in kf.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        dt = DT()
        dt.fit(X_train, y_train)
        dt_accuracy_scores.append(dt.score(X_test, y_test))
        dt_y_pred = dt.predict(X_test)
        dt_precision_scores.append(precision_score(y_test, dt_y_pred))
        dt_recall_scores.append(recall_score(y_test, dt_y_pred))
        lr = LgR()
        lr.fit(X_train, y_train)
        lr_accuracy_scores.append(lr.score(X_test, y_test))
        lr_y_pred = lr.predict(X_test)
        lr_precision_scores.append(precision_score(y_test, lr_y_pred))
        lr_recall_scores.append(recall_score(y_test, lr_y_pred))
    print("Decision Tree")
    print("\tAccuracy:", np.mean(dt_accuracy_scores))
    print("\tPrecision:", np.mean(dt_precision_scores))
    print("\tRecall:", np.mean(dt_recall_scores))
    print("Logistic Regression")
    print("\tAccuracy:", np.mean(lr_accuracy_scores))
    print("\tPrecision:", np.mean(lr_precision_scores))
    print("\tRecall:", np.mean(lr_recall_scores))

In [26]:
compare_logistic_regression_decision_tree()

Decision Tree
	Accuracy: 0.7812988002285279
	Precision: 0.7149516065626427
	Recall: 0.7275785783920206
Logistic Regression
	Accuracy: 0.8015743033073065
	Precision: 0.7649462243249993
	Recall: 0.7027226721625388
