# Application: Decision Trees Part 2

#### References

 ## Decision Trees Part 2

In this notebook we will apply decision trees to identify emails.

In [2]:
import os
import sys
import numpy as np
from sklearn import tree
from sklearn.metrics import accuracy_score

# Need these so that Jupyter can find the datasets and the various utilities 
nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)

In [3]:
from utilities.email_preprocess import preprocess


In [6]:
# preprocess the data
features_train, features_test, labels_train, labels_test = preprocess(words_file ="data/machine_learning/word_data.pkl",
                                                                     authors_file="data/machine_learning/email_authors.pkl")

no. of Chris training emails: 7936
no. of Sara training emails: 7884


In [8]:
clf_dt = tree.DecisionTreeClassifier()
clf_dt.fit(features_train, labels_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

### Question: What is the accuracy of the DT?

In [9]:
# predict 
pred = clf_dt.predict(features_test)

In [10]:
# compute accuracy
accuracy = accuracy_score(pred, labels_test)
print("DT accuracy is ", accuracy)

DT accuracy is  0.98976109215


### Question: What does the ```min_samples_split``` parameter do?

Upon running the ```fit``` method of the classifier, we get a message informing us about the various parameters the classifier is using. We can also see that there exists a parameter called ```min_samples_split``` which indicates the number minimum number of samples on a node that are required to further divide on this node. Thus, if a node i sfound with 10 samples and the ```min_samples_split``` is 11 then this node will not be further subdivided.

### Feature selection

We already know that we can improve the performance of a classfier by tuning various parameters. Another method to improve the performance is to use feature selection.  

### Question: How many features are in the training data?

The data is organized into a numpy array where the number of rows is the number of data points and the number of columns is the number of features; so to extract this number, use 

In [11]:
print("Number of features is: ",len(features_train[0]))

3785

We can change the number of features by chenaging the ```percentile``` parameter 

In [4]:
features_train, features_test, labels_train, labels_test = preprocess(words_file ="data/machine_learning/word_data.pkl",
                                                                     authors_file="data/machine_learning/email_authors.pkl",
                                                                     percentile=1)

no. of Chris training emails: 7936
no. of Sara training emails: 7884


In [5]:
print("Number of features is: ",len(features_train[0]))

379

### Question: Would a large number of features give you a more or less complex decision tree all other things being equal?

A large number of features in general delivers a complex decision tree.

### Question: What is the accuracy of the decision tree when we use only 1% of the available features? 

In [6]:
clf_dt = tree.DecisionTreeClassifier()
clf_dt.fit(features_train, labels_train)
# predict 
pred = clf_dt.predict(features_test)
# compute accuracy
accuracy = accuracy_score(pred, labels_test)
print("DT accuracy is ", accuracy)

DT accuracy is  0.988054607509
