# Decision Tree Mini Project

In this project, we will again try to classify emails, this time using a decision tree.   The starter code is in ```decision_tree/dt_author_id.py```.

### Part 1: Get the Decision Tree Running
Get the decision tree up and running as a classifier, setting ```min_samples_split=40```.  It will probably take a while to train.  What’s the accuracy?

### Part 2: Speed It Up
You found in the SVM mini-project that the parameter tune can significantly speed up the training time of a machine learning algorithm.  A general rule is that the parameters can tune the complexity of the algorithm, with more complex algorithms generally running more slowly.  

Another way to control the complexity of an algorithm is via the number of features that you use in training/testing.  The more features the algorithm has available, the more potential there is for a complex fit.  We will explore this in detail in the “Feature Selection” lesson, but you’ll get a sneak preview now.
+ Find the number of features in your data.  The data is organized into a numpy array where the number of rows is the number of data points and the number of columns is the number of features; so to extract this number, use a line of code like ```len(features_train[0])```
+ Go into ```tools/email_preprocess.py```, and find the line of code that looks like this:     ```selector = SelectPercentile(f_classif, percentile=1)```  Change ```percentile``` from 10 to 1
+ What’s the number of features now?
+ What do you think SelectPercentile is doing?  Would a large value for percentile lead to a more complex or less complex decision tree, all other things being equal?
+ Note the difference in training time depending on the number of features
+ What’s the accuracy when percentile = 1?



In [1]:
import pickle
#import cPickle # http://bit.ly/2ibKHa3
import _pickle as cPickle
import numpy

from sklearn import cross_validation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif

def preprocess(words_file = "../tools/word_data.pkl", authors_file="../tools/email_authors.pkl"):
    """ 
        this function takes a pre-made list of email texts (by default word_data.pkl)
        and the corresponding authors (by default email_authors.pkl) and performs
        a number of preprocessing steps:
            -- splits into training/testing sets (10% testing)
            -- vectorizes into tfidf matrix
            -- selects/keeps most helpful features

        after this, the feaures and labels are put into numpy arrays, which play nice with sklearn functions

        4 objects are returned:
            -- training/testing features
            -- training/testing labels

    """
    ### the words (features) and authors (labels), already largely preprocessed
    ### this preprocessing will be repeated in the text learning mini-project
    authors_file_handler = open(authors_file, "rb")
    authors = pickle.load(authors_file_handler)
    authors_file_handler.close()

    words_file_handler = open(words_file, "rb")
    word_data = cPickle.load(words_file_handler)
    words_file_handler.close()
    ### test_size is the percentage of events assigned to the test set
    ### (remainder go into training)
    features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(word_data, authors, test_size=0.1, random_state=42)
    ### text vectorization--go from strings to lists of numbers
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                 stop_words='english')
    features_train_transformed = vectorizer.fit_transform(features_train)
    features_test_transformed  = vectorizer.transform(features_test)
    ### feature selection, because text is super high dimensional and 
    ### can be really computationally chewy as a result
    selector = SelectPercentile(f_classif, percentile=10)
    selector.fit(features_train_transformed, labels_train)
    features_train_transformed = selector.transform(features_train_transformed).toarray()
    features_test_transformed  = selector.transform(features_test_transformed).toarray()
    ### info on the data
    print("no. of Chris training emails:", sum(labels_train))
    print("no. of Sara training emails :" , len(labels_train)-sum(labels_train))
    
    return features_train_transformed, features_test_transformed, labels_train, labels_test



In [2]:
features_train, features_test, labels_train, labels_test = preprocess()

no. of Chris training emails: 7936
no. of Sara training emails : 7884


In [3]:
#+--------------------------------------------------------------------+
#| 'features_train' are the features for the training                 |
#| 'features_test' are the testing datasets                           |
#| "labels_train" and "labels_test" are the corresponding item labels |
#+--------------------------------------------------------------------+
def NBAccuracy(features_train, labels_train, features_test, labels_test):
    from time import time
    """ Compute the accuracy of your Naive Bayes classifier """
    ### import the sklearn module for GaussianNB
    from sklearn import tree
    from sklearn.metrics     import accuracy_score
    ### create classifier
    clf = tree.DecisionTreeClassifier(min_samples_split=40)
    ### fit the classifier on the training features and labels
    t0 = time()
    clf.fit(features_train, labels_train)
    print("Training time  :", round(time()-t0, 3), "s")
    ### use the trained classifier to predict labels for the test features
    t0 = time()
    pred = clf.predict(features_test)
    print("Predicting time:", round(time()-t0, 3), "s")
    ### calculate and return the accuracy on the test data
    accuracy = accuracy_score(pred,labels_test)
    return accuracy

In [4]:
NBAccuracy(features_train, labels_train, features_test, labels_test)

Training time  : 87.706 s
Predicting time: 0.047 s


0.97781569965870307

In [5]:
len(features_train[0])

3785

In [6]:
def preprocess(words_file = "../tools/word_data.pkl", authors_file="../tools/email_authors.pkl"):
    """ 
        this function takes a pre-made list of email texts (by default word_data.pkl)
        and the corresponding authors (by default email_authors.pkl) and performs
        a number of preprocessing steps:
            -- splits into training/testing sets (10% testing)
            -- vectorizes into tfidf matrix
            -- selects/keeps most helpful features

        after this, the feaures and labels are put into numpy arrays, which play nice with sklearn functions

        4 objects are returned:
            -- training/testing features
            -- training/testing labels

    """
    ### Reducing percentile
    ### the words (features) and authors (labels), already largely preprocessed
    ### this preprocessing will be repeated in the text learning mini-project
    authors_file_handler = open(authors_file, "rb")
    authors = pickle.load(authors_file_handler)
    authors_file_handler.close()

    words_file_handler = open(words_file, "rb")
    word_data = cPickle.load(words_file_handler)
    words_file_handler.close()
    ### test_size is the percentage of events assigned to the test set
    ### (remainder go into training)
    features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(word_data, authors, test_size=0.1, random_state=42)
    ### text vectorization--go from strings to lists of numbers
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                 stop_words='english')
    features_train_transformed = vectorizer.fit_transform(features_train)
    features_test_transformed  = vectorizer.transform(features_test)
    ### feature selection, because text is super high dimensional and 
    ### can be really computationally chewy as a result
    selector = SelectPercentile(f_classif, percentile=1)
    selector.fit(features_train_transformed, labels_train)
    features_train_transformed = selector.transform(features_train_transformed).toarray()
    features_test_transformed  = selector.transform(features_test_transformed).toarray()
    ### info on the data
    print("no. of Chris training emails:", sum(labels_train))
    print("no. of Sara training emails :" , len(labels_train)-sum(labels_train))
    
    return features_train_transformed, features_test_transformed, labels_train, labels_test

In [7]:
features_train, features_test, labels_train, labels_test = preprocess()

no. of Chris training emails: 7936
no. of Sara training emails : 7884


In [8]:
NBAccuracy(features_train, labels_train, features_test, labels_test)

Training time  : 5.854 s
Predicting time: 0.004 s


0.96643913538111492

In [9]:
len(features_train[0])

379