# Decision Trees mini-project
## From Udacity's course Introduction to Machine Learning

In this mini-project, we'll tackle the exact same email author ID problem as the Naive Bayes mini-project, but now with a decision tree.

### Introduction
A couple of years ago, J.K. Rowling (of Harry Potter fame) tried something interesting. She wrote a book, “The Cuckoo’s Calling,” under the name Robert Galbraith. The book received some good reviews, but no one paid much attention to it--until an anonymous tipster on Twitter said it was J.K. Rowling. The London Sunday Times enlisted two experts to compare the linguistic patterns of “Cuckoo” to Rowling’s “The Casual Vacancy,” as well as to books by several other authors. After the results of their analysis pointed strongly toward Rowling as the author, the Times directly asked the publisher if they were the same person, and the publisher confirmed. The book exploded in popularity overnight.

We’ll do something very similar in this project. We have a set of emails, half of which were written by one person and the other half by another person at the same company.

### Problem statement:
*Our objective is to classify the emails as written by one person or the other based only on the text of the email*.

## Preprocessing

Provided functions for preprocessing data

In [10]:
#!/usr/bin/python

import pickle
import cPickle
import numpy

from sklearn import cross_validation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif

def preprocess(words_file = "../tools/word_data.pkl", authors_file="../tools/email_authors.pkl"):
    """ 
        this function takes a pre-made list of email texts (by default word_data.pkl)
        and the corresponding authors (by default email_authors.pkl) and performs
        a number of preprocessing steps:
            -- splits into training/testing sets (10% testing)
            -- vectorizes into tfidf matrix
            -- selects/keeps most helpful features

        after this, the feaures and labels are put into numpy arrays, which play nice with sklearn functions

        4 objects are returned:
            -- training/testing features
            -- training/testing labels

    """

    ### the words (features) and authors (labels), already largely preprocessed
    ### this preprocessing will be repeated in the text learning mini-project
    authors_file_handler = open(authors_file, "r")
    authors = pickle.load(authors_file_handler)
    authors_file_handler.close()

    words_file_handler = open(words_file, "r")
    word_data = cPickle.load(words_file_handler)
    words_file_handler.close()

    ### test_size is the percentage of events assigned to the test set
    ### (remainder go into training)
    features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(word_data, authors, test_size=0.1, random_state=42)



    ### text vectorization--go from strings to lists of numbers
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                 stop_words='english')
    features_train_transformed = vectorizer.fit_transform(features_train)
    features_test_transformed  = vectorizer.transform(features_test)



    ### feature selection, because text is super high dimensional and 
    ### can be really computationally chewy as a result
    selector = SelectPercentile(f_classif, percentile=1)  #when percentile = 10 -> Number of features: 3785
                                                          #when percentile = 1  -> Number of features: 379
    selector.fit(features_train_transformed, labels_train)
    features_train_transformed = selector.transform(features_train_transformed).toarray()
    features_test_transformed  = selector.transform(features_test_transformed).toarray()

    ### info on the data
    print "no. of Chris training emails:", sum(labels_train)
    print "no. of Sara training emails:", len(labels_train)-sum(labels_train)
    
    return features_train_transformed, features_test_transformed, labels_train, labels_test

## Decision Tree Algorithm implementation

In [11]:
#!/usr/bin/python
""" 
    This is the code to accompany the Lesson 2 (SVM) mini-project.
    Use a SVM to identify emails from the Enron corpus by their authors:    
    Sara has label 0
    Chris has label 1
"""
    
import sys
from time import time
from sklearn import tree
import numpy as np

### features_train and features_test are the features for the training and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()

# Creating and training our tree
clf = tree.DecisionTreeClassifier(min_samples_split=40)  # Min samples = 40... this reduces overfitting
clf = clf.fit(features_train, labels_train)

## Accuracy
print "Accuracy (method 1):", clf.score(features_test, labels_test)
#########################################################

#Obtaining specific predictions.
#print "Class for feature # 10:", clf.predict([features_test[10]])
#print "Class for feature # 26:", clf.predict([features_test[26]])
#print "Class for feature # 50:", clf.predict([features_test[50]])
#predictions = clf.predict(features_test)
# How many are predicted to be the "chris" class
#print predictions.sum()

no. of Chris training emails: 7936
no. of Sara training emails: 7884
Accuracy (method 1): 0.9670079635949943


In [12]:
print "Number of features in data:", len(features_train[0])

Number of features in data: 379
