# Udacity UD120 - Final Project
###  by: Josh Haines - Q4 2015


## Project Code & Comments

import statments

In [67]:
import sys
import pickle
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data

# Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file : data_dict = pickle.load(data_file)

##### Examine the Dataset

I'm going to be using this jupyter notebook to take you through a machine learning example.  This example is going to be using corporate Enron data to determine who in the company is a person of interest (for legal investigation) and who is not.

>As a quick note on this document format: What you are seeing is live code inputs and outputs mixed with typed text.  The plain text fields (like this one) were typed in by hand.  The "In [x]" fields on the left denote actual code boxes.  The area right beneath a code box denoted by "Out [x]" is what was generated by the code directly above it.  Some code boxes don't have outputs, but some do.  This is a method of conveying information which is popular in scientific communities because you can have explanation and code outputs in an easy to use place.

Let's do some exploring of the data to get a feel for what is in it...
This, specifically, is the Enron corpus dataset.  It includes emails, financial, and investigative information on various Enron employees from the days around their major scandal.  The data is organized as a dict of dicts which is a slick way of saying it's a dictionary of dictionaries.  You can think of it organized like the following: 

* Sally
    * Salary: X
    * Emails Sent: Y
    * Person of Interest: true
    * ... 21 "features" total per person
* John
    * Salary: X'
    * Emails Sent: Y'
    * Person of Interest: false
    * ...
* ... 146 People Total

The dataset is mostly complete, but a few places are empty.  In these places the value "NaN" is put in to not break the code.  We'll have to take this into account when we get to work.

A few definitions may help you make sense of this data.
POI - Person of Interest (in the Enron investigation)
Names - Are formatted as "LAST FIRST MI"
Emails - Some features are listed as "from this person to poi" that would mean the number in that field is the number of emails this person sent to people on the POI list.

In [68]:
print('Number of People in the dataset: {}'.format(len(data_dict)))
print
print('A list of the first 15 people in the dataset.')
print('{}'.format([i for i in data_dict][:15]))
print
print("All features for a single person, let's use the CEO: Jeff Skilling.")
print('{}'.format(data_dict["SKILLING JEFFREY K"]))


Number of People in the dataset: 146

A list of the first 15 people in the dataset.
['METTS MARK', 'BAXTER JOHN C', 'ELLIOTT STEVEN', 'CORDES WILLIAM R', 'HANNON KEVIN P', 'MORDAUNT KRISTINA M', 'MEYER ROCKFORD G', 'MCMAHON JEFFREY', 'HORTON STANLEY C', 'PIPER GREGORY F', 'HUMPHREY GENE E', 'UMANOFF ADAM S', 'BLACHMAN JEREMY M', 'SUNDE MARTIN', 'GIBBS DANA R']

All features for a single person, let's use the CEO: Jeff Skilling.
{'salary': 1111258, 'to_messages': 3627, 'deferral_payments': 'NaN', 'total_payments': 8682716, 'exercised_stock_options': 19250000, 'bonus': 5600000, 'restricted_stock': 6843672, 'shared_receipt_with_poi': 2042, 'restricted_stock_deferred': 'NaN', 'total_stock_value': 26093672, 'expenses': 29336, 'loan_advances': 'NaN', 'from_messages': 108, 'other': 22122, 'from_this_person_to_poi': 30, 'poi': True, 'director_fees': 'NaN', 'deferred_income': 'NaN', 'long_term_incentive': 1920000, 'email_address': 'jeff.skilling@enron.com', 'from_poi_to_this_person': 88}


So we have 146 people in the data set, and you can see their names above.  If we lock in on a single person (say the CEO Jeff Skilling) we can see all of the features in the dataset listed for him.  

What if we take a look at all of the salaries?

In [69]:
print("Let's see all the salaries and number of emails sent for the first 15 people in the dataset.")
print
print('Name          Salary     Emails Sent')
print
for i, name in enumerate(data_dict):
    if i in range(15):
        print('{} -- ${} -- {}'.format(name, data_dict[name]['salary'], data_dict[name]['from_messages']))

Let's see all the salaries and number of emails sent for the first 15 people in the dataset.

Name          Salary     Emails Sent

METTS MARK -- $365788 -- 29
BAXTER JOHN C -- $267102 -- NaN
ELLIOTT STEVEN -- $170941 -- NaN
CORDES WILLIAM R -- $NaN -- 12
HANNON KEVIN P -- $243293 -- 32
MORDAUNT KRISTINA M -- $267093 -- NaN
MEYER ROCKFORD G -- $NaN -- 28
MCMAHON JEFFREY -- $370448 -- 48
HORTON STANLEY C -- $NaN -- 1073
PIPER GREGORY F -- $197091 -- 222
HUMPHREY GENE E -- $130724 -- 17
UMANOFF ADAM S -- $288589 -- 18
BLACHMAN JEREMY M -- $248546 -- 14
SUNDE MARTIN -- $257486 -- 38
GIBBS DANA R -- $NaN -- 12


##### Task 1: Select the Data Features / Task 3: Create New Features

Deciding on features to use to determine POI status is a tough choice.  I'm leaning towards using the emails features and other features to check association with POIs as my method.  I better do some visualization first.

Let's start with a chart showing total emails with POIs vs being a POI

In [70]:
import matplotlib
from matplotlib import pyplot as plt
%matplotlib notebook

emails_with_pois = []
is_poi = []

# make x and y axes for a graph
for i in data_dict:
    emails_with_pois.append(data_dict[i]['from_this_person_to_poi'] + data_dict[i]['from_poi_to_this_person'])
    
    # Build the is_poi list
    if data_dict[i]['poi'] == True:
        is_poi.append(1)
    else:
        is_poi.append(0)
# clean up NaN issues
emails_with_pois = [i if i != 'NaNNaN' else 0 for i in emails_with_pois]


# print('emails_with_pois: {}'.format(emails_with_pois))
# print('is_poi: {}'.format(is_poi))
fig1 = plt.figure()
plt.scatter(is_poi, emails_with_pois)
plt.xlabel('Person is a POI')
plt.ylabel('Total Emails with POIs')
plt.title('POI Status vs Total Emails with POIs')
plt.show()

<IPython.core.display.Javascript object>

Eh, that one doesn't really show anything interesting.  Let's try another one.  How about we plot total money (stock, salary, and bonus) vs emails with POIs for POIs and non-POIs.

In [71]:
# POI compensation vs emails
# compensation
from operator import add
poi_compensation = []
non_poi_compensation = []

for i in data_dict:  # Loop all people
    salary = 0
    total_stock_value = 0
    bonus = 0
    incentive = 0
    
    if data_dict[i]['poi'] == True:  # Only act if person is a POI
        if data_dict[i]['salary'] != 'NaN':
            salary = data_dict[i]['salary']
        if data_dict[i]['total_stock_value'] != 'NaN':
            total_stock_value = data_dict[i]['total_stock_value']
        if data_dict[i]['bonus'] != 'NaN':
            bonus = data_dict[i]['bonus']
        if data_dict[i]['long_term_incentive'] != 'NaN':
            incentive = data_dict[i]['long_term_incentive']
        poi_compensation.append(salary + total_stock_value + bonus + incentive)
        non_poi_compensation.append(0)  # Add zero to non_pois
    
    else:  # Person is not a POI
        if data_dict[i]['salary'] != 'NaN':
            salary = data_dict[i]['salary']
        if data_dict[i]['total_stock_value'] != 'NaN':
            total_stock_value = data_dict[i]['total_stock_value']
        if data_dict[i]['bonus'] != 'NaN':
            bonus = data_dict[i]['bonus']
        if data_dict[i]['long_term_incentive'] != 'NaN':
            incentive = data_dict[i]['long_term_incentive']
        non_poi_compensation.append(salary + total_stock_value + bonus + incentive)
        poi_compensation.append(0)  # Add zero to pois

# Emails
emails_with_pois_poi = []
emails_with_pois_non_poi = []

for i in data_dict:  # Loop all people
    if data_dict[i]['poi'] == True:  # Only act if person is a POI
        emails_with_pois_poi.append(data_dict[i]['from_this_person_to_poi'] + data_dict[i]['from_poi_to_this_person'])
        emails_with_pois_non_poi.append(0)
    else:  # Act if person is not a poi
        emails_with_pois_non_poi.append(data_dict[i]['from_this_person_to_poi'] + data_dict[i]['from_poi_to_this_person'])
        emails_with_pois_poi.append(0)
        
# clean up NaN issues
emails_with_pois_poi = [i if i != 'NaNNaN' else 0 for i in emails_with_pois_poi]
emails_with_pois_non_poi = [i if i != 'NaNNaN' else 0 for i in emails_with_pois_non_poi]

Now we have the data points, lets plot them...

In [74]:
fig2 = plt.figure()
plt.scatter(poi_compensation, emails_with_pois_poi, label="POI", color="r")
plt.scatter(non_poi_compensation, emails_with_pois_non_poi, label="Non-POI", color="b")
plt.xlabel('Total Compensation')
plt.ylabel('Total Emails with POIs')
plt.title('Total Compensation vs Total Emails with POIs')
plt.legend()
plt.show()

<IPython.core.display.Javascript object>

##### Task 2: Remove Outliers

We have a major outlier in the Non-POI total compensation.  Lets figure out what that is, and remove it.

In [77]:
print('Display Non-POI Compensation from largest to smallest: {} ...'.format(sorted(non_poi_compensation, reverse=True)[:15]))

Display Non-POI Compensation from largest to smallest: [607079287, 25079809, 19300128, 15911666, 15541812, 13676415, 10608288, 10397847, 10275432, 7987164, 7857980, 7646839, 7565227, 7538371, 7256648] ...


Someone had a total compensation of $607,079,287...It's not really a false point but the outlier doesn't let us examine our feature.  Let's get rid of it, at least for now.

In [78]:
index = [non_poi_compensation.index(607079287)]
index = index[0]
del non_poi_compensation[index]
del emails_with_pois_non_poi[index]

...all done, now we can replot...

In [79]:
fig3 = plt.figure()
plt.scatter(poi_compensation, emails_with_pois_poi, label="POI", color="r")
plt.scatter(non_poi_compensation, emails_with_pois_non_poi, label="Non-POI", color="b")
plt.xlabel('Total Compensation')
plt.ylabel('Total Emails with POIs')
plt.title('Total Compensation vs Total Emails with POIs')
plt.legend()
plt.show()

<IPython.core.display.Javascript object>

That's better with the outlier removed.  Unfortunately, this feature doesn't seem to split things apart easily either.  The POIs don't seem to make more money or send more emails as a group than the Non-POIs.  If we had a nice even split between the red and blue points our job would be a whole lot easier.  Unfortunately we have to keep trying other things.

So both of my attempts at a custom data feature have failed to give me any type of POI / Non-POI separation.  Instead of continuing to guess, we should start using some machine learning tools to have a go at examining this data.

In [80]:
# features_list is a list of strings, each of which is a feature name.
# The first feature must be "poi".

# Get list of all features
keys = data_dict['SKILLING JEFFREY K'].keys()
keys.remove('poi')  # remove poi
keys.remove('email_address')  # remove email_address
keys.insert(0, 'poi')  # put poi back in, but in the front
# print keys
features_list = ['poi',
                 # 'salary',
                 # 'to_messages',
                 # 'deferral_payments',
                 # 'total_payments',
                 'exercised_stock_options',
                 'bonus',
                 # 'restricted_stock',
                 # 'shared_receipt_with_poi',
                 # 'restricted_stock_deferred',
                 # 'total_stock_value',
                 'expenses'
                 # 'loan_advances',
                 # 'from_messages',
                 # 'other',
                 # 'from_this_person_to_poi',
                 # 'director_fees',
                 # 'deferred_income',
                 # 'long_term_incentive'
                 # 'from_poi_to_this_person'
                 ]
# list of all features available, but with 'poi' as the first.

# Store to my_dataset for easy export below.
my_dataset = data_dict

# Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_list, sort_keys=True)
labels, features = targetFeatureSplit(data)

from sklearn.cross_validation import train_test_split
features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)

Ok, all that code above organizes the data, puts it into a nice format for analysis, and splits it into useable chunks.  Namely, we have our test features and labels along with our training features and labels.  These are called:

* features_test
* labels_test

and

* features_train
* labels_train

We'll use the training data set to train our algorithms and then we'll test with the test set to see how close our predictions come to the actual labels.

You'll also notice that I've selected only 3 features from the data set.  These were chosen partially by thinking about how someone doing something illegal could be selected, and partially by trial and error running the model.  I've selected "bonus", "expenses", and "exercised stock options".  

First I'd like to do some PCA (Primary component analysis) to mix and match our features into some usable data that doesn't have so many dimensions.  You can read more about [PCA Here](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).

In [81]:
from sklearn.decomposition import PCA  # get access to the module
pca_test = PCA(n_components=2)  # PCA and use the best few
pca_test.fit(features_train)
print(pca_test.explained_variance_ratio_)

[ 0.99774874  0.00224656]


What this is telling us is that there is a combination of features which will contain essentially all the information (99.8%) and there is very little need for anything else.  This is a good result and shows that we can just go straight into using a classifier.

##### Task 4: Use Classifiers / Task 5: Tune the Classifier

In [83]:
from sklearn.tree import DecisionTreeClassifier  # import the classifier module
clf = DecisionTreeClassifier(criterion='entropy', min_samples_split=45)  # Initialize the classifier
clf = clf.fit(features_train, labels_train)  # fit the classifier

I've decided to use a decision tree classifier.  You can read more about them [here](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).  They work well with labeled data (which ours is) and they are simple to program and train.  I've modified the criteria to "entropy" which seemed to train a little better.  I've also changed the min_samples_split parameter to something larger than default (which is 2).

After fitting the classifer, the next step is to actually predict labels for the test features.  I can then compare the actual test data labels to the predicted labels to determine the quality of the model.  These are shown below.

In [85]:
pred = clf.predict(features_test)
print('  Actual     Predicted')
print('  Test      Test')
print('  Label     Label')
for i in range(len(pred)):
    print('   {}  --  {}'.format(labels_test[i], pred[i]))

  Actua     Predicted
  Test      Test
  Label     Label
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  1.0
   0.0  --  0.0
   0.0  --  1.0
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   1.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   0.0  --  1.0
   0.0  --  1.0
   1.0  --  0.0
   0.0  --  0.0
   0.0  --  0.0
   1.0  --  1.0
   1.0  --  1.0


As you can see we have pretty good alignment.  It is not perfect, but it's close enough to at least tell the prosectors who to investigate.  Obviously we are working with a data set where we know who is a person of interest and who isn't.  You can imagine bringing in data for everyone at the company and predicted labels based on their data.  We would know with 83% accuracy who would be worth looking into and who wouldn't.

If you are wondering where the accuracy comes in, the next step is to determine the metrics of our model.  We'll output the data below and run a script to give us accuracy information.

##### Task 6: Output Your Variables

In [86]:
# Dump your classifier, dataset, and features_list so anyone can
# check your results.

dump_classifier_and_data(clf, my_dataset, features_list)

In [87]:
%run tester.py

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=45, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
	Accuracy: 0.83807	Precision: 0.42596	Recall: 0.38400	F1: 0.40389	F2: 0.39172
	Total predictions: 14000	True positives:  768	False positives: 1035	False negatives: 1232	True negatives: 10965



In the end we have an accuracy of 84%.  The next two metrics have to do with the ability to not label a positive as a negative (precision), and the abilty to find all positives (recall).  These metrics are important because in some circumstances false postives or false negatives are more or less important.  You can read more about these metrics [here](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html). 

## Project Questions

*Q1: Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?  
[relevant rubric items: “data exploration”, “outlier investigation”]*

>A1: The goal of the project is to use the skills learned in UD120 to be able to predict an outcome from ambigious data.  In this case, we'll be trying to find "Persons of Interest" (POI) from various types of information related to employees working at Enron.  We'll have financial, email, and POI data for a list of people.  The data came from the various trials the company and its employees went through.  It is rare that we would have comprehensive corporate data like this.

>Machine learning can make these types of analyses much faster and more accurate than having a team of forensic accountants trying to comb through the data and guess at connections.  We can use the machine learning tools from UD120 to dissect the data, compare features, and become accurate at predicting an outcome based on different types of data inputs.

*Q2: What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.  
[relevant rubric items: “create new features”, “properly scale features”, “intelligently select feature”]*

>A2: I initially tried to create two features.  The first was a feature consisting of all the emails sent with POIs.  This was an addition of two existing features.  I plotted the relationship and found it uninteresting.

>My next test was to compare total compensation with total emails sent to and from POIs.  This charted well with an outlier that made examination difficult.  I removed the outlier to view the data better.  To be clear, the data point was an outlier in the scale, but not by error.  It was a valid point.  After replotting it became clear that these new features did not show any separation between POIs and Non-POIs.  I had to throw it out.

>In the end I used "bonus", "expenses", and "exercised stock options".  These were chosen partially by thinking about how someone doing something illegal could be selected, and partially by trial and error running the model.  Adding more features reduced the prediction capability.

>I did run a PCA to try and determine features and found that the first PCA component contained over 99% of the information making it unlikely to be of much use.  This led me to simply using a decision tree.

*Q3: What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?  
[relevant rubric item: “pick an algorithm”]*

>A3: I ended up using a simple decision tree classification.  Unfortunately my route wasn't exactly direct.  

>I began with trying to use a tool called support vector machines ([SVM](http://scikit-learn.org/stable/modules/svm.html)).  It didn't fit well so I jumped into the pipeline + gridsearch methods to try and iterate through all different parameters.  Even after running the gridsearch my precision and recall never were more than about .10.  I then tried to run the gridsearch and pipeline system with a decision tree and had trouble getting it to run.  After commenting out the gridsearch and pipeline code, my simple decision tree gave me the best accuracy and performance at that point so I deleted all the old code and moved into tuning the decision tree.

*Q4: What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?  How did you tune the parameters of your particular algorithm? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier).
[relevant rubric item: “tune the algorithm”]*

>A4: Tuning the parameters of an algorithm can have a large affect on not only the accuracy, but the runtime of an algorithm.  Essentially you're changing parts of the underlying equation of the algorithm which in turns tweaks the way it returns predictions.  I found changing my decision tree criteria from "gini" to "entropy" had a large difference on my results.  This tuning was just simply trial and error as there are only two options.  

> The second parameter (min_samples_split) was initially tested across a few ranges: 1, 5, 10, 100.  From there I bracketed in to around 45 which had the highest accuracy for this specific data set.

*Q5: What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?  
[relevant rubric item: “validation strategy”]*

>A5: Validation is the method of checking your predictions against a known test set.  It is one measure to simply see how often you are correct and incorrect.  Since these are real-world problems, however, we can't just stop at that point.  Metrics like precision, recall, and F1 have to do with the accuracy of false positives and false negatives.  Depending on the situation, a false positive may be insignificant but a false negative could be life altering.  We as data scientists need methods to take these into account when tuning our algorithms and can use these metrics to make our changes.

*Q6: Give at least 2 evaluation metrics and your average performance for each of them.  Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance.
[relevant rubric item: “usage of evaluation metrics”]*

>A6: 
* Accuracy: 84%  (Project Goal: >70%)
* Precision: 43%  (Project Goal: >30%)
* Recall: 38%  (Project Goal: >30%)

>The accuracy is simply a measure of correct vs incorrect predictions.  The next two metrics have to do with the ability to not label a positive as a negative (precision), and the abilty to find all positives (recall).  In this case, recall is important because we don't want to miss any POIs for the investigation.