# Enron Data Machine Learning

## Data Exploration
The purpose of this project is to analyze Enron employee's financial and email data to identify employee's that may have committed fraud. A number of employee's are described in the data as persons of interest based on Udacity research. We'll attempt to use machine learning to create a model that can help us detect persons of interest based on this data. Thus the model we will create can be described as POI (Person of Interest) Identifier. Using similar dataset for companys or organizations we could use this model to predict POI's.

Machine learning algorithms can find relationships in the data, determine which are the most influencial features and provide us with models that can be used to predict behavior for future data. 

Below is a summary of the data:
 
               Number of Data Points: 146
                     Number of POI's: 18
                 Number of Non-POI's: 128
            Total Number of Features: 21

There are 6 features related to email messages, one that is an email address label. 14 features are related to financial earnings and lastly the feature (poi) person of interest.

For the features there were several missing data points described below. Some of these values are quite high (loan_advances is 97% incomplete for example). Some of these can be explained that perhaps these options are not available to all employees (director_fees for examples). Either way this table will certainly be important later when selecting features, as I am not sure features that are mostly incomplete will be good indicators as it will be mostly outlier data. 

                                 Feature  #NaN   %NaN
                                  salary    50   0.34
                             to_messages    58   0.40
                       deferral_payments   106   0.73
                          total_payments    20   0.14
                 exercised_stock_options    43   0.30
                                   bonus    63   0.43
                           director_fees   128   0.88
               restricted_stock_deferred   127   0.88
                       total_stock_value    19   0.13
                                expenses    50   0.34
                 from_poi_to_this_person    70   0.48
                           loan_advances   141   0.97
                           from_messages    58   0.40
                                   other    52   0.36
                 from_this_person_to_poi    78   0.54
                         deferred_income    96   0.66
                 shared_receipt_with_poi    58   0.40
                        restricted_stock    35   0.24
                     long_term_incentive    79   0.54

Below summarized were data points that had NaN for at least 80% of their features. I am not certain if I want to cut any of the data points from the experiment. ~81% represents 17/21 points with value as I am not certain which features are missing. The point LOCKHART EUGENE E at 95%  only has an email address complete. Another data point of interest was 'THE TRAVEL AGENCY IN THE PARK'. According to the footnotes of the Enron Insider Payments sheet this was an agency used by Enron for travel purposes and expenses, that was also owned by Kenneth Lay's sister. http://www.travelweekly.com/Travel-News/Travel-Agent-Issues/Texas-agency-takes-a-huge-hit-from-Enron-s-fall

                              Data Point  % of Features NaN
                           WODRASKA JOHN  0.809523809524
                          WHALEY DAVID A  0.857142857143
                         CLINE KENNETH W  0.809523809524
                            WAKEHAM JOHN  0.809523809524
                            WROBEL BRUCE  0.857142857143
                             GILLIS JOHN  0.809523809524
                       LOCKHART EUGENE E  0.952380952381
           THE TRAVEL AGENCY IN THE PARK  0.857142857143
                       SCRIMSHAW MATTHEW  0.809523809524
                            SAVAGE FRANK  0.809523809524
                           GRAMM WENDY L  0.857142857143

### Records Removed
I decided to remove the following data points:

    THE TRAVEL AGENCY IN THE PARK
    LOCKHART EUGENE E
    TOTAL

The TOTAL field was discovered as an outlier in charting salary and bonus data. It exists as a summation of the other fields and therefore is not relevant to to the model. LOCKHART EUGENE E contained no useful information to the model. I decided that since THE TRAVEL AGENCY IN THE PARK does not represent an individual to remove this field. Perhaps investigation into agencies or other businesses is better to be completed in a seperate model. Through plotting I found there are several data points that contain significant larger values than other data points in salary, bonus, total payments etc. I decided these are valid points and should not be removed at this time.


## Data Features
I split the data with 80% training data and used a SelectKBest feature selection algorithm to determine the features to use. Below is the resulting features selected.

                      Feature  p-values Score of Features
              deferred_income  0.00000  29.63748
                       salary  0.00003  18.67421
      shared_receipt_with_poi  0.00014  15.58401
      exercised_stock_options  0.00187  10.14635
          long_term_incentive  0.00013  15.71770
                loan_advances  0.00412   8.58075
                        bonus  0.01313   6.35298
             restricted_stock  0.02975   4.84684
            total_stock_value  0.36255   0.83584
               total_payments  0.75885   0.09471

One stand out is loan_advances given there is only 3 points of Non-NaN data.

Shared_receipt_with_poi was the only e-mail value that made showed in the SelectKBest result. I chose to create two features that are the ratio of e-mails to/from pois. In otherwords from_poi_ratio = (from_poi_to_this_person/from_messages) and to_poi_ratio = (from_this_person_to_poi/to_messages). These features demonstrate the frequency of communication with POI's with regards to their total communication in and output. Perhaps POI's are communicating more frequently within their group.

Total_stock_value and Total_payments were both selected and represent an individuals total wealth. They are sums of their respective earning areas. I created a total_exp_wealth from the sum of these two values. This generalizes the data points wealth to one value. Some records have missing or no information for their stock values or payments so having a general total income value may improve our models identification of POI's discounting these fields.

After adding the features to the main dataset I decided to run k best again, this time selecting for 13 features to see how my new features scored. I found the to_poi_ratio and total_exp_wealth were both added as well as from_poi_to_this_person. While the new feaure from_poi_ratio did not score well, the from_poi_to_this_person improved.

              deferred_income  0.00000 29.63748
                       salary  0.00003 18.67421
          long_term_incentive  0.00013 15.71770
      from_poi_to_this_person  0.00014 15.58401
                 to_poi_ratio  0.00113 11.16549
      exercised_stock_options  0.00187 10.14635
             total_exp_wealth  0.00301  9.20181
                loan_advances  0.00412  8.58075
                        bonus  0.01313  6.35298
             restricted_stock  0.02975  4.84684
      shared_receipt_with_poi  0.05540  3.74758
            total_stock_value  0.36255  0.83584
               total_payments  0.75885  0.09471

I decided using the above list for the features in my dataset. The dataset is scaled prior fitting, due to the variety of data formats (emails are counts, where finance is in USD, and user created ratio), as well as differing scales of data for each point. However, after testing a variety of classifiers with the full 13 features I decided to scale features back starting with total_stock_value and total_payments, given their representation in total_exp_wealth and low scores.



## Algorithms
The final algorithm I chose was the Adaboost algorithm. It was covered in class as an optional algorithm for investigation that is based on Decision Tree Algorithms. I learned it was a strong classifer for binary classification problems such as the problem at hand. It works by creating a strong classifer from several weaker classifiers. It uses a weighting system that adjusts iteratively, giving more weight to incorrectly predicted instances, and reduced weighting for correctly predicted instances. The learner then gives more attention to the incorrectly predicted instances.

I setup my algorithms testing to use data split by a KFold splitter with n_splits = 5. A few other algorithms tested were GaussianNB and SVM, using GridSearchCV. I did not have success with GridSearchCV. If I did a simple train_test_split of .3 I could get ~80% accuracy with less than 20% precision and recall. Using the KFold splitter the scores would drop significantly. GaussianNB performed relatively well with my KFold test, and had even better results using the supplied tester.py. K-Means Clustering performed well with interesting results for different number of features used, and is an interesting algorithm for the number of tunable parameters.

## Parameters
It is important to tune the parameters for the algorithm being used and for the data model. We want out algorithms to perform as best as possible with regards to our validation parameters. Different parameters can effect the learning of an algorithm as well as the intended outcome as we will see with K-Means.

Below were the parameters tuned for each algorithm. 

* ADABOOST: 
    * <b>n_estimators</b>:max number of estimators used
    * <b>learning_rate</b>:affects the contribution of each classifier by it's value
* K-Means 
    * <b>n_clusters</b>:number of clusters/centroids to form
    * <b>n_init</b>:# of times to run algorithm with different seeds
    * <b>max_iter</b>:# of iterations for a single run (new seed)
    * <b>tol</b>: tolerance to declare convergence

With K-Means it is important for our enron dataset for n_clusters = 2, since we are looking to have two centroids for POI and Non-POI.  I found best success with n_init = 200. Although the data is scaled, with several features I felt more initializations increase the likelihood to find the best fit. I used max_iter = 600, although my Kfold test had good scores at lower values, the results were worse using the tester.py. I found lowering the tolerance tended to net better results and settled with tol=0.00005.

With Adaboost I used n_estimators = 60, and learning_rate = 1.25. For the number of estimators it seemed like two few would net lower results for precision and recall, and higher had no effect. Adaboost will stop iteration before n_estimators is reached if a best fit is reached first. At first I tested doubling the learning_rate and found accuracy and precision would drop. I elected then to test over 0.5 to 1.5 by .25 and then .1 and found 1.25 to return the best precision and recall.

## Validation
Validation is used to ensure an algorithm is performing well, or generalizing your dataset well. A common mistake in validation is overfitting your training data. Here your model performs very well against training data, but poorly against test data. It is biased to the training data, and doesn't react well to new data points.

For validation I used these two testing methods:

1. <b>tester.py</b> - the Udacity supplied method used StratifiedShuffleSplit to split the dataset into train and test data. The default # of splitting and reshuffling passed to the function is 1000.
2. <b>clf_tester</b> - I developed a function in poi_id.py to test classifiers by using KFold to split the data. I used 5 as the the number of splits in every case.

I used <b>precision</b> and <b>recall</b> as primary metrics. Precision measures ratio of the number of poi's detected to the number of actual poi's. We get an idea of how often false alarms can occur in this model. Recall measures the ratio of the number of poi's to the number of records flagged as poi's, so we have an idea how sensitive the model is. Accuracy is not especially useful here, due to the low number of poi's in the dataset. With 12% of the population being poi, if all the non poi were chosen by the algorithm we will have at least 88% accuracy.

#### Validation with KFold, n_splits =  5
<table>
<tr><td>Classifier</td><td>Precision</td><td>Recall</td><td># of Features</td></tr>
<tr><td>Adaboost</td><td>0.414</td><td>0.353</td><td>13</td></tr>
<tr><td>K-Means</td><td>0.353</td><td>0.353</td><td>13</td></tr>
<tr><td>Adaboost</td><td>0.377</td><td>0.366</td><td>11</td></tr>
<tr><td>K-Means</td><td>0.311</td><td>0.553</td><td>11</td></tr>
</table>

#### Validation with StratifiedShuffleSplit, n_splits =  1000
<table>
<tr><td>Classifier</td><td>Precision</td><td>Recall</td><td># of Features</td></tr>
<tr><td>Adaboost</td><td>0.304</td><td>0.249</td><td>13</td></tr>
<tr><td>K-Means</td><td>0.183</td><td>0.126</td><td>13</td></tr>
<tr><td>Adaboost</td><td>0.342</td><td>0.294</td><td>11</td></tr>
<tr><td>K-Means</td><td>0.205</td><td>0.115</td><td>11</td></tr>
</table>

Both algorithms performed well considering the size and quality of the data. Adaboost showed better predictions overall. (Stated above, the omitted features were total_stock_value and total_payments due to addition of total_exp_wealth). In my opinion the recall was the more important metric in this model. From an investigation stand point, we'd want to cast a larget net on potential poi's. Given this data set, where there were very few poi's and in some cases, high variances of data, setting a larger baseline group for future investigation is more important than identifying 'for sure' poi's

## Conclusion
The main challenges faced in this dataset, is not only the small size but also the availability of data for the data points. Many records had limited data were perhaps over simplified by the fact they were totaled values. Perhaps if their was also a time component to the payments, it could add another dimension to determine a poi. We could see spikes or other unusual payments and activity, similary to how a bank may detect fraud, and create features to describe this type of activity.