# Identify Fraud from Enron Email
Project for Intro to Machine Learning
## Goal:

### Project Overview
In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives. In this project, you will play detective, and put your new skills to use by building a person of interest identifier based on financial and email data made public as a result of the Enron scandal. To assist you in your detective work, we've combined this data with a hand-generated list of persons of interest in the fraud case, which means individuals who were indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity.

Provided is a combined the Enron email and financial dataset in the form of a dictionary, where each key-value pair in the dictionary corresponds to one person. The dictionary key is the person's name, and the value is another dictionary, which contains the names of all the features and their values for that person. The features in the data fall into three major types, namely financial features, email features and POI labels.

### Project goal
The goal of the project is to see if I an make building a person of interest identifier that has reasonable accuracy and other metric scores to identify persons of interest from based on financial and email data made public as a result of the Enron scandal.

### Project Plan
The steps to achieve this will be:

* Dataset exploration / Question
* Features
* Algorithms
* Evaluation


## Dataset exploration / Question




In the dataset there is a mix of different data: payment, stock and email information

payment features:  

| Feature | meaning |
|: -------|:--------|
| salary | Reflects items such as base salary, executive cash allowances, and benefits payments |
| bonus | Reflects annual cash incentives paid based upon company performance. Also may include other retention payments. |
| long_term_incentive |Reflects long-term incentive cash payments from various long-term incentive programs designed to tie executive compensation to long-term success as measured against key performance drivers and business objectives over a multi-year period, generally 3 to 5 years. |
| deferred_income | Reflects voluntary executive deferrals of salary, annual cash incentives, and long-term cash incentives as well as cash fees deferred by non-employee directors under a deferred compensation arrangement. May also reflect deferrals under a stock option or phantom stock unit in lieu of cash arrangement.  |
| deferral_payments  |Reflects distributions from a deferred compensation arrangement due to termination of employment or due to in-service withdrawals as per plan provisions. |
| loan_advances  | Reflects total amount of loan advances, excluding repayments, provided by the Debtor in return for a promise of repayment. In certain instances, the terms of the promissory notes allow for the option to repay with stock of the company. |
| other  |Reflects items such as payments for severence, consulting services, relocation costs, tax advances and allowances for employees on international assignment (i.e. housing allowances, cost of living allowances, payments under Enron’s Tax Equalization Program, etc.). May also include payments provided with respect to employment agreements, as well as imputed income amounts for such things as use of corporate aircraft. |
| expenses  | Reflects reimbursements of business expenses. May include fees paid for consulting services |
| director_fees | Reflects cash payments and/or value of stock grants made in lieu of cash payments to non-employee directors. | 
| total_payments  | cumulative of all payments |
  
Stock features  

| Feature | meaning |
|: -------|:--------|
| exercised_stock_options | Reflects amounts from exercised stock options which equal the market value in excess of the exercise price on the date the options were exercised either through cashless (same-day sale), stock swap or cash exercises. The reflected gain may differ from that realized by the insider due to fluctuations in the market price and the timing of any subsequent sale of the securities. |
| restricted_stock | Reflects the gross fair market value of shares and accrued dividends (and/or phantom units and dividend equivalents) on the date of release due to lapse of vesting periods, regardless of whether deferred. |
| restricted_stock_deferred  | Reflects value of restricted stock voluntarily deferred prior to release under a deferred compensation arrangement. |
| total_stock_value | cumulative value of stock |
  
email features  
  
| Feature | meaning |
|: -------|:--------|
| from_poi_to_this_person  | emails send from POI to this person |
| shared_receipt_with_poi  | emails which were also send to a POI |
| from_this_person_to_poi  | emails sent from this person to POI |
| to_messages | total messages to this person |
| from_messages | total messages from this person |
  
and then of course if the person is a poi, in this case identified as either sentenced or someone who took a plea deal as noted in this article : http://usatoday30.usatoday.com/money/industries/energy/2005-12-28-enron-participants_x.htm and provided in the poi_names.txt

| Feature | meaning |
|---------|---------|
| poi  | person of interest |

### Cleaning and outliers
The dataset had some small problems:
* **non-persons in the data** (Total row was added as data point, and a travel agency as well), these were removed
* **Person with no data** was removed
* **two people with shifted columns** where the data was manually corrected

Outside these problems, all outliers are an important part of the dataset, since the biggest outliers are the biggest POIs
* total number of data points
* allocation across classes (POI/non-POI)
* number of features available
* are there features with many missing values? etc.

There are 18 poi identified in the dataset  
There are 125 non-poi identified in the dataset  
  
number of features available : 20, total number of datapoints: 143  
  
Feature : poi, known values : 18  
Feature : salary, known values : 94  
Feature : bonus, known values : 81  
Feature : long_term_incentive, known values : 65  
Feature : deferred_income, known values : 49  
Feature : deferral_payments, known values : 37  
Feature : loan_advances, known values : 3  
Feature : other, known values : 90  
Feature : expenses, known values : 96  
Feature : director_fees, known values : 15  
Feature : total_payments, known values : 123  
Feature : exercised_stock_options, known values : 100  
Feature : restricted_stock, known values : 110  
Feature : restricted_stock_deferred, known values : 17  
Feature : total_stock_value, known values : 125  
Feature : from_poi_to_this_person, known values : 74  
Feature : shared_receipt_with_poi, known values : 86  
Feature : from_this_person_to_poi, known values : 66  
Feature : to_messages, known values : 86  
Feature : from_messages, known values : 86  

## Feature Selection/Engineering
Looking at the original features in the dataset, email-addresses were removed, since they do not add information regarding poi or not.

To create more meaningfull features, I've introduced features relative to the total (salary / total payment for instance). I've used SelectPercentile from sklearn as a metric to evaluate if the relative features score better than the original metric.

I've then removed the original features and kept the relative ones:

<code> features_list = ['poi',
                 'from_messages',
                 'restricted_stock_deferred',
                 'to_messages',
                 'director_fees',
                 'other',
                 'from_poi_to_this_person',
                 'expenses',
                 'loan_advances',
                 'restricted_stock',
                 'total_payments',
                 'deferred_income',
                 'salary',
                 'bonus',
                 'exercised_stock_options',
                 'total_stock_value',
                 'deferral_payments_r',
                 'shared_receipt_with_poi_r',
                 'long_term_incentive_r',
                 'from_this_person_to_poi_r']</code>
    
Where the features denoted with _r suffix are features relative to the total. (ie. long_term_incentive_r = long_term_incentive / total_payments)

## Algorithm

#### Pick and Tune an Algorithm

To determine which algoritm performs best, I created a for loop that checks the algorithms with gridSearhCV for al the selected classifiers methods. For the parameters to be tuned, I've tried to keep some continuity between the algorithms between the parameter grids.

I've selected the following classifiers:
* SVC
* DecisionTreeClassifier
* RandomForestClassifier
* AdaBoostClassifier
* GradientBoostingClassifier

In [57]:
result_df.nlargest(5, 'f1_score')

Unnamed: 0,accuracy_test_set,accuracy_training_set,best_params,classifier,f1_score,precision_score,recall_score,scaling
25,0.925161,0.931852,"{u'kernel': u'poly', u'C': 50}",SVC,0.571429,1.0,0.4,Data after quantile transformation (uniform pdf)
9,0.907197,1.0,"{u'max_features': u'sqrt', u'loss': u'deviance...",GradientBoostingClassifier,0.5,0.666667,0.4,Data after min-max scaling
21,0.890645,1.0,"{u'min_samples_split': 2, u'splitter': u'rando...",DecisionTreeClassifier,0.444444,0.5,0.4,Data after quantile transformation (gaussian pdf)
6,0.853078,0.989636,"{u'min_samples_split': 8, u'splitter': u'best'...",DecisionTreeClassifier,0.428571,0.333333,0.6,Data after min-max scaling
11,0.853078,1.0,"{u'min_samples_split': 2, u'splitter': u'best'...",DecisionTreeClassifier,0.428571,0.333333,0.6,Data after max-abs scaling


Of these, the SVC performed best based:
* **F1 score**, to make sure we do not have a high inbalance in the results, we use f1 scaling to check the quality of the algorithm  
In addition to that, it also:
* **High precision and a more modest recall**  which means that all of the poi's identified are indeed poi's, but it misses quite some (false negatives) because it is so picky. I think for this purpose it is quite a good thing, since we want to be as sure as we can before accusing someone, as opposed to for instance a medical application where a  false positive would be more acceptable if it means more coverage. 
* **Accuracy for the test set which is not far from the accuracy of the training set** which means it is not overfitting the training set.

## Test
Implementing all these in the [*poi_id.py*](poi_id.py), and testing it with the tester.py shows the message:
<code>Pipeline(memory=None,
     steps=[('scaler', QuantileTransformer(copy=True, ignore_implicit_zeros=False, n_quantiles=1000,
          output_distribution='uniform', random_state=None,
          subsample=100000)), ('SVC', SVC(C=50, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='poly', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))])
        Accuracy: 0.90480       Precision: 0.77186      Recall: 0.40600 F1: 0.53211     F2: 0.44852
        Total predictions: 15000        True positives:  812    False positives:  240   False negatives: 1188   True negatives: 12760</code>



So we have:
* Total predictions: 15000        
* True positives:  812    
* False positives:  240   
* False negatives: 1188   
* True negatives: 12760


* Accuracy: 0.90480       
* Precision: 0.77186      
* Recall: 0.40600 
* F1: 0.53211
* F2: 0.44852  
  
  

So it seems our classifier is doing a pretty good job: 
* 90% of the cases are predicted correctly (either positive or negative) (accuracy)
* with a 77% Precision, the TP/(TP+FP) is bigger then the recall, which is what we wanted, but both of them are high enough to maintain good quality