# Enron Fraud

## Initital Load
### Import packages:

In [1]:
import sys
import pickle
import numpy as np
import pandas as pd
from pandas import DataFrame
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer
from sklearn.ensemble import ExtraTreesClassifier
import seaborn as sns
import matplotlib.pyplot as plt
sys.path.append("../tools/")
%matplotlib inline
from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data



### Load dataset and create dataframe

In [2]:
features_email = ['to_messages', 'from_messages',  'from_poi_to_this_person',
           'from_this_person_to_poi', 'shared_receipt_with_poi']
# finance data
features_finance = ['salary', 'bonus', 'long_term_incentive', 'deferred_income',
             'deferral_payments', 'loan_advances', 'other', 'expenses',
             'director_fees', 'total_payments',
             'exercised_stock_options', 'restricted_stock',
             'restricted_stock_deferred', 'total_stock_value']
# all features
features_list = features_email + features_finance
# all features column names
features_column_names = ['poi'] + ['email_address'] + features_email + features_finance
# all features data type
features_dtype = [bool] + [str] + list(np.repeat(float, 19))

### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

# converting the data into a data frame
df = DataFrame.from_dict(data_dict, orient='index')

# reordering the columns
df = df.loc[:, features_column_names]

# converting the data type
for i in xrange(len(features_column_names)):
    df[features_column_names[i]] = df[features_column_names[i]].astype(features_dtype[i], errors='ignore')

## Descriptive analysis & wrangling of dataset

### Raw table

In [3]:
print df

                                 poi                     email_address  \
ALLEN PHILLIP K                False           phillip.allen@enron.com   
BADUM JAMES P                  False                               NaN   
BANNANTINE JAMES M             False        james.bannantine@enron.com   
BAXTER JOHN C                  False                               NaN   
BAY FRANKLIN R                 False               frank.bay@enron.com   
BAZELIDES PHILIP J             False                               NaN   
BECK SALLY W                   False              sally.beck@enron.com   
BELDEN TIMOTHY N                True              tim.belden@enron.com   
BELFER ROBERT                  False                               NaN   
BERBERIAN DAVID                False         david.berberian@enron.com   
BERGSIEKER RICHARD P           False         rick.bergsieker@enron.com   
BHATNAGAR SANJAY               False        sanjay.bhatnagar@enron.com   
BIBI PHILIPPE A                False  


### Add ratios of email & receipt sharing with poi's to df
The ratio of emails send/recieved to/by pois compared to all emails that were send. The total amount of mails send by pois

In [4]:
# calculate ratio
df['recieved_from_poi_ratio'] = df['from_poi_to_this_person'] / df['to_messages']
df['sent_to_poi_ratio'] = df['from_this_person_to_poi'] / df['from_messages']
df['shared_receipt_with_poi_ratio'] = df['shared_receipt_with_poi'] / df['to_messages']
# add labels to df
features_email_new = ['recieved_from_poi_ratio', 'sent_to_poi_ratio', 'shared_receipt_with_poi_ratio']
features_all = features_list + features_email_new

### Remove invalid columns
The `TOTAL` and `THE TRAVEL AGENCY IN THE PARK` column are mistakes in the dataset as discovered and discussed during the Nanodegree course. They'll be removed here as well – for reference please check the ML course in Udacity.

In [5]:
df = df[df.index != 'TOTAL']
df = df[df.index != 'THE TRAVEL AGENCY IN THE PARK']

### Shape of the table:

In [6]:
#Dataset Shape:
df.shape

(144, 24)

The shape of the dataset shows that there are 144 rows (persons) and 24 columns (features) of which 3 are the interaction with POS ratios that were just added.

### Missing features

In [7]:
df_null_value_ratio = (df.isnull().sum() / df.shape[0]).sort_values(ascending=False)
df_null_values = (df.isnull().sum()).sort_values(ascending=False)
frames = [df_null_values, df_null_value_ratio]
print pd.concat(frames, axis=1, join_axes=[df_null_values.index])

                                 0         1
loan_advances                  141  0.979167
director_fees                  128  0.888889
restricted_stock_deferred      127  0.881944
deferral_payments              106  0.736111
deferred_income                 96  0.666667
long_term_incentive             79  0.548611
bonus                           63  0.437500
to_messages                     58  0.402778
from_messages                   58  0.402778
from_poi_to_this_person         58  0.402778
from_this_person_to_poi         58  0.402778
shared_receipt_with_poi         58  0.402778
shared_receipt_with_poi_ratio   58  0.402778
sent_to_poi_ratio               58  0.402778
recieved_from_poi_ratio         58  0.402778
other                           53  0.368056
expenses                        50  0.347222
salary                          50  0.347222
exercised_stock_options         43  0.298611
restricted_stock                35  0.243056
total_payments                  21  0.145833
total_stoc

This is good news! Even though there are features missing, there is a poi flag and email address for each person in the dataset.

### POI characteristics

In [8]:
# count number of pois
print 'Number of POIs:'
print df.loc[df['poi'] == True].count().sort_values()

Number of POIs:
restricted_stock_deferred         0
director_fees                     0
loan_advances                     1
deferral_payments                 5
deferred_income                  11
exercised_stock_options          12
long_term_incentive              12
recieved_from_poi_ratio          14
sent_to_poi_ratio                14
shared_receipt_with_poi_ratio    14
shared_receipt_with_poi          14
from_this_person_to_poi          14
from_poi_to_this_person          14
from_messages                    14
to_messages                      14
bonus                            16
salary                           17
restricted_stock                 17
other                            18
expenses                         18
total_payments                   18
total_stock_value                18
email_address                    18
poi                              18
dtype: int64


The table above shows the number of poi's as well as the count of available values per feature. At first I was very happy about those features that were missing for all POI's until I realized that those are also missing for the rest of the dataset. 

POI's only make up for 12.5% of our dataset, so there is a skewed distribution. This needs to be considered when evaluating the classification algos. If an algo such as POI = False would be deployed the accuracy would already be at 86%. Developing an algorithm with a accurady of 86% in less than 5 minutes does not sound too bad – but doesn't help in identifying the POIs at all.

### Positional Parameters
Defined as outliers above Q3 + 1.5 IQR and Q3 + 3 IQR, respectively.

In [9]:
print df.describe()

        to_messages  from_messages  from_poi_to_this_person  \
count     86.000000      86.000000                86.000000   
mean    2073.860465     608.790698                64.895349   
std     2582.700981    1841.033949                86.979244   
min       57.000000      12.000000                 0.000000   
25%      541.250000      22.750000                10.000000   
50%     1211.000000      41.000000                35.000000   
75%     2634.750000     145.500000                72.250000   
max    15149.000000   14368.000000               528.000000   

       from_this_person_to_poi  shared_receipt_with_poi        salary  \
count                86.000000                86.000000  9.400000e+01   
mean                 41.232558              1176.465116  2.840875e+05   
std                 100.073111              1178.317641  1.771311e+05   
min                   0.000000                 2.000000  4.770000e+02   
25%                   1.000000               249.750000  2.118020e+

#### Outlier and extreme value detection

### Correlation coefficients:

In [10]:
print df.corr()

                                    poi  to_messages  from_messages  \
poi                            1.000000     0.058954      -0.074308   
to_messages                    0.058954     1.000000       0.475450   
from_messages                 -0.074308     0.475450       1.000000   
from_poi_to_this_person        0.167722     0.525667       0.186708   
from_this_person_to_poi        0.112940     0.568506       0.588687   
shared_receipt_with_poi        0.228313     0.847990       0.230855   
salary                         0.264976     0.187047      -0.003541   
bonus                          0.302384     0.372997       0.052725   
long_term_incentive            0.254723     0.134277      -0.071958   
deferred_income               -0.265698    -0.350815      -0.319995   
deferral_payments             -0.098428     0.310129       0.321947   
loan_advances                  0.999851     0.739805      -0.213768   
other                          0.120270     0.040580      -0.101686   
expens

## References

### Articles
* A look at those involved in the Enron scandal, USA Today - http://usatoday30.usatoday.com/money/industries/energy/2005-12-28-enron-participants_x.htm
* The Immortal Life of the Enron E-mails, MIT Technology Review - https://www.technologyreview.com/s/515801/the-immortal-life-of-the-enron-e-mails/
* Implementing a Weighted Majority Rule Ensemble Classifier in scikit-learn, Sebastian Raschka - http://sebastianraschka.com/Articles/2014_ensemble_classifier.html
* Color Palettes in Seaborn, Chris Albon - http://chrisalbon.com/python/seaborn_color_palettes.html
* Random Forests, Leo Breiman and Adele Cutler - http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
* Python sklearn.feature_selection.f_classif Examples - http://www.programcreek.com/python/example/85917/sklearn.feature_selection.f_classif

### Cheatsheets
* Markdown - https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet
* Pandas - https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf
* Numpy - https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf

### Documentation
* Pipelining: chaining a PCA and a logistic regression, scikit learn - http://scikit-learn.org/stable/auto_examples/plot_digits_pipe.html
* matplotlib.axes, matplotlib - http://matplotlib.org/api/axes_api.html
* DataFrame quantiles, pandas - http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.quantile.html
* Visualization, pandas - https://pandas.pydata.org/pandas-docs/stable/visualization.html
* pyplot, matplotlib - https://matplotlib.org/devdocs/api/_as_gen/matplotlib.pyplot.hist.html
* sort values, pandas - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html  
* Working with missing data, pandas - https://pandas.pydata.org/pandas-docs/stable/missing_data.html

### GitHub Repositories
* EnsembleVoteClassifier, Sebastian Raschka - http://rasbt.github.io/mlxtend/user_guide/classifier/EnsembleVoteClassifier/
* Grace Pehl: Identifying Persons of Interest from the Enron Corpus - https://github.com/grace-pehl/enron
* brandjamie: Marchine Learning with the enron emails dataset - https://github.com/brandjamie/udacity_enron
* Daria ALekseeva: Enron Dataset - https://github.com/DariaAlekseeva/Enron_Dataset
* watanabe8760: uda-da-p5-enron-fraud-detection - https://github.com/watanabe8760/uda-da-p5-enron-fraud-detection
* Mayukh Sobo: Enron Fraud https://github.com/MayukhSobo/EnronFraud 

### Q&A pages
* Pandas Replacement for .ix, Stack Overflow  - https://stackoverflow.com/questions/43838999/pandas-replacement-for-ix
* Sci-kit and Regression Summary, Stack Overflow - http://stackoverflow.com/questions/26319259/sci-kit-and-regression-summary
* How to obtain True Positive, True Negative, False Positive and False Negative, Stack Overflow - https://stackoverflow.com/questions/31324218/scikit-learn-how-to-obtain-true-positive-true-negative-false-positive-and-fal
* Why do we need to normalize data before analysis, Cross Validated - http://stats.stackexchange.com/questions/69157/why-do-we-need-to-normalize-data-before-analysis
* Perform feature normalization before or within model validation?, Cross Validated - http://stats.stackexchange.com/questions/77350/perform-feature-normalization-before-or-within-model-validation
* How should the interquartile range be calculated in Python?, Stack Overflow - http://stackoverflow.com/questions/27472330/how-should-the-interquartile-range-be-calculated-in-python
* scikit learn svc coef0 parameter range, Stack Overflow - http://stackoverflow.com/questions/21390570/scikit-learn-svc-coef0-parameter-range
* What is a good range of values for the svm.SVC() hyperparameters to be explored via GridSearchCV()?, Stack Overflow - http://stackoverflow.com/questions/26337403/what-is-a-good-range-of-values-for-the-svm-svc-hyperparameters-to-be-explored
* Imputation before or after splitting into train and test?, Cross Validated - http://stats.stackexchange.com/questions/95083/imputation-before-or-after-splitting-into-train-and-test
* Is there a rule-of-thumb for how to divide a dataset into training and validation sets?, Stack Overflow - http://stackoverflow.com/questions/13610074/is-there-a-rule-of-thumb-for-how-to-divide-a-dataset-into-training-and-validatio
* What is the difference between test set and validation set?, Cross Validated - http://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set
* Python - What is exactly sklearn.pipeline.Pipeline?, Stack Overflow - http://stackoverflow.com/questions/33091376/python-what-is-exactly-sklearn-pipeline-pipeline
* How can I use a custom feature selection function in scikit-learn's pipeline, Stack Overflow - http://stackoverflow.com/questions/25250654/how-can-i-use-a-custom-feature-selection-function-in-scikit-learns-pipeline
* Seaborn distplot y-axis normalisation wrong ticklabels, Stack Overflow - http://stackoverflow.com/questions/32274865/seaborn-distplot-y-axis-normalisation-wrong-ticklabels
* How to save a Seaborn plot into a file, Stack Overflow - http://stackoverflow.com/questions/32244753/how-to-save-a-seaborn-plot-into-a-file
* Seaborn plots not showing up, Stack Overflow - https://stackoverflow.com/questions/26597116/seaborn-plots-not-showing-up
* Select rows from a dataframe, Stack Overflow - https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas

### Tools
* Markdown Tables Generator - http://www.tablesgenerator.com/markdown_tables
* JSON pretty print - http://jsonprettyprint.com

### Wikipedia
* Enron scandal - https://en.wikipedia.org/wiki/Enron_scandal
* Boxplots - https://en.wikipedia.org/wiki/Box_plot
* Interquartile range - https://en.wikipedia.org/wiki/Interquartile_range
* False positive rate - https://en.wikipedia.org/wiki/False_positive_rate
* False discovery rate - https://en.wikipedia.org/wiki/False_discovery_rate
* Precision and recall - https://en.wikipedia.org/wiki/Precision_and_recall