## Identify Fraud from Enron Email

### Overview

In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives.

The purpose of this project is to use machine learning skills to identify Enron Employees who may have committed fraud, based on the public Enron financial and email dataset. I have performed an end-to-end process of investigating the data through a machine learning lens.

### Dataset

The data is combined with a hand-generated list of persons of interest in the fraud case, which means individuals who were indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity.

In [25]:
# %load poi_id.py
#!/usr/bin/python

import sys
import pickle

ospath = 'C:\\Users\\jubin\\Documents\\GitHub\\DAND-Nanodegree\\ud-ml-projects-master\\tools\\'
sys.path.append(ospath)

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data

### Task 1: Select what features you'll use.
### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi".
features_list = ['poi', 'salary', 'exercised_stock_options', 'expenses', 'total_payments', 'total_stock_value']  # You will need to use more features

### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

In [26]:
import pandas as pd
import numpy as np

enron_data = pd.DataFrame.from_dict(data_dict, orient='index')

In [27]:
enron_data.head()

Unnamed: 0,salary,to_messages,deferral_payments,total_payments,exercised_stock_options,bonus,restricted_stock,shared_receipt_with_poi,restricted_stock_deferred,total_stock_value,...,loan_advances,from_messages,other,from_this_person_to_poi,poi,director_fees,deferred_income,long_term_incentive,email_address,from_poi_to_this_person
ALLEN PHILLIP K,201955.0,2902.0,2869717.0,4484442,1729541.0,4175000.0,126027.0,1407.0,-126027.0,1729541,...,,2195.0,152.0,65.0,False,,-3081055.0,304805.0,phillip.allen@enron.com,47.0
BADUM JAMES P,,,178980.0,182466,257817.0,,,,,257817,...,,,,,False,,,,,
BANNANTINE JAMES M,477.0,566.0,,916197,4046157.0,,1757552.0,465.0,-560222.0,5243487,...,,29.0,864523.0,0.0,False,,-5104.0,,james.bannantine@enron.com,39.0
BAXTER JOHN C,267102.0,,1295738.0,5634343,6680544.0,1200000.0,3942714.0,,,10623258,...,,,2660303.0,,False,,-1386055.0,1586055.0,,
BAY FRANKLIN R,239671.0,,260455.0,827696,,400000.0,145796.0,,-82782.0,63014,...,,,69.0,,False,,-201641.0,,frank.bay@enron.com,


In [28]:
print ("There are total {} people in the dataset.".format(enron_data.shape[0]))
print("Out of which {} are POI and {} are Non-POI.".format(enron_data.poi.value_counts()[True],
                                                          enron_data.poi.value_counts()[False]))
print("Total number of email plus financial features are {}.".format(enron_data.columns.shape[0]-1))
print("Label is 'poi' column.")

There are total 146 people in the dataset.
Out of which 18 are POI and 128 are Non-POI.
Total number of email plus financial features are 20.
Label is 'poi' column.


Enron dataset is really messy and has a lot of missing values (NaN). Almost all of the features have missing values and some features have more than 50% of their values missing, as we can see from the frequency of NaN from the table below.
I have converted NaN to 0, to make all the values numeric and train the machine learning algorithm later.

In [29]:
enron_data.describe().transpose()

Unnamed: 0,count,unique,top,freq
salary,146,95,,51
to_messages,146,87,,60
deferral_payments,146,40,,107
total_payments,146,126,,21
exercised_stock_options,146,102,,44
bonus,146,42,,64
restricted_stock,146,98,,36
shared_receipt_with_poi,146,84,,60
restricted_stock_deferred,146,19,,128
total_stock_value,146,125,,20


In [30]:
enron_data.replace(to_replace='NaN', value=0, inplace=True)
enron_data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
salary,146.0,365811.4,2203575.0,0.0,0.0,210596.0,270850.5,26704229.0
to_messages,146.0,1221.589,2226.771,0.0,0.0,289.0,1585.75,15149.0
deferral_payments,146.0,438796.5,2741325.0,-102500.0,0.0,0.0,9684.5,32083396.0
total_payments,146.0,4350622.0,26934480.0,0.0,93944.75,941359.5,1968286.75,309886585.0
exercised_stock_options,146.0,4182736.0,26070400.0,0.0,0.0,608293.5,1714220.75,311764000.0
bonus,146.0,1333474.0,8094029.0,0.0,0.0,300000.0,800000.0,97343619.0
restricted_stock,146.0,1749257.0,10899950.0,-2604490.0,8115.0,360528.0,814528.0,130322299.0
shared_receipt_with_poi,146.0,692.9863,1072.969,0.0,0.0,102.5,893.5,5521.0
restricted_stock_deferred,146.0,20516.37,1439661.0,-7576788.0,0.0,0.0,0.0,15456290.0
total_stock_value,146.0,5846018.0,36246810.0,-44093.0,228869.5,965955.0,2319991.25,434509511.0


In [31]:
### Task 2: Remove outliers


In [None]:
### Task 3: Create new feature(s)
### Store to my_dataset for easy export below.
my_dataset = data_dict

### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_list, sort_keys=True)
labels, features = targetFeatureSplit(data)

### Task 4: Try a varity of classifiers
### Please name your classifier clf for easy export below.
### Note that if you want to do PCA or other multi-stage operations,
### you'll need to use Pipelines. For more info:
### http://scikit-learn.org/stable/modules/pipeline.html

# Provided to give you a starting point. Try a variety of classifiers.
from sklearn.naive_bayes import GaussianNB

clf = GaussianNB()

### Task 5: Tune your classifier to achieve better than .3 precision and recall 
### using our testing script. Check the tester.py script in the final project
### folder for details on the evaluation method, especially the test_classifier
### function. Because of the small size of the dataset, the script uses
### stratified shuffle split cross validation. For more info: 
### http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedShuffleSplit.html

# Example starting point. Try investigating other evaluation techniques!
from sklearn.cross_validation import train_test_split

features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)

### Task 6: Dump your classifier, dataset, and features_list so anyone can
### check your results. You do not need to change anything below, but make sure
### that the version of poi_id.py that you submit can be run on its own and
### generates the necessary .pkl files for validating your results.

dump_classifier_and_data(clf, my_dataset, features_list)