* * * 

<div align="Right">
  ©    Josefin Axberg 2017<br>
 </div>

# Final Project: Discovering the Enron Fraud

## Introduction

In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, there was a significant amount of typically confidential information entered into public record, including tens of thousands of emails and detailed financial data for top executives. In this project, I will play detective, and put my new skills to use by building a person of interest identifier based on financial and email data made public as a result of the Enron scandal. I use email and financial data for 146 executives at Enron to identify persons of interest in the fraud case. A person of interest (POI) is someone who was indicted for fraud, settled with the government, or testified in exchange for immunity. This report documents the machine learning techniques used in building a POI identifier.
There are four major steps in my project:
1. **Explore data**
2. **Feature engineering**
3. **Classifiers**
4. **Validation**

![](enron.jpg)

In [1]:
import time
print("Today is %s" % time.strftime("%Y-%m-%d"))

Today is 2017-10-12


In [2]:
#!/usr/bin/python

import os
import pickle
import re
import sys
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt

sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit

# Features_list is a list of strings, each of which is a feature name.
# The first feature must be "poi".
features_list = ['poi'] 

Let me start with the loading of the data and see how it is structured.

In [3]:
# Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

my_dataset = data_dict
df_enron = pd.DataFrame(my_dataset) # Load pickle data to DataFrame for feature enginering etc
df_enron = df_enron.T               # Setting names as indexes and features as columns


df_enron.head()

Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,email_address,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,...,long_term_incentive,other,poi,restricted_stock,restricted_stock_deferred,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
ALLEN PHILLIP K,4175000.0,2869717.0,-3081055.0,,phillip.allen@enron.com,1729541.0,13868,2195.0,47.0,65.0,...,304805.0,152.0,False,126027.0,-126027.0,201955.0,1407.0,2902.0,4484442,1729541
BADUM JAMES P,,178980.0,,,,257817.0,3486,,,,...,,,False,,,,,,182466,257817
BANNANTINE JAMES M,,,-5104.0,,james.bannantine@enron.com,4046157.0,56301,29.0,39.0,0.0,...,,864523.0,False,1757552.0,-560222.0,477.0,465.0,566.0,916197,5243487
BAXTER JOHN C,1200000.0,1295738.0,-1386055.0,,,6680544.0,11200,,,,...,1586055.0,2660303.0,False,3942714.0,,267102.0,,,5634343,10623258
BAY FRANKLIN R,400000.0,260455.0,-201641.0,,frank.bay@enron.com,,129142,,,,...,,69.0,False,145796.0,-82782.0,239671.0,,,827696,63014


In [4]:
df_enron.describe()

Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,email_address,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,...,long_term_incentive,other,poi,restricted_stock,restricted_stock_deferred,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
count,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,...,146.0,146.0,146,146.0,146.0,146.0,146.0,146.0,146.0,146.0
unique,42.0,40.0,45.0,18.0,112.0,102.0,95.0,65.0,58.0,42.0,...,53.0,93.0,2,98.0,19.0,95.0,84.0,87.0,126.0,125.0
top,,,,,,,,,,,...,,,False,,,,,,,
freq,64.0,107.0,97.0,129.0,35.0,44.0,51.0,60.0,60.0,60.0,...,80.0,53.0,128,36.0,128.0,51.0,60.0,60.0,21.0,20.0


In [5]:
df_enron.shape

(146, 21)

In [7]:
print "There are ", len(data_dict.keys()), " executives in Enron Dataset."

There are  146  executives in Enron Dataset.


In [8]:
print data_dict.keys()

['METTS MARK', 'BAXTER JOHN C', 'ELLIOTT STEVEN', 'CORDES WILLIAM R', 'HANNON KEVIN P', 'MORDAUNT KRISTINA M', 'MEYER ROCKFORD G', 'MCMAHON JEFFREY', 'HORTON STANLEY C', 'PIPER GREGORY F', 'HUMPHREY GENE E', 'UMANOFF ADAM S', 'BLACHMAN JEREMY M', 'SUNDE MARTIN', 'GIBBS DANA R', 'LOWRY CHARLES P', 'COLWELL WESLEY', 'MULLER MARK S', 'JACKSON CHARLENE R', 'WESTFAHL RICHARD K', 'WALTERS GARETH W', 'WALLS JR ROBERT H', 'KITCHEN LOUISE', 'CHAN RONNIE', 'BELFER ROBERT', 'SHANKMAN JEFFREY A', 'WODRASKA JOHN', 'BERGSIEKER RICHARD P', 'URQUHART JOHN A', 'BIBI PHILIPPE A', 'RIEKER PAULA H', 'WHALEY DAVID A', 'BECK SALLY W', 'HAUG DAVID L', 'ECHOLS JOHN B', 'MENDELSOHN JOHN', 'HICKERSON GARY J', 'CLINE KENNETH W', 'LEWIS RICHARD', 'HAYES ROBERT E', 'MCCARTY DANNY J', 'KOPPER MICHAEL J', 'LEFF DANIEL P', 'LAVORATO JOHN J', 'BERBERIAN DAVID', 'DETMERING TIMOTHY J', 'WAKEHAM JOHN', 'POWERS WILLIAM', 'GOLD JOSEPH', 'BANNANTINE JAMES M', 'DUNCAN JOHN H', 'SHAPIRO RICHARD S', 'SHERRIFF JOHN R', 'SHELBY 

# Explore Data

First I would like to visually look at the data and check if it contains any NaN values and/or outliers. 

In [10]:
df_enron = df_enron.replace('NaN', np.nan)

In [11]:
print "Amount of NaN values in the dataset: ", df_enron.isnull().sum().sum()


df_enron.loc[(df_enron['email_address'].isnull()) |
              (df_enron['deferral_payments'].isnull()) |
              (df_enron['bonus'].isnull()) |
              (df_enron['from_messages'].isnull())]

Amount of NaN values in the dataset:  1358


Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,email_address,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,...,long_term_incentive,other,poi,restricted_stock,restricted_stock_deferred,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
BADUM JAMES P,,178980.0,,,,257817.0,3486.0,,,,...,,,False,,,,,,182466.0,257817.0
BANNANTINE JAMES M,,,-5104.0,,james.bannantine@enron.com,4046157.0,56301.0,29.0,39.0,0.0,...,,864523.0,False,1757552.0,-560222.0,477.0,465.0,566.0,916197.0,5243487.0
BAXTER JOHN C,1200000.0,1295738.0,-1386055.0,,,6680544.0,11200.0,,,,...,1586055.0,2660303.0,False,3942714.0,,267102.0,,,5634343.0,10623258.0
BAY FRANKLIN R,400000.0,260455.0,-201641.0,,frank.bay@enron.com,,129142.0,,,,...,,69.0,False,145796.0,-82782.0,239671.0,,,827696.0,63014.0
BAZELIDES PHILIP J,,684694.0,,,,1599641.0,,,,,...,93750.0,874.0,False,,,80818.0,,,860136.0,1599641.0
BECK SALLY W,700000.0,,,,sally.beck@enron.com,,37172.0,4343.0,144.0,386.0,...,,566.0,False,126027.0,,231330.0,2639.0,7315.0,969068.0,126027.0
BELFER ROBERT,,-102500.0,,3285.0,,3285.0,,,,,...,,,False,,44093.0,,,,102500.0,-44093.0
BERBERIAN DAVID,,,,,david.berberian@enron.com,1624396.0,11892.0,,,,...,,,False,869220.0,,216582.0,,,228474.0,2493616.0
BERGSIEKER RICHARD P,250000.0,,-485813.0,,rick.bergsieker@enron.com,,59175.0,59.0,4.0,0.0,...,180250.0,427316.0,False,659249.0,,187922.0,233.0,383.0,618850.0,659249.0
BHATNAGAR SANJAY,,,,137864.0,sanjay.bhatnagar@enron.com,2604490.0,,29.0,0.0,1.0,...,,137864.0,False,-2604490.0,15456290.0,,463.0,523.0,15456290.0,


In [16]:
df_enron = df_enron.fillna(0, axis=1)

df_enron.head(5)

Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,email_address,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,...,long_term_incentive,other,poi,restricted_stock,restricted_stock_deferred,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
ALLEN PHILLIP K,4175000.0,2869717.0,-3081055.0,0.0,phillip.allen@enron.com,1729541.0,13868.0,2195.0,47.0,65.0,...,304805.0,152.0,False,126027.0,-126027.0,201955.0,1407.0,2902.0,4484442.0,1729541.0
BADUM JAMES P,0.0,178980.0,0.0,0.0,0,257817.0,3486.0,0.0,0.0,0.0,...,0.0,0.0,False,0.0,0.0,0.0,0.0,0.0,182466.0,257817.0
BANNANTINE JAMES M,0.0,0.0,-5104.0,0.0,james.bannantine@enron.com,4046157.0,56301.0,29.0,39.0,0.0,...,0.0,864523.0,False,1757552.0,-560222.0,477.0,465.0,566.0,916197.0,5243487.0
BAXTER JOHN C,1200000.0,1295738.0,-1386055.0,0.0,0,6680544.0,11200.0,0.0,0.0,0.0,...,1586055.0,2660303.0,False,3942714.0,0.0,267102.0,0.0,0.0,5634343.0,10623258.0
BAY FRANKLIN R,400000.0,260455.0,-201641.0,0.0,frank.bay@enron.com,0.0,129142.0,0.0,0.0,0.0,...,0.0,69.0,False,145796.0,-82782.0,239671.0,0.0,0.0,827696.0,63014.0


In [17]:
sb.pairplot(df_enron, hue='poi')

<seaborn.axisgrid.PairGrid at 0x7fa329a5c610>

In [18]:
df_enron.plot.scatter(x='salary', y='bonus', color='steelblue');

In [19]:
# Remove any outliers before proceeding further
- tex: # This line drops any 'Iris-setosa' rows with a separal width less than 2.5 cm
iris_data = iris_data.loc[(iris_data['class'] != 'Iris-setosa') | (iris_data['sepal_width_cm'] >= 2.5)]
iris_data.loc[iris_data['class'] == 'Iris-setosa', 'sepal_width_cm'].hist()



SyntaxError: invalid syntax (<ipython-input-19-01f2897ad72e>, line 2)

# Feature engineering

In [None]:
- Do correlation
- Feature Selection
- 

In [None]:
features_list = ['poi','salary', 'bonus', 'total_stock_value', 'from_poi_to_this_person', 'from_this_person_to_poi'] 




maybe? = ["poi", "salary", "bonus", "fraction_from_poi_email", "fraction_to_poi_email", 'deferral_payments', 'total_payments', 'loan_advances', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value']

# Classifiers

In [None]:
To do:
- Feature importence
- Precision/recall
- 

In [None]:
df_enron = df_enron.drop([''], axis=1)

In [None]:
# Create the classifier

#clf = DecisionTreeClassifier()
clf = RFC(n_estimators=10)
#clf = SVC(C=500)
clf

In [None]:
feature_list = []

In [None]:
X = df_enron[feature_list].values
Y = df_enron['poi'].values

In [None]:
(training_inputs, testing_inputs, training_scores, testing_scores) = train_test_split(X, Y, test_size=0.75, random_state=43)

In [None]:
clf.fit(training_inputs, training_scores);

In [None]:
clf.score(testing_inputs, testing_scores)

# Validation

In [None]:
# Dump the classifier, dataset and features_list so anyone can run/check your results
pickle.dump(clf, open("my_classifier.pkl", "w") )
pickle.dump(data_dict, open("my_dataset.pkl", "w") )
pickle.dump(features_list, open("my_feature_list.pkl", "w") )

In [None]:
#!jupyter nbconvert --to script FinalProject_TheEnronFraud.ipynb