# Identifying Fraud from Enron Data
### - Marty VanHoof

In [1]:
import pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

<a id='exploration'></a>
## Data Exploration / Outlier Removal

We begin by doing some exploratory data analysis (EDA) in order to get a better understanding of the dataset and remove a few outliers.  The Enron email and financial data has been preprocessed and combined into a Python dictionary, where each key-value pair in the dictionary corresponds to one person.  The dictionary key is the person's name, and the value is another dictionary, which contains the names of all the features and their values for that person. 

In order to make our EDA easier, we will use the Python library Pandas to first transform our Python dictionary into a Pandas dataframe.

In [39]:
# load the dictionary containing the dataset
with open('enron_dataset.pkl', 'rb') as data_file:
    data_dict = pickle.load(data_file)
    
# create a dataframe from data_dict and set the index column to employees
df = pd.DataFrame.from_dict(data_dict, orient='index')

# coerce numeric values into floats and convert NaN values to 0
df = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
df.fillna(0, inplace=True)

Below are the first 5 rows of our dataset.  The row observations correspond to Enron employees and the columns are the features (the email and financial information for the employee).  There were many missing values in the original financial dataset and these values were encoded in the Python dictionary as 'NaN'.  These NaN values are transformed to 0 in the dataframe.

In [40]:
df.head()

Unnamed: 0,salary,to_messages,deferral_payments,total_payments,loan_advances,bonus,email_address,restricted_stock_deferred,deferred_income,total_stock_value,...,from_poi_to_this_person,exercised_stock_options,from_messages,other,from_this_person_to_poi,poi,long_term_incentive,shared_receipt_with_poi,restricted_stock,director_fees
ALLEN PHILLIP K,201955.0,2902.0,2869717.0,4484442.0,0.0,4175000.0,0.0,-126027.0,-3081055.0,1729541.0,...,47.0,1729541.0,2195.0,152.0,65.0,False,304805.0,1407.0,126027.0,0.0
BADUM JAMES P,0.0,0.0,178980.0,182466.0,0.0,0.0,0.0,0.0,0.0,257817.0,...,0.0,257817.0,0.0,0.0,0.0,False,0.0,0.0,0.0,0.0
BANNANTINE JAMES M,477.0,566.0,0.0,916197.0,0.0,0.0,0.0,-560222.0,-5104.0,5243487.0,...,39.0,4046157.0,29.0,864523.0,0.0,False,0.0,465.0,1757552.0,0.0
BAXTER JOHN C,267102.0,0.0,1295738.0,5634343.0,0.0,1200000.0,0.0,0.0,-1386055.0,10623258.0,...,0.0,6680544.0,0.0,2660303.0,0.0,False,1586055.0,0.0,3942714.0,0.0
BAY FRANKLIN R,239671.0,0.0,260455.0,827696.0,0.0,400000.0,0.0,-82782.0,-201641.0,63014.0,...,0.0,0.0,0.0,69.0,0.0,False,0.0,0.0,145796.0,0.0


### Some Dataset Characteristics

In [55]:
print('number of data points: ', df.shape[0])
print('number of features: ', df.shape[1])
print('number of POIs: ', df.query('poi == True').shape[0])
print('number of non POIs: ', df.query('poi == False').shape[0])

number of data points:  146
number of features:  21
number of POIs:  18
number of non POIs:  128


Below are the features with missing values (zeros) and the number of missing values for each feature.  We exclude the email addresses since they don't provide useful information and were all converted to zeros in the dataframe.

In [109]:
# print the number of missing values (zeros) for each feature except 'email_address'
zero_counts = (~df.astype(bool)).sum(axis=0)
zero_counts.drop('email_address')

salary                        51
to_messages                   60
deferral_payments            107
total_payments                21
loan_advances                142
bonus                         64
restricted_stock_deferred    128
deferred_income               97
total_stock_value             20
expenses                      51
from_poi_to_this_person       72
exercised_stock_options       44
from_messages                 60
other                         53
from_this_person_to_poi       80
poi                          128
long_term_incentive           80
shared_receipt_with_poi       60
restricted_stock              36
director_fees                129
dtype: int64

### Outliers