# Identifying Fraud in Enron Email

## Table of contents

## Code Summary

## Exploratory Analysis

In [67]:
%matplotlib inline
import pandas as pd
import ggplot as gg
import numpy as np
import matplotlib.pyplot as plt

from sklearn.svm import LinearSVC
from sklearn.preprocessing import scale,StandardScaler
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import RidgeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.decomposition import PCA

from scoring import calc_score
from preprocess_data import pkl_to_df,extract_df,linearsvc_outlier_rm,FeatureSel

In [68]:
# Load the dataset and convert it to a pandas dataframe
df=pkl_to_df()
df=df.convert_objects(convert_numeric=True)

The rows and columns of data:

In [69]:
df.shape

(146, 21)

This data set has 145 entries and 21 features, including ```poi```.

Then we use ```describe()``` function to explore the summary of the feature ```poi```:

In [70]:
df["poi"].describe()

count          146
mean     0.1232877
std      0.3298989
min          False
25%              0
50%              0
75%              0
max           True
Name: poi, dtype: object

The mean value indicates that only 12.4% of ```poi``` is ```True```. ```count``` is the number of total non-NaN values. The ```count``` of ```poi``` equals to the total rows, meaning that every row has a ```poi``` value.

We can then use ```transpose()``` and ```count``` to explore the ratio of non-NaN values in each feature: 

In [71]:
df.describe().transpose()["count"]/df.shape[0]

bonus                         0.5616438
deferral_payments             0.2671233
deferred_income               0.3356164
director_fees                 0.1164384
exercised_stock_options       0.6986301
expenses                      0.6506849
from_messages                 0.5890411
from_poi_to_this_person       0.5890411
from_this_person_to_poi       0.5890411
loan_advances                0.02739726
long_term_incentive           0.4520548
other                         0.6369863
poi                                   1
restricted_stock              0.7534247
restricted_stock_deferred     0.1232877
salary                        0.6506849
shared_receipt_with_poi       0.5890411
to_messages                   0.5890411
total_payments                0.8561644
total_stock_value             0.8630137
Name: count, dtype: object

This result shows that every feature has NaN values except ```poi```. The features ```loan_advances``` has the lowest ratio of non-NaN data, which is only 2%.

## Removing outliers manually

In this section, we will show how we remove the data points that are not appropriate for further analysis.
As discussed in the lessons, the dataset includes an entry that is the total of the financial features. Therefore we should remove this entry before proceeding to further analysis:

In [72]:
# remove "TOTAL"
df.drop("TOTAL",axis=0,inplace=True)

In [73]:
tests=df.transpose().count()
tests[tests<3]

LOCKHART EUGENE E    2
dtype: int64

Also, we found that some data entries was highly incomplete. For example, all the features of *LOCKHART EUGENE E* are NaN excep ```poi```:

In [74]:
df.loc["LOCKHART EUGENE E",:]

bonus                          NaN
deferral_payments              NaN
deferred_income                NaN
director_fees                  NaN
email_address                  NaN
exercised_stock_options        NaN
expenses                       NaN
from_messages                  NaN
from_poi_to_this_person        NaN
from_this_person_to_poi        NaN
loan_advances                  NaN
long_term_incentive            NaN
other                          NaN
poi                          False
restricted_stock               NaN
restricted_stock_deferred      NaN
salary                         NaN
shared_receipt_with_poi        NaN
to_messages                    NaN
total_payments                 NaN
total_stock_value              NaN
Name: LOCKHART EUGENE E, dtype: object

Therefore we should exclude this data entry for further analysis.

In [75]:
df.drop("LOCKHART EUGENE E",axis=0,inplace=True)