Eron Fraud Identification
===

Python
---

---

**Author :** Matheus Willian Machado  
**Date :** Jul 30, 2018

---

Project Overview
---

>Banque o detetive e coloque suas habilidades de aprendizado de máquina em uso através da construção de um algoritmo para identificar funcionários da Enron que possam ter cometido fraude. Sua base será um conjunto de dados financeiros e de e-mail público da Enron.
> 
> (Udacity).

---

## Introduction

Em 2000, Enron era uma das maiores empresas dos Estados Unidos. Já em 2002, ela colapsou e quebrou devido a uma fraude que envolveu grande parte da corporação. Resultando em uma investigação federal, muitos dados que são normalmente confidenciais, se tornaram públicos, incluindo dezenas de milhares de e-mails e detalhes financeiros para os executivos dos mais altos níveis da empresa. Neste projeto, você irá bancar o detetive, e colocar suas habilidades na construção de um modelo preditivo que visará determinar se um funcionário é ou não um funcionário de interesse (POI). Um funcionário de interesse é um funcionário que participou do escândalo da empresa Enron. Para te auxiliar neste trabalho de detetive, nós combinamos os dados financeiros e sobre e-mails dos funcionários investigados neste caso de fraude, o que significa que eles foram indiciados, fecharam acordos com o governo, ou testemunharam em troca de imunidade no processo.

---

## Libraries

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import pickle
import pandas as pd
import numpy as np

from sklearn.preprocessing import Imputer, MinMaxScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.pipeline import Pipeline

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data, test_classifier

---

## Data Exploration

In [4]:
with open('final_project_dataset.pkl', 'rb') as f:
    dic = pickle.load(f)

In [5]:
data = pd.DataFrame.from_dict(dic, orient='index')
data.head()


Unnamed: 0,salary,to_messages,deferral_payments,total_payments,loan_advances,bonus,email_address,restricted_stock_deferred,deferred_income,total_stock_value,...,from_poi_to_this_person,exercised_stock_options,from_messages,other,from_this_person_to_poi,poi,long_term_incentive,shared_receipt_with_poi,restricted_stock,director_fees
ALLEN PHILLIP K,201955.0,2902.0,2869717.0,4484442,,4175000.0,phillip.allen@enron.com,-126027.0,-3081055.0,1729541,...,47.0,1729541.0,2195.0,152.0,65.0,False,304805.0,1407.0,126027.0,
BADUM JAMES P,,,178980.0,182466,,,,,,257817,...,,257817.0,,,,False,,,,
BANNANTINE JAMES M,477.0,566.0,,916197,,,james.bannantine@enron.com,-560222.0,-5104.0,5243487,...,39.0,4046157.0,29.0,864523.0,0.0,False,,465.0,1757552.0,
BAXTER JOHN C,267102.0,,1295738.0,5634343,,1200000.0,,,-1386055.0,10623258,...,,6680544.0,,2660303.0,,False,1586055.0,,3942714.0,
BAY FRANKLIN R,239671.0,,260455.0,827696,,400000.0,frank.bay@enron.com,-82782.0,-201641.0,63014,...,,,,69.0,,False,,,145796.0,


In [6]:
data.replace('NaN', np.nan, inplace=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 146 entries, ALLEN PHILLIP K to YEAP SOON
Data columns (total 21 columns):
salary                       95 non-null float64
to_messages                  86 non-null float64
deferral_payments            39 non-null float64
total_payments               125 non-null float64
loan_advances                4 non-null float64
bonus                        82 non-null float64
email_address                111 non-null object
restricted_stock_deferred    18 non-null float64
deferred_income              49 non-null float64
total_stock_value            126 non-null float64
expenses                     95 non-null float64
from_poi_to_this_person      86 non-null float64
exercised_stock_options      102 non-null float64
from_messages                86 non-null float64
other                        93 non-null float64
from_this_person_to_poi      86 non-null float64
poi                          146 non-null bool
long_term_incentive          66 non-null float6

In [7]:
label = 'poi'
data[label].value_counts()

False    128
True      18
Name: poi, dtype: int64

In [8]:
data['email_address'].nunique()


111

In [9]:
del data['email_address']

In [10]:
data.isnull().sum()

salary                        51
to_messages                   60
deferral_payments            107
total_payments                21
loan_advances                142
bonus                         64
restricted_stock_deferred    128
deferred_income               97
total_stock_value             20
expenses                      51
from_poi_to_this_person       60
exercised_stock_options       44
from_messages                 60
other                         53
from_this_person_to_poi       60
poi                            0
long_term_incentive           80
shared_receipt_with_poi       60
restricted_stock              36
director_fees                129
dtype: int64

In [11]:
s = data[data[label] == 1].isnull().sum()
s

salary                        1
to_messages                   4
deferral_payments            13
total_payments                0
loan_advances                17
bonus                         2
restricted_stock_deferred    18
deferred_income               7
total_stock_value             0
expenses                      0
from_poi_to_this_person       4
exercised_stock_options       6
from_messages                 4
other                         0
from_this_person_to_poi       4
poi                           0
long_term_incentive           6
shared_receipt_with_poi       4
restricted_stock              1
director_fees                18
dtype: int64

In [12]:
limit = data[label].value_counts()[1]/3
few_poi_values = s[s > limit].index.tolist()
few_poi_values

['deferral_payments',
 'loan_advances',
 'restricted_stock_deferred',
 'deferred_income',
 'director_fees']

In [13]:
payments = ['salary',
            'deferral_payments',
            'loan_advances',
            'bonus',
            'deferred_income',
            'expenses',
            'long_term_incentive',
            'other',
            'director_fees',
            'total_payments']

In [14]:
data[payments] = data[payments].fillna(0)

In [15]:
data[data[payments[:-1]].sum(axis=1) != data.total_payments][payments]

Unnamed: 0,salary,deferral_payments,loan_advances,bonus,deferred_income,expenses,long_term_incentive,other,director_fees,total_payments
BELFER ROBERT,0.0,-102500.0,0.0,0.0,0.0,0.0,0.0,0.0,3285.0,102500.0
BHATNAGAR SANJAY,0.0,0.0,0.0,0.0,0.0,0.0,0.0,137864.0,137864.0,15456290.0


In [16]:
correct = ['deferred_income','deferral_payments', 'expenses', 'director_fees', 'total_payments']
data.loc['BELFER ROBERT',correct] = np.array([-102500, 0, 3285, 102500, 3285])


In [17]:
correct = ['other', 'expenses', 'director_fees', 'total_payments']
data.loc['BHATNAGAR SANJAY',correct] = np.array([0, 137864, 0, 137864])

In [18]:
data[data[payments[:-1]].sum(axis=1) != data.total_payments][payments]

Unnamed: 0,salary,deferral_payments,loan_advances,bonus,deferred_income,expenses,long_term_incentive,other,director_fees,total_payments


In [19]:
stock = ['restricted_stock_deferred',
         'restricted_stock',
         'exercised_stock_options',
         'total_stock_value']

In [20]:
data[stock] = data[stock].fillna(0)

In [21]:
data[data[stock[:-1]].sum(axis=1) != data.total_stock_value][stock]

Unnamed: 0,restricted_stock_deferred,restricted_stock,exercised_stock_options,total_stock_value
BELFER ROBERT,44093.0,0.0,3285.0,-44093.0
BHATNAGAR SANJAY,15456290.0,-2604490.0,2604490.0,0.0


In [22]:
correct = ['restricted_stock_deferred','restricted_stock', 'exercised_stock_options', 'total_stock_value']
data.loc['BELFER ROBERT',correct] = np.array([-44093, 44093, 0, 0])

In [23]:
correct = ['restricted_stock_deferred','restricted_stock', 'exercised_stock_options', 'total_stock_value']
data.loc['BHATNAGAR SANJAY',correct] = np.array([-2604490, 2604490, 15456290, 15456290])

In [24]:
data[data[stock[:-1]].sum(axis=1) != data.total_stock_value][stock]

Unnamed: 0,restricted_stock_deferred,restricted_stock,exercised_stock_options,total_stock_value


In [25]:
email = ['to_messages',
         'from_poi_to_this_person',
         'from_messages',
         'from_this_person_to_poi',
         'shared_receipt_with_poi']

In [26]:
data[email].info()

<class 'pandas.core.frame.DataFrame'>
Index: 146 entries, ALLEN PHILLIP K to YEAP SOON
Data columns (total 5 columns):
to_messages                86 non-null float64
from_poi_to_this_person    86 non-null float64
from_messages              86 non-null float64
from_this_person_to_poi    86 non-null float64
shared_receipt_with_poi    86 non-null float64
dtypes: float64(5)
memory usage: 11.8+ KB


In [27]:
data[email+[label]].groupby(label).mean()

Unnamed: 0_level_0,to_messages,from_poi_to_this_person,from_messages,from_this_person_to_poi,shared_receipt_with_poi
poi,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
False,2007.111111,58.5,668.763889,36.277778,1058.527778
True,2417.142857,97.785714,300.357143,66.714286,1783.0


In [28]:
imp = Imputer(np.nan)
data.loc[data[label] == 1, email] = imp.fit_transform(data[email][data[label]==1])
data.loc[data[label] == 0, email] = imp.fit_transform(data[email][data[label]==0])

In [29]:
payments = list(set(payments)-set(few_poi_values))
stock    = list(set(stock)-set(few_poi_values))
email    = list(set(email)-set(few_poi_values))

In [30]:
data = data[payments+stock+email+[label]]
data.shape

(146, 15)

In [31]:
data.head()

Unnamed: 0,expenses,total_payments,bonus,long_term_incentive,salary,other,restricted_stock,exercised_stock_options,total_stock_value,to_messages,from_poi_to_this_person,shared_receipt_with_poi,from_this_person_to_poi,from_messages,poi
ALLEN PHILLIP K,13868.0,4484442.0,4175000.0,304805.0,201955.0,152.0,126027.0,1729541.0,1729541.0,2902.0,47.0,1407.0,65.0,2195.0,False
BADUM JAMES P,3486.0,182466.0,0.0,0.0,0.0,0.0,0.0,257817.0,257817.0,2007.111111,58.5,1058.527778,36.277778,668.763889,False
BANNANTINE JAMES M,56301.0,916197.0,0.0,0.0,477.0,864523.0,1757552.0,4046157.0,5243487.0,566.0,39.0,465.0,0.0,29.0,False
BAXTER JOHN C,11200.0,5634343.0,1200000.0,1586055.0,267102.0,2660303.0,3942714.0,6680544.0,10623258.0,2007.111111,58.5,1058.527778,36.277778,668.763889,False
BAY FRANKLIN R,129142.0,827696.0,400000.0,0.0,239671.0,69.0,145796.0,0.0,63014.0,2007.111111,58.5,1058.527778,36.277778,668.763889,False


---

## Outliers Investigation

In [32]:
data[data.drop(label, axis=1).isnull().all(1)]

Unnamed: 0,expenses,total_payments,bonus,long_term_incentive,salary,other,restricted_stock,exercised_stock_options,total_stock_value,to_messages,from_poi_to_this_person,shared_receipt_with_poi,from_this_person_to_poi,from_messages,poi


In [33]:
data.drop('LOCKHART EUGENE E', inplace=True)


In [34]:
data.drop(['TOTAL', 'THE TRAVEL AGENCY IN THE PARK'], inplace=True)

In [35]:
data.shape

(143, 15)

---

## Feature Engineer

In [3]:
def KBestTable(sel, df, features):
    names = df[features].columns.values[sel.get_support()]
    scores = pd.Series(sel.scores_, names).sort_values(ascending=False)
    return scores

In [36]:
features = payments+stock+email
sel = SelectKBest(f_classif, k = 'all').fit(data[features], data[label])
KBestTable(sel, data, features)

total_stock_value          22.510549
exercised_stock_options    22.348975
bonus                      20.792252
salary                     18.289684
shared_receipt_with_poi    10.409148
long_term_incentive         9.922186
total_payments              9.283874
restricted_stock            8.825442
from_poi_to_this_person     5.478692
expenses                    5.418900
other                       4.202436
from_this_person_to_poi     2.445551
from_messages               1.050952
to_messages                 0.660154
dtype: float64

In [37]:
data['ratio_from_poi'] = data.from_this_person_to_poi/data.from_messages
data['ratio_to_poi']   = data.from_poi_to_this_person/data.to_messages
data['ratio_with_poi'] = data.shared_receipt_with_poi/data.to_messages
new = ['ratio_with_poi', 'ratio_to_poi', 'ratio_from_poi']

In [38]:
features = email+new
sel = SelectKBest(f_classif, k = 'all').fit(data[features], data[label])
KBestTable(sel, data, features)

ratio_from_poi             25.878195
ratio_with_poi             15.693633
shared_receipt_with_poi    10.409148
from_poi_to_this_person     5.478692
ratio_to_poi                2.592766
from_this_person_to_poi     2.445551
from_messages               1.050952
to_messages                 0.660154
dtype: float64

In [39]:
email = new

---

## Feature Selection

In [40]:
features = payments+stock+email
sel = SelectKBest(f_classif, k = 'all').fit(data[features], data[label])
KBestTable(sel, data, features)

ratio_from_poi             25.878195
total_stock_value          22.510549
exercised_stock_options    22.348975
bonus                      20.792252
salary                     18.289684
ratio_with_poi             15.693633
long_term_incentive         9.922186
total_payments              9.283874
restricted_stock            8.825442
expenses                    5.418900
other                       4.202436
ratio_to_poi                2.592766
dtype: float64

In [41]:
features_list = [label]+features
features_list

['poi',
 'expenses',
 'total_payments',
 'bonus',
 'long_term_incentive',
 'salary',
 'other',
 'restricted_stock',
 'exercised_stock_options',
 'total_stock_value',
 'ratio_with_poi',
 'ratio_to_poi',
 'ratio_from_poi']

---

## Feature Scaling

In [42]:
data[features] = MinMaxScaler().fit_transform(data[features])
data[features].head()

Unnamed: 0,expenses,total_payments,bonus,long_term_incentive,salary,other,restricted_stock,exercised_stock_options,total_stock_value,ratio_with_poi,ratio_to_poi,ratio_from_poi
ALLEN PHILLIP K,0.060622,0.043303,0.521875,0.059238,0.181735,1.5e-05,0.008537,0.050353,0.035218,0.47464,0.074518,0.029613
BADUM JAMES P,0.015238,0.001762,0.0,0.0,0.0,0.0,0.0,0.007506,0.00525,0.517937,0.134104,0.054246
BANNANTINE JAMES M,0.246111,0.008847,0.0,0.0,0.000429,0.08345,0.119062,0.117798,0.10677,0.81726,0.317034,0.0
BAXTER JOHN C,0.048959,0.054407,0.15,0.308245,0.24036,0.256793,0.267091,0.194494,0.216315,0.517937,0.134104,0.054246
BAY FRANKLIN R,0.564523,0.007992,0.05,0.0,0.215675,7e-06,0.009877,0.0,0.001283,0.517937,0.134104,0.054246


---

## Algorithms

In [43]:
my_dataset = data.to_dict(orient='index')
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

In [44]:
from sklearn.naive_bayes  import GaussianNB
from sklearn.svm          import SVC
from sklearn.tree         import DecisionTreeClassifier
from sklearn.ensemble     import RandomForestClassifier

---

## Parameters Tunning

In [45]:
cv = StratifiedShuffleSplit(labels, 1000,random_state = 42)
skb = {'SKB__k': ['all'] + list(range(6,len(features[0]),2))}
pca = {'PCA__n_components': [None] + list(range(2,4))}
scoring = 'f1'

params = {}
params.update(pca)
params.update(skb)
params

{'PCA__n_components': [None, 2, 3], 'SKB__k': ['all', 6, 8, 10]}

In [46]:
nb = {}
nb.update(params)
params2 = params.copy().update(nb)
pipe = Pipeline(steps=[('SKB', SelectKBest()),('PCA', PCA()), ('clf', GaussianNB())])
clf = GridSearchCV(pipe, param_grid = nb, cv=cv, scoring = 'f1').fit(features, labels)
clf.best_params_

{'PCA__n_components': None, 'SKB__k': 'all'}

In [47]:
test_classifier(clf.best_estimator_,my_dataset,features_list)

Pipeline(memory=None,
     steps=[('SKB', SelectKBest(k='all', score_func=<function f_classif at 0x7f3c8b7e81e0>)), ('PCA', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', GaussianNB(priors=None))])
	Accuracy: 0.83460	Precision: 0.36343	Recall: 0.32000	F1: 0.34034	F2: 0.32784
	Total predictions: 15000	True positives:  640	False positives: 1121	False negatives: 1360	True negatives: 11879



In [48]:
svm = {'clf__kernel'       : ['linear', 'poly', 'rbf'],
       'clf__C'            : [1., 10., 100., 1000.],
       'clf__gamma'        : [0.001, 0.0001],
       'clf__random_state' : [42]}

svm.update(params)

pipe = Pipeline(steps=[('SKB', SelectKBest()),('PCA', PCA()), ('clf', SVC())])
clf = GridSearchCV(pipe, param_grid = svm, cv=cv, scoring=scoring).fit(features, labels)
clf.best_params_

{'PCA__n_components': None,
 'SKB__k': 10,
 'clf__C': 1000.0,
 'clf__gamma': 0.001,
 'clf__kernel': 'linear',
 'clf__random_state': 42}

In [49]:
test_classifier(clf.best_estimator_,my_dataset,features_list)

Pipeline(memory=None,
     steps=[('SKB', SelectKBest(k=10, score_func=<function f_classif at 0x7f3c8b7e81e0>)), ('PCA', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', SVC(C=1000.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.001, kernel='linear',
  max_iter=-1, probability=False, random_state=42, shrinking=True,
  tol=0.001, verbose=False))])
	Accuracy: 0.87240	Precision: 0.57491	Recall: 0.16500	F1: 0.25641	F2: 0.19244
	Total predictions: 15000	True positives:  330	False positives:  244	False negatives: 1670	True negatives: 12756



In [None]:
dt = {'clf__criterion'        : ['gini', 'entropy'],
      'clf__max_depth'        : [None, 1, 2, 4, 8],
      'clf__min_samples_split': [2, 3, 4, 5],
      'clf__min_samples_leaf' : [2, 4, 6, 8],
      'clf__random_state'     : [42]}

dt.update(params)

pipe = Pipeline(steps=[('SKB', SelectKBest()),('PCA', PCA()), ('clf', DecisionTreeClassifier())])
clf = GridSearchCV(pipe, param_grid = dt, cv=cv, scoring=scoring
                  ).fit(features, labels)
clf.best_params_

In [None]:
test_classifier(clf.best_estimator_,my_dataset,features_list)

In [None]:
rf = {'clf__min_samples_split':[2, 4, 6],
      'clf__max_depth'        :[2, 4, 6],
      'clf__random_state'     :[42]}
rf.update(params)

pipe = Pipeline(steps=[('SKB', SelectKBest()),('PCA', PCA()), ('clf', RandomForestClassifier())])
clf = GridSearchCV(pipe, param_grid = rf, cv=cv, scoring=scoring).fit(features, labels)
clf.best_params_

In [None]:
test_classifier(clf.best_estimator_,my_dataset,features_list)

In [None]:
clf = DecisionTreeClassifier(criterion='entropy', max_depth =6, min_samples_split=6, min_samples_leaf=6)
test_classifier(clf,my_dataset,features_list)

## Validation

## Conclusion

> 1. Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?  [relevant rubric items: “data exploration”, “outlier investigation”]
> 1. What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.  [relevant rubric items: “create new features”, “intelligently select features”, “properly scale features”]
> 1. What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?  [relevant rubric item: “pick an algorithm”]
> 1. What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?  How did you tune the parameters of your particular algorithm? What parameters did you tune? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier).  [relevant rubric items: “discuss parameter tuning”, “tune the algorithm”]
> 1. What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?  [relevant rubric items: “discuss validation”, “validation strategy”]
> 1. Give at least 2 evaluation metrics and your average performance for each of them.  Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance. [relevant rubric item: “usage of evaluation metrics”]


# References

1. <https://stackoverflow.com/>
1. <https://pandas.pydata.org/>
1. <https://stats.stackexchange.com/>
1. <https://olegleyz.github.io/enron_classifier.html>
1. <https://medium.com/@williamkoehrsen/machine-learning-with-python-on-the-enron-dataset-8d71015be26d/>
1. <https://www.kaggle.com/tsilveira/machine-learning-tutorial-enron-e-mails>