In [1]:
import sys
sys.path.append("../tools/")

In [102]:
from __future__ import print_function
from sklearn.model_selection import train_test_split
import pickle
from feature_format import featureFormat, targetFeatureSplit
import pandas as pd
import numpy as np
import itertools
from tester import dump_classifier_and_data
from sklearn.decomposition import PCA
import tester
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import classification_report

Loading the data set

In [3]:
data_dict = pickle.load(open("../final_project/final_project_dataset.pkl", "rb"))
df = pd.DataFrame.from_records(list(data_dict.values()))
employees = pd.Series(list(data_dict.keys()))
# set the index of df to be the employees series:
df.set_index(employees, inplace=True)

# DATA EXPLORATION

We assume that all the email-related data were collected using valid email addresses, therefore if some people have their email addresses missing, that implies the corresponding email features will be missing as well (NaN). The contrary is not true as we are about to see:

In [4]:
print("Number of people with missing email data: ",df['from_messages'].value_counts().max())
print("Number of people with missing email address: ",df['email_address'].value_counts().max())

Number of people with missing email data:  60
Number of people with missing email address:  35


Who are these people?

In [5]:
strange_cases = df.index[(df['email_address'] != 'NaN') & (df['from_messages']=='NaN')].tolist()
print("People with at least one email address but without email-related features: ")
print()
for i, name in enumerate(strange_cases):
    print(i + 1, name)

People with at least one email address but without email-related features: 

1 ELLIOTT STEVEN
2 MORDAUNT KRISTINA M
3 WESTFAHL RICHARD K
4 WODRASKA JOHN
5 ECHOLS JOHN B
6 KOPPER MICHAEL J
7 BERBERIAN DAVID
8 DETMERING TIMOTHY J
9 GOLD JOSEPH
10 KISHKILL JOSEPH G
11 LINDHOLM TOD A
12 BUTTS ROBERT H
13 HERMANN ROBERT J
14 SCRIMSHAW MATTHEW
15 FASTOW ANDREW S
16 OVERDYKE JR JERE C
17 STABLER FRANK
18 PRENTICE JAMES
19 WHITE JR THOMAS E
20 CHRISTODOULOU DIOMEDES
21 DIMICHELE RICHARD G
22 YEAGER F SCOTT
23 HIRKO JOSEPH
24 PAI LOU L
25 BAY FRANKLIN R


Those are 25 extrange cases, where actual Enron employees with a valid email address,  do not have email-related features.

In [6]:
emailless_people = df.index[df['email_address'] == 'NaN'].tolist()
print("People without an email address: ")
print()
for i, name in enumerate(emailless_people):
    print(i + 1, name)

People without an email address: 

1 BAXTER JOHN C
2 LOWRY CHARLES P
3 WALTERS GARETH W
4 CHAN RONNIE
5 BELFER ROBERT
6 URQUHART JOHN A
7 WHALEY DAVID A
8 MENDELSOHN JOHN
9 CLINE KENNETH W
10 WAKEHAM JOHN
11 DUNCAN JOHN H
12 LEMAISTRE CHARLES
13 SULLIVAN-SHAKLOVITZ COLLEEN
14 WROBEL BRUCE
15 MEYER JEROME J
16 CUMBERLAND MICHAEL S
17 GAHN ROBERT S
18 GATHMANN WILLIAM D
19 GILLIS JOHN
20 BAZELIDES PHILIP J
21 LOCKHART EUGENE E
22 PEREIRA PAULO V. FERRAZ
23 BLAKE JR. NORMAN P
24 GRAY RODNEY
25 THE TRAVEL AGENCY IN THE PARK
26 NOLES JAMES L
27 TOTAL
28 JAEDICKE ROBERT
29 WINOKUR JR. HERBERT S
30 BADUM JAMES P
31 REYNOLDS LAWRENCE
32 YEAP SOON
33 FUGH JOHN L
34 SAVAGE FRANK
35 GRAMM WENDY L


Here we found two entities, that are not real people: THE TRAVEL AGENCY IN THE PARK and TOTAL. These entities do not contribute in any meaningful way to the purpose of this study, so let say that from this moment on we mark them for deletion. 

The email address is the only field that could not be converted to numeric. We chose to remove it from the data frame because it is of no use to identify poi from the data given.
Also, in the case of the poi column only zeroes (0) and ones (1) are allowed: 1 = poi, 0 = non-poi

In [7]:
df=df.apply(lambda x: pd.to_numeric(x, errors='coerse'))
del df['email_address']
df['poi']=df['poi'].astype(int)

In [8]:
poi_label = ['poi']
financial_feat_list = ['salary', 'bonus', 'long_term_incentive', 'deferred_income', 'deferral_payments', 
                       'loan_advances', 'other', 'expenses', 'director_fees', 'total_payments', 
                       'exercised_stock_options', 'restricted_stock','restricted_stock_deferred', 'total_stock_value']
email_feat_list = ['from_messages', 'from_poi_to_this_person','from_this_person_to_poi', 'shared_receipt_with_poi', 
                   'to_messages']
features_list = poi_label + financial_feat_list + email_feat_list
df=df[features_list]
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 146 entries, METTS MARK to GLISAN JR BEN F
Data columns (total 20 columns):
poi                          146 non-null int64
salary                       95 non-null float64
bonus                        82 non-null float64
long_term_incentive          66 non-null float64
deferred_income              49 non-null float64
deferral_payments            39 non-null float64
loan_advances                4 non-null float64
other                        93 non-null float64
expenses                     95 non-null float64
director_fees                17 non-null float64
total_payments               125 non-null float64
exercised_stock_options      102 non-null float64
restricted_stock             110 non-null float64
restricted_stock_deferred    18 non-null float64
total_stock_value            126 non-null float64
from_messages                86 non-null float64
from_poi_to_this_person      86 non-null float64
from_this_person_to_poi      86 non-null flo

Counting the number of poi and non-poi in the dataset.

In [9]:
df['poi'].value_counts()

0    128
1     18
Name: poi, dtype: int64

In [16]:
print("Number of poi without email data: ", df[(df['poi']==1) & (~df.to_messages.notnull())].shape[0])

Number of poi without email data:  4


# DATA CLEANSING

Checking if there are people without data associated.

In [10]:
df[df.isnull().sum(axis=1) >= df.shape[1]-1]

Unnamed: 0,poi,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value,from_messages,from_poi_to_this_person,from_this_person_to_poi,shared_receipt_with_poi,to_messages
LOCKHART EUGENE E,0,,,,,,,,,,,,,,,,,,,


It seems that the only thing we know about Eugene E. Lockhart is that he is not a person of interest.

As we have a relatively low number of data points, we are going to proceed extra-carefully at removing them.
For the moment, we are going to do it just to the items we previously have marked for deletion plus this last one, and we will analyze any further need in a case by case manner as we proceed with our ML algorithms.

In [13]:
df.drop(['TOTAL','THE TRAVEL AGENCY IN THE PARK', 'LOCKHART EUGENE E'], inplace=True)
df.shape

(143, 20)

From the financial data, we learned that NaN means zero. Therefore we proceed to make the corresponding changes in our data frame.

In [14]:
df.iloc[:, 1:15] = df.iloc[:, 1:15].fillna(0)

After performing such an operation, the number of NaN values was dramatically reduced from 1323 up to 285, which ultimately is the amount of missing email-related entries.

In [15]:
print("Number of email-related missing data: ", df.isnull().sum().sum())

Number of email-related missing data:  285


Maybe the best way to proceed with the remaining NaN values is to impute them with the median for non-poi people.

In [17]:
df[email_feat_list]=df[email_feat_list].fillna(df.groupby("poi")[email_feat_list].transform("median"))
print("Amount of remaining NaN entries in the dataframe:", df.isnull().sum().sum())

Amount of remaining NaN entries in the dataframe: 0


As found in some of our references, the manual input of the financial data could have been the cause of some observed mistakes. 

In [18]:
payments = financial_feat_list[:9]
df[df[payments].sum(axis = 1) != df.total_payments][financial_feat_list]

Unnamed: 0,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value
BELFER ROBERT,0.0,0.0,0.0,0.0,-102500.0,0.0,0.0,0.0,3285.0,102500.0,3285.0,0.0,44093.0,-44093.0
BHATNAGAR SANJAY,0.0,0.0,0.0,0.0,0.0,0.0,137864.0,0.0,137864.0,15456290.0,2604490.0,-2604490.0,15456290.0,0.0


In [19]:
stock_value = financial_feat_list[10:13]
test_df=df[df[stock_value].sum(axis='columns') != df.total_stock_value][financial_feat_list]
test_df

Unnamed: 0,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value
BELFER ROBERT,0.0,0.0,0.0,0.0,-102500.0,0.0,0.0,0.0,3285.0,102500.0,3285.0,0.0,44093.0,-44093.0
BHATNAGAR SANJAY,0.0,0.0,0.0,0.0,0.0,0.0,137864.0,0.0,137864.0,15456290.0,2604490.0,-2604490.0,15456290.0,0.0


Fortunately, there are errors in just two rows. Checking the .pdf document obtained from FindLaw, we acknowledged that the errors are in fact shifts of one column in each case but opposite directions. Let's correct them.

In [20]:
test_df.loc[['BELFER ROBERT']] = test_df.loc[['BELFER ROBERT']].shift(-1, axis =1).fillna(0)
test_df.loc[['BHATNAGAR SANJAY']] = test_df.loc[['BHATNAGAR SANJAY']].shift(1, axis =1).fillna(0)

df.update(test_df)

if not (df[df[payments].sum(axis = 1) != df.total_payments].shape[0] | df[df[stock_value].sum(
    axis='columns') != df.total_stock_value][financial_feat_list].shape[0]):
    print("All the financial data has been corrected")
else:
    print("Some errors remain")

All the financial data has been corrected


# CREATE NEW FEATURES

The most straightforward manner of creating new features, in this case, is by using the existing ones. For example, we can create meaningful ratios of two features. A more complicated way to achieve the same goal is to work extensively with the full Enron email dataset. As we were curious about those cases of existing emails and no email related data, we decided to dive into the Enron email data.

### Finding the mysterious missing data

Exploring the Enron email dataset proved to be a time-consuming task. After searching with an intricate pattern of regular expressions and using specific search criteria based on the observed email addresses patterns, we were able to find up to 424 different email addresses linked to the people under study. Our search methods were far from optimal as they included final manual adjudications in many cases. That is why we have reasons to believe that there could be more email addresses than the ones we were able to find (but we decided to leave that as a subject of a more detailed study to be carried out in the future). In any case, our search allowed us to find some of the missing email addresses, and with that information, we built the email-based existing features for the employees including those "strange 25 cases". The code is too large to be inserted here, but we provide a text file with the procedure followed alongside with the script files we used. We are going to load a dictionary we created, similar to data_dict in structure, but with the data we processed directly from the Enron email dataset. It also contains new features.

In [21]:
with open("missing_data_df.pickle", "r") as data_file:
    missing_data_df = pickle.load(data_file)

missing_data_df = missing_data_df[missing_data_df.index.isin(strange_cases)]
missing_data_df

Unnamed: 0_level_0,from_messages,from_poi_cc_this_person,from_poi_to_this_person,from_this_person_to_poi,shared_receipt_with_poi,to_messages
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
FASTOW ANDREW S,9,15,49,5,1136,1183
BERBERIAN DAVID,1,1,5,0,158,159
CHRISTODOULOU DIOMEDES,1,0,0,1,448,521
YEAGER F SCOTT,0,0,10,0,81,88
STABLER FRANK,0,0,4,0,36,89
BAY FRANKLIN R,1,0,10,0,47,124
PRENTICE JAMES,6,0,8,2,72,344
OVERDYKE JR JERE C,3,0,33,0,374,465
ECHOLS JOHN B,8,0,8,5,78,90
WODRASKA JOHN,0,0,0,0,13,96


Mystery solved: We were able to find the email-related data belonging to those 25 people with valid email addresses. Taking a quick look at the data we realized that there is a disproportion between the number of emails sent and received. The amount of messages sent by these people is suspiciously low (or inexistent) for the timeframe considered. We found email data for additional 19 people (from our second list of 35 shown above) that displays the same trend. 

One particular case in the above data frame is worth noticing: Andrew S. Fastow. It is hard to believe that the chief financial officer of a corporation, (who received at least 1183 emails) just sent nine emails in more than a year, including the time when the financial scandal shattered the company.

I believe that the process of emails removal from the dataset due to privacy protection issues that occurred at some point after the first release of the Enron email data might have something to do with this. As this might be an intentional intervention in the data set, it definitively could affect the outcome of any attempt of classification if these data were to be included. It is, therefore, reasonable to assume that this particular situation is the reason behind the absence of the email-related features for those 25 "strange cases" we found earlier. 

### New features

Having the Enron email dataset in a workable shape makes possible to create any number of new features. In this study we are going to try some that are very easy to create. We proposed four new features, belonging to two different kinds. Two were ratios of the existing features as we mentioned earlier, and the other two were the result of working with the entire email dataset. 
In the second case, we created an intermediate feature, called pubIndex.  This one is not going to be used explicitly, but it is part of the process.

 pubIndex accounts for the number of people involved in a given email (To and Cc fields) correcting for when people sent emails to themselves. The lowest possible value for this feature is zero (if someone sent an email just to him or herself with no Cc), it is equal to one if there is only a single person in the To field and none in the Cc field, and so on. It is worth noticing that there is in principle no upper limit for this feature.  

### Ratios of existing features:

- to_poi_rate: ratio of from_this_person_to_poi / from_messages

- from_poi_rate: ratio of from_poi_to_this_person / to_messages

### New features from the email dataset:

- from_messages_median_pubIndex: We grouped all the emails sent by a given person and took the median of the pubIndex feature.

- to_poi_median_pubIndex: The same as above but considering just when sending messages to poi.



In [22]:
with open("new_data_df.pickle", "r") as data_file:
    df_new = pickle.load(data_file)
    df_new.drop(['TOTAL', 'THE TRAVEL AGENCY IN THE PARK', 'LOCKHART EUGENE E'], inplace = True)


In [23]:
df['to_poi_rate'] = df['from_this_person_to_poi']/df['from_messages']
df['from_poi_rate'] = df['from_poi_to_this_person']/df['to_messages']
new_feat_list = ['from_messages_median_pubIndex', 'to_poi_median_pubIndex']
df = pd.concat([df, df_new], axis=1)
df[new_feat_list]=df[new_feat_list].fillna(df.groupby("poi")[new_feat_list].transform("median"))

In [24]:
print("Number of email-related missing data: ", df.isnull().sum().sum())

Number of email-related missing data:  0


In [25]:
features_list = poi_label + financial_feat_list + email_feat_list + ['to_poi_rate', 'from_poi_rate' ] + new_feat_list
print("Total number of features: ", len(features_list)-1)

Total number of features:  23


In [26]:
### Store to my_dataset for easy export below.
my_dataset = df.to_dict(orient = 'index')

# INITIAL ALGORITHM TRAINING

### Extract features and labels from dataset for local testing

In [103]:
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, 
                                                                            test_size=0.3, random_state=42)

In [40]:
# Cross-validation for parameter tuning in grid search 
sss = StratifiedShuffleSplit(n_splits=3, test_size = 0.25, random_state = 0)

### Create a pipeline for Naive Bayes

In [109]:
scaler = StandardScaler()
select = SelectKBest()
clf = GaussianNB()

steps = [
		 # Preprocessing
         ('standard_scaler', scaler),
         
         # Feature selection
         ('feature_selection', select),
         
         # Classifier
         ('clf', clf)
         ]
# Create pipeline
pipeline = Pipeline(steps)

parameters = dict(feature_selection__k=[2, 3, 5, 6, 7, 8, 9, 10, 12])


# Create, fit, and make predictions with grid search
gs = GridSearchCV(pipeline,
	              param_grid=parameters,
	              scoring="f1",
	              cv=sss.split(features_train, labels_train),
	              error_score=0)
gs.fit(features_train, labels_train)

labels_predictions = gs.predict(features_test)

#clf = gs.best_estimator_

print ("\n", "Best parameters: ", gs.best_params_, "\n")
print(" Best score: ", gs.best_score_ , "\n")

classif_report = classification_report(labels_test, labels_predictions)
print(classif_report)

scores = gs.best_estimator_.named_steps['feature_selection'].scores_
mask = gs.best_estimator_.named_steps['feature_selection'].get_support()

kselect_features = [] 
feat_importance = []
for bool, feature, score in zip(mask, features_list[1:], scores):
    if bool:
        kselect_features.append(feature)
        feat_importance.append([feature, round(score, 2)])
feat_importance.sort(key=lambda x: x[1], reverse = True)
for item in feat_importance:
    print('{} ===> {}'.format(item[0], item[1]))
print()

kselect_features.insert(0, "poi")


 Best parameters:  {'feature_selection__k': 2} 

 Best score:  0.374074074074 

             precision    recall  f1-score   support

        0.0       0.89      0.89      0.89        38
        1.0       0.20      0.20      0.20         5

avg / total       0.81      0.81      0.81        43

bonus ===> 30.73
to_poi_rate ===> 23.94



In [105]:

tester.dump_classifier_and_data(pipeline, my_dataset, features_list)
tester.main();

Pipeline(memory=None,
     steps=[('standard_scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('feature_selection', SelectKBest(k=10, score_func=<function f_classif at 0x1a10f669b0>)), ('clf', GaussianNB(priors=None))])
	Accuracy: 0.83967	Precision: 0.37809	Recall: 0.31400	F1: 0.34308	F2: 0.32502
	Total predictions: 15000	True positives:  628	False positives: 1033	False negatives: 1372	True negatives: 11967



In [110]:
best_pipe = Pipeline([('standard_scaler', StandardScaler()),
                          ('feature_selection', SelectKBest(k=2)),
                          ('clf', GaussianNB())])

tester.dump_classifier_and_data(best_pipe, my_dataset, kselect_features)
tester.main();

Pipeline(memory=None,
     steps=[('standard_scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('feature_selection', SelectKBest(k=2, score_func=<function f_classif at 0x1a10f669b0>)), ('clf', GaussianNB(priors=None))])
	Accuracy: 0.83343	Precision: 0.34250	Recall: 0.18050	F1: 0.23641	F2: 0.19936
	Total predictions: 14000	True positives:  361	False positives:  693	False negatives: 1639	True negatives: 11307



### Create a pipeline for Decision Tree

In [50]:
scaler = StandardScaler()
select = SelectKBest()
dtc = DecisionTreeClassifier()

steps = [
		 # Preprocessing
         ('standard_scaler', scaler),
         
         # Feature selection
         ('feature_selection', select),
         
         # Classifier
         ('dtc', dtc)
         # ('svc', svc)
         # ('knc', knc)
         ]
# Create pipeline
pipeline = Pipeline(steps)

# Parameters to try in grid search
parameters = dict(
                  feature_selection__k=[2, 3, 5, 6, 8, 10, 12], 
                  dtc__criterion=['gini', 'entropy'],
                  dtc__splitter=['best', 'random'],
                  dtc__max_depth=[None, 1, 2, 3, 4],
                  dtc__min_samples_split=[2, 3, 4, 25],
                  dtc__min_samples_leaf=[1, 2, 3, 4],
                  dtc__min_weight_fraction_leaf=[0, 0.25, 0.5],
                  dtc__class_weight=[None, 'balanced'],
                  dtc__random_state=[45]
                  )


In [51]:
import time
start = time.time()
# Create, fit, and make predictions with grid search
gs = GridSearchCV(pipeline,
                  param_grid=parameters,
                  scoring="f1",
                  cv=sss.split(features_train, labels_train),
                  error_score=0)
gs.fit(features_train, labels_train)

labels_predictions = gs.predict(features_test)

#clf = gs.best_estimator_
print ("\n", "Best parameters are: ", gs.best_params_, "\n")
print()
print ('It took', time.time()-start, 'seconds.')


 Best parameters are:  {'dtc__max_depth': None, 'dtc__criterion': 'gini', 'dtc__min_samples_leaf': 1, 'dtc__min_samples_split': 3, 'dtc__splitter': 'best', 'dtc__class_weight': None, 'feature_selection__k': 10, 'dtc__random_state': 45, 'dtc__min_weight_fraction_leaf': 0} 


It took 745.417464972 seconds.


In [52]:
print(" Best score: ", gs.best_score_ , "\n")

 Best score:  0.85 



In [53]:
finalFeatureIndices = gs.best_estimator_.named_steps['feature_selection'].get_support(indices=True)
finalFeatureIndices

array([ 0,  1,  3,  9, 10, 11, 13, 17, 19, 22])

In [54]:
gs.best_estimator_.named_steps['dtc'].feature_importances_

array([ 0.066313  ,  0.02095606,  0.066313  ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.44112559,  0.40529236,  0.        ])

In [49]:
best_pipe = Pipeline([('standard_scaler', StandardScaler()),
                          ('feature_selection', SelectKBest(k=8)),
                          ('clf', DecisionTreeClassifier(max_depth = 4, criterion = 'gini', 
                                                        min_samples_leaf = 1, min_samples_split = 2, 
                                                        splitter = 'best', class_weight = None, 
                                                        min_weight_fraction_leaf = 0, random_state = 45))])

tester.dump_classifier_and_data(pipeline, my_dataset, features_list)
tester.main();

Pipeline(memory=None,
     steps=[('standard_scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('feature_selection', SelectKBest(k=10, score_func=<function f_classif at 0x1a10f669b0>)), ('dtc', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None...      min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'))])
	Accuracy: 0.88887	Precision: 0.59858	Recall: 0.50550	F1: 0.54812	F2: 0.52173
	Total predictions: 15000	True positives: 1011	False positives:  678	False negatives:  989	True negatives: 12322



In [None]:
##### TESTING ZONE #####

In [56]:
scaler = StandardScaler()
s_scaled = scaler.fit_transform(df.iloc[:,1:])
scaled_df = pd.DataFrame(s_scaled, index=df.index)
scaled_df.insert(0, "poi", df.poi)
scaled_df.columns = features_list
scaled_data_dict = scaled_df.to_dict(orient = 'index')

In [95]:
clf = DecisionTreeClassifier()
clf.fit(scaled_df.iloc[:,1:], scaled_df["poi"])

feat_importance = []
for i in range(len(clf.feature_importances_)):
    if clf.feature_importances_[i] > 0.05:
        feat_importance.append([scaled_df.columns[i+1], round(clf.feature_importances_[i], 2)])
feat_importance.sort(key=lambda x: x[1], reverse = True)
for item in feat_importance:
    print('{} ===> {}'.format(item[0], item[1]))
print()
tree_feat_list = [x[0] for x in feat_importance]
tree_feat_list.insert(0, 'poi')
tree_lim_df = scaled_df[tree_feat_list]
lim_feat_data_dict = tree_lim_df.to_dict(orient = 'index')



to_poi_rate ===> 0.35
shared_receipt_with_poi ===> 0.22
expenses ===> 0.17
other ===> 0.16
from_poi_rate ===> 0.11



In [96]:
sdata = featureFormat(lim_feat_data_dict, tree_feat_list, sort_keys = True)
labels, features = targetFeatureSplit(sdata)
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, 
                                                                            test_size=0.3, random_state=42)

In [99]:
# Parameters to try in grid search
parameters = dict(
                  criterion=['gini', 'entropy'],
                  splitter=['best', 'random'],
                  max_depth=[None, 1, 2, 3, 4],
                  min_samples_split=[2, 3, 4, 25],
                  min_samples_leaf=[1, 2, 3, 4],
                  min_weight_fraction_leaf=[0, 0.25, 0.5],
                  class_weight=[None, 'balanced'],
                  random_state=[45]
                  )

clf = GridSearchCV(DecisionTreeClassifier(random_state = 45), param_grid = parameters, cv=sss.split(
                                          features_train, labels_train),scoring='f1')
clf.fit(features_train, labels_train)
labels_predictions = clf.predict(features_test)
clf.best_params_

{'class_weight': None,
 'criterion': 'entropy',
 'max_depth': 3,
 'min_samples_leaf': 3,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0,
 'random_state': 45,
 'splitter': 'best'}

In [101]:

classif_report = classification_report(labels_test, labels_predictions)
print(classif_report)

             precision    recall  f1-score   support

        0.0       0.97      0.92      0.95        38
        1.0       0.57      0.80      0.67         5

avg / total       0.93      0.91      0.91        43



In [98]:
clf = DecisionTreeClassifier(class_weight = None, criterion = 'entropy', max_depth = 3, 
                            min_samples_leaf = 3, min_samples_split = 2, min_weight_fraction_leaf = 0, 
                            random_state = 45, splitter = 'best')
dump_classifier_and_data(clf, lim_feat_data_dict, tree_feat_list )
tester.main()

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=3, min_samples_split=2,
            min_weight_fraction_leaf=0, presort=False, random_state=45,
            splitter='best')
	Accuracy: 0.93047	Precision: 0.70868	Recall: 0.81250	F1: 0.75705	F2: 0.78937
	Total predictions: 15000	True positives: 1625	False positives:  668	False negatives:  375	True negatives: 12332



In [79]:
AdaBoostClassifier().get_params().keys()

['n_estimators',
 'base_estimator',
 'random_state',
 'learning_rate',
 'algorithm']

In [87]:
clf = AdaBoostClassifier(random_state = 45)
clf.fit(scaled_df.iloc[:,1:], scaled_df["poi"])

feat_importance = []
for i in range(len(clf.feature_importances_)):
    if clf.feature_importances_[i] > 0.04:
        feat_importance.append([scaled_df.columns[i+1], round(clf.feature_importances_[i], 2)])
feat_importance.sort(key=lambda x: x[1], reverse = True)
for item in feat_importance:
    print('{} ===> {}'.format(item[0], item[1]))
print()
feat_list = [x[0] for x in feat_importance]
feat_list.insert(0, 'poi')
lim_feat_sdata_dict = {}
for key in scaled_data_dict:
    lim_feat_sdata_dict[key] = {}
    for feat in scaled_data_dict[key]:
        if feat in feat_list:
            lim_feat_sdata_dict[key][feat]= scaled_data_dict[key][feat]

shared_receipt_with_poi ===> 0.14
to_poi_rate ===> 0.1
expenses ===> 0.08
from_this_person_to_poi ===> 0.08
from_poi_rate ===> 0.08
deferred_income ===> 0.06
other ===> 0.06
total_payments ===> 0.06
exercised_stock_options ===> 0.06



In [88]:
sdata = featureFormat(lim_feat_sdata_dict, feat_list, sort_keys = True)
labels, features = targetFeatureSplit(sdata)
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, 
                                                                            test_size=0.3, random_state=42)

In [94]:
start = time.time()
param_grid = {
              "base_estimator__criterion" : ["gini", "entropy"],
              "base_estimator__splitter" :   ["best", "random"],
              "base_estimator__max_depth" : [None, 1, 2, 3, 4],
              "base_estimator__min_samples_leaf" : [1, 2, 3, 4], 
              "base_estimator__min_samples_split" : [2, 3, 4, 5, 6],
              "n_estimators": [50, 200, 400, 600, 800],
              "learning_rate": [0.01, 0.1, 1]
             }

DTC = DecisionTreeClassifier(random_state = 45, class_weight = None, min_weight_fraction_leaf = 0)
ABC = AdaBoostClassifier(base_estimator = DTC)

# run grid search
clf = GridSearchCV(ABC, param_grid=param_grid, scoring = 'f1')

clf.fit(features_train, labels_train)
print(clf.best_params_)

print()
print ('It took', time.time()-start, 'seconds.')

KeyboardInterrupt: 

In [91]:
clf.best_params_

{'base_estimator__criterion': 'gini',
 'base_estimator__max_depth': 1,
 'base_estimator__min_samples_leaf': 3,
 'base_estimator__min_samples_split': 2,
 'base_estimator__splitter': 'random',
 'learning_rate': 0.1,
 'n_estimators': 600}

In [92]:
dtc = DecisionTreeClassifier(criterion = 'gini', max_depth = 1, min_samples_leaf = 3, min_samples_split = 2, 
                            splitter = 'random', random_state = 45)
clf = AdaBoostClassifier(base_estimator = dtc, learning_rate = 0.1, n_estimators = 600)
tester.dump_classifier_and_data(clf, lim_feat_sdata_dict, feat_list)
tester.main();

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=3, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=45,
            splitter='random'),
          learning_rate=0.1, n_estimators=600, random_state=None)
	Accuracy: 0.89473	Precision: 0.62274	Recall: 0.53400	F1: 0.57497	F2: 0.54967
	Total predictions: 15000	True positives: 1068	False positives:  647	False negatives:  932	True negatives: 12353



Right out the box, only the Gaussian Naive Bayes classifier failed to meet the minimum requirement of 0.3 or higher in Accuracy, Recall and Precision. Let see if that could be fixed with scaling and changing the number of features.

### Three classifiers, data scaled, reduced number of features.

When working with a collection of features, sometimes their values vary within wild margins. In our case, the changes in email related features will be in the range of thousands at the maximum. Meanwhile, the financial features could change several orders of magnitude above that. This fact could cause those financial features to become more relevant than the email-related ones just because of their size. To correct for this, we applied a feature (standard) scaling procedure to the data. 

In [None]:
scaler = StandardScaler()
s_scaled = scaler.fit_transform(df.iloc[:,1:])
scaled_df = pd.DataFrame(s_scaled, index=df.index)
scaled_df.insert(0, "poi", df.poi)
scaled_df.columns = features_list
scaled_data_dict = scaled_df.to_dict(orient = 'index')

It is well-known that a large number of features might have a negative impact on the overall performance of a model, with a tendency to generate overfitting. Next, we will try to reduce the dimensionality of our data in different ways. 

In [None]:
# PCA
for nc in range(1, scaled_df.shape[1]-1):
    my_model = PCA(n_components=nc)
    ft = my_model.fit_transform(scaled_df.iloc[:,1:])
    if my_model.explained_variance_ratio_.cumsum()[-1] >= 0.9:
            pca_df = pd.DataFrame(ft,  index=scaled_df.index)
            pca_df.insert(0, "poi", scaled_df.poi)
            pca_data_dict = pca_df.to_dict(orient = 'index')
            pca_feat_list = ['poi'] + list(range(nc))
            print("Number of components: ", nc)
            print ("Retain", my_model.explained_variance_ratio_.cumsum()[-1] * 100, "% of the variance")
            break


In [None]:
clf = GaussianNB()
dump_classifier_and_data(clf, pca_data_dict, pca_feat_list)
tester.main()

Using Principal Component Analysis (PCA), we optimized our number of features (-PCA features are different from the ones we have already defined-) in such a way we retained at least 90% of the variance. That was barely enough to create a project compliant Naive Bayes classifier. 

We found the optimal number of features manually in the case of the Naive Bayes classifier using KBest as our feature selection method, which produced slightly better results than PCA.

In [None]:
kbest_feat = 8
selector = SelectKBest(f_classif, k=kbest_feat)
select_k_best_classifier = selector.fit_transform(s_scaled , df.poi)
scores = selector.scores_
mask = selector.get_support() 
kselect_features = [] 
feat_importance = []
for bool, feature, score in zip(mask, cols[1:], scores):
    if bool:
        kselect_features.append(feature)
        feat_importance.append([feature, round(score, 2)])
feat_importance.sort(key=lambda x: x[1], reverse = True)
for item in feat_importance:
    print('{} ===> {}'.format(item[0], item[1]))
print()
kselect_df = pd.DataFrame(select_k_best_classifier, index = df.index, columns = kselect_features)
kselect_df.insert(0, "poi", df.poi)
kselect_data_dict = kselect_df.to_dict(orient = 'index')  
clf = GaussianNB()
dump_classifier_and_data(clf, kselect_data_dict, ["poi"] + kselect_features )
tester.main()


In [None]:
clf = DecisionTreeClassifier(random_state = 45)
clf.fit(scaled_df.iloc[:,1:], scaled_df["poi"])

feat_importance = []
for i in range(len(clf.feature_importances_)):
    if clf.feature_importances_[i] > 0.1:
        feat_importance.append([scaled_df.columns[i+1], round(clf.feature_importances_[i], 2)])
feat_importance.sort(key=lambda x: x[1], reverse = True)
for item in feat_importance:
    print('{} ===> {}'.format(item[0], item[1]))
print()
tree_feat_list = [x[0] for x in feat_importance]
tree_feat_list.insert(0, 'poi')
tree_lim_df = scaled_df[tree_feat_list]
lim_feat_data_dict = tree_lim_df.to_dict(orient = 'index')

dump_classifier_and_data(clf, lim_feat_data_dict, tree_feat_list )
tester.main()

In [None]:
clf = AdaBoostClassifier(random_state = 45)
clf.fit(scaled_df.iloc[:,1:], scaled_df["poi"])

feat_importance = []
for i in range(len(clf.feature_importances_)):
    if clf.feature_importances_[i] > 0.02:
        feat_importance.append([scaled_df.columns[i+1], round(clf.feature_importances_[i], 2)])
feat_importance.sort(key=lambda x: x[1], reverse = True)
for item in feat_importance:
    print('{} ===> {}'.format(item[0], item[1]))
print()
feat_list = [x[0] for x in feat_importance]
feat_list.insert(0, 'poi')
lim_feat_sdata_dict = {}
for key in scaled_data_dict:
    lim_feat_sdata_dict[key] = {}
    for feat in scaled_data_dict[key]:
        if feat in feat_list:
            lim_feat_sdata_dict[key][feat]= scaled_data_dict[key][feat]
dump_classifier_and_data(clf, lim_feat_sdata_dict, feat_list )
tester.main()

Interestingly, after performing some feature scaling and selection, all three classifiers met the minimum requirements of this project. In spite of the fact that Naive Bayes displayed the best performance increase, it still falls short in comparison to the other two classifiers.

Before going further with algorithm tuning, let proceed to create a collection of new features.

### Scaling the features and running the classifiers (again)

In [None]:
scaler = StandardScaler()
s_scaled = scaler.fit_transform(df.iloc[:,1:])
scaled_df = pd.DataFrame(s_scaled, index=df.index)
scaled_df.insert(0, "poi", df.poi)
scaled_df.columns = full_feat_cols
scaled_data_dict = scaled_df.to_dict(orient = 'index')

In [None]:
kbest_feat = 9
selector = SelectKBest(f_classif, k=kbest_feat)
select_k_best_classifier = selector.fit_transform(s_scaled , df.poi)
scores = selector.scores_
mask = selector.get_support() 
kselect_features = [] 
feat_importance = []
for bool, feature, score in zip(mask, full_feat_cols[1:], scores):
    if bool:
        kselect_features.append(feature)
        feat_importance.append([feature, round(score, 2)])
feat_importance.sort(key=lambda x: x[1], reverse = True)
for item in feat_importance:
    print('{} ===> {}'.format(item[0], item[1]))
print()
kselect_df = pd.DataFrame(select_k_best_classifier, index = df.index, columns = kselect_features)
kselect_df.insert(0, "poi", df.poi)
kselect_data_dict = kselect_df.to_dict(orient = 'index')  
clf = GaussianNB()
dump_classifier_and_data(clf, kselect_data_dict, ["poi"] + kselect_features)
tester.main()

In [None]:
clf = DecisionTreeClassifier(random_state = 45)
clf.fit(scaled_df.iloc[:,1:], scaled_df["poi"])

feat_importance = []
for i in range(len(clf.feature_importances_)):
    if clf.feature_importances_[i] > 0.1:
        feat_importance.append([scaled_df.columns[i+1], round(clf.feature_importances_[i], 2)])
feat_importance.sort(key=lambda x: x[1], reverse = True)
for item in feat_importance:
    print('{} ===> {}'.format(item[0], item[1]))
print()
tree_feat_list = [x[0] for x in feat_importance]
tree_feat_list.insert(0, 'poi')
tree_lim_df = scaled_df[tree_feat_list]
lim_feat_data_dict = tree_lim_df.to_dict(orient = 'index')

dump_classifier_and_data(clf, lim_feat_data_dict, tree_feat_list )
tester.main()

In [None]:
clf = AdaBoostClassifier(random_state = 45)
clf.fit(scaled_df.iloc[:,1:], scaled_df["poi"])

feat_importance = []
for i in range(len(clf.feature_importances_)):
    if clf.feature_importances_[i] > 0.02:
        feat_importance.append([scaled_df.columns[i+1], round(clf.feature_importances_[i], 2)])
feat_importance.sort(key=lambda x: x[1], reverse = True)
for item in feat_importance:
    print('{} ===> {}'.format(item[0], item[1]))
print()
feat_list = [x[0] for x in feat_importance]
feat_list.insert(0, 'poi')
lim_feat_sdata_dict = {}
for key in scaled_data_dict:
    lim_feat_sdata_dict[key] = {}
    for feat in scaled_data_dict[key]:
        if feat in feat_list:
            lim_feat_sdata_dict[key][feat]= scaled_data_dict[key][feat]
dump_classifier_and_data(clf, lim_feat_sdata_dict, feat_list )
tester.main()

Not all the classifiers reacted in the same way to the addition of the new features. The Decision Tree classifier was the one that improved the most. For that reason, and because it is faster than the AdaBoost classifier, we picked it for further optimization.

# ALGORITHM TUNING

In [None]:
# Testing Parameters. Decision Tree Classifier
param_grid = dict(criterion = ['gini', 'entropy'] , 
                  min_samples_split = [2, 4, 8, 12, 16, 20, 24, 26, 28],
                  max_depth = [None, 1, 2, 3, 4, 5, 6, 7],
                  max_features = [None, 'sqrt', 'log2', 'auto'])
clf = GridSearchCV(DecisionTreeClassifier(random_state = 45), param_grid = param_grid, cv=10,
                       scoring='f1')
clf.fit(tree_lim_df.iloc[:,1:], tree_lim_df.poi)
clf.best_params_

In [None]:
clf = DecisionTreeClassifier(criterion = 'entropy', max_depth = None, max_features = None, 
                            min_samples_split = 20)
dump_classifier_and_data(clf, lim_feat_data_dict, tree_feat_list )
tester.main()

After tunning the Decision Tree Classifier parameters, we obtained decent values for all the relevant metrics, (all of them at or above 0.7). It is worth mentioning that we tried this out using a fixed random state, if we remove this restriction, not only the optimal parameters change from one run to the next, but also the results given by tester.py. To be honest, we should expect our metrics to be a little lower (as an average) than the ones shown, but still well above the minimum acceptable value of 0.3.

In [None]:
print("Final Results")
pd.DataFrame([[0.93, 0.71, 0.81, 0.76]],
             columns = ['Accuracy','Precision', 'Recall', 'F1'], 
             index = ['Decision Tree Classifier'])

# CONCLUSIONS

We applied three well-known machine learning algorithms to a combination of financial data (FindLaw) and email-related data from the Enron dataset in an attempt to find persons of interest (poi) in the Enron scandal case.
As it is almost always the case in Data Analysis, the preprocessing of the data played an essential role in the development and final results of our project.

The data set provided contained information about 144 people involved (18 of them were poi); which is, by all means, a small amount of data for the intended task. With an initial assessment of 1323 missing entries within which there were 25 "strange cases" of missing email information, the prospects for success were not precisely enjoyable. Fortunately, we learned that the missing (NaN) values in the financial data were in reality zeros and that dramatically reduced our missing entries to a more manageable amount of 290.  After that, and making use of the Enron email dataset, we were able to unravel the mystery of those 25 missing cases and decided in correspondence. We chose to impute the NaN values with the median of their respective columns making the difference between poi and non-poi. That decision was not taken blindly. It was such that it maximized the performance of our classifiers.

After applying Gaussian Naive Bayes, Decision Tree and AdaBoost classifiers to a reduced number of features of the previously scaled data, we observed that all three of them fulfilled the minimum requirements for the project, with a slight advantage for Decision Tree. The incorporation of the new features ended up tipping the balance to the Decision Tree Classifier. We tuned this classifier using GridSearchCV and a definite random_state to make our results consistent from one run to the next. In the end, after applying tester.py our best values were as follow: Accuracy: 0.93, Precision: 0.71, Recall 0.81 and F1 0.76. 
We believe that this set of values is indicative of a solid performance by the Decision Tree classifier in this problem, but there is still room for further improvements. 


## Limitations and future work

As we mentioned above, one of the inherent limitations of this project emanates from the small size of the data set, just 144 people (18 poi). The quality of the data also played a significant role, as the government intervention for privacy issues substantially modified part of the data making them useless for this study.
What is exciting about this project is that by using the full email data set, it is possible to create, at least in principle, any number of new and meaningful features. I believe that if we work this process in more detail,  we could end up creating features that could improve further the efficiency of our classifiers. For instance, some features could take into account the flux of emails around the critical dates, or be the result of sentiment analysis applied to the emails texts. This is an endeavor I will happily pursue in the future.


# REFERENCES

1. http://www.ahschulz.de/enron-email-data/
2. https://enrondata.readthedocs.io/en/latest/data/custodian-names-and-titles/
3. http://www.infosys.tuwien.ac.at/staff/dschall/email/enron-employees.txt
4. https://marcobonzanini.com/2015/02/25/fuzzy-string-matching-in-python/
5. https://codereview.stackexchange.com/questions/146834/function-to-find-all-occurrences-of-substring
6. https://regex101.com
7. https://stackoverflow.com/questions/32468402/how-to-explode-a-list-inside-a-dataframe-cell-into-separate-rows

8. https://rodgersnotes.wordpress.com/2013/11/19/enron-email-analysis-persons-of-interest/
