Udacity - Intro to Machine Learning

ENRON SCANDAL

The Enron scandal was a financial scandal that eventually led to the bankruptcy of the Enron Corporation, an American energy company based in Houston, Texas, and the de facto dissolution of Arthur Andersen, which was one of the five largest audit and accountancy partnerships in the world. In addition to being the largest bankruptcy reorganization in American history at that time, Enron was cited as the biggest audit failure.

Enron was formed in 1985 by Kenneth Lay after merging Houston Natural Gas and InterNorth. Several years later, when Jeffrey Skilling was hired, he developed a staff of executives that – by the use of accounting loopholes, special purpose entities, and poor financial reporting – were able to hide billions of dollars in debt from failed deals and projects. Chief Financial Officer Andrew Fastow and other executives not only misled Enron's Board of Directors and Audit Committee on high-risk accounting practices, but also pressured Arthur Andersen to ignore the issues.

Enron shareholders filed a 40 billion dollar lawsuit after the company's stock price, which achieved a high of US90.75 per share in mid-2000, plummeted to less than 1 dollar by the end of November 2001. The U.S. Securities and Exchange Commission (SEC) began an investigation, and rival Houston competitor Dynegy offered to purchase the company at a very low price. The deal failed, and on December 2, 2001, Enron filed for bankruptcy under Chapter 11 of the United States Bankruptcy Code. Enron's 63.4 billion dollars in assets made it the largest corporate bankruptcy in U.S. history until WorldCom's bankruptcy the next year.

Many executives at Enron were indicted for a variety of charges and some were later sentenced to prison. Enron's auditor, Arthur Andersen, was found guilty in a United States District Court of illegally destroying documents relevant to the SEC investigation which voided its license to audit public companies, effectively closing the business. By the time the ruling was overturned at the U.S. Supreme Court, the company had lost the majority of its customers and had ceased operating. Enron employees and shareholders received limited returns in lawsuits, despite losing billions in pensions and stock prices. As a consequence of the scandal, new regulations and legislation were enacted to expand the accuracy of financial reporting for public companies. One piece of legislation, the Sarbanes–Oxley Act, increased penalties for destroying, altering, or fabricating records in federal investigations or for attempting to defraud shareholders. The act also increased the accountability of auditing firms to remain unbiased and independent of their clients.

EnronCEO Picture
Trial of Ken Lay and Jeff Skilling

Final Project: Identify Fraud from Enron Email

By: Marissa Schmucker

May 2018

Table of Contents

Salary

Bonus

Total Payments

Exercised Stock Options

Total Stock Value

Total Bonus and Exercised Stock Options

Total Payments and Stock Value in Millions

Shared Receipt with POI

Project Goal

The goal of this project is to use the Enron dataset to train our machine learning algorithm to detect the possiblity of fraud (identify person's of interest.) Since we know our persons of interest (POIs) in our dataset, we will be able to use supervised learning algorithms in constructing our POI identifier. This will be done by picking the features within our dataset that separate our POIs from our non-POIs best.

We will start out our analysis by answering some questions about our data. Then, we will explore our features further by visualizing any correlations/outliers. Next, we will transform/scale our features and select those that will be most useful in our POI identifier, engineering new features and adding them to the dataset if provided to be useful for our analysis. We will identify at least two algorithms that may be best suited for our particular set of data and test them, tuning our parameters until optimal performance is reached. In our final analysis, the algorithm we have fit will be validated using our training/testing data. Using performance metrics to evaluate our results, any problems will be addressed and motifications made. In our final thoughts, the performance of our final algorithm will be discussed.

"""Import pickle and sklearn to get started.
Load the data as enron_dict"""

import pickle
import sklearn

enron_dict = pickle.load(open("final_project_dataset.pkl", "r"))

Dataset Questions

After getting our data dictionary loaded, we can start exploring our data. We'll answer the following questions:

How many people do we have in our dataset?
What are their names?
What information do we have about these people?
Who are the POIs in our dataset?
Who are the highest earners? Are they POIs?
Whos stock options had the highest value (max exercised_stock_options)?
Are there any features we can ignore due to missing data?
What is the mean salary for non-POIs and POIs?
What features might be useful for training our algorithm?
Are there any features we may need to scale?

Top

print 'Number of People in Dataset: ', len(enron_dict)

Number of People in Dataset:  146

import pprint

pretty = pprint.PrettyPrinter()

names = sorted(enron_dict.keys())  #sort names of Enron employees in dataset by first letter of last name

print 'Sorted list of Enron employees by last name'
pretty.pprint(names)

Sorted list of Enron employees by last name
['ALLEN PHILLIP K',
 'BADUM JAMES P',
 'BANNANTINE JAMES M',
 'BAXTER JOHN C',
 'BAY FRANKLIN R',
 'BAZELIDES PHILIP J',
 'BECK SALLY W',
 'BELDEN TIMOTHY N',
 'BELFER ROBERT',
 'BERBERIAN DAVID',
 'BERGSIEKER RICHARD P',
 'BHATNAGAR SANJAY',
 'BIBI PHILIPPE A',
 'BLACHMAN JEREMY M',
 'BLAKE JR. NORMAN P',
 'BOWEN JR RAYMOND M',
 'BROWN MICHAEL',
 'BUCHANAN HAROLD G',
 'BUTTS ROBERT H',
 'BUY RICHARD B',
 'CALGER CHRISTOPHER F',
 'CARTER REBECCA C',
 'CAUSEY RICHARD A',
 'CHAN RONNIE',
 'CHRISTODOULOU DIOMEDES',
 'CLINE KENNETH W',
 'COLWELL WESLEY',
 'CORDES WILLIAM R',
 'COX DAVID',
 'CUMBERLAND MICHAEL S',
 'DEFFNER JOSEPH M',
 'DELAINEY DAVID W',
 'DERRICK JR. JAMES V',
 'DETMERING TIMOTHY J',
 'DIETRICH JANET R',
 'DIMICHELE RICHARD G',
 'DODSON KEITH',
 'DONAHUE JR JEFFREY M',
 'DUNCAN JOHN H',
 'DURAN WILLIAM D',
 'ECHOLS JOHN B',
 'ELLIOTT STEVEN',
 'FALLON JAMES B',
 'FASTOW ANDREW S',
 'FITZGERALD JAY L',
 'FOWLER PEGGY',
 'FOY JOE',
 'FREVERT MARK A',
 'FUGH JOHN L',
 'GAHN ROBERT S',
 'GARLAND C KEVIN',
 'GATHMANN WILLIAM D',
 'GIBBS DANA R',
 'GILLIS JOHN',
 'GLISAN JR BEN F',
 'GOLD JOSEPH',
 'GRAMM WENDY L',
 'GRAY RODNEY',
 'HAEDICKE MARK E',
 'HANNON KEVIN P',
 'HAUG DAVID L',
 'HAYES ROBERT E',
 'HAYSLETT RODERICK J',
 'HERMANN ROBERT J',
 'HICKERSON GARY J',
 'HIRKO JOSEPH',
 'HORTON STANLEY C',
 'HUGHES JAMES A',
 'HUMPHREY GENE E',
 'IZZO LAWRENCE L',
 'JACKSON CHARLENE R',
 'JAEDICKE ROBERT',
 'KAMINSKI WINCENTY J',
 'KEAN STEVEN J',
 'KISHKILL JOSEPH G',
 'KITCHEN LOUISE',
 'KOENIG MARK E',
 'KOPPER MICHAEL J',
 'LAVORATO JOHN J',
 'LAY KENNETH L',
 'LEFF DANIEL P',
 'LEMAISTRE CHARLES',
 'LEWIS RICHARD',
 'LINDHOLM TOD A',
 'LOCKHART EUGENE E',
 'LOWRY CHARLES P',
 'MARTIN AMANDA K',
 'MCCARTY DANNY J',
 'MCCLELLAN GEORGE',
 'MCCONNELL MICHAEL S',
 'MCDONALD REBECCA',
 'MCMAHON JEFFREY',
 'MENDELSOHN JOHN',
 'METTS MARK',
 'MEYER JEROME J',
 'MEYER ROCKFORD G',
 'MORAN MICHAEL P',
 'MORDAUNT KRISTINA M',
 'MULLER MARK S',
 'MURRAY JULIA H',
 'NOLES JAMES L',
 'OLSON CINDY K',
 'OVERDYKE JR JERE C',
 'PAI LOU L',
 'PEREIRA PAULO V. FERRAZ',
 'PICKERING MARK R',
 'PIPER GREGORY F',
 'PIRO JIM',
 'POWERS WILLIAM',
 'PRENTICE JAMES',
 'REDMOND BRIAN L',
 'REYNOLDS LAWRENCE',
 'RICE KENNETH D',
 'RIEKER PAULA H',
 'SAVAGE FRANK',
 'SCRIMSHAW MATTHEW',
 'SHANKMAN JEFFREY A',
 'SHAPIRO RICHARD S',
 'SHARP VICTORIA T',
 'SHELBY REX',
 'SHERRICK JEFFREY B',
 'SHERRIFF JOHN R',
 'SKILLING JEFFREY K',
 'STABLER FRANK',
 'SULLIVAN-SHAKLOVITZ COLLEEN',
 'SUNDE MARTIN',
 'TAYLOR MITCHELL S',
 'THE TRAVEL AGENCY IN THE PARK',
 'THORN TERENCE H',
 'TILNEY ELIZABETH A',
 'TOTAL',
 'UMANOFF ADAM S',
 'URQUHART JOHN A',
 'WAKEHAM JOHN',
 'WALLS JR ROBERT H',
 'WALTERS GARETH W',
 'WASAFF GEORGE',
 'WESTFAHL RICHARD K',
 'WHALEY DAVID A',
 'WHALLEY LAWRENCE G',
 'WHITE JR THOMAS E',
 'WINOKUR JR. HERBERT S',
 'WODRASKA JOHN',
 'WROBEL BRUCE',
 'YEAGER F SCOTT',
 'YEAP SOON']

print 'Example Value Dictionary of Features'
pretty.pprint(enron_dict['ALLEN PHILLIP K'])

Example Value Dictionary of Features
{'bonus': 4175000,
 'deferral_payments': 2869717,
 'deferred_income': -3081055,
 'director_fees': 'NaN',
 'email_address': 'phillip.allen@enron.com',
 'exercised_stock_options': 1729541,
 'expenses': 13868,
 'from_messages': 2195,
 'from_poi_to_this_person': 47,
 'from_this_person_to_poi': 65,
 'loan_advances': 'NaN',
 'long_term_incentive': 304805,
 'other': 152,
 'poi': False,
 'restricted_stock': 126027,
 'restricted_stock_deferred': -126027,
 'salary': 201955,
 'shared_receipt_with_poi': 1407,
 'to_messages': 2902,
 'total_payments': 4484442,
 'total_stock_value': 1729541}

Before we go any further, let's transform our dictionary into a pandas dataframe to explore further.

import csv

"""Write Enron Dictionary to CSV File for Possible Future Use and Easily Read into Dataframe"""

fieldnames = ['name'] + enron_dict['METTS MARK'].keys()

with open('enron.csv', 'w') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    writer.writeheader()
    for name in enron_dict.keys():
        if name != 'TOTAL':
            n = {'name':name}
            n.update(enron_dict[name])
            writer.writerow(n)

#read csv into pandas dataframe 
import pandas as pd
enron = pd.read_csv('enron.csv')

#added/combined feature, total bonus and exercised_stock_options
enron['total_be'] = enron['bonus'].fillna(0.0) + enron['exercised_stock_options'].fillna(0.0)

#added feature, fraction of e-mails to and from poi
enron['fraction_to_poi'] = enron['from_this_person_to_poi'].fillna(0.0)/enron['from_messages'].fillna(0.0)
enron['fraction_from_poi'] = enron['from_poi_to_this_person'].fillna(0.0)/enron['to_messages'].fillna(0.0)

#added feature, scaled total compensation
enron['total_millions'] = (enron['total_payments'].fillna(0.0) + enron['total_stock_value'].fillna(0.0))/1000000

Dataset Information

#data information/types

enron.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145 entries, 0 to 144
Data columns (total 26 columns):
name                         145 non-null object
salary                       94 non-null float64
to_messages                  86 non-null float64
deferral_payments            38 non-null float64
total_payments               124 non-null float64
exercised_stock_options      101 non-null float64
bonus                        81 non-null float64
restricted_stock             109 non-null float64
shared_receipt_with_poi      86 non-null float64
restricted_stock_deferred    17 non-null float64
total_stock_value            125 non-null float64
expenses                     94 non-null float64
loan_advances                3 non-null float64
from_messages                86 non-null float64
other                        92 non-null float64
from_this_person_to_poi      86 non-null float64
poi                          145 non-null bool
director_fees                16 non-null float64
deferred_income              48 non-null float64
long_term_incentive          65 non-null float64
email_address                111 non-null object
from_poi_to_this_person      86 non-null float64
total_be                     145 non-null float64
fraction_to_poi              86 non-null float64
fraction_from_poi            86 non-null float64
total_millions               145 non-null float64
dtypes: bool(1), float64(23), object(2)
memory usage: 28.5+ KB

Top

Just by looking at our dataset information above, we can quickly point out a few ways to narrow down our feature selection. Some of our features have lots of missing data, so those may be ones that we can remove. Features like "restricted_stock_deferred", "loan_advances", and "director_fees" may be some that we can take out altogether. There are also a few features that seem to be giving us the same information, like "shared_receipt_with_poi","to_messages", "from_messages", "from_this_person_to_poi", and "from_poi_to_this_person" all tell us about the person's e-mail behavior and all have the same data count, 86. We may be able to narrow those features down to just one or two, or create a new feature from them (see feature added above.)

Features that may be most useful, since we're dealing with corporate fraud, are those features that tell us about the money. Let's follow the money! Features that will give us that money trail will be "salary", "total_payments", "exercised_stock_options", "bonus", "restricted_stock", and "total_stock_value".

For now, let's continue to explore our dataset before making our final selection.

Feature Statistics

#number of POI in dataset
print 'There are 18 POI in our Dataset as you can see by our "True" count'
enron['poi'].value_counts()

There are 18 POI in our Dataset as you can see by our "True" count





False    127
True      18
Name: poi, dtype: int64

#set a baseline by extracting non-POIs and printing stats

non_poi = enron[enron.poi.isin([False])]

non_poi_money = non_poi[['salary','bonus','exercised_stock_options','total_stock_value',\
                         'total_payments','total_be','total_millions']].describe()
non_poi_money

	salary	bonus	exercised_stock_options	total_stock_value	total_payments	total_be	total_millions
count	7.700000e+01	6.500000e+01	8.900000e+01	1.070000e+02	1.060000e+02	1.270000e+02	127.000000
mean	2.621515e+05	9.868249e+05	1.947752e+06	2.374085e+06	1.725091e+06	1.870028e+06	3.440052
std	1.392317e+05	1.173880e+06	2.547068e+06	3.535017e+06	2.618288e+06	2.693997e+06	4.839977
min	4.770000e+02	7.000000e+04	3.285000e+03	-4.409300e+04	1.480000e+02	0.000000e+00	0.000000
25%	2.061210e+05	4.000000e+05	4.365150e+05	4.246845e+05	3.304798e+05	2.213790e+05	0.507935
50%	2.516540e+05	7.000000e+05	1.030329e+06	1.030329e+06	1.056092e+06	8.862310e+05	1.884748
75%	2.885890e+05	1.000000e+06	2.165172e+06	2.307584e+06	2.006025e+06	2.250522e+06	4.317325
max	1.060932e+06	8.000000e+06	1.536417e+07	2.381793e+07	1.725253e+07	1.636417e+07	31.874715

non_poi_email_behavior = non_poi[['shared_receipt_with_poi','to_messages',\
                                  'from_messages','fraction_from_poi','fraction_to_poi']].describe()
non_poi_email_behavior

	shared_receipt_with_poi	to_messages	from_messages	fraction_from_poi	fraction_to_poi
count	72.000000	72.000000	72.000000	72.000000	72.000000
mean	1058.527778	2007.111111	668.763889	0.036107	0.152669
std	1132.503757	2693.165955	1978.997801	0.041929	0.206057
min	2.000000	57.000000	12.000000	0.000000	0.000000
25%	191.500000	513.750000	20.500000	0.007760	0.000000
50%	594.000000	944.000000	41.000000	0.022741	0.053776
75%	1635.500000	2590.750000	216.500000	0.050705	0.225000
max	4527.000000	15149.000000	14368.000000	0.217341	1.000000

I thought it was interesting to see someone with 100% of their e-mails going to persons of interest. Below, I printed out some features associated with this person. After a little research, I found that Gene Humphrey was one of the first employees of Enron. So, it makes sense that all of his e-mails were to persons of interest who had been with the company from the beginning. Those were the only people he worked with.

enron[(enron['fraction_to_poi']>0.9)][['name','salary','total_be',\
                                       'restricted_stock','total_stock_value','to_messages','poi']]

	name	salary	total_be	restricted_stock	total_stock_value	to_messages	poi
10	HUMPHREY GENE E	130724.0	2282768.0	NaN	2282768.0	128.0	False

#POI stats

poi_info = enron[enron.poi.isin([True])]

poi_money = poi_info[['salary','bonus','exercised_stock_options','total_stock_value',\
                      'total_payments','total_be','total_millions']].describe()
poi_money

	salary	bonus	exercised_stock_options	total_stock_value	total_payments	total_be	total_millions
count	1.700000e+01	1.600000e+01	1.200000e+01	1.800000e+01	1.800000e+01	1.800000e+01	18.000000
mean	3.834449e+05	2.075000e+06	1.046379e+07	9.165671e+06	7.913590e+06	8.820307e+06	17.079261
std	2.783597e+05	2.047437e+06	1.238259e+07	1.384117e+07	2.396549e+07	1.222914e+07	35.289434
min	1.584030e+05	2.000000e+05	3.847280e+05	1.260270e+05	9.109300e+04	8.000000e+05	1.765324
25%	2.401890e+05	7.750000e+05	1.456581e+06	1.016450e+06	1.142396e+06	1.262500e+06	3.140359
50%	2.786010e+05	1.275000e+06	3.914557e+06	2.206836e+06	1.754028e+06	2.079817e+06	4.434161
75%	4.151890e+05	2.062500e+06	1.938604e+07	1.051133e+07	2.665345e+06	7.990914e+06	11.274354
max	1.111258e+06	7.000000e+06	3.434838e+07	4.911008e+07	1.035598e+08	4.134838e+07	152.669871

poi_email_behavior = poi_info[['shared_receipt_with_poi','to_messages', \
                               'from_messages','fraction_from_poi','fraction_to_poi']].describe()
poi_email_behavior

	shared_receipt_with_poi	to_messages	from_messages	fraction_from_poi	fraction_to_poi
count	14.000000	14.000000	14.000000	14.000000	14.000000
mean	1783.000000	2417.142857	300.357143	0.047507	0.345470
std	1264.996625	1961.858101	805.844574	0.032085	0.156894
min	91.000000	225.000000	16.000000	0.021339	0.173611
25%	1059.250000	1115.750000	33.000000	0.026900	0.228580
50%	1589.000000	1875.000000	44.500000	0.030639	0.276389
75%	2165.250000	2969.250000	101.500000	0.059118	0.427083
max	5521.000000	7991.000000	3069.000000	0.136519	0.656250

#difference in non-poi compensation and poi compensation

difference_in_money = poi_money - non_poi_money
difference_in_money

	salary	bonus	exercised_stock_options	total_stock_value	total_payments	total_be	total_millions
count	-60.000000	-4.900000e+01	-7.700000e+01	-8.900000e+01	-8.800000e+01	-1.090000e+02	-109.000000
mean	121293.375859	1.088175e+06	8.516041e+06	6.791586e+06	6.188499e+06	6.950279e+06	13.639208
std	139128.027285	8.735576e+05	9.835520e+06	1.030615e+07	2.134720e+07	9.535142e+06	30.449457
min	157926.000000	1.300000e+05	3.814430e+05	1.701200e+05	9.094500e+04	8.000000e+05	1.765324
25%	34068.000000	3.750000e+05	1.020066e+06	5.917658e+05	8.119162e+05	1.041121e+06	2.632424
50%	26947.000000	5.750000e+05	2.884228e+06	1.176506e+06	6.979350e+05	1.193586e+06	2.549413
75%	126600.000000	1.062500e+06	1.722087e+07	8.203751e+06	6.593195e+05	5.740393e+06	6.957028
max	50326.000000	-1.000000e+06	1.898422e+07	2.529215e+07	8.630726e+07	2.498422e+07	120.795156

We can see from the table above that money matters! The mean difference is especially telling. And, although upper management tends to have greater compensation, you can't help but be shocked by the tremendous gap seen here.

#difference in non-poi email behavior and poi behavior

difference_in_email = poi_email_behavior - non_poi_email_behavior
difference_in_email

	shared_receipt_with_poi	to_messages	from_messages	fraction_from_poi	fraction_to_poi
count	-58.000000	-58.000000	-58.000000	-58.000000	-58.000000
mean	724.472222	410.031746	-368.406746	0.011399	0.192800
std	132.492868	-731.307854	-1173.153226	-0.009844	-0.049163
min	89.000000	168.000000	4.000000	0.021339	0.173611
25%	867.750000	602.000000	12.500000	0.019140	0.228580
50%	995.000000	931.000000	3.500000	0.007898	0.222613
75%	529.750000	378.500000	-115.000000	0.008413	0.202083
max	994.000000	-7158.000000	-11299.000000	-0.080822	-0.343750

My original email behavior table was a bit less telling than the money table, but I was able to scale features to reflect e-mail behavior more accurately. The updated tables can be seen above with the fraction of emails sent to and from POIs.

#poi name, salary, bonus, stock options, total bonus and options, from messages, and fraction to poi, ordered by total descending

poi_info[['name','salary','bonus','exercised_stock_options','total_be','total_millions',\
          'from_messages','fraction_to_poi']].sort_values('total_millions',ascending=False)

	name	salary	bonus	exercised_stock_options	total_be	total_millions	from_messages	fraction_to_poi
65	LAY KENNETH L	1072321.0	7000000.0	34348384.0	41348384.0	152.669871	36.0	0.444444
95	SKILLING JEFFREY K	1111258.0	5600000.0	19250000.0	24850000.0	34.776388	108.0	0.277778
125	HIRKO JOSEPH	NaN	NaN	30766064.0	30766064.0	30.857157	NaN	NaN
88	RICE KENNETH D	420636.0	1750000.0	19794175.0	21544175.0	23.047589	18.0	0.222222
124	YEAGER F SCOTT	158403.0	NaN	8308552.0	8308552.0	12.245058	NaN	NaN
60	DELAINEY DAVID W	365163.0	3000000.0	2291113.0	5291113.0	8.362240	3069.0	0.198436
4	HANNON KEVIN P	243293.0	1500000.0	5538001.0	7038001.0	6.679747	32.0	0.656250
82	BELDEN TIMOTHY N	213999.0	5249999.0	953136.0	6203135.0	6.612335	484.0	0.223140
53	SHELBY REX	211844.0	200000.0	1624396.0	1824396.0	4.497501	39.0	0.358974
141	CAUSEY RICHARD A	415189.0	1000000.0	NaN	1000000.0	4.370821	49.0	0.244898
85	FASTOW ANDREW S	440698.0	1300000.0	NaN	1300000.0	4.218495	NaN	NaN
41	KOPPER MICHAEL J	224305.0	800000.0	NaN	800000.0	3.637644	NaN	NaN
134	KOENIG MARK E	309946.0	700000.0	671737.0	1371737.0	3.507476	61.0	0.245902
30	RIEKER PAULA H	249201.0	700000.0	1635238.0	2335238.0	3.017987	82.0	0.585366
76	BOWEN JR RAYMOND M	278601.0	1350000.0	NaN	1350000.0	2.921644	27.0	0.555556
16	COLWELL WESLEY	288542.0	1200000.0	NaN	1200000.0	2.188586	40.0	0.275000
144	GLISAN JR BEN F	274975.0	600000.0	384728.0	984728.0	2.050830	16.0	0.375000
87	CALGER CHRISTOPHER F	240189.0	1250000.0	NaN	1250000.0	1.765324	144.0	0.173611

Although we don't have the salary and bonus data for Joseph Hirko, his exercised stock options is second to Kenneth Lay. Since "exercised_stock_options" seems to be a key indicator of a POI when salary/bonus is unavailable, that is definitely a feature we'll want to include in our final feature selection. These features may be even more robust when taking the total of bonus and options. In fact, let's add the feature, total_be, to our dataset and maybe it will come in handy. I went back and added this feature to the top of my code in order to include it in the info and analysis.
Also, it was interesting to see that POIs don't have as many "from_messages" as non-POIs. David Delainey is the only one that has well over 500 emails from him. This may be a telling behavior of POIs, as they may prefer to talk face-to-face with others.
Top

Explore Features

In this section, we'll visualize some of our features in order to explore them further.

import matplotlib.pyplot as plt
%matplotlib inline
%pylab inline
import seaborn as sns

Populating the interactive namespace from numpy and matplotlib

Salary

average_salary = enron.groupby('poi').mean()['salary']
average_salary

poi
False    262151.506494
True     383444.882353
Name: salary, dtype: float64

sns.boxplot(x='poi',y='salary',data=enron)

<matplotlib.axes._subplots.AxesSubplot at 0xd353a20>

Bonus

average_bonus = enron.groupby('poi').mean()['bonus']
average_bonus

poi
False    9.868249e+05
True     2.075000e+06
Name: bonus, dtype: float64

sns.boxplot(x='poi',y='bonus',data=enron)

<matplotlib.axes._subplots.AxesSubplot at 0xe0ac860>

Wow! An 8 million dollar bonus seems a bit much, but you can see the difference between John and Ken in other ways. Below, you can see the difference between all other financial features is significant. And, bonuses among POIs are still higher on average than non-POIs. So, despite our non-POI outlier, this feature may still be useful in training our algorithm.

enron[(enron['bonus']>6000000)][['name','salary','bonus','exercised_stock_options','restricted_stock','total_stock_value','poi']]

	name	salary	bonus	exercised_stock_options	restricted_stock	total_stock_value	poi
43	LAVORATO JOHN J	339288.0	8000000.0	4158995.0	1008149.0	5167144.0	False
65	LAY KENNETH L	1072321.0	7000000.0	34348384.0	14761694.0	49110078.0	True

Total Payments

average_total_payments = enron.groupby('poi').mean()['total_payments']
average_total_payments

poi
False    1.725091e+06
True     7.913590e+06
Name: total_payments, dtype: float64

sns.boxplot(x='poi',y='total_payments',data=enron)

<matplotlib.axes._subplots.AxesSubplot at 0xe2289e8>

Wow! That last boxplot has an outlier that is obviously pulling the mean waaaaaaaay up. Who is that?

enron[(enron['total_payments']>40000000)][['name','total_payments','poi']]

	name	total_payments	poi
65	LAY KENNETH L	103559793.0	True

#take Ken Lay out of the poi boxplot

kl_not_in = enron[(enron['total_payments']<40000000)]

sns.boxplot(x='poi',y='total_payments',data=kl_not_in)

<matplotlib.axes._subplots.AxesSubplot at 0xe3e44e0>

Well, at least now we can see the boxplots more clearly. There's not much of a difference between POI and non-POI here when we take Ken Lay out, so total_payments probably won't be a feature we'll use.

Exercised Stock Options

average_optionsvalue = enron.groupby('poi').mean()['exercised_stock_options']
average_optionsvalue

poi
False    1.947752e+06
True     1.046379e+07
Name: exercised_stock_options, dtype: float64

sns.boxplot(x='poi',y='exercised_stock_options',data=enron)

<matplotlib.axes._subplots.AxesSubplot at 0xe353080>

Exercised stock options definitely looks to be higher among POIs, so this will definitely be a feature to include in our list of features for our algorithm.

Total Bonus and Exercised Stock Options

average_total_sbe = enron.groupby('poi').mean()['total_be']
average_total_sbe

poi
False    1.870028e+06
True     8.820307e+06
Name: total_be, dtype: float64

sns.boxplot(x='poi',y='total_be',data=enron)

<matplotlib.axes._subplots.AxesSubplot at 0xe6089b0>

Total Bonus and Exercised Stock Options might be useful, but it might also just add to the noise. So, maybe we won't use this one.

Total Stock Value

average_stockvalue = enron.groupby('poi').mean()['total_stock_value']
average_stockvalue

poi
False    2.374085e+06
True     9.165671e+06
Name: total_stock_value, dtype: float64

sns.boxplot(x='poi',y='total_stock_value',data=enron)

<matplotlib.axes._subplots.AxesSubplot at 0xeba4358>

Total stock value for POIs on average is much higher than non-POIs. This feature is another good option for our POI identifier.

Total Payments and Stock Value in Millions

average_total_comp = enron.groupby('poi').mean()['total_millions']
average_total_comp

poi
False     3.440052
True     17.079261
Name: total_millions, dtype: float64

sns.boxplot(x='poi',y='total_millions',data= enron)

<matplotlib.axes._subplots.AxesSubplot at 0xef21668>

Let's try that one again without Ken Lay...

sns.boxplot(x='poi',y='total_millions',data= kl_not_in)

<matplotlib.axes._subplots.AxesSubplot at 0xeec3390>

Hmmmm, maybe we didn't need to add this feature. We can look closer by using lmplot and pairplot later on in our analysis.

Shared Receipt with POI

average_shared_receipt = enron.groupby('poi').mean()['shared_receipt_with_poi']
average_shared_receipt

poi
False    1058.527778
True     1783.000000
Name: shared_receipt_with_poi, dtype: float64

sns.boxplot(x='poi',y='shared_receipt_with_poi',data= enron)

<matplotlib.axes._subplots.AxesSubplot at 0x10734b00>

To Messages

average_to = enron.groupby('poi').mean()['to_messages']
average_to

poi
False    2007.111111
True     2417.142857
Name: to_messages, dtype: float64

sns.boxplot(x='poi',y='to_messages',data= enron)

<matplotlib.axes._subplots.AxesSubplot at 0x10b46e80>

From Messages

average_from = enron.groupby('poi').mean()['from_messages']
average_from

poi
False    668.763889
True     300.357143
Name: from_messages, dtype: float64

sns.boxplot(x='poi',y='from_messages',data= enron)

<matplotlib.axes._subplots.AxesSubplot at 0x106998d0>

Fraction to POI

average_fraction_to = enron.groupby('poi').mean()['fraction_to_poi']
average_fraction_to

poi
False    0.152669
True     0.345470
Name: fraction_to_poi, dtype: float64

sns.boxplot(x='poi',y='fraction_to_poi',data= enron)

<matplotlib.axes._subplots.AxesSubplot at 0x10e21c88>

Fraction_to_poi looks like a good feature to add to our list, since most of the poi distribution is in the upper range of the non-poi distribution.

Fraction from POI

average_fraction_from = enron.groupby('poi').mean()['fraction_from_poi']
average_fraction_from

poi
False    0.036107
True     0.047507
Name: fraction_from_poi, dtype: float64

sns.boxplot(x='poi',y='fraction_from_poi',data= enron)

<matplotlib.axes._subplots.AxesSubplot at 0x112cf2e8>

Top

Pairplot Analysis

Now, let's take a look at some of our features in the following pairplot. Maybe it will help us make our final decisions for our features list.

import seaborn as sns; sns.set(style="ticks", color_codes=True)

g = sns.pairplot(enron, vars=['bonus','exercised_stock_options','from_messages','fraction_to_poi'],
                 dropna=True, diag_kind='kde', hue='poi', markers=['x','o'])

Outliers

When looking at the stats for poi and non-poi for the first time, I noticed that the non-poi stats were much higher than the poi stats. That's when I remembered I didn't account for the "TOTAL" key. So, I went back and skipped over that key when writing to my csv. I figured I'd just pop it out of my dictionary later if I need to. After doing that, my stats were as expected. Now, let's see what other outliers we can find. In the pairplot above, there were two POIs that really stood out. Let's take a closer look in the following lmplot.

sns.lmplot(x='bonus', y= 'salary', hue='poi', data=enron, palette='Set1',size=10,markers=['x','o'])
plt.title('Salary/Bonus for POI and non-POI', fontsize=18)
plt.xlabel('Bonus', fontsize=16)
plt.ylabel('Salary', fontsize=16)

<matplotlib.text.Text at 0x14e43e10>

#Who are the two outliers in blue with the high salary AND high bonus?  Ken Lay and Jeff Skilling of course!

enron[(enron['salary']>1000000)][['name','salary','bonus','poi']]

	name	salary	bonus	poi
65	LAY KENNETH L	1072321.0	7000000.0	True
95	SKILLING JEFFREY K	1111258.0	5600000.0	True
128	FREVERT MARK A	1060932.0	2000000.0	False

These are two of our persons of interest, so we definitely don't want to take them out of our dataset.

According to Executive Excess 2002

Top executives at 23 companies under investigation for their accounting practices earned far more during the past three years than the average CEO at large companies. CEOs at the firms under investigation earned an average of 62.2 million dollars during 1999-2001, 70 percent more than the average of 36.5 million dollars for all leading executives for that period.

We may also be able to find a few datapoints that are just causing noise by checking for lots of missing values in rows.

#check for more than 20 missing values for each datapoint

i = 0

for row in enron.isnull().sum(axis=1):
    if row > 20:
        print enron.iloc[i]
    i+=1

name                         LOCKHART EUGENE E
salary                                     NaN
to_messages                                NaN
deferral_payments                          NaN
total_payments                             NaN
exercised_stock_options                    NaN
bonus                                      NaN
restricted_stock                           NaN
shared_receipt_with_poi                    NaN
restricted_stock_deferred                  NaN
total_stock_value                          NaN
expenses                                   NaN
loan_advances                              NaN
from_messages                              NaN
other                                      NaN
from_this_person_to_poi                    NaN
poi                                      False
director_fees                              NaN
deferred_income                            NaN
long_term_incentive                        NaN
email_address                              NaN
from_poi_to_this_person                    NaN
total_be                                     0
fraction_to_poi                            NaN
fraction_from_poi                          NaN
total_millions                               0
Name: 90, dtype: object

#check for missing values in features

enron.isnull().sum()

name                           0
salary                        51
to_messages                   59
deferral_payments            107
total_payments                21
exercised_stock_options       44
bonus                         64
restricted_stock              36
shared_receipt_with_poi       59
restricted_stock_deferred    128
total_stock_value             20
expenses                      51
loan_advances                142
from_messages                 59
other                         53
from_this_person_to_poi       59
poi                            0
director_fees                129
deferred_income               97
long_term_incentive           80
email_address                 34
from_poi_to_this_person       59
total_be                       0
fraction_to_poi               59
fraction_from_poi             59
total_millions                 0
dtype: int64

Loan advances has 142 missing values! That's definitely a feature we can remove before we run our tests.

Money and Messages Regression Model

sns.lmplot(x='bonus', y='fraction_to_poi', hue='poi', data=enron, palette='Set1',size=10,markers=['x','o'])
plt.title('Money & Messages', fontsize=18)

<matplotlib.text.Text at 0x1517fa90>

Top

Transform, Select, and Scale

Now let's transform, select, and scale our features.

import sys
import pickle
sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data

features_full_list = enron.columns.tolist()
features_full_list.pop(0) #take out 'name'
features_full_list.pop(19) #take out 'email_address'
features_full_list.pop(11) #take out 'loan_advances' because of missing values
features_full_list.pop(15) #take out 'director_fees' because of missing values
features_full_list.pop(14) #take out 'poi' for now and add to beginning of list
features_list = ['poi']
for n in features_full_list:
    features_list.append(n)
features_list

['poi',
 'salary',
 'to_messages',
 'deferral_payments',
 'total_payments',
 'exercised_stock_options',
 'bonus',
 'restricted_stock',
 'shared_receipt_with_poi',
 'restricted_stock_deferred',
 'total_stock_value',
 'expenses',
 'from_messages',
 'other',
 'from_this_person_to_poi',
 'deferred_income',
 'long_term_incentive',
 'from_poi_to_this_person',
 'total_be',
 'fraction_to_poi',
 'fraction_from_poi',
 'total_millions']

### Remove outliers that corrupt the data
enron_dict.pop('TOTAL', 0)

{'bonus': 97343619,
 'deferral_payments': 32083396,
 'deferred_income': -27992891,
 'director_fees': 1398517,
 'email_address': 'NaN',
 'exercised_stock_options': 311764000,
 'expenses': 5235198,
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 83925000,
 'long_term_incentive': 48521928,
 'other': 42667589,
 'poi': False,
 'restricted_stock': 130322299,
 'restricted_stock_deferred': -7576788,
 'salary': 26704229,
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 309886585,
 'total_stock_value': 434509511}

#remove datapoints that create noise

enron_dict.pop('LOCKHART EUGENE E',0)

{'bonus': 'NaN',
 'deferral_payments': 'NaN',
 'deferred_income': 'NaN',
 'director_fees': 'NaN',
 'email_address': 'NaN',
 'exercised_stock_options': 'NaN',
 'expenses': 'NaN',
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 'NaN',
 'long_term_incentive': 'NaN',
 'other': 'NaN',
 'poi': False,
 'restricted_stock': 'NaN',
 'restricted_stock_deferred': 'NaN',
 'salary': 'NaN',
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 'NaN',
 'total_stock_value': 'NaN'}

#take out all 'loan_advances' because of missing values

for name in enron_dict:
    enron_dict[name].pop('loan_advances',0)

### Create new feature(s)

#add fraction of emails from and to poi
#idea for this added feature taken from course materials

def computeFraction( poi_messages, all_messages ):
    """ given a number messages to/from POI (numerator) 
        and number of all messages to/from a person (denominator),
        return the fraction of messages to/from that person
        that are from/to a POI
   """
    fraction = 0.
    if poi_messages != 'NaN' and all_messages != 'NaN':
        fraction = float(poi_messages)/all_messages


    return fraction

for name in enron_dict:

    data_point = enron_dict[name]

    from_poi_to_this_person = data_point["from_poi_to_this_person"]
    to_messages = data_point["to_messages"]
    fraction_from_poi = computeFraction( from_poi_to_this_person, to_messages )
    
    enron_dict[name]["fraction_from_poi"] = fraction_from_poi
  
    from_this_person_to_poi = data_point["from_this_person_to_poi"]
    from_messages = data_point["from_messages"]
    fraction_to_poi = computeFraction( from_this_person_to_poi, from_messages )

    enron_dict[name]["fraction_to_poi"] = fraction_to_poi

#add total_be to dictionary

for name in enron_dict:
    data_point = enron_dict[name]
    
    bonus = data_point['bonus']
    if bonus == 'NaN':
        bonus = 0.0
    options = data_point['exercised_stock_options']
    if options == 'NaN':
        options = 0.0
    total = bonus+options

    enron_dict[name]['total_be'] = total

#add total compensation in millions to dataset

for name in enron_dict:
    data_point = enron_dict[name]
    
    total_payments = data_point['total_payments']
    if total_payments == 'NaN':
        total_payments = 0.0
    total_stock = data_point['total_stock_value']
    if total_stock == 'NaN':
        total_stock = 0.0
    total = (total_payments + total_stock)/1000000

    enron_dict[name]['total_millions'] = total

SELECT FEATURES

I've selected a few lists that may be useful in training our classifiers. Each of the features selected may be able to give us some insight into the compensation and behavior of a POI. The total compensation (total_millions) shows us that, on average, POIs are compensated more highly than non-POIs. The same holds true for individual payments, like salary and bonus. And, when it comes to stock behavior, POIs are more active in their exercising of stock options(exercised_stock_options.) Other features, like from_messages, show a kind of pattern in e-mail behavior. POIs do not send many messages. However, the ones they do send are often to other POIs(fraction_to_poi). These are all features we'll test before making our final feature selection.

We will start out with our full list as a baseline and test that against our selected lists' metrics in order to find our final feature list for our POI identifier. Our lists were chosen based on our stats and plots generated in our analysis. Those features that showed a greater overall difference between POI and non-POI stats were chosen

print features_list #list of features available for testing

['poi', 'salary', 'to_messages', 'deferral_payments', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'restricted_stock_deferred', 'total_stock_value', 'expenses', 'from_messages', 'other', 'from_this_person_to_poi', 'deferred_income', 'long_term_incentive', 'from_poi_to_this_person', 'total_be', 'fraction_to_poi', 'fraction_from_poi', 'total_millions']

### Select what features to use
first_list = ['poi','total_millions','fraction_to_poi','from_messages']
second_list = ['poi','total_be','fraction_to_poi','from_messages']
third_list = ['poi','salary','bonus','fraction_to_poi','from_messages']
fourth_list = ['poi','bonus','exercised_stock_options','fraction_to_poi']
features_final_list = fourth_list
print "Final List", features_final_list

Final List ['poi', 'bonus', 'exercised_stock_options', 'fraction_to_poi']

Algorithm Selection

Evaluation Metrics

#Evaluation metrics

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

import scikitplot as skplt

from sklearn.cross_validation import StratifiedShuffleSplit

#Classifiers
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

# %load 'my_test.py'
def test_list(classifier, feature_list, enron_dict):
    
    my_dataset = enron_dict
    
    data = featureFormat(my_dataset, feature_list, sort_keys = True) 
    labels, features = targetFeatureSplit(data) 
    
    X = np.array(features)
    y = np.array(labels)
    sss = StratifiedShuffleSplit(labels, n_iter=1000, test_size=0.3, random_state=42)      
    for train_index, test_index in sss:
        features_train, features_test = X[train_index], X[test_index]
        labels_train, labels_test = y[train_index], y[test_index]
        
    clf = classifier
    clf.fit(features_train,labels_train)
    pred = clf.predict(features_test)
    
    if classifier == DecisionTreeClassifier():
        return {'Accuracy': accuracy_score(labels_test,pred),'Precision': precision_score(labels_test,pred),
                'Recall': recall_score(labels_test,pred), 'Feature Importance': clf.feature_importances_}
    
    return {'Accuracy': accuracy_score(labels_test,pred),'Precision': precision_score(labels_test,pred),
            'Recall': recall_score(labels_test,pred)}

Performance: Accuracy, Precision, and Recall

Below, we will test each list using three different algorithms: Naive Bayes, Decision Tree, and KNearest Neighbors.

Our accuracy score will show us our ratio of correctly predicted observation to the total observations.
Accuracy = TP+TN/TP+FP+FN+TN
Precision is the ratio of correctly predicted positive observations to the total predicted positive observations.
Precision = TP/TP+FP
And, recall is the ratio of correctly predicted positive observations to all observations in the class.
Recall = TP/TP+FN

When trying to identify POIs, we want to see as few falsely identified positives. We don't want to falsely identify anyone as a POI. So, I'd say precision is a bit more important here. Let's see how each list does using each classifier.

#full list
print features_list
print 'GaussianNB: ', test_list(GaussianNB(),features_list,enron_dict)
print 'DecisionTree: ', test_list(DecisionTreeClassifier(),features_list,enron_dict)
print 'KNeighbors: ', test_list(KNeighborsClassifier(),features_list,enron_dict)

['poi', 'salary', 'to_messages', 'deferral_payments', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'restricted_stock_deferred', 'total_stock_value', 'expenses', 'from_messages', 'other', 'from_this_person_to_poi', 'deferred_income', 'long_term_incentive', 'from_poi_to_this_person', 'total_be', 'fraction_to_poi', 'fraction_from_poi', 'total_millions']
GaussianNB:  {'Recall': 0.83333333333333337, 'Precision': 0.26315789473684209, 'Accuracy': 0.65909090909090906}
DecisionTree:  {'Recall': 0.16666666666666666, 'Precision': 0.10000000000000001, 'Accuracy': 0.68181818181818177}
KNeighbors:  {'Recall': 0.33333333333333331, 'Precision': 0.66666666666666663, 'Accuracy': 0.88636363636363635}

#first list
print first_list
print 'GaussianNB: ', test_list(GaussianNB(),first_list,enron_dict)
print 'DecisionTree: ', test_list(DecisionTreeClassifier(),first_list,enron_dict)
print 'KNeighbors: ', test_list(KNeighborsClassifier(),first_list,enron_dict)

['poi', 'total_millions', 'fraction_to_poi', 'from_messages']
GaussianNB:  {'Recall': 0.5, 'Precision': 0.59999999999999998, 'Accuracy': 0.87804878048780488}
DecisionTree:  {'Recall': 0.0, 'Precision': 0.0, 'Accuracy': 0.75609756097560976}
KNeighbors:  {'Recall': 0.0, 'Precision': 0.0, 'Accuracy': 0.85365853658536583}

#second list
print second_list
print 'GaussianNB: ', test_list(GaussianNB(),second_list,enron_dict)
print 'DecisionTree: ', test_list(DecisionTreeClassifier(),second_list,enron_dict)
print 'KNeighbors: ', test_list(KNeighborsClassifier(),second_list,enron_dict)

['poi', 'total_be', 'fraction_to_poi', 'from_messages']
GaussianNB:  {'Recall': 0.33333333333333331, 'Precision': 0.33333333333333331, 'Accuracy': 0.79487179487179482}
DecisionTree:  {'Recall': 0.33333333333333331, 'Precision': 0.33333333333333331, 'Accuracy': 0.79487179487179482}
KNeighbors:  {'Recall': 0.33333333333333331, 'Precision': 0.40000000000000002, 'Accuracy': 0.82051282051282048}

#third list
print third_list
print 'GaussianNB: ', test_list(GaussianNB(),third_list,enron_dict)
print 'DecisionTree: ', test_list(DecisionTreeClassifier(),third_list,enron_dict)
print 'KNeighbors: ', test_list(KNeighborsClassifier(),third_list,enron_dict)

['poi', 'salary', 'bonus', 'fraction_to_poi', 'from_messages']
GaussianNB:  {'Recall': 0.40000000000000002, 'Precision': 0.2857142857142857, 'Accuracy': 0.76470588235294112}
DecisionTree:  {'Recall': 0.80000000000000004, 'Precision': 0.40000000000000002, 'Accuracy': 0.79411764705882348}
KNeighbors:  {'Recall': 0.40000000000000002, 'Precision': 0.66666666666666663, 'Accuracy': 0.88235294117647056}

#fourth_list
print fourth_list
print 'GaussianNB: ', test_list(GaussianNB(),fourth_list,enron_dict)
print 'DecisionTree: ', test_list(DecisionTreeClassifier(),fourth_list,enron_dict)
print 'KNeighbors: ', test_list(KNeighborsClassifier(),fourth_list,enron_dict)

['poi', 'bonus', 'exercised_stock_options', 'fraction_to_poi']
GaussianNB:  {'Recall': 0.20000000000000001, 'Precision': 0.14285714285714285, 'Accuracy': 0.73684210526315785}
DecisionTree:  {'Recall': 0.40000000000000002, 'Precision': 0.22222222222222221, 'Accuracy': 0.73684210526315785}
KNeighbors:  {'Recall': 0.59999999999999998, 'Precision': 0.75, 'Accuracy': 0.92105263157894735}

With an accuracy score of 92%, a precision score of 75%, and a recall score of 60% ...

Our Final List includes 'poi', 'bonus', 'exercised stock options', and 'fraction to poi'
Our Final Classifier with be KNeighbors

Validation

We've already implemented our validation process, but here we will discuss its importance. Without validating our classifier using training/testing data, we have no way of measuring its accuracy and reliability. Training and testing the classifier against the same data will only yield overfitting results. This is why validation is important. By using StratifiedShuffleSplit to split our data into training and testing data, we can make sure that our classes are allocated by the same ratio set for training/testing and that each datapoint in the class is randomly selected. Because of our small dataset, setting the iterations to 1000 will give us more reliable results in the end, as we will have trained and tested on almost all of our datapoints. The only downside is the run time.

### Store to my_dataset for easy export below.
my_dataset = enron_dict

### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_final_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

#Validation using StratifiedShuffleSplit in order to evenly dispurse the classes between training and test data
X = np.array(features)
y = np.array(labels)
sss = StratifiedShuffleSplit(labels, n_iter=1000, test_size=0.3, random_state=42)      
for train_index, test_index in sss:
    features_train, features_test = X[train_index], X[test_index]
    labels_train, labels_test = y[train_index], y[test_index]

#check for accuracy
    
clf = KNeighborsClassifier()
clf.fit(features_train, labels_train)

print clf.score(features_test, labels_test)

0.921052631579

Top

Tuning

Even though we've all but settled on KNeighbors, let's see if tuning the parameters of our Decision Tree Classifier would make a difference. Tuning parameters can sometimes significantly change our performance metrics outcome. Parameters can control for overfitting/underfitting, so tuning them can certainly change the metrics.

DecisionTreeClassifier().get_params()

{'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_split': 1e-07,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'presort': False,
 'random_state': None,
 'splitter': 'best'}

#Decision Tree
from sklearn.tree import DecisionTreeClassifier

#set min_samples_split to 3 and increase until no longer helpful
clf = DecisionTreeClassifier(min_samples_split=9)
clf.fit(features_train,labels_train)
pred = clf.predict(features_test)

"""  Can we do better?  [0.71052631578947367, 0.20000000000000001, 0.40000000000000002]"""

#print performance metrics
print {'Accuracy': accuracy_score(labels_test,pred),'Precision': precision_score(labels_test,pred),
       'Recall': recall_score(labels_test,pred)}

{'Recall': 0.40000000000000002, 'Precision': 0.25, 'Accuracy': 0.76315789473684215}

#plot confusion matrix
skplt.metrics.plot_confusion_matrix(labels_test, pred, normalize=True)

<matplotlib.axes._subplots.AxesSubplot at 0x177c7828>

How about our KNeighbors Classifier? What are the parameters? If we changed any of them, would it make a difference?

KNeighborsClassifier().get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': 1,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

"""Can we do better without overfitting?  [0.92105263157894735, 0.75, 0.59999999999999998]"""

#set n_neighbors to 2 and increase until metrics show overfitting
clf = KNeighborsClassifier(n_neighbors=6)
clf.fit(features_train,labels_train)
pred = clf.predict(features_test)

print {'Accuracy': accuracy_score(labels_test,pred),'Precision': precision_score(labels_test,pred),
       'Recall': recall_score(labels_test,pred)}

{'Recall': 0.59999999999999998, 'Precision': 1.0, 'Accuracy': 0.94736842105263153}

^OVERFIT!

"""Can we still do better?  [0.92105263157894735, 0.75, 0.59999999999999998]"""

#set n_neighbors to 2,3, and 4

def test_param(n):
    clf = KNeighborsClassifier(n_neighbors=n)
    clf.fit(features_train,labels_train)
    pred = clf.predict(features_test)

    return [accuracy_score(labels_test,pred), precision_score(labels_test,pred), recall_score(labels_test,pred)]

print "2: ", test_param(2)
print "3: ", test_param(3)
print "4: ", test_param(4)

2:  [0.92105263157894735, 0.75, 0.59999999999999998]
3:  [0.89473684210526316, 0.59999999999999998, 0.59999999999999998]
4:  [0.92105263157894735, 0.75, 0.59999999999999998]

GridSearchCV

When using GridSearchCV to find the best parameters for KNeighbors, our estimator gives us the same results.

from sklearn.model_selection import GridSearchCV

k = np.arange(10)+1
leaf = np.arange(30)+1
params = {'algorithm': ('auto', 'ball_tree', 'kd_tree', 'brute'),'leaf_size': leaf,'n_neighbors': k}

clf_params = GridSearchCV(KNeighborsClassifier(), params, cv=5)
clf_params.fit(features_train,labels_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10]), 'leaf_size': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]), 'algorithm': ('auto', 'ball_tree', 'kd_tree', 'brute')},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

clf_params.best_estimator_

KNeighborsClassifier(algorithm='auto', leaf_size=1, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

Let's see what kind of results we get using the parameters "suggested" by GridSearchCV.

clf = KNeighborsClassifier(algorithm='auto', leaf_size=1, metric='minkowski',
                           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
                           weights='uniform')
clf.fit(features_train,labels_train)
pred = clf.predict(features_test)

print {'Accuracy': accuracy_score(labels_test,pred),'Precision': precision_score(labels_test,pred),
       'Recall': recall_score(labels_test,pred)}

{'Recall': 0.59999999999999998, 'Precision': 0.75, 'Accuracy': 0.92105263157894735}

These are the same results as our default settings, so that's what we'll use.

Top

FINAL FEATURES and ALGORITHM SELECTION

Our final features, based on our feature analysis and testing, will be:

Bonus
Exercised Stock Options
Fraction to POI

The KNeighbors Classifier, even without any parameter tuning, had a higher accuracy score than the Decision Tree Classifier with its min_samples_split tuned to 9. So, we're going to use KNeighbors in our POI identifier. KNearest Neighbors will help us zero in on pockets of POIs/non-POIs within our testing data. The features I selected work well with this particular classifier because 'bonus' and 'exercised stock options' are good for training the algorithm to pick up on POI compensation trends, and 'fraction_to_poi' will help our algorithm pick up on POI e-mail behavior. I narrowed it down to three features so as not to create noise. When running our longer lists of features, we saw our precision and recall drop to zero, so cutting down our features to 3 was the strategy moving forward. As we had already gotten rid of features with lots of missing data, it was easier to narrow them down and test each list against our chosen classifiers.

clf = KNeighborsClassifier()
clf.fit(features_train,labels_train)
pred = clf.predict(features_test)

print accuracy_score(labels_test,pred)
print precision_score(labels_test,pred)
print recall_score(labels_test,pred)

0.921052631579
0.75
0.6

#plot confusion matrix
skplt.metrics.plot_confusion_matrix(labels_test, pred, normalize=True)

<matplotlib.axes._subplots.AxesSubplot at 0x12c6bcf8>

### Task 6: Dump your classifier, dataset, and features_list 
features_list = features_final_list
dump_classifier_and_data(clf, my_dataset, features_list)

#test performance using tester.py

%run 'tester.py'

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
	Accuracy: 0.87854	Precision: 0.71113	Recall: 0.35450	F1: 0.47314	F2: 0.39402
	Total predictions: 13000	True positives:  709	False positives:  288	False negatives: 1291	True negatives: 10712

Final Thoughts

With an accuracy of 85-95%, precision of 65-75%, and recall of 35-60%, I think our algorithm has done well considering the small amount of data we had to work with. But, using our validation methods proved useful in creating a reliable algorithm. We've done pretty well, but I don't think I'd want to bet anyones life in prison on any algorithm trained on this data. We could use more data! Our Naive Bayes and Decision Tree Classifiers didn't perform as well. Although, overfitting is a problem as you increase the number of 'n_neighbors' in our KNeighbors Classifier, we can avoid this by keeping the default setting of 5. That left us with a working algorithm and some pretty solid evaluation metrics. According to our confusion matrix, we were able to identify 97% of non-POIs and 60% of POIs. I'd rather get a few POIs wrong than falsely identify a non-POI as a POI. There may come a day when people will be convicted based on machine learning, so it's important that we be as accurate as possible. This identifier only gets it wrong around 40% of the time.

%run 'poi_id.py'

%run 'tester.py'

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
	Accuracy: 0.87854	Precision: 0.71113	Recall: 0.35450	F1: 0.47314	F2: 0.39402
	Total predictions: 13000	True positives:  709	False positives:  288	False negatives: 1291	True negatives: 10712

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Udacity - Intro to Machine Learning

ENRON SCANDAL

Final Project: Identify Fraud from Enron Email

By: Marissa Schmucker

May 2018

Project Goal

Dataset Questions

Dataset Information

Feature Statistics

Explore Features

Salary

Bonus

Total Payments

Exercised Stock Options

Total Bonus and Exercised Stock Options

Total Stock Value

Total Payments and Stock Value in Millions

Shared Receipt with POI

To Messages

From Messages

Fraction to POI

Fraction from POI

Outliers

Transform, Select, and Scale

Algorithm Selection

Evaluation Metrics

Performance: Accuracy, Precision, and Recall

Validation

Tuning

GridSearchCV

FINAL FEATURES and ALGORITHM SELECTION

Final Thoughts

Files

README.md

Latest commit

History

README.md

File metadata and controls

Udacity - Intro to Machine Learning

ENRON SCANDAL

Final Project: Identify Fraud from Enron Email

By: Marissa Schmucker

May 2018

Project Goal

Dataset Questions

Dataset Information

Feature Statistics

Explore Features

Salary

Bonus

Total Payments

Exercised Stock Options

Total Bonus and Exercised Stock Options

Total Stock Value

Total Payments and Stock Value in Millions

Shared Receipt with POI

To Messages

From Messages

Fraction to POI

Fraction from POI

Outliers

Transform, Select, and Scale

Algorithm Selection

Evaluation Metrics

Performance: Accuracy, Precision, and Recall

Validation

Tuning

GridSearchCV

FINAL FEATURES and ALGORITHM SELECTION

Final Thoughts