Skip to content

Udacity Machine Learning Project using the Enron financial and email dataset

Notifications You must be signed in to change notification settings

missmariss31/enron

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Udacity - Intro to Machine Learning


ENRON SCANDAL

Summary link - Wikipedia

The Enron scandal was a financial scandal that eventually led to the bankruptcy of the Enron Corporation, an American energy company based in Houston, Texas, and the de facto dissolution of Arthur Andersen, which was one of the five largest audit and accountancy partnerships in the world. In addition to being the largest bankruptcy reorganization in American history at that time, Enron was cited as the biggest audit failure.

Enron was formed in 1985 by Kenneth Lay after merging Houston Natural Gas and InterNorth. Several years later, when Jeffrey Skilling was hired, he developed a staff of executives that – by the use of accounting loopholes, special purpose entities, and poor financial reporting – were able to hide billions of dollars in debt from failed deals and projects. Chief Financial Officer Andrew Fastow and other executives not only misled Enron's Board of Directors and Audit Committee on high-risk accounting practices, but also pressured Arthur Andersen to ignore the issues.

Enron shareholders filed a 40 billion dollar lawsuit after the company's stock price, which achieved a high of US90.75 per share in mid-2000, plummeted to less than 1 dollar by the end of November 2001. The U.S. Securities and Exchange Commission (SEC) began an investigation, and rival Houston competitor Dynegy offered to purchase the company at a very low price. The deal failed, and on December 2, 2001, Enron filed for bankruptcy under Chapter 11 of the United States Bankruptcy Code. Enron's 63.4 billion dollars in assets made it the largest corporate bankruptcy in U.S. history until WorldCom's bankruptcy the next year.

Many executives at Enron were indicted for a variety of charges and some were later sentenced to prison. Enron's auditor, Arthur Andersen, was found guilty in a United States District Court of illegally destroying documents relevant to the SEC investigation which voided its license to audit public companies, effectively closing the business. By the time the ruling was overturned at the U.S. Supreme Court, the company had lost the majority of its customers and had ceased operating. Enron employees and shareholders received limited returns in lawsuits, despite losing billions in pensions and stock prices. As a consequence of the scandal, new regulations and legislation were enacted to expand the accuracy of financial reporting for public companies. One piece of legislation, the Sarbanes–Oxley Act, increased penalties for destroying, altering, or fabricating records in federal investigations or for attempting to defraud shareholders. The act also increased the accountability of auditing firms to remain unbiased and independent of their clients.

EnronCEO Picture
Trial of Ken Lay and Jeff Skilling

Final Project: Identify Fraud from Enron Email

By: Marissa Schmucker

May 2018


Table of Contents

Project Goal |
Dataset Questions |
Dataset Information |
Feature Statistics |
Explore Features |

Salary

Bonus

Total Payments

Exercised Stock Options

Total Stock Value

Total Bonus and Exercised Stock Options

Total Payments and Stock Value in Millions

Shared Receipt with POI

To Messages

From Messages

Fraction to POI

Fraction from POI

Outliers |
Transform, Select, and Scale |
Algorithm Selection |
Evaluation Metrics |
Performance Test |
Parameter Tuning |
Final Analysis |
Validating Our Analysis |
Final Thoughts |


Project Goal

The goal of this project is to use the Enron dataset to train our machine learning algorithm to detect the possiblity of fraud (identify person's of interest.) Since we know our persons of interest (POIs) in our dataset, we will be able to use supervised learning algorithms in constructing our POI identifier. This will be done by picking the features within our dataset that separate our POIs from our non-POIs best.

We will start out our analysis by answering some questions about our data. Then, we will explore our features further by visualizing any correlations/outliers. Next, we will transform/scale our features and select those that will be most useful in our POI identifier, engineering new features and adding them to the dataset if provided to be useful for our analysis. We will identify at least two algorithms that may be best suited for our particular set of data and test them, tuning our parameters until optimal performance is reached. In our final analysis, the algorithm we have fit will be validated using our training/testing data. Using performance metrics to evaluate our results, any problems will be addressed and motifications made. In our final thoughts, the performance of our final algorithm will be discussed.

"""Import pickle and sklearn to get started.
Load the data as enron_dict"""

import pickle
import sklearn

enron_dict = pickle.load(open("final_project_dataset.pkl", "r"))

Dataset Questions

After getting our data dictionary loaded, we can start exploring our data. We'll answer the following questions:

  1. How many people do we have in our dataset?
  2. What are their names?
  3. What information do we have about these people?
  4. Who are the POIs in our dataset?
  5. Who are the highest earners? Are they POIs?
  6. Whos stock options had the highest value (max exercised_stock_options)?
  7. Are there any features we can ignore due to missing data?
  8. What is the mean salary for non-POIs and POIs?
  9. What features might be useful for training our algorithm?
  10. Are there any features we may need to scale?

Top

print 'Number of People in Dataset: ', len(enron_dict) 
Number of People in Dataset:  146
import pprint

pretty = pprint.PrettyPrinter()
names = sorted(enron_dict.keys())  #sort names of Enron employees in dataset by first letter of last name

print 'Sorted list of Enron employees by last name'
pretty.pprint(names) 
Sorted list of Enron employees by last name
['ALLEN PHILLIP K',
 'BADUM JAMES P',
 'BANNANTINE JAMES M',
 'BAXTER JOHN C',
 'BAY FRANKLIN R',
 'BAZELIDES PHILIP J',
 'BECK SALLY W',
 'BELDEN TIMOTHY N',
 'BELFER ROBERT',
 'BERBERIAN DAVID',
 'BERGSIEKER RICHARD P',
 'BHATNAGAR SANJAY',
 'BIBI PHILIPPE A',
 'BLACHMAN JEREMY M',
 'BLAKE JR. NORMAN P',
 'BOWEN JR RAYMOND M',
 'BROWN MICHAEL',
 'BUCHANAN HAROLD G',
 'BUTTS ROBERT H',
 'BUY RICHARD B',
 'CALGER CHRISTOPHER F',
 'CARTER REBECCA C',
 'CAUSEY RICHARD A',
 'CHAN RONNIE',
 'CHRISTODOULOU DIOMEDES',
 'CLINE KENNETH W',
 'COLWELL WESLEY',
 'CORDES WILLIAM R',
 'COX DAVID',
 'CUMBERLAND MICHAEL S',
 'DEFFNER JOSEPH M',
 'DELAINEY DAVID W',
 'DERRICK JR. JAMES V',
 'DETMERING TIMOTHY J',
 'DIETRICH JANET R',
 'DIMICHELE RICHARD G',
 'DODSON KEITH',
 'DONAHUE JR JEFFREY M',
 'DUNCAN JOHN H',
 'DURAN WILLIAM D',
 'ECHOLS JOHN B',
 'ELLIOTT STEVEN',
 'FALLON JAMES B',
 'FASTOW ANDREW S',
 'FITZGERALD JAY L',
 'FOWLER PEGGY',
 'FOY JOE',
 'FREVERT MARK A',
 'FUGH JOHN L',
 'GAHN ROBERT S',
 'GARLAND C KEVIN',
 'GATHMANN WILLIAM D',
 'GIBBS DANA R',
 'GILLIS JOHN',
 'GLISAN JR BEN F',
 'GOLD JOSEPH',
 'GRAMM WENDY L',
 'GRAY RODNEY',
 'HAEDICKE MARK E',
 'HANNON KEVIN P',
 'HAUG DAVID L',
 'HAYES ROBERT E',
 'HAYSLETT RODERICK J',
 'HERMANN ROBERT J',
 'HICKERSON GARY J',
 'HIRKO JOSEPH',
 'HORTON STANLEY C',
 'HUGHES JAMES A',
 'HUMPHREY GENE E',
 'IZZO LAWRENCE L',
 'JACKSON CHARLENE R',
 'JAEDICKE ROBERT',
 'KAMINSKI WINCENTY J',
 'KEAN STEVEN J',
 'KISHKILL JOSEPH G',
 'KITCHEN LOUISE',
 'KOENIG MARK E',
 'KOPPER MICHAEL J',
 'LAVORATO JOHN J',
 'LAY KENNETH L',
 'LEFF DANIEL P',
 'LEMAISTRE CHARLES',
 'LEWIS RICHARD',
 'LINDHOLM TOD A',
 'LOCKHART EUGENE E',
 'LOWRY CHARLES P',
 'MARTIN AMANDA K',
 'MCCARTY DANNY J',
 'MCCLELLAN GEORGE',
 'MCCONNELL MICHAEL S',
 'MCDONALD REBECCA',
 'MCMAHON JEFFREY',
 'MENDELSOHN JOHN',
 'METTS MARK',
 'MEYER JEROME J',
 'MEYER ROCKFORD G',
 'MORAN MICHAEL P',
 'MORDAUNT KRISTINA M',
 'MULLER MARK S',
 'MURRAY JULIA H',
 'NOLES JAMES L',
 'OLSON CINDY K',
 'OVERDYKE JR JERE C',
 'PAI LOU L',
 'PEREIRA PAULO V. FERRAZ',
 'PICKERING MARK R',
 'PIPER GREGORY F',
 'PIRO JIM',
 'POWERS WILLIAM',
 'PRENTICE JAMES',
 'REDMOND BRIAN L',
 'REYNOLDS LAWRENCE',
 'RICE KENNETH D',
 'RIEKER PAULA H',
 'SAVAGE FRANK',
 'SCRIMSHAW MATTHEW',
 'SHANKMAN JEFFREY A',
 'SHAPIRO RICHARD S',
 'SHARP VICTORIA T',
 'SHELBY REX',
 'SHERRICK JEFFREY B',
 'SHERRIFF JOHN R',
 'SKILLING JEFFREY K',
 'STABLER FRANK',
 'SULLIVAN-SHAKLOVITZ COLLEEN',
 'SUNDE MARTIN',
 'TAYLOR MITCHELL S',
 'THE TRAVEL AGENCY IN THE PARK',
 'THORN TERENCE H',
 'TILNEY ELIZABETH A',
 'TOTAL',
 'UMANOFF ADAM S',
 'URQUHART JOHN A',
 'WAKEHAM JOHN',
 'WALLS JR ROBERT H',
 'WALTERS GARETH W',
 'WASAFF GEORGE',
 'WESTFAHL RICHARD K',
 'WHALEY DAVID A',
 'WHALLEY LAWRENCE G',
 'WHITE JR THOMAS E',
 'WINOKUR JR. HERBERT S',
 'WODRASKA JOHN',
 'WROBEL BRUCE',
 'YEAGER F SCOTT',
 'YEAP SOON']
print 'Example Value Dictionary of Features'
pretty.pprint(enron_dict['ALLEN PHILLIP K']) 
Example Value Dictionary of Features
{'bonus': 4175000,
 'deferral_payments': 2869717,
 'deferred_income': -3081055,
 'director_fees': 'NaN',
 'email_address': 'phillip.allen@enron.com',
 'exercised_stock_options': 1729541,
 'expenses': 13868,
 'from_messages': 2195,
 'from_poi_to_this_person': 47,
 'from_this_person_to_poi': 65,
 'loan_advances': 'NaN',
 'long_term_incentive': 304805,
 'other': 152,
 'poi': False,
 'restricted_stock': 126027,
 'restricted_stock_deferred': -126027,
 'salary': 201955,
 'shared_receipt_with_poi': 1407,
 'to_messages': 2902,
 'total_payments': 4484442,
 'total_stock_value': 1729541}

Before we go any further, let's transform our dictionary into a pandas dataframe to explore further.

import csv

"""Write Enron Dictionary to CSV File for Possible Future Use and Easily Read into Dataframe"""

fieldnames = ['name'] + enron_dict['METTS MARK'].keys()

with open('enron.csv', 'w') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    writer.writeheader()
    for name in enron_dict.keys():
        if name != 'TOTAL':
            n = {'name':name}
            n.update(enron_dict[name])
            writer.writerow(n)      
#read csv into pandas dataframe 
import pandas as pd
enron = pd.read_csv('enron.csv')
#added/combined feature, total bonus and exercised_stock_options
enron['total_be'] = enron['bonus'].fillna(0.0) + enron['exercised_stock_options'].fillna(0.0)
#added feature, fraction of e-mails to and from poi
enron['fraction_to_poi'] = enron['from_this_person_to_poi'].fillna(0.0)/enron['from_messages'].fillna(0.0)
enron['fraction_from_poi'] = enron['from_poi_to_this_person'].fillna(0.0)/enron['to_messages'].fillna(0.0)
#added feature, scaled total compensation
enron['total_millions'] = (enron['total_payments'].fillna(0.0) + enron['total_stock_value'].fillna(0.0))/1000000

Dataset Information

#data information/types

enron.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145 entries, 0 to 144
Data columns (total 26 columns):
name                         145 non-null object
salary                       94 non-null float64
to_messages                  86 non-null float64
deferral_payments            38 non-null float64
total_payments               124 non-null float64
exercised_stock_options      101 non-null float64
bonus                        81 non-null float64
restricted_stock             109 non-null float64
shared_receipt_with_poi      86 non-null float64
restricted_stock_deferred    17 non-null float64
total_stock_value            125 non-null float64
expenses                     94 non-null float64
loan_advances                3 non-null float64
from_messages                86 non-null float64
other                        92 non-null float64
from_this_person_to_poi      86 non-null float64
poi                          145 non-null bool
director_fees                16 non-null float64
deferred_income              48 non-null float64
long_term_incentive          65 non-null float64
email_address                111 non-null object
from_poi_to_this_person      86 non-null float64
total_be                     145 non-null float64
fraction_to_poi              86 non-null float64
fraction_from_poi            86 non-null float64
total_millions               145 non-null float64
dtypes: bool(1), float64(23), object(2)
memory usage: 28.5+ KB

Top

Just by looking at our dataset information above, we can quickly point out a few ways to narrow down our feature selection. Some of our features have lots of missing data, so those may be ones that we can remove. Features like "restricted_stock_deferred", "loan_advances", and "director_fees" may be some that we can take out altogether. There are also a few features that seem to be giving us the same information, like "shared_receipt_with_poi","to_messages", "from_messages", "from_this_person_to_poi", and "from_poi_to_this_person" all tell us about the person's e-mail behavior and all have the same data count, 86. We may be able to narrow those features down to just one or two, or create a new feature from them (see feature added above.)

Features that may be most useful, since we're dealing with corporate fraud, are those features that tell us about the money. Let's follow the money! Features that will give us that money trail will be "salary", "total_payments", "exercised_stock_options", "bonus", "restricted_stock", and "total_stock_value".

For now, let's continue to explore our dataset before making our final selection.

Feature Statistics

#number of POI in dataset
print 'There are 18 POI in our Dataset as you can see by our "True" count'
enron['poi'].value_counts()
There are 18 POI in our Dataset as you can see by our "True" count





False    127
True      18
Name: poi, dtype: int64
#set a baseline by extracting non-POIs and printing stats

non_poi = enron[enron.poi.isin([False])]

non_poi_money = non_poi[['salary','bonus','exercised_stock_options','total_stock_value',\
                         'total_payments','total_be','total_millions']].describe()
non_poi_money
salary bonus exercised_stock_options total_stock_value total_payments total_be total_millions
count 7.700000e+01 6.500000e+01 8.900000e+01 1.070000e+02 1.060000e+02 1.270000e+02 127.000000
mean 2.621515e+05 9.868249e+05 1.947752e+06 2.374085e+06 1.725091e+06 1.870028e+06 3.440052
std 1.392317e+05 1.173880e+06 2.547068e+06 3.535017e+06 2.618288e+06 2.693997e+06 4.839977
min 4.770000e+02 7.000000e+04 3.285000e+03 -4.409300e+04 1.480000e+02 0.000000e+00 0.000000
25% 2.061210e+05 4.000000e+05 4.365150e+05 4.246845e+05 3.304798e+05 2.213790e+05 0.507935
50% 2.516540e+05 7.000000e+05 1.030329e+06 1.030329e+06 1.056092e+06 8.862310e+05 1.884748
75% 2.885890e+05 1.000000e+06 2.165172e+06 2.307584e+06 2.006025e+06 2.250522e+06 4.317325
max 1.060932e+06 8.000000e+06 1.536417e+07 2.381793e+07 1.725253e+07 1.636417e+07 31.874715
non_poi_email_behavior = non_poi[['shared_receipt_with_poi','to_messages',\
                                  'from_messages','fraction_from_poi','fraction_to_poi']].describe()
non_poi_email_behavior
shared_receipt_with_poi to_messages from_messages fraction_from_poi fraction_to_poi
count 72.000000 72.000000 72.000000 72.000000 72.000000
mean 1058.527778 2007.111111 668.763889 0.036107 0.152669
std 1132.503757 2693.165955 1978.997801 0.041929 0.206057
min 2.000000 57.000000 12.000000 0.000000 0.000000
25% 191.500000 513.750000 20.500000 0.007760 0.000000
50% 594.000000 944.000000 41.000000 0.022741 0.053776
75% 1635.500000 2590.750000 216.500000 0.050705 0.225000
max 4527.000000 15149.000000 14368.000000 0.217341 1.000000

I thought it was interesting to see someone with 100% of their e-mails going to persons of interest. Below, I printed out some features associated with this person. After a little research, I found that Gene Humphrey was one of the first employees of Enron. So, it makes sense that all of his e-mails were to persons of interest who had been with the company from the beginning. Those were the only people he worked with.

enron[(enron['fraction_to_poi']>0.9)][['name','salary','total_be',\
                                       'restricted_stock','total_stock_value','to_messages','poi']]
name salary total_be restricted_stock total_stock_value to_messages poi
10 HUMPHREY GENE E 130724.0 2282768.0 NaN 2282768.0 128.0 False
#POI stats

poi_info = enron[enron.poi.isin([True])]

poi_money = poi_info[['salary','bonus','exercised_stock_options','total_stock_value',\
                      'total_payments','total_be','total_millions']].describe()
poi_money
salary bonus exercised_stock_options total_stock_value total_payments total_be total_millions
count 1.700000e+01 1.600000e+01 1.200000e+01 1.800000e+01 1.800000e+01 1.800000e+01 18.000000
mean 3.834449e+05 2.075000e+06 1.046379e+07 9.165671e+06 7.913590e+06 8.820307e+06 17.079261
std 2.783597e+05 2.047437e+06 1.238259e+07 1.384117e+07 2.396549e+07 1.222914e+07 35.289434
min 1.584030e+05 2.000000e+05 3.847280e+05 1.260270e+05 9.109300e+04 8.000000e+05 1.765324
25% 2.401890e+05 7.750000e+05 1.456581e+06 1.016450e+06 1.142396e+06 1.262500e+06 3.140359
50% 2.786010e+05 1.275000e+06 3.914557e+06 2.206836e+06 1.754028e+06 2.079817e+06 4.434161
75% 4.151890e+05 2.062500e+06 1.938604e+07 1.051133e+07 2.665345e+06 7.990914e+06 11.274354
max 1.111258e+06 7.000000e+06 3.434838e+07 4.911008e+07 1.035598e+08 4.134838e+07 152.669871
poi_email_behavior = poi_info[['shared_receipt_with_poi','to_messages', \
                               'from_messages','fraction_from_poi','fraction_to_poi']].describe()
poi_email_behavior
shared_receipt_with_poi to_messages from_messages fraction_from_poi fraction_to_poi
count 14.000000 14.000000 14.000000 14.000000 14.000000
mean 1783.000000 2417.142857 300.357143 0.047507 0.345470
std 1264.996625 1961.858101 805.844574 0.032085 0.156894
min 91.000000 225.000000 16.000000 0.021339 0.173611
25% 1059.250000 1115.750000 33.000000 0.026900 0.228580
50% 1589.000000 1875.000000 44.500000 0.030639 0.276389
75% 2165.250000 2969.250000 101.500000 0.059118 0.427083
max 5521.000000 7991.000000 3069.000000 0.136519 0.656250
#difference in non-poi compensation and poi compensation

difference_in_money = poi_money - non_poi_money
difference_in_money
salary bonus exercised_stock_options total_stock_value total_payments total_be total_millions
count -60.000000 -4.900000e+01 -7.700000e+01 -8.900000e+01 -8.800000e+01 -1.090000e+02 -109.000000
mean 121293.375859 1.088175e+06 8.516041e+06 6.791586e+06 6.188499e+06 6.950279e+06 13.639208
std 139128.027285 8.735576e+05 9.835520e+06 1.030615e+07 2.134720e+07 9.535142e+06 30.449457
min 157926.000000 1.300000e+05 3.814430e+05 1.701200e+05 9.094500e+04 8.000000e+05 1.765324
25% 34068.000000 3.750000e+05 1.020066e+06 5.917658e+05 8.119162e+05 1.041121e+06 2.632424
50% 26947.000000 5.750000e+05 2.884228e+06 1.176506e+06 6.979350e+05 1.193586e+06 2.549413
75% 126600.000000 1.062500e+06 1.722087e+07 8.203751e+06 6.593195e+05 5.740393e+06 6.957028
max 50326.000000 -1.000000e+06 1.898422e+07 2.529215e+07 8.630726e+07 2.498422e+07 120.795156

We can see from the table above that money matters! The mean difference is especially telling. And, although upper management tends to have greater compensation, you can't help but be shocked by the tremendous gap seen here.

#difference in non-poi email behavior and poi behavior

difference_in_email = poi_email_behavior - non_poi_email_behavior
difference_in_email
shared_receipt_with_poi to_messages from_messages fraction_from_poi fraction_to_poi
count -58.000000 -58.000000 -58.000000 -58.000000 -58.000000
mean 724.472222 410.031746 -368.406746 0.011399 0.192800
std 132.492868 -731.307854 -1173.153226 -0.009844 -0.049163
min 89.000000 168.000000 4.000000 0.021339 0.173611
25% 867.750000 602.000000 12.500000 0.019140 0.228580
50% 995.000000 931.000000 3.500000 0.007898 0.222613
75% 529.750000 378.500000 -115.000000 0.008413 0.202083
max 994.000000 -7158.000000 -11299.000000 -0.080822 -0.343750

My original email behavior table was a bit less telling than the money table, but I was able to scale features to reflect e-mail behavior more accurately. The updated tables can be seen above with the fraction of emails sent to and from POIs.

#poi name, salary, bonus, stock options, total bonus and options, from messages, and fraction to poi, ordered by total descending

poi_info[['name','salary','bonus','exercised_stock_options','total_be','total_millions',\
          'from_messages','fraction_to_poi']].sort_values('total_millions',ascending=False)
name salary bonus exercised_stock_options total_be total_millions from_messages fraction_to_poi
65 LAY KENNETH L 1072321.0 7000000.0 34348384.0 41348384.0 152.669871 36.0 0.444444
95 SKILLING JEFFREY K 1111258.0 5600000.0 19250000.0 24850000.0 34.776388 108.0 0.277778
125 HIRKO JOSEPH NaN NaN 30766064.0 30766064.0 30.857157 NaN NaN
88 RICE KENNETH D 420636.0 1750000.0 19794175.0 21544175.0 23.047589 18.0 0.222222
124 YEAGER F SCOTT 158403.0 NaN 8308552.0 8308552.0 12.245058 NaN NaN
60 DELAINEY DAVID W 365163.0 3000000.0 2291113.0 5291113.0 8.362240 3069.0 0.198436
4 HANNON KEVIN P 243293.0 1500000.0 5538001.0 7038001.0 6.679747 32.0 0.656250
82 BELDEN TIMOTHY N 213999.0 5249999.0 953136.0 6203135.0 6.612335 484.0 0.223140
53 SHELBY REX 211844.0 200000.0 1624396.0 1824396.0 4.497501 39.0 0.358974
141 CAUSEY RICHARD A 415189.0 1000000.0 NaN 1000000.0 4.370821 49.0 0.244898
85 FASTOW ANDREW S 440698.0 1300000.0 NaN 1300000.0 4.218495 NaN NaN
41 KOPPER MICHAEL J 224305.0 800000.0 NaN 800000.0 3.637644 NaN NaN
134 KOENIG MARK E 309946.0 700000.0 671737.0 1371737.0 3.507476 61.0 0.245902
30 RIEKER PAULA H 249201.0 700000.0 1635238.0 2335238.0 3.017987 82.0 0.585366
76 BOWEN JR RAYMOND M 278601.0 1350000.0 NaN 1350000.0 2.921644 27.0 0.555556
16 COLWELL WESLEY 288542.0 1200000.0 NaN 1200000.0 2.188586 40.0 0.275000
144 GLISAN JR BEN F 274975.0 600000.0 384728.0 984728.0 2.050830 16.0 0.375000
87 CALGER CHRISTOPHER F 240189.0 1250000.0 NaN 1250000.0 1.765324 144.0 0.173611

Although we don't have the salary and bonus data for Joseph Hirko, his exercised stock options is second to Kenneth Lay. Since "exercised_stock_options" seems to be a key indicator of a POI when salary/bonus is unavailable, that is definitely a feature we'll want to include in our final feature selection. These features may be even more robust when taking the total of bonus and options. In fact, let's add the feature, total_be, to our dataset and maybe it will come in handy. I went back and added this feature to the top of my code in order to include it in the info and analysis.
Also, it was interesting to see that POIs don't have as many "from_messages" as non-POIs. David Delainey is the only one that has well over 500 emails from him. This may be a telling behavior of POIs, as they may prefer to talk face-to-face with others.
Top

Explore Features

In this section, we'll visualize some of our features in order to explore them further.

import matplotlib.pyplot as plt
%matplotlib inline
%pylab inline
import seaborn as sns
Populating the interactive namespace from numpy and matplotlib

Salary

average_salary = enron.groupby('poi').mean()['salary']
average_salary
poi
False    262151.506494
True     383444.882353
Name: salary, dtype: float64
sns.boxplot(x='poi',y='salary',data=enron)
<matplotlib.axes._subplots.AxesSubplot at 0xd353a20>

png

Bonus

average_bonus = enron.groupby('poi').mean()['bonus']
average_bonus
poi
False    9.868249e+05
True     2.075000e+06
Name: bonus, dtype: float64
sns.boxplot(x='poi',y='bonus',data=enron)
<matplotlib.axes._subplots.AxesSubplot at 0xe0ac860>

png

Wow! An 8 million dollar bonus seems a bit much, but you can see the difference between John and Ken in other ways. Below, you can see the difference between all other financial features is significant. And, bonuses among POIs are still higher on average than non-POIs. So, despite our non-POI outlier, this feature may still be useful in training our algorithm.

enron[(enron['bonus']>6000000)][['name','salary','bonus','exercised_stock_options','restricted_stock','total_stock_value','poi']]
name salary bonus exercised_stock_options restricted_stock total_stock_value poi
43 LAVORATO JOHN J 339288.0 8000000.0 4158995.0 1008149.0 5167144.0 False
65 LAY KENNETH L 1072321.0 7000000.0 34348384.0 14761694.0 49110078.0 True

Total Payments

average_total_payments = enron.groupby('poi').mean()['total_payments']
average_total_payments
poi
False    1.725091e+06
True     7.913590e+06
Name: total_payments, dtype: float64
sns.boxplot(x='poi',y='total_payments',data=enron)
<matplotlib.axes._subplots.AxesSubplot at 0xe2289e8>

png

Wow! That last boxplot has an outlier that is obviously pulling the mean waaaaaaaay up. Who is that?

enron[(enron['total_payments']>40000000)][['name','total_payments','poi']]
name total_payments poi
65 LAY KENNETH L 103559793.0 True
#take Ken Lay out of the poi boxplot

kl_not_in = enron[(enron['total_payments']<40000000)]

sns.boxplot(x='poi',y='total_payments',data=kl_not_in)
<matplotlib.axes._subplots.AxesSubplot at 0xe3e44e0>

png

Well, at least now we can see the boxplots more clearly. There's not much of a difference between POI and non-POI here when we take Ken Lay out, so total_payments probably won't be a feature we'll use.

Exercised Stock Options

average_optionsvalue = enron.groupby('poi').mean()['exercised_stock_options']
average_optionsvalue
poi
False    1.947752e+06
True     1.046379e+07
Name: exercised_stock_options, dtype: float64
sns.boxplot(x='poi',y='exercised_stock_options',data=enron)
<matplotlib.axes._subplots.AxesSubplot at 0xe353080>

png

Exercised stock options definitely looks to be higher among POIs, so this will definitely be a feature to include in our list of features for our algorithm.

Total Bonus and Exercised Stock Options

average_total_sbe = enron.groupby('poi').mean()['total_be']
average_total_sbe
poi
False    1.870028e+06
True     8.820307e+06
Name: total_be, dtype: float64
sns.boxplot(x='poi',y='total_be',data=enron)
<matplotlib.axes._subplots.AxesSubplot at 0xe6089b0>

png

Total Bonus and Exercised Stock Options might be useful, but it might also just add to the noise. So, maybe we won't use this one.

Total Stock Value

average_stockvalue = enron.groupby('poi').mean()['total_stock_value']
average_stockvalue
poi
False    2.374085e+06
True     9.165671e+06
Name: total_stock_value, dtype: float64
sns.boxplot(x='poi',y='total_stock_value',data=enron)
<matplotlib.axes._subplots.AxesSubplot at 0xeba4358>

png

Total stock value for POIs on average is much higher than non-POIs. This feature is another good option for our POI identifier.

Total Payments and Stock Value in Millions

average_total_comp = enron.groupby('poi').mean()['total_millions']
average_total_comp
poi
False     3.440052
True     17.079261
Name: total_millions, dtype: float64
sns.boxplot(x='poi',y='total_millions',data= enron)
<matplotlib.axes._subplots.AxesSubplot at 0xef21668>

png

Let's try that one again without Ken Lay...

sns.boxplot(x='poi',y='total_millions',data= kl_not_in)
<matplotlib.axes._subplots.AxesSubplot at 0xeec3390>

png

Hmmmm, maybe we didn't need to add this feature. We can look closer by using lmplot and pairplot later on in our analysis.

Shared Receipt with POI

average_shared_receipt = enron.groupby('poi').mean()['shared_receipt_with_poi']
average_shared_receipt
poi
False    1058.527778
True     1783.000000
Name: shared_receipt_with_poi, dtype: float64
sns.boxplot(x='poi',y='shared_receipt_with_poi',data= enron)
<matplotlib.axes._subplots.AxesSubplot at 0x10734b00>

png

To Messages

average_to = enron.groupby('poi').mean()['to_messages']
average_to
poi
False    2007.111111
True     2417.142857
Name: to_messages, dtype: float64
sns.boxplot(x='poi',y='to_messages',data= enron)
<matplotlib.axes._subplots.AxesSubplot at 0x10b46e80>

png

From Messages

average_from = enron.groupby('poi').mean()['from_messages']
average_from
poi
False    668.763889
True     300.357143
Name: from_messages, dtype: float64
sns.boxplot(x='poi',y='from_messages',data= enron)
<matplotlib.axes._subplots.AxesSubplot at 0x106998d0>

png

Fraction to POI

average_fraction_to = enron.groupby('poi').mean()['fraction_to_poi']
average_fraction_to
poi
False    0.152669
True     0.345470
Name: fraction_to_poi, dtype: float64
sns.boxplot(x='poi',y='fraction_to_poi',data= enron)
<matplotlib.axes._subplots.AxesSubplot at 0x10e21c88>

png

Fraction_to_poi looks like a good feature to add to our list, since most of the poi distribution is in the upper range of the non-poi distribution.

Fraction from POI

average_fraction_from = enron.groupby('poi').mean()['fraction_from_poi']
average_fraction_from
poi
False    0.036107
True     0.047507
Name: fraction_from_poi, dtype: float64
sns.boxplot(x='poi',y='fraction_from_poi',data= enron)
<matplotlib.axes._subplots.AxesSubplot at 0x112cf2e8>

png

Top

Pairplot Analysis

Now, let's take a look at some of our features in the following pairplot. Maybe it will help us make our final decisions for our features list.

import seaborn as sns; sns.set(style="ticks", color_codes=True)

g = sns.pairplot(enron, vars=['bonus','exercised_stock_options','from_messages','fraction_to_poi'],
                 dropna=True, diag_kind='kde', hue='poi', markers=['x','o'])

png

Outliers

When looking at the stats for poi and non-poi for the first time, I noticed that the non-poi stats were much higher than the poi stats. That's when I remembered I didn't account for the "TOTAL" key. So, I went back and skipped over that key when writing to my csv. I figured I'd just pop it out of my dictionary later if I need to. After doing that, my stats were as expected. Now, let's see what other outliers we can find. In the pairplot above, there were two POIs that really stood out. Let's take a closer look in the following lmplot.

sns.lmplot(x='bonus', y= 'salary', hue='poi', data=enron, palette='Set1',size=10,markers=['x','o'])
plt.title('Salary/Bonus for POI and non-POI', fontsize=18)
plt.xlabel('Bonus', fontsize=16)
plt.ylabel('Salary', fontsize=16)
<matplotlib.text.Text at 0x14e43e10>

png

#Who are the two outliers in blue with the high salary AND high bonus?  Ken Lay and Jeff Skilling of course!

enron[(enron['salary']>1000000)][['name','salary','bonus','poi']]
name salary bonus poi
65 LAY KENNETH L 1072321.0 7000000.0 True
95 SKILLING JEFFREY K 1111258.0 5600000.0 True
128 FREVERT MARK A 1060932.0 2000000.0 False

These are two of our persons of interest, so we definitely don't want to take them out of our dataset.

According to Executive Excess 2002

Top executives at 23 companies under investigation for their accounting practices earned far more during the past three years than the average CEO at large companies. CEOs at the firms under investigation earned an average of 62.2 million dollars during 1999-2001, 70 percent more than the average of 36.5 million dollars for all leading executives for that period.

We may also be able to find a few datapoints that are just causing noise by checking for lots of missing values in rows.

#check for more than 20 missing values for each datapoint

i = 0

for row in enron.isnull().sum(axis=1):
    if row > 20:
        print enron.iloc[i]
    i+=1
name                         LOCKHART EUGENE E
salary                                     NaN
to_messages                                NaN
deferral_payments                          NaN
total_payments                             NaN
exercised_stock_options                    NaN
bonus                                      NaN
restricted_stock                           NaN
shared_receipt_with_poi                    NaN
restricted_stock_deferred                  NaN
total_stock_value                          NaN
expenses                                   NaN
loan_advances                              NaN
from_messages                              NaN
other                                      NaN
from_this_person_to_poi                    NaN
poi                                      False
director_fees                              NaN
deferred_income                            NaN
long_term_incentive                        NaN
email_address                              NaN
from_poi_to_this_person                    NaN
total_be                                     0
fraction_to_poi                            NaN
fraction_from_poi                          NaN
total_millions                               0
Name: 90, dtype: object
#check for missing values in features

enron.isnull().sum()
name                           0
salary                        51
to_messages                   59
deferral_payments            107
total_payments                21
exercised_stock_options       44
bonus                         64
restricted_stock              36
shared_receipt_with_poi       59
restricted_stock_deferred    128
total_stock_value             20
expenses                      51
loan_advances                142
from_messages                 59
other                         53
from_this_person_to_poi       59
poi                            0
director_fees                129
deferred_income               97
long_term_incentive           80
email_address                 34
from_poi_to_this_person       59
total_be                       0
fraction_to_poi               59
fraction_from_poi             59
total_millions                 0
dtype: int64

Loan advances has 142 missing values! That's definitely a feature we can remove before we run our tests.

Money and Messages Regression Model

sns.lmplot(x='bonus', y='fraction_to_poi', hue='poi', data=enron, palette='Set1',size=10,markers=['x','o'])
plt.title('Money & Messages', fontsize=18)
<matplotlib.text.Text at 0x1517fa90>

png

Top

Transform, Select, and Scale

Now let's transform, select, and scale our features.

import sys
import pickle
sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data
features_full_list = enron.columns.tolist()
features_full_list.pop(0) #take out 'name'
features_full_list.pop(19) #take out 'email_address'
features_full_list.pop(11) #take out 'loan_advances' because of missing values
features_full_list.pop(15) #take out 'director_fees' because of missing values
features_full_list.pop(14) #take out 'poi' for now and add to beginning of list
features_list = ['poi']
for n in features_full_list:
    features_list.append(n)
features_list
['poi',
 'salary',
 'to_messages',
 'deferral_payments',
 'total_payments',
 'exercised_stock_options',
 'bonus',
 'restricted_stock',
 'shared_receipt_with_poi',
 'restricted_stock_deferred',
 'total_stock_value',
 'expenses',
 'from_messages',
 'other',
 'from_this_person_to_poi',
 'deferred_income',
 'long_term_incentive',
 'from_poi_to_this_person',
 'total_be',
 'fraction_to_poi',
 'fraction_from_poi',
 'total_millions']
### Remove outliers that corrupt the data
enron_dict.pop('TOTAL', 0)
{'bonus': 97343619,
 'deferral_payments': 32083396,
 'deferred_income': -27992891,
 'director_fees': 1398517,
 'email_address': 'NaN',
 'exercised_stock_options': 311764000,
 'expenses': 5235198,
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 83925000,
 'long_term_incentive': 48521928,
 'other': 42667589,
 'poi': False,
 'restricted_stock': 130322299,
 'restricted_stock_deferred': -7576788,
 'salary': 26704229,
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 309886585,
 'total_stock_value': 434509511}
#remove datapoints that create noise

enron_dict.pop('LOCKHART EUGENE E',0)
{'bonus': 'NaN',
 'deferral_payments': 'NaN',
 'deferred_income': 'NaN',
 'director_fees': 'NaN',
 'email_address': 'NaN',
 'exercised_stock_options': 'NaN',
 'expenses': 'NaN',
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 'NaN',
 'long_term_incentive': 'NaN',
 'other': 'NaN',
 'poi': False,
 'restricted_stock': 'NaN',
 'restricted_stock_deferred': 'NaN',
 'salary': 'NaN',
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 'NaN',
 'total_stock_value': 'NaN'}
#take out all 'loan_advances' because of missing values

for name in enron_dict:
    enron_dict[name].pop('loan_advances',0)
### Create new feature(s)

#add fraction of emails from and to poi
#idea for this added feature taken from course materials

def computeFraction( poi_messages, all_messages ):
    """ given a number messages to/from POI (numerator) 
        and number of all messages to/from a person (denominator),
        return the fraction of messages to/from that person
        that are from/to a POI
   """
    fraction = 0.
    if poi_messages != 'NaN' and all_messages != 'NaN':
        fraction = float(poi_messages)/all_messages


    return fraction
for name in enron_dict:

    data_point = enron_dict[name]

    from_poi_to_this_person = data_point["from_poi_to_this_person"]
    to_messages = data_point["to_messages"]
    fraction_from_poi = computeFraction( from_poi_to_this_person, to_messages )
    
    enron_dict[name]["fraction_from_poi"] = fraction_from_poi
  
    from_this_person_to_poi = data_point["from_this_person_to_poi"]
    from_messages = data_point["from_messages"]
    fraction_to_poi = computeFraction( from_this_person_to_poi, from_messages )

    enron_dict[name]["fraction_to_poi"] = fraction_to_poi
#add total_be to dictionary

for name in enron_dict:
    data_point = enron_dict[name]
    
    bonus = data_point['bonus']
    if bonus == 'NaN':
        bonus = 0.0
    options = data_point['exercised_stock_options']
    if options == 'NaN':
        options = 0.0
    total = bonus+options

    enron_dict[name]['total_be'] = total
    
    
#add total compensation in millions to dataset

for name in enron_dict:
    data_point = enron_dict[name]
    
    total_payments = data_point['total_payments']
    if total_payments == 'NaN':
        total_payments = 0.0
    total_stock = data_point['total_stock_value']
    if total_stock == 'NaN':
        total_stock = 0.0
    total = (total_payments + total_stock)/1000000

    enron_dict[name]['total_millions'] = total

SELECT FEATURES

I've selected a few lists that may be useful in training our classifiers. Each of the features selected may be able to give us some insight into the compensation and behavior of a POI. The total compensation (total_millions) shows us that, on average, POIs are compensated more highly than non-POIs. The same holds true for individual payments, like salary and bonus. And, when it comes to stock behavior, POIs are more active in their exercising of stock options(exercised_stock_options.) Other features, like from_messages, show a kind of pattern in e-mail behavior. POIs do not send many messages. However, the ones they do send are often to other POIs(fraction_to_poi). These are all features we'll test before making our final feature selection.

We will start out with our full list as a baseline and test that against our selected lists' metrics in order to find our final feature list for our POI identifier. Our lists were chosen based on our stats and plots generated in our analysis. Those features that showed a greater overall difference between POI and non-POI stats were chosen

print features_list #list of features available for testing
['poi', 'salary', 'to_messages', 'deferral_payments', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'restricted_stock_deferred', 'total_stock_value', 'expenses', 'from_messages', 'other', 'from_this_person_to_poi', 'deferred_income', 'long_term_incentive', 'from_poi_to_this_person', 'total_be', 'fraction_to_poi', 'fraction_from_poi', 'total_millions']
### Select what features to use
first_list = ['poi','total_millions','fraction_to_poi','from_messages']
second_list = ['poi','total_be','fraction_to_poi','from_messages']
third_list = ['poi','salary','bonus','fraction_to_poi','from_messages']
fourth_list = ['poi','bonus','exercised_stock_options','fraction_to_poi']
features_final_list = fourth_list
print "Final List", features_final_list
Final List ['poi', 'bonus', 'exercised_stock_options', 'fraction_to_poi']

Algorithm Selection

Evaluation Metrics

#Evaluation metrics

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

import scikitplot as skplt

from sklearn.cross_validation import StratifiedShuffleSplit
#Classifiers
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
# %load 'my_test.py'
def test_list(classifier, feature_list, enron_dict):
    
    my_dataset = enron_dict
    
    data = featureFormat(my_dataset, feature_list, sort_keys = True) 
    labels, features = targetFeatureSplit(data) 
    
    X = np.array(features)
    y = np.array(labels)
    sss = StratifiedShuffleSplit(labels, n_iter=1000, test_size=0.3, random_state=42)      
    for train_index, test_index in sss:
        features_train, features_test = X[train_index], X[test_index]
        labels_train, labels_test = y[train_index], y[test_index]
        
    clf = classifier
    clf.fit(features_train,labels_train)
    pred = clf.predict(features_test)
    
    if classifier == DecisionTreeClassifier():
        return {'Accuracy': accuracy_score(labels_test,pred),'Precision': precision_score(labels_test,pred),
                'Recall': recall_score(labels_test,pred), 'Feature Importance': clf.feature_importances_}
    
    return {'Accuracy': accuracy_score(labels_test,pred),'Precision': precision_score(labels_test,pred),
            'Recall': recall_score(labels_test,pred)}
    
    
   

Performance: Accuracy, Precision, and Recall

Below, we will test each list using three different algorithms: Naive Bayes, Decision Tree, and KNearest Neighbors.

  • Our accuracy score will show us our ratio of correctly predicted observation to the total observations.
    Accuracy = TP+TN/TP+FP+FN+TN
  • Precision is the ratio of correctly predicted positive observations to the total predicted positive observations.
    Precision = TP/TP+FP
  • And, recall is the ratio of correctly predicted positive observations to all observations in the class.
    Recall = TP/TP+FN

When trying to identify POIs, we want to see as few falsely identified positives. We don't want to falsely identify anyone as a POI. So, I'd say precision is a bit more important here. Let's see how each list does using each classifier.

#full list
print features_list
print 'GaussianNB: ', test_list(GaussianNB(),features_list,enron_dict)
print 'DecisionTree: ', test_list(DecisionTreeClassifier(),features_list,enron_dict)
print 'KNeighbors: ', test_list(KNeighborsClassifier(),features_list,enron_dict)
['poi', 'salary', 'to_messages', 'deferral_payments', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'restricted_stock_deferred', 'total_stock_value', 'expenses', 'from_messages', 'other', 'from_this_person_to_poi', 'deferred_income', 'long_term_incentive', 'from_poi_to_this_person', 'total_be', 'fraction_to_poi', 'fraction_from_poi', 'total_millions']
GaussianNB:  {'Recall': 0.83333333333333337, 'Precision': 0.26315789473684209, 'Accuracy': 0.65909090909090906}
DecisionTree:  {'Recall': 0.16666666666666666, 'Precision': 0.10000000000000001, 'Accuracy': 0.68181818181818177}
KNeighbors:  {'Recall': 0.33333333333333331, 'Precision': 0.66666666666666663, 'Accuracy': 0.88636363636363635}
#first list
print first_list
print 'GaussianNB: ', test_list(GaussianNB(),first_list,enron_dict)
print 'DecisionTree: ', test_list(DecisionTreeClassifier(),first_list,enron_dict)
print 'KNeighbors: ', test_list(KNeighborsClassifier(),first_list,enron_dict)
['poi', 'total_millions', 'fraction_to_poi', 'from_messages']
GaussianNB:  {'Recall': 0.5, 'Precision': 0.59999999999999998, 'Accuracy': 0.87804878048780488}
DecisionTree:  {'Recall': 0.0, 'Precision': 0.0, 'Accuracy': 0.75609756097560976}
KNeighbors:  {'Recall': 0.0, 'Precision': 0.0, 'Accuracy': 0.85365853658536583}
#second list
print second_list
print 'GaussianNB: ', test_list(GaussianNB(),second_list,enron_dict)
print 'DecisionTree: ', test_list(DecisionTreeClassifier(),second_list,enron_dict)
print 'KNeighbors: ', test_list(KNeighborsClassifier(),second_list,enron_dict)
['poi', 'total_be', 'fraction_to_poi', 'from_messages']
GaussianNB:  {'Recall': 0.33333333333333331, 'Precision': 0.33333333333333331, 'Accuracy': 0.79487179487179482}
DecisionTree:  {'Recall': 0.33333333333333331, 'Precision': 0.33333333333333331, 'Accuracy': 0.79487179487179482}
KNeighbors:  {'Recall': 0.33333333333333331, 'Precision': 0.40000000000000002, 'Accuracy': 0.82051282051282048}
#third list
print third_list
print 'GaussianNB: ', test_list(GaussianNB(),third_list,enron_dict)
print 'DecisionTree: ', test_list(DecisionTreeClassifier(),third_list,enron_dict)
print 'KNeighbors: ', test_list(KNeighborsClassifier(),third_list,enron_dict)
['poi', 'salary', 'bonus', 'fraction_to_poi', 'from_messages']
GaussianNB:  {'Recall': 0.40000000000000002, 'Precision': 0.2857142857142857, 'Accuracy': 0.76470588235294112}
DecisionTree:  {'Recall': 0.80000000000000004, 'Precision': 0.40000000000000002, 'Accuracy': 0.79411764705882348}
KNeighbors:  {'Recall': 0.40000000000000002, 'Precision': 0.66666666666666663, 'Accuracy': 0.88235294117647056}
#fourth_list
print fourth_list
print 'GaussianNB: ', test_list(GaussianNB(),fourth_list,enron_dict)
print 'DecisionTree: ', test_list(DecisionTreeClassifier(),fourth_list,enron_dict)
print 'KNeighbors: ', test_list(KNeighborsClassifier(),fourth_list,enron_dict)
['poi', 'bonus', 'exercised_stock_options', 'fraction_to_poi']
GaussianNB:  {'Recall': 0.20000000000000001, 'Precision': 0.14285714285714285, 'Accuracy': 0.73684210526315785}
DecisionTree:  {'Recall': 0.40000000000000002, 'Precision': 0.22222222222222221, 'Accuracy': 0.73684210526315785}
KNeighbors:  {'Recall': 0.59999999999999998, 'Precision': 0.75, 'Accuracy': 0.92105263157894735}

With an accuracy score of 92%, a precision score of 75%, and a recall score of 60% ...

Our Final List includes 'poi', 'bonus', 'exercised stock options', and 'fraction to poi'
Our Final Classifier with be KNeighbors

Validation

We've already implemented our validation process, but here we will discuss its importance. Without validating our classifier using training/testing data, we have no way of measuring its accuracy and reliability. Training and testing the classifier against the same data will only yield overfitting results. This is why validation is important. By using StratifiedShuffleSplit to split our data into training and testing data, we can make sure that our classes are allocated by the same ratio set for training/testing and that each datapoint in the class is randomly selected. Because of our small dataset, setting the iterations to 1000 will give us more reliable results in the end, as we will have trained and tested on almost all of our datapoints. The only downside is the run time.

### Store to my_dataset for easy export below.
my_dataset = enron_dict

### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_final_list, sort_keys = True)
labels, features = targetFeatureSplit(data)
#Validation using StratifiedShuffleSplit in order to evenly dispurse the classes between training and test data
X = np.array(features)
y = np.array(labels)
sss = StratifiedShuffleSplit(labels, n_iter=1000, test_size=0.3, random_state=42)      
for train_index, test_index in sss:
    features_train, features_test = X[train_index], X[test_index]
    labels_train, labels_test = y[train_index], y[test_index]

#check for accuracy
    
clf = KNeighborsClassifier()
clf.fit(features_train, labels_train)

print clf.score(features_test, labels_test)
0.921052631579

Top

Tuning

Even though we've all but settled on KNeighbors, let's see if tuning the parameters of our Decision Tree Classifier would make a difference. Tuning parameters can sometimes significantly change our performance metrics outcome. Parameters can control for overfitting/underfitting, so tuning them can certainly change the metrics.

DecisionTreeClassifier().get_params()
{'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_split': 1e-07,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'presort': False,
 'random_state': None,
 'splitter': 'best'}
#Decision Tree
from sklearn.tree import DecisionTreeClassifier

#set min_samples_split to 3 and increase until no longer helpful
clf = DecisionTreeClassifier(min_samples_split=9)
clf.fit(features_train,labels_train)
pred = clf.predict(features_test)
"""  Can we do better?  [0.71052631578947367, 0.20000000000000001, 0.40000000000000002]"""

#print performance metrics
print {'Accuracy': accuracy_score(labels_test,pred),'Precision': precision_score(labels_test,pred),
       'Recall': recall_score(labels_test,pred)}
{'Recall': 0.40000000000000002, 'Precision': 0.25, 'Accuracy': 0.76315789473684215}
#plot confusion matrix
skplt.metrics.plot_confusion_matrix(labels_test, pred, normalize=True)
<matplotlib.axes._subplots.AxesSubplot at 0x177c7828>

png

How about our KNeighbors Classifier? What are the parameters? If we changed any of them, would it make a difference?

KNeighborsClassifier().get_params()
{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': 1,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}
"""Can we do better without overfitting?  [0.92105263157894735, 0.75, 0.59999999999999998]"""

#set n_neighbors to 2 and increase until metrics show overfitting
clf = KNeighborsClassifier(n_neighbors=6)
clf.fit(features_train,labels_train)
pred = clf.predict(features_test)

print {'Accuracy': accuracy_score(labels_test,pred),'Precision': precision_score(labels_test,pred),
       'Recall': recall_score(labels_test,pred)}
{'Recall': 0.59999999999999998, 'Precision': 1.0, 'Accuracy': 0.94736842105263153}

^OVERFIT!

"""Can we still do better?  [0.92105263157894735, 0.75, 0.59999999999999998]"""

#set n_neighbors to 2,3, and 4

def test_param(n):
    clf = KNeighborsClassifier(n_neighbors=n)
    clf.fit(features_train,labels_train)
    pred = clf.predict(features_test)

    return [accuracy_score(labels_test,pred), precision_score(labels_test,pred), recall_score(labels_test,pred)]
print "2: ", test_param(2)
print "3: ", test_param(3)
print "4: ", test_param(4)
2:  [0.92105263157894735, 0.75, 0.59999999999999998]
3:  [0.89473684210526316, 0.59999999999999998, 0.59999999999999998]
4:  [0.92105263157894735, 0.75, 0.59999999999999998]

GridSearchCV

When using GridSearchCV to find the best parameters for KNeighbors, our estimator gives us the same results.

from sklearn.model_selection import GridSearchCV

k = np.arange(10)+1
leaf = np.arange(30)+1
params = {'algorithm': ('auto', 'ball_tree', 'kd_tree', 'brute'),'leaf_size': leaf,'n_neighbors': k}

clf_params = GridSearchCV(KNeighborsClassifier(), params, cv=5)
clf_params.fit(features_train,labels_train)
GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10]), 'leaf_size': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]), 'algorithm': ('auto', 'ball_tree', 'kd_tree', 'brute')},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)
clf_params.best_estimator_
KNeighborsClassifier(algorithm='auto', leaf_size=1, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

Let's see what kind of results we get using the parameters "suggested" by GridSearchCV.

clf = KNeighborsClassifier(algorithm='auto', leaf_size=1, metric='minkowski',
                           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
                           weights='uniform')
clf.fit(features_train,labels_train)
pred = clf.predict(features_test)

print {'Accuracy': accuracy_score(labels_test,pred),'Precision': precision_score(labels_test,pred),
       'Recall': recall_score(labels_test,pred)}
{'Recall': 0.59999999999999998, 'Precision': 0.75, 'Accuracy': 0.92105263157894735}

These are the same results as our default settings, so that's what we'll use.

Top

FINAL FEATURES and ALGORITHM SELECTION

Our final features, based on our feature analysis and testing, will be:

  • Bonus
  • Exercised Stock Options
  • Fraction to POI

The KNeighbors Classifier, even without any parameter tuning, had a higher accuracy score than the Decision Tree Classifier with its min_samples_split tuned to 9. So, we're going to use KNeighbors in our POI identifier. KNearest Neighbors will help us zero in on pockets of POIs/non-POIs within our testing data. The features I selected work well with this particular classifier because 'bonus' and 'exercised stock options' are good for training the algorithm to pick up on POI compensation trends, and 'fraction_to_poi' will help our algorithm pick up on POI e-mail behavior. I narrowed it down to three features so as not to create noise. When running our longer lists of features, we saw our precision and recall drop to zero, so cutting down our features to 3 was the strategy moving forward. As we had already gotten rid of features with lots of missing data, it was easier to narrow them down and test each list against our chosen classifiers.

clf = KNeighborsClassifier()
clf.fit(features_train,labels_train)
pred = clf.predict(features_test)

print accuracy_score(labels_test,pred)
print precision_score(labels_test,pred)
print recall_score(labels_test,pred)
0.921052631579
0.75
0.6
#plot confusion matrix
skplt.metrics.plot_confusion_matrix(labels_test, pred, normalize=True)
<matplotlib.axes._subplots.AxesSubplot at 0x12c6bcf8>

png

### Task 6: Dump your classifier, dataset, and features_list 
features_list = features_final_list
dump_classifier_and_data(clf, my_dataset, features_list)
#test performance using tester.py

%run 'tester.py'
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
	Accuracy: 0.87854	Precision: 0.71113	Recall: 0.35450	F1: 0.47314	F2: 0.39402
	Total predictions: 13000	True positives:  709	False positives:  288	False negatives: 1291	True negatives: 10712

Final Thoughts

With an accuracy of 85-95%, precision of 65-75%, and recall of 35-60%, I think our algorithm has done well considering the small amount of data we had to work with. But, using our validation methods proved useful in creating a reliable algorithm. We've done pretty well, but I don't think I'd want to bet anyones life in prison on any algorithm trained on this data. We could use more data! Our Naive Bayes and Decision Tree Classifiers didn't perform as well. Although, overfitting is a problem as you increase the number of 'n_neighbors' in our KNeighbors Classifier, we can avoid this by keeping the default setting of 5. That left us with a working algorithm and some pretty solid evaluation metrics. According to our confusion matrix, we were able to identify 97% of non-POIs and 60% of POIs. I'd rather get a few POIs wrong than falsely identify a non-POI as a POI. There may come a day when people will be convicted based on machine learning, so it's important that we be as accurate as possible. This identifier only gets it wrong around 40% of the time.

%run 'poi_id.py'
%run 'tester.py'
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
	Accuracy: 0.87854	Precision: 0.71113	Recall: 0.35450	F1: 0.47314	F2: 0.39402
	Total predictions: 13000	True positives:  709	False positives:  288	False negatives: 1291	True negatives: 10712

About

Udacity Machine Learning Project using the Enron financial and email dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published