# Enron Email Dataset Mini-Project

Datasets and Questions Mini-Project
---
The Enron fraud is a big, messy and totally fascinating story about corporate malfeasance of nearly every imaginable type. The Enron email and financial datasets are also big, messy treasure troves of information, which become much more useful once you know your way around them a bit. We’ve combined the email and finance data into a single dataset, which you’ll explore in this mini-project.

Quizzes: Size of the Enron Dataset and Features in the Enron Dataset
--
The aggregated Enron email + financial dataset is stored in a dictionary, where each key in the dictionary is a person’s name and the value is a dictionary containing all the features of that person.
The email + finance (E+F) data dictionary is stored as a pickle file, which is a handy way to store and load python objects directly. Use datasets_questions/explore_enron_data.py to load the dataset.

How many data points (people) are in the dataset? And for each person, how many features are available?

In [1]:
import pandas as pd

df = pd.DataFrame(pd.read_pickle('../final_project/final_project_dataset_unix2dos.pkl')).T
print('There\'re {} data points (i.e. people) and are avaiable {} features for each person.'.format(*df.shape))

There're 146 data points (i.e. people) and are avaiable 21 features for each person.


After load the dataset, I've started exploring dataset features.

In [2]:
df.columns.tolist()

['bonus',
 'deferral_payments',
 'deferred_income',
 'director_fees',
 'email_address',
 'exercised_stock_options',
 'expenses',
 'from_messages',
 'from_poi_to_this_person',
 'from_this_person_to_poi',
 'loan_advances',
 'long_term_incentive',
 'other',
 'poi',
 'restricted_stock',
 'restricted_stock_deferred',
 'salary',
 'shared_receipt_with_poi',
 'to_messages',
 'total_payments',
 'total_stock_value']

Quiz: Finding POIs in the Enron Data
--
The "poi" feature records whether the person is a person of interest, according to our definition. 

How many POIs are there in the E+F dataset?

In [3]:
num_pois = sum(df.poi==1)
print('There\'re {} POIs in the dataset.'.format(num_pois))

There're 18 POIs in the dataset.


Quiz: How Many POIs Exist?
--
We compiled a list of all POI names (in ../final_project/poi_names.txt) and associated email addresses (in ../final_project/poi_email_addresses.py).

How many POI’s were there total?

In [4]:
pois=pd.read_csv('../final_project/poi_names.txt')

print('There\'re {} POIs in total.'.format(len(pois)))

There're 35 POIs in total.


Quiz: Query the Dataset 1
--
What is the total value of the stock belonging to James Prentice?

In [5]:
print('The total value of the stock belonging to James Prentice is {}.'.format(df.loc['PRENTICE JAMES'].total_stock_value))

The total value of the stock belonging to James Prentice is 1095040.


Quiz: Query the Dataset 2
--
How many email messages do we have from Wesley Colwell to persons of interest?

In [6]:
print('We have {} email messages from Wesley Colwell to persons of interest.'.format(df.loc['COLWELL WESLEY'].from_this_person_to_poi))

We have 11 email messages from Wesley Colwell to persons of interest.


Quiz: Query the Dataset 3
--
What’s the value of stock options exercised by Jeffrey K Skilling?

In [7]:
print('The value of stock options exercised by Jeffrey K Skilling is equals to {}.'.format(df.loc['SKILLING JEFFREY K'].exercised_stock_options))

The value of stock options exercised by Jeffrey K Skilling is equals to 19250000.


Quiz: Follow the Money
--
Of these three individuals (Lay, Skilling and Fastow), who took home the most money (largest value of ``total_payments`` feature)?

How much money did that person get?

In [8]:
total_payments = [df.loc['SKILLING JEFFREY K'].total_payments, \
                  df.loc['LAY KENNETH L'].total_payments, df.loc['FASTOW ANDREW S'].total_payments]
print('Jeffrey Skilling\'s total payments: {}'.format(df.loc['SKILLING JEFFREY K'].total_payments))
print('Kenneth Lay\'s total payments: {}'.format(df.loc['LAY KENNETH L'].total_payments))
print('Andrew Fastow\'s total payments: {}'.format(df.loc['FASTOW ANDREW S'].total_payments))
print('And clearly, Kenneth Lay was who took home the most money.'.format(df.loc['LAY KENNETH L'].total_payments))

Jeffrey Skilling's total payments: 8682716
Kenneth Lay's total payments: 103559793
Andrew Fastow's total payments: 2424083
And clearly, Kenneth Lay was who took home the most money.


Quiz: Dealing with Unfilled Features
--
How many folks in this dataset have a quantified salary? What about a known email address?

In [9]:
display(df[df.salary != 'NaN'].head())
display(df[df.email_address != 'NaN'].head())
print('{} folks in this dataset have a quantified salary and {} has a known email address.'\
      .format(len(df[df.salary != 'NaN']), len(df[df.email_address != 'NaN'])))

Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,email_address,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,...,long_term_incentive,other,poi,restricted_stock,restricted_stock_deferred,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
METTS MARK,600000,,,,mark.metts@enron.com,,94299,29.0,38.0,1.0,...,,1740,False,585062,,365788,702.0,807.0,1061827,585062
BAXTER JOHN C,1200000,1295738.0,-1386055.0,,,6680544.0,11200,,,,...,1586055.0,2660303,False,3942714,,267102,,,5634343,10623258
ELLIOTT STEVEN,350000,,-400729.0,,steven.elliott@enron.com,4890344.0,78552,,,,...,,12961,False,1788391,,170941,,,211725,6678735
HANNON KEVIN P,1500000,,-3117011.0,,kevin.hannon@enron.com,5538001.0,34039,32.0,32.0,21.0,...,1617011.0,11350,True,853064,,243293,1035.0,1045.0,288682,6391065
MORDAUNT KRISTINA M,325000,,,,kristina.mordaunt@enron.com,,35018,,,,...,,1411,False,208510,,267093,,,628522,208510


Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,email_address,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,...,long_term_incentive,other,poi,restricted_stock,restricted_stock_deferred,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
METTS MARK,600000.0,,,,mark.metts@enron.com,,94299.0,29.0,38.0,1.0,...,,1740.0,False,585062,,365788.0,702.0,807.0,1061827.0,585062
ELLIOTT STEVEN,350000.0,,-400729.0,,steven.elliott@enron.com,4890344.0,78552.0,,,,...,,12961.0,False,1788391,,170941.0,,,211725.0,6678735
CORDES WILLIAM R,,,,,bill.cordes@enron.com,651850.0,,12.0,10.0,0.0,...,,,False,386335,,,58.0,764.0,,1038185
HANNON KEVIN P,1500000.0,,-3117011.0,,kevin.hannon@enron.com,5538001.0,34039.0,32.0,32.0,21.0,...,1617011.0,11350.0,True,853064,,243293.0,1035.0,1045.0,288682.0,6391065
MORDAUNT KRISTINA M,325000.0,,,,kristina.mordaunt@enron.com,,35018.0,,,,...,,1411.0,False,208510,,267093.0,,,628522.0,208510


95 folks in this dataset have a quantified salary and 111 has a known email address.


Quiz: Missing POIs 1
--
How many people in the E+F dataset (as it currently exists) have “NaN” for their total payments? What percentage of people in the dataset as a whole is this?

In [10]:
display(df[df.total_payments == 'NaN'].head())
num_total_payments_nan = len(df[df.total_payments == 'NaN'])
print('{}% of people in the dataset have "NaN" for their payments.'\
     .format(round(num_total_payments_nan/len(df.total_payments)*100)))

Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,email_address,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,...,long_term_incentive,other,poi,restricted_stock,restricted_stock_deferred,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
CORDES WILLIAM R,,,,,bill.cordes@enron.com,651850.0,,12.0,10.0,0.0,...,,,False,386335.0,,,58.0,764.0,,1038185.0
LOWRY CHARLES P,,,,,,372205.0,,,,,...,,,False,153686.0,-153686.0,,,,,372205.0
CHAN RONNIE,,,-98784.0,98784.0,,,,,,,...,,,False,32460.0,-32460.0,,,,,
WHALEY DAVID A,,,,,,98718.0,,,,,...,,,False,,,,,,,98718.0
CLINE KENNETH W,,,,,,,,,,,...,,,False,662086.0,-472568.0,,,,,189518.0


14% of people in the dataset have "NaN" for their payments.


Quiz: Missing POIs 2
--
How many POIs in the E+F dataset have “NaN” for their total payments? What percentage of POI’s as a whole is this?

In [11]:
total_payments_nan_and_poi = df[(df.total_payments == 'NaN') & (df.poi == True)]
display(total_payments_nan_and_poi)

num_total_payments_nan_and_poi = len(total_payments_nan_and_poi)
print('{}% of POIs in the E+F dataset have "NaN" for their total payments.'\
      .format(num_total_payments_nan_and_poi/len(df.total_payments)*100))

Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,email_address,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,...,long_term_incentive,other,poi,restricted_stock,restricted_stock_deferred,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value


0.0% of POIs in the E+F dataset have "NaN" for their total payments.


Quiz: Missing POIs 4
--
If you added in, say, 10 more data points which were all POI's, and put "NaN" for the total payments for those folks, the numbers you just calculated would change.

What is the new number of people of the dataset? What is the new number of folks with "NaN" for total payments?

In [12]:
num_new_pois = 10
print('The dataset has now {} data points and the number of folks with "NaN" for total payments is equals to {}.'\
      .format(len(df) + num_new_pois, num_total_payments_nan + num_new_pois))

The dataset has now 156 data points and the number of folks with "NaN" for total payments is equals to 31.


Quiz: Missing POIs 5
--
What is the new number of POI’s in the dataset? What is the new number of POI’s with NaN for total_payments?

In [13]:
print('The new number of POI’s in the dataset is {} and the new number of POI\'s with NaN for total_payments is {}.'\
     .format(num_pois + num_new_pois, num_total_payments_nan_and_poi + num_new_pois))

The new number of POI’s in the dataset is 28 and the new number of POI's with NaN for total_payments is 10.
