The Enron fraud is a big, messy and totally fascinating story about corporate malfeasance of nearly every imaginable type. The Enron email and financial datasets are also big, messy treasure troves of information, which become much more useful once you know your way around them a bit. We’ve combined the email and finance data into a single dataset, which you’ll explore in this mini-project.

#!/usr/bin/python

""" 
    Starter code for exploring the Enron dataset (emails + finances);
    loads up the dataset (pickled dict of dicts).

    The dataset has the form:
    enron_data["LASTNAME FIRSTNAME MIDDLEINITIAL"] = { features_dict }

    {features_dict} is a dictionary of features associated with that person.
    You should explore features_dict as part of the mini-project,
    but here's an example to get you started:

    enron_data["SKILLING JEFFREY K"]["bonus"] = 5600000
    
"""

In [1]:
import pickle

In [2]:
enron_data = pickle.load(open("../final_project/final_project_dataset.pkl", "r"))

The aggregated Enron email + financial dataset is stored in a dictionary, where each key in the dictionary is a person’s name and the value is a dictionary containing all the features of that person.
The email + finance (E+F) data dictionary is stored as a pickle file, which is a handy way to store and load python objects directly. Use datasets_questions/explore_enron_data.py to load the dataset.
How many data points (people) are in the dataset?

In [12]:
print 'people = %s' % len(enron_data)

people = 146


For each person, how many features are available?

In [14]:
print 'features = %s' % len(enron_data[enron_data.keys()[0]])

features = 21


The “poi” feature records whether the person is a person of interest, according to our definition. How many POIs are there in the E+F dataset?

In [42]:
print 'poi\'s in dataset = %s' % sum(1 for _,v in enron_data.iteritems() if v['poi'] == True)

poi's in dataset = 18


We compiled a list of all POI names (in ../final_project/poi_names.txt) and associated email addresses (in ../final_project/poi_email_addresses.py).
How many POI’s were there total? (Use the names file, not the email addresses, since many folks have more than one address and a few didn’t work for Enron, so we don’t have their emails.)

In [46]:
f = open('../final_project/poi_names.txt','r')
print 'poi\'s names = %s' % sum(1 for line in f.readlines()[1:] if line.rstrip())
f.close()

poi's names = 35


As you can see, we have many of the POIs in our E+F dataset, but not all of them. Why is that a potential problem?
We will return to this later to explain how a POI could end up not being in the Enron E+F dataset, so you fully understand the issue before moving on.
Like any dict of dicts, individual people/features can be accessed like so:
enron_data["LASTNAME FIRSTNAME"]["feature_name"]
or, sometimes 
enron_data["LASTNAME FIRSTNAME MIDDLEINITIAL"]["feature_name"]
What is the total value of the stock belonging to James Prentice?

In [52]:
print 'total value of the stock belonging to James Prentice = %s' % enron_data['PRENTICE JAMES']['total_stock_value']

total value of the stock belonging to James Prentice = 1095040


Like any dict of dicts, individual people/features can be accessed like so:
enron_data["LASTNAME FIRSTNAME"]["feature_name"]
How many email messages do we have from Wesley Colwell to persons of interest?

In [54]:
print 'email messages from Wesley Colwell to persons of interest = %s' % enron_data['COLWELL WESLEY']['from_this_person_to_poi']

email messages from Wesley Colwell to persons of interest = 11


Like any dict of dicts, individual people/features can be accessed like so:
enron_data["LASTNAME FIRSTNAME"]["feature_name"]
What’s the value of stock options exercised by Jeffrey Skilling?

In [59]:
print 'value of stock options exercised by Jeffrey Skilling = %s' % enron_data['SKILLING JEFFREY K']['exercised_stock_options']

value of stock options exercised by Jeffrey Skilling = 19250000


In the coming lessons, we’ll talk about how the best features are often motivated by our human understanding of the problem at hand. In this case, that means knowing a little about the story of the Enron fraud.
If you have an hour and a half to spare, “Enron: The Smartest Guys in the Room” is a documentary that gives an amazing overview of the story. Alternatively, there are plenty of archival newspaper stories that chronicle the rise and fall of Enron.
Which of these schemes was Enron not involved in?
    - selling assets to shell companies at the end of each month, and buying them back at the beginning of the next month to hide accounting losses
    - causing electrical grid failures in California
    - illegally obtained a government report that enabled them to corner the market on frozen concentrated orange juice futures
    - conspiring to give a Saudi prince expedited American citizenship
    - a plan in collaboration with Blockbuster movies to stream movies over the internet

Of these three individuals (Lay, Skilling and Fastow), who took home the most money (largest value of “total_payments” feature)?

How much money did that person get?

In [73]:
ceo = 'SKILLING JEFFREY K'
chairman = 'LAY KENNETH L'
cfo = 'FASTOW ANDREW S'
l = [(poi, enron_data[poi]['total_payments']) for poi in [ceo, chairman, cfo]]
from operator import itemgetter
print '%s' % max(l,key=itemgetter(1))

('LAY KENNETH L', 103559793)

For nearly every person in the dataset, not every feature has a value. How is it denoted when a feature doesn’t have a well-defined value?

In [74]:
enron_data[enron_data.keys()[0]]

{'bonus': 600000,
 'deferral_payments': 'NaN',
 'deferred_income': 'NaN',
 'director_fees': 'NaN',
 'email_address': 'mark.metts@enron.com',
 'exercised_stock_options': 'NaN',
 'expenses': 94299,
 'from_messages': 29,
 'from_poi_to_this_person': 38,
 'from_this_person_to_poi': 1,
 'loan_advances': 'NaN',
 'long_term_incentive': 'NaN',
 'other': 1740,
 'poi': False,
 'restricted_stock': 585062,
 'restricted_stock_deferred': 'NaN',
 'salary': 365788,
 'shared_receipt_with_poi': 702,
 'to_messages': 807,
 'total_payments': 1061827,
 'total_stock_value': 585062}

How many folks in this dataset have a quantified salary? What about a known email address?

In [107]:
# this should be done with pandas isnull
print 'quantified salary = %s\nknown email address = %s' % \
    (sum(1 for _,v in enron_data.iteritems() if isinstance(v['salary'],int)),
     sum(1 for _,v in enron_data.iteritems() if v['email_address'] != 'NaN'))

quantified salary = 95
known email address = 111


A python dictionary can’t be read directly into an sklearn classification or regression algorithm; instead, it needs a numpy array or a list of lists (each element of the list (itself a list) is a data point, and the elements of the smaller list are the features of that point).
We’ve written some helper functions (featureFormat() and targetFeatureSplit() in tools/feature_format.py) that can take a list of feature names and the data dictionary, and return a numpy array.
In the case when a feature does not have a value for a particular person, this function will also replace the feature value with 0 (zero).

As you saw a little while ago, not every POI has an entry in the dataset (e.g. Michael Krautz). That’s because the dataset was created using the financial data you can find in final_project/enron61702insiderpay.pdf, which is missing some POI’s (those absences propagated through to the final dataset). On the other hand, for many of these “missing” POI’s, we do have emails.
While it would be straightforward to add these POI’s and their email information to the E+F dataset, and just put “NaN” for their financial information, this could introduce a subtle problem. You will walk through that here.
How many people in the E+F dataset (as it currently exists) have “NaN” for their total payments? What percentage of people in the dataset as a whole is this?

In [120]:
import math
print 'percentage of people with “NaN” for their total payments = %s' % \
    (100 * sum(1 for _,v in enron_data.iteritems() if math.isnan(float(v['total_payments']))) / float(len(enron_data)))


percentage of people with “NaN” for their total payments = 14.3835616438


How many POIs in the E+F dataset have “NaN” for their total payments? What percentage of POI’s as a whole is this?

In [125]:
import math
print 'percentage of people with “NaN” for their total payments = %s' % \
    (100 * sum(1 for _,v in enron_data.iteritems() if v['poi'] and math.isnan(float(v['total_payments']))) / float(len(enron_data)))

percentage of people with “NaN” for their total payments = 0.0


If you added in, say, 10 more data points which were all POI’s, and put “NaN” for the total payments for those folks, the numbers you just calculated would change.
What is the new number of people of the dataset? What is the new number of folks with “NaN” for total payments?

In [129]:
print 'new number of people of the dataset = %s\npercentage of people with “NaN” for their total payments = %s' % \
    (len(enron_data) + 10,
    10 + sum(1 for _,v in enron_data.iteritems() if math.isnan(float(v['total_payments']))))

new number of people of the dataset = 156
percentage of people with “NaN” for their total payments = 31


What is the new number of POI’s in the dataset? What percentage of them have “NaN” for their total stock value?

In [132]:
print 'new number of poi\'s in dataset = %s\nnew percentage of people with “NaN” for their total payments = %s' % \
    ((10 + sum(1 for _,v in enron_data.iteritems() if v['poi'] == True)), 10)

new number of poi's in dataset = 28
new percentage of people with “NaN” for their total payments = 10
