In [1]:
#!/usr/bin/python

""" 
    Starter code for exploring the Enron dataset (emails + finances);
    loads up the dataset (pickled dict of dicts).

    The dataset has the form:
    enron_data["LASTNAME FIRSTNAME MIDDLEINITIAL"] = { features_dict }

    {features_dict} is a dictionary of features associated with that person.
    You should explore features_dict as part of the mini-project,
    but here's an example to get you started:

    enron_data["SKILLING JEFFREY K"]["bonus"] = 5600000
    
"""

import pickle

enron_data = pickle.load(open("../final_project/final_project_dataset.pkl", "r"))


## To get an idea of dataset

The aggregated _Enron email + financial dataset_ is stored in a dictionary, where each key in the dictionary is a person’s name and the value is a dictionary containing all the features of that person.   

i.e., it is  nested dictionary.

In [2]:
type(enron_data)

dict

In [3]:
from pprint import pprint
c = 0
for i in enron_data.items():
    if c < 3:
        c += 1
        pprint(i)
        print('\n')
    else:
        break
    

('METTS MARK',
 {'bonus': 600000,
  'deferral_payments': 'NaN',
  'deferred_income': 'NaN',
  'director_fees': 'NaN',
  'email_address': 'mark.metts@enron.com',
  'exercised_stock_options': 'NaN',
  'expenses': 94299,
  'from_messages': 29,
  'from_poi_to_this_person': 38,
  'from_this_person_to_poi': 1,
  'loan_advances': 'NaN',
  'long_term_incentive': 'NaN',
  'other': 1740,
  'poi': False,
  'restricted_stock': 585062,
  'restricted_stock_deferred': 'NaN',
  'salary': 365788,
  'shared_receipt_with_poi': 702,
  'to_messages': 807,
  'total_payments': 1061827,
  'total_stock_value': 585062})


('BAXTER JOHN C',
 {'bonus': 1200000,
  'deferral_payments': 1295738,
  'deferred_income': -1386055,
  'director_fees': 'NaN',
  'email_address': 'NaN',
  'exercised_stock_options': 6680544,
  'expenses': 11200,
  'from_messages': 'NaN',
  'from_poi_to_this_person': 'NaN',
  'from_this_person_to_poi': 'NaN',
  'loan_advances': 'NaN',
  'long_term_incentive': 1586055,
  'other': 2660303,
  'poi

 # Number of data points(people) in the dataset

In [4]:
len(enron_data) 

146

In [5]:
len(enron_data.keys())

146

In [6]:
len(enron_data.values())

146

## to find the number of Features in the Enron Dataset

In [7]:
len(enron_data['METTS MARK']) 

21

## Finding numer of POIs(Person of Interest) in the Enron Data

In [8]:
poi_count = 0
for key1 in enron_data.keys():
    if enron_data[key1]['poi'] == True:
        poi_count += 1

poi_count

18

# Querying the dataset

Like any dict of dicts, individual people/features can be accessed like so:

enron_data["LASTNAME FIRSTNAME"]["feature_name"]

or

enron_data["LASTNAME FIRSTNAME MIDDLEINITIAL"]["feature_name"]

### total value of the stock belonging to James Prentice

In [9]:
enron_data['PRENTICE JAMES']['total_stock_value']

1095040

### How many email messages do we have from Wesley Colwell to persons of interest?

In [10]:
enron_data['COLWELL WESLEY']['from_this_person_to_poi']

11

### What’s the value of stock options exercised by Jeffrey K Skilling?

In [11]:
enron_data['SKILLING JEFFREY K']['exercised_stock_options']

19250000

## 3 Biggest frauds of Enron scandal

In [12]:
enron_data['LAY KENNETH L']

{'bonus': 7000000,
 'deferral_payments': 202911,
 'deferred_income': -300000,
 'director_fees': 'NaN',
 'email_address': 'kenneth.lay@enron.com',
 'exercised_stock_options': 34348384,
 'expenses': 99832,
 'from_messages': 36,
 'from_poi_to_this_person': 123,
 'from_this_person_to_poi': 16,
 'loan_advances': 81525000,
 'long_term_incentive': 3600000,
 'other': 10359729,
 'poi': True,
 'restricted_stock': 14761694,
 'restricted_stock_deferred': 'NaN',
 'salary': 1072321,
 'shared_receipt_with_poi': 2411,
 'to_messages': 4273,
 'total_payments': 103559793,
 'total_stock_value': 49110078}

In [13]:
enron_data['SKILLING JEFFREY K']

{'bonus': 5600000,
 'deferral_payments': 'NaN',
 'deferred_income': 'NaN',
 'director_fees': 'NaN',
 'email_address': 'jeff.skilling@enron.com',
 'exercised_stock_options': 19250000,
 'expenses': 29336,
 'from_messages': 108,
 'from_poi_to_this_person': 88,
 'from_this_person_to_poi': 30,
 'loan_advances': 'NaN',
 'long_term_incentive': 1920000,
 'other': 22122,
 'poi': True,
 'restricted_stock': 6843672,
 'restricted_stock_deferred': 'NaN',
 'salary': 1111258,
 'shared_receipt_with_poi': 2042,
 'to_messages': 3627,
 'total_payments': 8682716,
 'total_stock_value': 26093672}

In [14]:
enron_data['FASTOW ANDREW S']

{'bonus': 1300000,
 'deferral_payments': 'NaN',
 'deferred_income': -1386055,
 'director_fees': 'NaN',
 'email_address': 'andrew.fastow@enron.com',
 'exercised_stock_options': 'NaN',
 'expenses': 55921,
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 'NaN',
 'long_term_incentive': 1736055,
 'other': 277464,
 'poi': True,
 'restricted_stock': 1794412,
 'restricted_stock_deferred': 'NaN',
 'salary': 440698,
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 2424083,
 'total_stock_value': 1794412}

### How many folks in this dataset have a quantified salary?

In [15]:
count_qsal = 0
for k in enron_data.keys():
    if enron_data[k]['salary'] != 'NaN':
        count_qsal += 1

count_qsal

95

### How many folks in this dataset have a known email address?

In [16]:
count_email = 0
for k in enron_data.keys():
    if enron_data[k]['email_address'] != 'NaN':
        count_email += 1

count_email

111

### How many people in the E+F dataset (as it currently exists) have “NaN” for their total payments? What percentage of people in the dataset as a whole is this?

In [17]:
c_NaN_total_payment = 0
for k in enron_data.keys():
    if enron_data[k]['total_payments'] == 'NaN':
        c_NaN_total_payment += 1

from __future__ import division   ## If this is not done 21/146 returns a zero
percent = (c_NaN_total_payment/len(enron_data))*100
print 'percentage of people having NaN for their total payments : ', percent 

percentage of people having NaN for their total payments :  14.3835616438


### How many POIs in the E+F dataset have “NaN” for their total payments? What percentage of POI’s as a whole is this?

In [18]:
poi_count_NaN_TP = 0
for k in enron_data.keys():
    if enron_data[k]['poi']==True and enron_data[k]['total_payments']=='NaN':
        poi_count_NaN_TP += 1

poi_count_NaN_TP

0

##### 0% of POI's don't have total_payments filled.

In [19]:
c = 0
for k in enron_data.keys():
        if enron_data[k]['poi']==True and c < 3:
            print enron_data[k]
            print '\n'
            c += 1
    

{'salary': 243293, 'to_messages': 1045, 'deferral_payments': 'NaN', 'total_payments': 288682, 'exercised_stock_options': 5538001, 'bonus': 1500000, 'restricted_stock': 853064, 'shared_receipt_with_poi': 1035, 'restricted_stock_deferred': 'NaN', 'total_stock_value': 6391065, 'expenses': 34039, 'loan_advances': 'NaN', 'from_messages': 32, 'other': 11350, 'from_this_person_to_poi': 21, 'poi': True, 'director_fees': 'NaN', 'deferred_income': -3117011, 'long_term_incentive': 1617011, 'email_address': 'kevin.hannon@enron.com', 'from_poi_to_this_person': 32}


{'salary': 288542, 'to_messages': 1758, 'deferral_payments': 27610, 'total_payments': 1490344, 'exercised_stock_options': 'NaN', 'bonus': 1200000, 'restricted_stock': 698242, 'shared_receipt_with_poi': 1132, 'restricted_stock_deferred': 'NaN', 'total_stock_value': 698242, 'expenses': 16514, 'loan_advances': 'NaN', 'from_messages': 40, 'other': 101740, 'from_this_person_to_poi': 11, 'poi': True, 'director_fees': 'NaN', 'deferred_income':

## If a machine learning algorithm were to use total_payments as a feature, would you expect it to associate a “NaN” value with POIs or non-POIs?

ANS : non-POIs  
No training points would have "NaN" for total_payments when the class label is "POI"

******


## If you added in, say, 10 more data points which were all POI’s, and put “NaN” for the total payments for those folks, the numbers you just calculated would change. 
What is the new number of people of the dataset? What is the new number of folks with “NaN” for total payments?

In [20]:
len(enron_data)+10

156

In [21]:
count_total_payment = 0
for k in enron_data.keys():
    if enron_data[k]['total_payments'] == 'NaN':
        count_total_payment += 1
        
count_total_payment + 10

31

## What is the new number of POI’s in the dataset? What is the new number of POI’s with NaN for total_payments?

In [22]:
poi_count + 10

28

new number of POI’s with NaN for total_payments = 10

## Once the new data points are added, do you think a supervised classification algorithm might interpret “NaN” for total_payments as a clue that someone is a POI?

##### Yes

# Mixing Data Sources

Adding in the new POI’s in this example, none of whom we have financial information for, has introduced a subtle problem, that our lack of financial information about them can be picked up by an algorithm as a clue that they’re POIs. Another way to think about this is that there’s now a difference in how we generated the data for our two classes--non-POIs all come from the financial spreadsheet, while many POIs get added in by hand afterwards. That difference can trick us into thinking we have better performance than we do--suppose you use your POI detector to decide whether a new, unseen person is a POI, and that person isn’t on the spreadsheet. Then all their financial data would contain “NaN” but the person is very likely not a POI (there are many more non-POIs than POIs in the world, and even at Enron)--you’d be likely to accidentally identify them as a POI, though!

This goes to say that, when generating or augmenting a dataset, you should be exceptionally careful if your data are coming from different sources for different classes. It can easily lead to the type of bias or mistake that we showed here. There are ways to deal with this, for example, you wouldn’t have to worry about this problem if you used only email data--in that case, discrepancies in the financial data wouldn’t matter because financial features aren’t being used. There are also more sophisticated ways of estimating how much of an effect these biases can have on your final answer; those are beyond the scope of this course.

## For now, the takeaway message is __to be very careful about introducing features that come from different sources depending on the class! It’s a classic way to accidentally introduce biases and mistakes__.