# Explore Enron Data

In [1]:
""" 
    Starter code for exploring the Enron dataset (emails + finances);
    loads up the dataset (pickled dict of dicts).
    The dataset has the form:
    enron_data["LASTNAME FIRSTNAME MIDDLEINITIAL"] = { features_dict }
    {features_dict} is a dictionary of features associated with that person.
    You should explore features_dict as part of the mini-project,
    but here's an example to get you started:
    enron_data["SKILLING JEFFREY K"]["bonus"] = 5600000
    
"""

import pickle

enron_data = pickle.load(open("final_project_dataset.pkl", "rb"))


## Size of the Enron Dataset

In [2]:
len(enron_data.keys())

146

## Features in the Enron Dataset

In [3]:
lengths = [len(v) for v in enron_data.values()]
print(lengths)

[21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21]


## Find Persons of Interest (POIs) in the Enron Dataset

In [4]:
poi = 0
for person in enron_data:
    if enron_data[person]["poi"] == True:
        poi += 1
print(poi)

18


## How Many POIs Exist?

In [5]:
count = 0
with open("poi_names.txt", "r") as f:
    for line in f:
        if "(y)" in line or "(n)" in line:
            count += 1

print(count)

35


As we can see, we have many of the POIs in our email and financial (E+F) dataset, but not all of them. The main problem with incomplete data is that having only 18 data points doesn't give us many examples to learn from.

* In general, more data is always better!

## Query the Dataset
### What is the total value of the stock belonging to James Prentice?

In [6]:
for name in enron_data:
    if "PRENTICE" in name:
        print(enron_data[name])
        print()
        print(enron_data[name]['total_stock_value'])

{'salary': 'NaN', 'to_messages': 'NaN', 'deferral_payments': 564348, 'total_payments': 564348, 'loan_advances': 'NaN', 'bonus': 'NaN', 'email_address': 'james.prentice@enron.com', 'restricted_stock_deferred': 'NaN', 'deferred_income': 'NaN', 'total_stock_value': 1095040, 'expenses': 'NaN', 'from_poi_to_this_person': 'NaN', 'exercised_stock_options': 886231, 'from_messages': 'NaN', 'other': 'NaN', 'from_this_person_to_poi': 'NaN', 'poi': False, 'long_term_incentive': 'NaN', 'shared_receipt_with_poi': 'NaN', 'restricted_stock': 208809, 'director_fees': 'NaN'}

1095040


### How many email messages do we have from Wesley Colwell to persons of interest?

In [7]:
for name in enron_data:
    if "COLWELL" in name:
        print(enron_data[name]['from_this_person_to_poi'])

11


### What's the value of stock options exercised by Jeffrey K Skilling?

In [8]:
for name in enron_data:
    if "SKILLING" in name:
        print(enron_data[name]['exercised_stock_options'])

19250000


### Of the three individuals CEO, Chairman, CFO (Skilling, Lay, Fastow), who took home the most money?

In [9]:
for name in enron_data:
    if "SKILLING" in name:
        print(name, enron_data[name]["total_payments"])
    if "LAY" in name:
        print(name, enron_data[name]["total_payments"])
    if "FASTOW" in name:
        print(name, enron_data[name]["total_payments"])
        
# Alternative solution
mykeys = ["SKILLING JEFFREY K", "LAY KENNETH L", "FASTOW ANDREW S"]
[(k,v["total_payments"]) for k, v in enron_data.items() if k in mykeys]

LAY KENNETH L 103559793
FASTOW ANDREW S 2424083
SKILLING JEFFREY K 8682716


[('LAY KENNETH L', 103559793),
 ('FASTOW ANDREW S', 2424083),
 ('SKILLING JEFFREY K', 8682716)]

### How many folks in this dataset have a quantified salary? What about a known email address?

In [10]:
no_salary = 0
no_email = 0
for person in enron_data:
    if enron_data[person]['salary'] != 'NaN':
        no_salary += 1
    if enron_data[person]['email_address'] != 'NaN':
        no_email += 1

print(no_salary)
print(no_email)

95
111


## Dictionary to Array Conversion

In [11]:
""" 
    A general tool for converting data from the
    dictionary format to an (n x k) python list that's 
    ready for training an sklearn algorithm
    
    n--no. of key-value pairs in dictonary
    k--no. of features being extracted
    
    dictionary keys are names of persons in dataset
    dictionary values are dictionaries, where each
        key-value pair in the dict is the name
        of a feature, and its value for that person
        
    In addition to converting a dictionary to a numpy 
    array, you may want to separate the labels from the
    features--this is what targetFeatureSplit is for
    
    so, if you want to have the poi label as the target,
    and the features you want to use are the person's
    salary and bonus, here's what you would do:
    
    feature_list = ["poi", "salary", "bonus"] 
    data_array = featureFormat( data_dictionary, feature_list )
    label, features = targetFeatureSplit(data_array)
    
    the line above (targetFeatureSplit) assumes that the
    label is the _first_ item in feature_list--very important
    that poi is listed first!
"""


import numpy as np

def featureFormat(dictionary, features, remove_NaN=True, remove_all_zeroes=True, remove_any_zeroes=False, sort_keys = False):
    """ convert dictionary to numpy array of features
        remove_NaN = True will convert "NaN" string to 0.0
        remove_all_zeroes = True will omit any data points for which
            all the features you seek are 0.0
        remove_any_zeroes = True will omit any data points for which
            any of the features you seek are 0.0
        sort_keys = True sorts keys by alphabetical order. Setting the value as
            a string opens the corresponding pickle file with a preset key
            order (this is used for Python 3 compatibility, and sort_keys
            should be left as False for the course mini-projects).
        NOTE: first feature is assumed to be 'poi' and is not checked for
            removal for zero or missing values.
    """


    return_list = []

    # Key order - first branch is for Python 3 compatibility on mini-projects,
    # second branch is for compatibility on final project.
    if isinstance(sort_keys, str):
        import pickle
        keys = pickle.load(open(sort_keys, "rb"))
    elif sort_keys:
        keys = sorted(dictionary.keys())
    else:
        keys = dictionary.keys()

    for key in keys:
        tmp_list = []
        for feature in features:
            try:
                dictionary[key][feature]
            except KeyError:
                print("error: key ", feature, " not present")
                return
            value = dictionary[key][feature]
            if value=="NaN" and remove_NaN:
                value = 0
            tmp_list.append( float(value) )

        # Logic for deciding whether or not to add the data point.
        append = True
        # exclude 'poi' class as criteria.
        if features[0] == 'poi':
            test_list = tmp_list[1:]
        else:
            test_list = tmp_list
        ### if all features are zero and you want to remove
        ### data points that are all zero, do that here
        if remove_all_zeroes:
            append = False
            for item in test_list:
                if item != 0 and item != "NaN":
                    append = True
                    break
        ### if any features for a given data point are zero
        ### and you want to remove data points with any zeroes,
        ### handle that here
        if remove_any_zeroes:
            if 0 in test_list or "NaN" in test_list:
                append = False
        ### Append the data point if flagged for addition.
        if append:
            return_list.append( np.array(tmp_list) )

    return np.array(return_list)


def targetFeatureSplit( data ):
    """ 
        given a numpy array like the one returned from
        featureFormat, separate out the first feature
        and put it into its own list (this should be the 
        quantity you want to predict)
        return targets and features as separate lists
        (sklearn can generally handle both lists and numpy arrays as 
        input formats when training/predicting)
    """

    target = []
    features = []
    for item in data:
        target.append( item[0] )
        features.append( item[1:] )

    return target, features

### How many people in the E+F dataset (as it currently exists) have “NaN” for their total payments?
### What percentage of people in the dataset as a whole is this?

In [12]:
NaN_total_payment = 0
for person in enron_data:
    if enron_data[person]['total_payments'] == 'NaN':
        NaN_total_payment += 1

print(str(NaN_total_payment) + " or", str(round((NaN_total_payment * 100 /146), 1)) + "%")

21 or 14.4%


### How many POIs in the E+F dataset have “NaN” for their total payments?
### What percentage of POI’s as a whole is this?

In [13]:
NaN_total_payment_poi = 0
for person in enron_data:
    if enron_data[person]['total_payments'] == 'NaN' and enron_data[person]['poi'] == True:
        NaN_total_payment_poi += 1

print(str(NaN_total_payment_poi) + " or", str(round((NaN_total_payment_poi * 100 /18), 1)) + "%")

0 or 0.0%


### If a machine learning algorithm were to use total_payments as a feature, would you expect it to associate a “NaN” value with POIs or non-POIs?

* non-POIs, No training points would have "NaN" for total_payments when the class label is "POI"

### If you added in, say, 10 more data points which were all POI’s, and put “NaN” for the total payments for those folks, the numbers you just calculated would change. What is the new number of people of the dataset? What is the new number of folks with “NaN” for total payments?

* Number in dataset = 156
* 'NaN' for total payments = 31

### What is the new number of POIs in the dataset? What is the new number of POIs with “NaN” for total payments?

* Number in dataset = 28
* 'NaN' for total payments = 10

### Once the new data points are added, do you think a supervised classification algorithm might interpret “NaN” for total_payments as a clue that someone is a POI?

* Yes

Adding in the new POI’s in this example, none of whom we have financial information for, has introduced a subtle problem, that our lack of financial information about them can be picked up by an algorithm as a clue that they’re POIs. Another way to think about this is that there’s now a difference in how we generated the data for our two classes--non-POIs all come from the financial spreadsheet, while many POIs get added in by hand afterwards. That difference can trick us into thinking we have better performance than we do--suppose you use your POI detector to decide whether a new, unseen person is a POI, and that person isn’t on the spreadsheet. Then all their financial data would contain “NaN” but the person is very likely not a POI (there are many more non-POIs than POIs in the world, and even at Enron)--you’d be likely to accidentally identify them as a POI, though!

This goes to say that, when generating or augmenting a dataset, you should be exceptionally careful if your data are coming from different sources for different classes. It can easily lead to the type of bias or mistake that we showed here. There are ways to deal with this, for example, you wouldn’t have to worry about this problem if you used only email data--in that case, discrepancies in the financial data wouldn’t matter because financial features aren’t being used. There are also more sophisticated ways of estimating how much of an effect these biases can have on your final answer; those are beyond the scope of this course.

For now, the takeaway message is to be very careful about introducing features that come from different sources depending on the class! It’s a classic way to accidentally introduce biases and mistakes.