# Datasets and Questions Mini-Project

The Enron fraud is a big, messy and totally fascinating story about corporate malfeasance of nearly every imaginable type. The Enron email and financial datasets are also big, messy treasure troves of information, which become much more useful once you know your way around them a bit. We’ve combined the email and finance data into a single dataset, which you’ll explore in this mini-project.

Getting started:

    Clone this git repository: https://github.com/udacity/ud120-projects
    Open the starter code: datasets_questions/explore_enron_data.py

In [None]:
"""
    Starter code for exploring the Enron dataset (emails + finances);
    loads up the dataset (pickled dict of dicts).

    The dataset has the form:
    enron_data["LASTNAME FIRSTNAME MIDDLEINITIAL"] = { features_dict }

    {features_dict} is a dictionary of features associated with that person.
    You should explore features_dict as part of the mini-project,
    but here's an example to get you started:

    enron_data["SKILLING JEFFREY K"]["bonus"] = 5600000

"""

import joblib

enron_data = joblib.load(open("../final_project/final_project_dataset.pkl", "rb"))

# Size of the Enron Dataset

The aggregated Enron email + financial dataset is stored in a dictionary, where each key in the dictionary is a person’s name and the value is a dictionary containing all the features of that person.
The email + finance (E+F) data dictionary is stored as a pickle file, which is a handy way to store and load python objects directly. Use datasets_questions/explore_enron_data.py to load the dataset.

How many data points (people) are in the dataset?

In [None]:
print(len(enron_data))

For each person, how many features are available?

In [None]:
keyCount = 0
for data in enron_data:
    for key in enron_data[data]:
        keyCount+=1
    break
print(keyCount)

# Finding POIs in the Enron Data

The “poi” feature records whether the person is a person of interest, according to our definition. How many POIs are there in the E+F dataset?
Go to Quiz

In other words, count the number of entries in the dictionary where
data[person_name]["poi"]==1

In [None]:
poiCount = 0;
for data in enron_data:
    for key in enron_data[data]:
        if key == 'poi' and enron_data[data]['poi'] == 1:
            poiCount += 1;

# How Many POIs Exist?

We compiled a list of all POI names (in ../final_project/poi_names.txt) and associated email addresses (in ../final_project/poi_email_addresses.py).

How many POI’s were there total? (Use the names file, not the email addresses, since many folks have more than one address and a few didn’t work for Enron, so we don’t have their emails.)

In [None]:
totalPoi = 0;
with open('../final_project/poi_names.txt', 'r') as f:
    file_data = f.read()
    while (f.readline()):
        totalPoi += 1;
print(totalPoi)

# Query the Dataset 1

Like any dict of dicts, individual people/features can be accessed like so:

enron_data["LASTNAME FIRSTNAME"]["feature_name"]
or, sometimes
enron_data["LASTNAME FIRSTNAME MIDDLEINITIAL"]["feature_name"]

What is the total value of the stock belonging to James Prentice?

In [None]:
print(enron_data["PRENTICE JAMES"]["total_stock_value"])

# Query the Dataset 2

Like any dict of dicts, individual people/features can be accessed like so:

enron_data["LASTNAME FIRSTNAME"]["feature_name"]

How many email messages do we have from Wesley Colwell to persons of interest?

In [None]:
print(enron_data["COLWELL WESLEY"]["from_this_person_to_poi"])

# Query the Dataset 3

Like any dict of dicts, individual people/features can be accessed like so:

enron_data["LASTNAME FIRSTNAME"]["feature_name"]

or

enron_data["LASTNAME FIRSTNAME MIDDLEINITIAL"]["feature_name"]
What’s the value of stock options exercised by Jeffrey K Skilling?

In [None]:
print(enron_data["SKILLING JEFFREY K"]["exercised_stock_options"])

# Follow the Money

Of these three individuals (Lay, Skilling and Fastow), who took home the most money (largest value of “total_payments” feature)?

How much money did that person get?

In [1]:
import re
totalPayments = [(name, enron_data[name]["total_payments"]) for name in enron_data if
                 (re.search('^LAY|^SKILLING|^FASTOW', name))]
totalPayments.sort(reverse=True, key=lambda a: a[1])
print(totalPayments[0])

NameError: name 'enron_data' is not defined

# Dealing with Unfilled Features

How many folks in this dataset have a quantified salary? What about a known email address?

In [None]:
knownEmail = [name for name in enron_data if enron_data[name]["email_address"] != 'NaN']
quantifiedSalary = [name for name in enron_data if enron_data[name]["salary"] != 'NaN']

print("With known email address:", len(knownEmail))
print("With quantified salary:", len(quantifiedSalary))

# Dict-to-array conversion

A python dictionary can’t be read directly into an sklearn classification or regression algorithm; instead, it needs a numpy array or a list of lists (each element of the list (itself a list) is a data point, and the elements of the smaller list are the features of that point).

We’ve written some helper functions (featureFormat() and targetFeatureSplit() in tools/feature_format.py) that can take a list of feature names and the data dictionary, and return a numpy array.

In the case when a feature does not have a value for a particular person, this function will also replace the feature value with 0 (zero).