# Identify Fraud from Enron Email Project
## June 2017, by Jude Moon
<br />

# Project Overview
In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives. 

In this project, I will play a detective, and put the new skills to use by building a person of interest (POI) identifier based on financial and email data made public as a result of the Enron scandal. I used [the provided dataset](link) from [Udacity Intro to Machine Learning Course](https://www.udacity.com/course/intro-to-machine-learning--ud120), which was combined with a hand-generated list of POI in the fraud case. POIs are individuals who were indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity.

This document is to keep notes as I work through the project and compose answers to [a series of questions](https://docs.google.com/document/d/1NDgi1PrNJP7WTbfSUuRUnz8yzs5nGVTSzpO7oeNTEWA/pub?embedded=true) provided by Udacity, to show my thought processes and approaches to solve this problem.
***

# Data Exploration
## Q1-1: Summarize the goal of this project
The goal of the Enron project is to build a valid algorithm to identify Enron Employees who may have committed fraud (labeled as a person of interest, aka POI), using features from their financial and email datasets.

## Q1-2: Give some background on the dataset 

In [25]:
%pylab inline
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import os
import re
import sys
import pprint
import operator
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy


In [2]:
# loads up the dataset (pickled dict of dicts)
data_dict = pickle.load(open("final_project_dataset.pkl", "r"))

### Enron dataset (emails + finances) has the form:
    
    data_dict["LASTNAME FIRSTNAME MIDDLEINITIAL"] = { features_dict }
    
The data dictionary is stored as a **pickle** file, which is a handy way to store and load python objects directly.

### How many data points (people) are in the dataset?

In [3]:
len(data_dict)

146

### How many POI?
In other words, count the number of entries in the dictionary where
data[person_name]["poi"]==1 
- 1 means POI 
- 0 means non-POI

In [53]:
count_poi = 0
for person in data_dict:
    if data_dict[person]["poi"] == 1:
        count_poi += 1
print "Number of POIs : %i" %count_poi
print "Number of non-POIs : %i" %(146-count_poi)

Number of POIs : 18
Number of non-POIs : 128


### Do we have sufficient data points?

In [54]:
# Udacity course provided a compiled list of all POI names from Enron corpus
# poi_names.txt is newline delimited
# read poi_names.txt file: each newline to string in a list
poi_names_txt = open("poi_names.txt", "r").read().splitlines()

print "1st line: " + poi_names_txt[0]
print "2nd line: " + poi_names_txt[1]
print "3rd line: " + poi_names_txt[2]
print "37th line: " + poi_names_txt[36]
print "Number of POIs from Enron corpus: %i"%(len(poi_names_txt)-2)

1st line: http://usatoday30.usatoday.com/money/industries/energy/2005-12-28-enron-participants_x.htm
2nd line: 
3rd line: (y) Lay, Kenneth
37th line: (n) Loehr, Christopher
Number of POIs from Enron corpus: 35


The name list of POIs which were extracted from Enron corpus database (emails of total 158 employees) showed 35 of POIs, whereas the combined dataset of financial and email data had 18 of POIs. 

About half of POIs were missing in the email + finance data dictionary. This might cause problems on understanding the full scope of patterns between features and POI. 

However, adding POIs data points from email data to financial data and leaving "NaN" value for all financial features of missing POIs would introduce "NaN" driving biases.

### For each person, how many features are available?

In [4]:
len(data_dict[data_dict.keys()[0]])

21

### What are the features?

In [9]:
# the key of features for the first key
features_list = data_dict[data_dict.keys()[0]].keys() 
pprint.pprint(features_list)

['salary',
 'to_messages',
 'deferral_payments',
 'total_payments',
 'exercised_stock_options',
 'bonus',
 'restricted_stock',
 'shared_receipt_with_poi',
 'restricted_stock_deferred',
 'total_stock_value',
 'expenses',
 'loan_advances',
 'from_messages',
 'other',
 'from_this_person_to_poi',
 'poi',
 'director_fees',
 'deferred_income',
 'long_term_incentive',
 'email_address',
 'from_poi_to_this_person']


### How many NaN (Not a Number) exist per feature?

In [14]:
# create a dictionary of feature and count of NaN pairs
count_NaN = {}
for feature in features_list:
    count_NaN[feature] = 0

for person in data_dict:
    for feature in data_dict[person]:
        if data_dict[person][feature] == "NaN":
            count_NaN[feature] +=1

# sort the dictionary by ascending ordering of values 
count_NaN = sorted(count_NaN.items(), key=operator.itemgetter(1))
pprint.pprint(count_NaN)

[('poi', 0),
 ('total_stock_value', 20),
 ('total_payments', 21),
 ('email_address', 35),
 ('restricted_stock', 36),
 ('exercised_stock_options', 44),
 ('salary', 51),
 ('expenses', 51),
 ('other', 53),
 ('to_messages', 60),
 ('shared_receipt_with_poi', 60),
 ('from_messages', 60),
 ('from_poi_to_this_person', 60),
 ('from_this_person_to_poi', 60),
 ('bonus', 64),
 ('long_term_incentive', 80),
 ('deferred_income', 97),
 ('deferral_payments', 107),
 ('restricted_stock_deferred', 128),
 ('director_fees', 129),
 ('loan_advances', 142)]


### Would NaN introduce bias to the features?

In [41]:
# create a dictionary showing the number of NaN and 
# number of POI with NaN each feature
NaN_dict = {}
keys = ['NaN_total', 'NaN_poi']

for key in keys:
    NaN_dict[key] = {}
    for feature in features_list:
        NaN_dict[key][feature] = 0
        
for person in data_dict:
    for feature in data_dict[person]:
        if data_dict[person][feature] == "NaN":
            NaN_dict['NaN_total'][feature] +=1
        
        if data_dict[person][feature] == "NaN" and data_dict[person]['poi'] == True:
            NaN_dict['NaN_poi'][feature] +=1

# convert from a dictionary to a panda dataframe
NaN_df = pd.DataFrame(NaN_dict)
NaN_df['NaN_non-poi'] = NaN_df['NaN_total']-NaN_df['NaN_poi']
NaN_df['%NaN_in_poi'] = (NaN_df['NaN_poi']/18)*100 # from total 18 POI
NaN_df['%NaN_in_non-poi'] = (NaN_df['NaN_non-poi']/128)*100 # from total 128 non-POI
NaN_df['diff_%'] = NaN_df['%NaN_in_poi'] - NaN_df['%NaN_in_non-poi']
NaN_df = NaN_df.sort(['diff_%'])
NaN_df



Unnamed: 0,NaN_poi,NaN_total,NaN_non-poi,%NaN_in_poi,%NaN_in_non-poi,diff_%
other,0,53,53,0.0,41.40625,-41.40625
expenses,0,51,51,0.0,39.84375,-39.84375
bonus,2,64,62,11.111111,48.4375,-37.326389
salary,1,51,50,5.555556,39.0625,-33.506944
deferred_income,7,97,90,38.888889,70.3125,-31.423611
email_address,0,35,35,0.0,27.34375,-27.34375
long_term_incentive,6,80,74,33.333333,57.8125,-24.479167
restricted_stock,1,36,35,5.555556,27.34375,-21.788194
to_messages,4,60,56,22.222222,43.75,-21.527778
shared_receipt_with_poi,4,60,56,22.222222,43.75,-21.527778


I thought that features with a greater number of "NaN" value (e.g. 'loan_advances', 'director_fees', 'restricted_stock_deferred', etc.) would introduce bias. However, the disproportion in the numbers of "NaN" value between POI labeled group vs. non-POI labeled group might be more problematic. The features with large differences between % NaN in POI group vs. % NaN in non-POI group, for example, 'other' and 'expenses' are likely biased by "NaN" value. This means that if a supervised classification algorithm was to use 'other' as a feature, I would think that it might interpret "NaN" for 'other' as a clue that a person is a non-POI, so I would expect it to associate a "NaN" value with non-POI label.

I am not sure whether it is ok to associate lack of information such as "NaN" value with a particular label. I will keep this in mind and consider excluding the NaN biased features at the feature selection stage.


## Summary of data exploration
- Total number of data points: 146
- Total number of data points labeled as POI: 18
- Total number of data points labeled as non-POI: 126
- Number of missing POIs: 17
- Number of initial features: 21
- List of features with the number of "NaN" value greater than 73 (50% cut-off): 

| feature name  | number of NaN  |
|:---:|:---:|
| 'loan_advances' | 142  |
| 'director_fees'  | 129  |
| 'restricted_stock_deferred'  | 128  |
|  'deferral_payments' | 107  |
| 'deferred_income'  | 97  |
| 'long_term_incentive'  |  80 |
    

- List of features with "NaN" value disproportionally distributed between POI vs. non-POI groups:

|    feature_name   | NaN_total | NaN_poi | NaN_non-poi | %NaN_in_poi | %NaN_in_non-poi | %Difference|
|:-----------------:|:---------:|:-------:|:-----------:|:-----------:|:---------------:|:---------------:|
|      'other'      |     53    |    0    |      53     |      0      |        41       |       -41       |
|     'expenses'    |     51    |    0    |      51     |      0      |        40       |       -40       |
|      'bonus'      |     64    |    2    |      62     |      11     |        48       |       -37       |
|      'salary'     |     51    |    1    |      50     |      6      |        39       |       -34       |
| 'deferred_income' |     97    |    7    |      90     |      39     |        70       |       -31       |

## Q1-3: How machine learning is useful in trying to accomplish the project goal and answer the project question

It is uncertain that the existing financial and email dataset can provide good indicators/predictors to identify POI. After data exploration, I realized that there are some limitations such as NaN driving bias and missing half of POIs. 

With these limitations and imperfect situation, machine learning can be useful in discovering some hidden patterns in features associated with POI labels and understanding relationship between a feature or a bundle of features and POI labels. After validating and evaluating the performance of machine learning algorithm, we can answer whether the features in the dataset can indicate or predict identification of POI. 

According to scikit-learn algorithm cheat-sheet below, predicting a category>yes>do you have labeled data>yes>less than 100k samples>yes> and the options are:


- Linear SVC 
- KNeighbors Classifier 
- SVC ensemble classifiers   

![image](http://scikit-learn.org/stable/_static/ml_map.png)

# Outlier Investigation

### Who has the most NaN?

In [165]:
# create a dictionary of person and count of NaN pairs
missing_value = {}

for person in data_dict:
    missing_value[person] = 0
    for feature in data_dict[person]:
        if data_dict[person][feature] == "NaN":
            missing_value[person] +=1

# sort the dictionary by ascending ordering of values 
missing_value = sorted(missing_value.items(), key=operator.itemgetter(1))

# print top 5 those who have the most NaN
pprint.pprint(missing_value[-5:])

[('WHALEY DAVID A', 18),
 ('WROBEL BRUCE', 18),
 ('THE TRAVEL AGENCY IN THE PARK', 18),
 ('GRAMM WENDY L', 18),
 ('LOCKHART EUGENE E', 20)]


### Glance at numerical variable distributions

In [166]:
# to summary statistics of each feature, I use pandas dataframe
# convert a python dictionary to a dataframe 
# with features as columns and people as rows
df = pd.DataFrame(data_dict)
df_trans = df.transpose()

In [167]:
# to get numerical statistics, replace "NaN" to zero (0)
def to_zero(v):
    if v == 'NaN':
        v = 0
    return v
df_trans = df_trans.applymap(to_zero)
df_trans.describe()

Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,loan_advances,long_term_incentive,other,restricted_stock,restricted_stock_deferred,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
count,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0
mean,1333474.0,438796.5,-382762.2,19422.49,4182736.0,70748.27,358.60274,38.226027,24.287671,1149658.0,664683.9,585431.8,1749257.0,20516.37,365811.4,692.986301,1221.589041,4350622.0,5846018.0
std,8094029.0,2741325.0,2378250.0,119054.3,26070400.0,432716.3,1441.259868,73.901124,79.278206,9649342.0,4046072.0,3682345.0,10899950.0,1439661.0,2203575.0,1072.969492,2226.770637,26934480.0,36246810.0
min,0.0,-102500.0,-27992890.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-2604490.0,-7576788.0,0.0,0.0,0.0,0.0,-44093.0
25%,0.0,0.0,-37926.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8115.0,0.0,0.0,0.0,0.0,93944.75,228869.5
50%,300000.0,0.0,0.0,0.0,608293.5,20182.0,16.5,2.5,0.0,0.0,0.0,959.5,360528.0,0.0,210596.0,102.5,289.0,941359.5,965955.0
75%,800000.0,9684.5,0.0,0.0,1714221.0,53740.75,51.25,40.75,13.75,0.0,375064.8,150606.5,814528.0,0.0,270850.5,893.5,1585.75,1968287.0,2319991.0
max,97343620.0,32083400.0,0.0,1398517.0,311764000.0,5235198.0,14368.0,528.0,609.0,83925000.0,48521930.0,42667590.0,130322300.0,15456290.0,26704230.0,5521.0,15149.0,309886600.0,434509500.0


## Q1-4: Are there any outliers in the dataset?

In [169]:
# I defined outliers as being above of 99% quantile here
# get lists of people above 99% quantile for each feature
highest = {}
for column in df_trans.columns:
    if df_trans[column].dtypes == "int64":
        highest[column]=[]
        q = df_trans[column].quantile(0.99)
        highest[column] = df_trans[data_df[column] > q].index.tolist()
    
pprint.pprint(highest)

{'bonus': ['LAVORATO JOHN J', 'TOTAL'],
 'deferral_payments': ['FREVERT MARK A', 'TOTAL'],
 'deferred_income': [],
 'director_fees': ['BHATNAGAR SANJAY', 'TOTAL'],
 'exercised_stock_options': ['LAY KENNETH L', 'TOTAL'],
 'expenses': ['MCCLELLAN GEORGE', 'TOTAL'],
 'from_messages': ['KAMINSKI WINCENTY J', 'KEAN STEVEN J'],
 'from_poi_to_this_person': ['DIETRICH JANET R', 'LAVORATO JOHN J'],
 'from_this_person_to_poi': ['DELAINEY DAVID W', 'LAVORATO JOHN J'],
 'loan_advances': ['LAY KENNETH L', 'TOTAL'],
 'long_term_incentive': ['MARTIN AMANDA K', 'TOTAL'],
 'other': ['LAY KENNETH L', 'TOTAL'],
 'restricted_stock': ['LAY KENNETH L', 'TOTAL'],
 'restricted_stock_deferred': ['BELFER ROBERT', 'BHATNAGAR SANJAY'],
 'salary': ['SKILLING JEFFREY K', 'TOTAL'],
 'shared_receipt_with_poi': ['BELDEN TIMOTHY N', 'SHAPIRO RICHARD S'],
 'to_messages': ['KEAN STEVEN J', 'SHAPIRO RICHARD S'],
 'total_payments': ['LAY KENNETH L', 'TOTAL'],
 'total_stock_value': ['LAY KENNETH L', 'TOTAL']}


### What are the outliers repeatedly shown among the features?

In [170]:
# summarize the previous dictionary, highest
# create a dictionary of outliers and the frequency of being outlier
highest_count = {}
for feature in highest:
    for person in highest[feature]:
        if person not in highest_count:
            highest_count[person] = 1
        else:
            highest_count[person] += 1
            
highest_count = sorted(highest_count.items(), key=operator.itemgetter(1))   
highest_count

[('DELAINEY DAVID W', 1),
 ('MARTIN AMANDA K', 1),
 ('SKILLING JEFFREY K', 1),
 ('BELDEN TIMOTHY N', 1),
 ('DIETRICH JANET R', 1),
 ('FREVERT MARK A', 1),
 ('KAMINSKI WINCENTY J', 1),
 ('BELFER ROBERT', 1),
 ('MCCLELLAN GEORGE', 1),
 ('KEAN STEVEN J', 2),
 ('BHATNAGAR SANJAY', 2),
 ('SHAPIRO RICHARD S', 2),
 ('LAVORATO JOHN J', 3),
 ('LAY KENNETH L', 6),
 ('TOTAL', 12)]

## Summary of outlier Investigation

- Top 5 people who has the most "NaN":

|          person name          | number of NaN |
|:-----------------------------:|:-------------:|
|       LOCKHART EUGENE E       |       20      |
|         GRAMM WENDY L         |       18      |
| THE TRAVEL AGENCY IN THE PARK |       18      |
|          WROBEL BRUCE         |       18      |
|         WHALEY DAVID A        |       18      |

- Top 3 people repeatedly shown as outliers:

|   person name   | frequency of being outlier |
|:---------------:|:--------------------------:|
|      TOTAL      |             12             |
|  LAY KENNETH L  |              6             |
| LAVORATO JOHN J |              3             |

### Take a look at outliers

In [178]:
df[['LOCKHART EUGENE E', 'GRAMM WENDY L', \
    'THE TRAVEL AGENCY IN THE PARK', \
    'WROBEL BRUCE', 'WHALEY DAVID A', \
    'TOTAL', 'LAY KENNETH L', 'LAVORATO JOHN J']]

Unnamed: 0,LOCKHART EUGENE E,GRAMM WENDY L,THE TRAVEL AGENCY IN THE PARK,WROBEL BRUCE,WHALEY DAVID A,TOTAL,LAY KENNETH L,LAVORATO JOHN J
bonus,,,,,,97343619,7000000,8000000
deferral_payments,,,,,,32083396,202911,
deferred_income,,,,,,-27992891,-300000,
director_fees,,119292,,,,1398517,,
email_address,,,,,,,kenneth.lay@enron.com,john.lavorato@enron.com
exercised_stock_options,,,,139130,98718,311764000,34348384,4158995
expenses,,,,,,5235198,99832,49537
from_messages,,,,,,,36,2585
from_poi_to_this_person,,,,,,,123,528
from_this_person_to_poi,,,,,,,16,411


## Q1-5: How to handle outliers?

'TOTAL' seemed an outlier introduced by spreadsheet quirk. It was the sum of all entries from the [pdf financial data](enron61702insiderpay.pdf). It needs to be removed from the dataset.

In addition, 'LOCKHART EUGENE E' might need to be removed as well because he does not have any value other than NaN and is labeled as non-POI. 

Among the outliers and data points with too many missing values, only 'LAY KENNETH L' was labeled as POI and he was chairman of the Enron board of directors. So I think these extreme values for this individual have a meaningful reason, not introduced by typos or technical errors.

'LAVORATO JOHN J' is an interesting individual who was recieved the largest bonus and the most frequently communicated with POI via emails, but he is not labeled as POI. So, I expect that this person would be lied near the border line of classification or tend to be mis-classified.

I tend to keep the other outliers detected, including 'THE TRAVEL AGENCY IN THE PARK'. According to the footnote from the [pdf financial data](enron61702insiderpay.pdf), the travel agency was coowned by the sister of Enron's former Chairman and I don't have solid reasons to exclude this from the dataset.

- List of data points to remove:
    
    - 'TOTAL'
    - 'LOCKHART EUGENE E'

In [179]:
### there's an outlier--remove it! 
data_dict.pop("TOTAL", 0)
data_dict.pop("LOCKHART EUGENE E", 0)

{'bonus': 'NaN',
 'deferral_payments': 'NaN',
 'deferred_income': 'NaN',
 'director_fees': 'NaN',
 'email_address': 'NaN',
 'exercised_stock_options': 'NaN',
 'expenses': 'NaN',
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 'NaN',
 'long_term_incentive': 'NaN',
 'other': 'NaN',
 'poi': False,
 'restricted_stock': 'NaN',
 'restricted_stock_deferred': 'NaN',
 'salary': 'NaN',
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 'NaN',
 'total_stock_value': 'NaN'}

***
# Create new features

 As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) 

calculate "fraction_to_this_person_from_poi" and "fraction_from_this_person_to_poi", using "from_poi_to_this_person", "from_poi_to_this_person", "from_messages", and "to_messages"

A python dictionary can’t be read directly into an sklearn classification or regression algorithm; instead, it needs a numpy array or a list of lists (each element of the list (itself a list) is a data point, and the elements of the smaller list are the features of that point).

Udacity has written helper functions (featureFormat() and targetFeatureSplit() in feature_format.py) that can take a list of feature names and the data dictionary, and return a numpy array.

In the case when a feature does not have a value for a particular person, this function will also replace the feature value with 0 (zero).

In [None]:
def computeFraction( poi_messages, all_messages ):
    """ given a number messages to/from POI (numerator) 
        and number of all messages to/from a person (denominator),
        return the fraction of messages to/from that person
        that are from/to a POI
   """

    ### beware of "NaN" when there is no known email address (and so
    ### no filled email features), and integer division!
    ### in case of poi_messages or all_messages having "NaN" value, return 0.
    fraction = 0.
    if all_messages == 'NaN':
        return fraction
    
    if poi_messages == 'NaN':
        poi_messages = 0
    
    fraction = float(poi_messages)/float(all_messages)

    return fraction


newfeature_dict = {}
for name in data_dict:
    
    data_point = data_dict[name]
    
    from_poi_to_this_person = data_point["from_poi_to_this_person"]
    to_messages = data_point["to_messages"]
    fraction_from_poi = computeFraction(from_poi_to_this_person, to_messages)
    data_point["fraction_from_poi"] = fraction_from_poi

    from_this_person_to_poi = data_point["from_this_person_to_poi"]
    from_messages = data_point["from_messages"]
    fraction_to_poi = computeFraction( from_this_person_to_poi, from_messages )
    data_point["fraction_to_poi"] = fraction_to_poi
    
    newfeature_dict[name]={"fraction_from_poi":fraction_from_poi,
                       "fraction_to_poi":fraction_to_poi}
    data_dict[name]["fraction_from_poi"] = fraction_from_poi
    data_dict[name]["fraction_to_poi"] = fraction_to_poi
    

print newfeature_dict['METTS MARK']

In [None]:
print data_dict['METTS MARK']

In [None]:
len(data_dict['METTS MARK'])

Now, I have total 23 key-value pairs per name. Before adding two, we had 21 total key-value pairs per name.

In [None]:
finacial_feature_list = ['salary', 'deferral_payments', 'total_payments', 
                'loan_advances', 'bonus', 'restricted_stock_deferred',
               'deferred_income', 'total_stock_value', 'expenses',
               'exercised_stock_options', 'other', 'long_term_incentive',
               'restricted_stock', 'director_fees'] 

# numeric feataure list which excludes email adress
email_feature_list = ['to_messages', 'from_poi_to_this_person', 'from_messages',
                     'from_this_person_to_poi', 'shared_receipt_with_poi', 
                      "fraction_from_poi", "fraction_to_poi"]
label = ['poi']

total_feature_list = label + finacial_feature_list + email_feature_list

print total_feature_list
print len(total_feature_list)
print len(finacial_feature_list)
print len(email_feature_list)

Total 23 key-value pairs per name minus 1 key-value pair(email_address). So now we have 14 finacial features, 7 email features, and 1 label.

In [None]:
# Select all numeric features and convert dictionary to numpy array of features
data_nparray = featureFormat(data_dict, total_feature_list)
poi, total_features = targetFeatureSplit(data_nparray)

In [None]:
# How many data points (people) are in the dataset?
# number of keys
len(data_dict)

In [None]:
len(data_nparray)

In [None]:
data_nparray[0]

Number of key was 146 - 1('TOTAL') - 1(all zeros) = 144

# Feature scaling

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
rescaled_data_nparray = scaler.fit_transform(data_nparray)
rescaled_data_nparray[0]

Except zero of poi, zero value is not truely zero. Try to figure out how to scale features without zero values.


In [None]:
# call for salary
data_nparray[:, 1]

In [None]:
# to figure out how to MinMaxScaler() excluding NaN 
from sklearn.preprocessing import MinMaxScaler

array = np.array([[115.], [140.], [175.], ['NaN']])
new_array = []
for value in array:
    if value != 'NaN':
        new_array.append(value)
new_nparray = np.array(new_array)
print new_array
print new_nparray
scaler = MinMaxScaler()
rescaled_array = scaler.fit_transform(new_array)
rescaled_array

Better option can be
http://scikit-learn.org/stable/modules/preprocessing.html
imputation of missing values:
The Imputer class provides basic strategies for imputing missing values, either using the mean, the median or the most frequent value of the row or column in which the missing values are located.

In [None]:
# figure out how to use imputer
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='median', axis=0)
a = [[1, 2], [np.nan, 3], [9, 6]]
imp.fit(a)
imp.transform(a)

In [None]:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='median', axis=0)
c = [[1, 2], [np.nan, 3], [9, 6]]
data_imp = imp.fit_transform(c)

data_imp

In [None]:
# Convert dictionary to numpy array of features with NaN
data_nparray_wNaN = featureFormat(data_dict, total_feature_list, remove_NaN=False, remove_all_zeroes=True)


In [None]:
len(data_nparray_wNaN)

In [None]:
#data_nparray_wNaN[:, 1]
data_nparray_wNaN[0]

In [None]:
from sklearn.preprocessing import Imputer

imp = Imputer(missing_values='NaN', strategy='median', axis=0)
data_nparray_wNaN_imp = imp.fit_transform(data_nparray_wNaN)
data_nparray_wNaN_imp[0]                    

why there are negative values?

In [None]:
# min of deferral_payments
min(data_nparray[:, 2])

In [None]:
# min of deferrd_income
min(data_nparray[:, 7])

In [None]:
# min of total_stock_value
min(data_nparray[:, 8])

the financial data have negative values

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
rescaled_data_nparray_imp = scaler.fit_transform(data_nparray_wNaN_imp)
rescaled_data_nparray_imp[0]

why median value is not around 0.5

In [None]:
imp_mean = Imputer(missing_values='NaN', strategy='mean', axis=0)
data_nparray_wNaN_imp_mean = imp_mean.fit_transform(data_nparray_wNaN)
print data_nparray_wNaN_imp_mean[0] 
print 
rescaled_data_nparray_imp_mean = scaler.fit_transform(data_nparray_wNaN_imp_mean)
print rescaled_data_nparray_imp_mean[0]

why mean value is not around 0.5. I think it is because the mean value calculated without the number of the missing value.

In [None]:
from sklearn.preprocessing import StandardScaler

stand = StandardScaler()
stand_data_nparray_imp = stand.fit_transform(data_nparray_wNaN_imp)
stand_data_nparray_imp[0]

# Feature selection

In [None]:
# convert numpy array to list
poi, total_rescaled_features = targetFeatureSplit(rescaled_data_nparray_imp)

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(random_state = 44)

clf.fit(total_rescaled_features, poi)

importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]

print 'Total Feature Ranking: '
for i in range(10):
    print "{}: no.{}, {} ({})".format(i+1,indices[i], 
                                      total_feature_list[indices[i]+1],
                            importances[indices[i]])


In [None]:
indices[9]

In [None]:
poi_array = rescaled_data_nparray_imp[:, 0]
rescaled_total_data_nparray_imp = rescaled_data_nparray_imp[:, 1:]
rescaled_financial_data_nparray_imp = rescaled_data_nparray_imp[:, 1:15]
rescaled_email_data_nparray_imp = rescaled_data_nparray_imp[:, 15:]


In [None]:
# important financial features

clf.fit(rescaled_financial_data_nparray_imp, poi_array)

importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]

print 'Financial Feature Ranking: '
for i in range(10):
    print "{}: no.{}, {} ({})".format(i+1,indices[i],
                                      finacial_feature_list[indices[i]], 
                                      importances[indices[i]])


In [None]:
indices[0]

In [None]:
# important email features

clf.fit(rescaled_email_data_nparray_imp, poi_array)

importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]

print 'Email Feature Ranking: '
for i in range(5):
    print "{}: no.{}, {} ({})".format(i+1,indices[i],
                                      email_feature_list[indices[i]], 
                                      importances[indices[i]])


# PCA

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(rescaled_total_data_nparray_imp)

print pca.explained_variance_ratio_ # % of variance explained


In [None]:
# financial featrues
pca.fit(rescaled_financial_data_nparray_imp)

print pca.explained_variance_ratio_

In [None]:
# email featrues
pca.fit(rescaled_email_data_nparray_imp)

print pca.explained_variance_ratio_