# Identify Fraud from Enron Email Project
## June 2017, by Jude Moon
<br />

# Project Overview
In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives. 

In this project, I will play a detective, and put the new skills to use by building a person of interest (POI) identifier based on financial and email data made public as a result of the Enron scandal. I used [the provided dataset](link) from [Udacity Intro to Machine Learning Course](https://www.udacity.com/course/intro-to-machine-learning--ud120), which was combined with a hand-generated list of POI in the fraud case. POIs are individuals who were indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity.

This document is to keep notes as I work through the project and compose answers to [a series of questions](https://docs.google.com/document/d/1NDgi1PrNJP7WTbfSUuRUnz8yzs5nGVTSzpO7oeNTEWA/pub?embedded=true) provided by Udacity, to show my thought processes and approaches to solve this problem.
***

# Part1. Data Exploration
## Q1-1: Summarize the goal of this project
The goal of the Enron project is to build a valid algorithm to identify Enron Employees who may have committed fraud (labeled as a person of interest, aka POI), using features from their financial and email datasets.

## Q1-2: Give some background on the dataset 

In [105]:
%pylab inline
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import os
import re
import sys
import pprint
import operator
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy


In [68]:
# loads up the dataset (pickled dict of dicts)
data_dict = pickle.load(open("final_project_dataset.pkl", "r"))

### Enron dataset (emails + finances) has the form:
    
    data_dict["LASTNAME FIRSTNAME MIDDLEINITIAL"] = { features_dict }
    
The data dictionary is stored as a **pickle** file, which is a handy way to store and load python objects directly.

### How many data points (people) are in the dataset?

In [69]:
len(data_dict)

146

### How many POI?
In other words, count the number of entries in the dictionary where
data[person_name]["poi"]==1 
- 1 means POI 
- 0 means non-POI

In [53]:
count_poi = 0
for person in data_dict:
    if data_dict[person]["poi"] == 1:
        count_poi += 1
print "Number of POIs : %i" %count_poi
print "Number of non-POIs : %i" %(146-count_poi)

Number of POIs : 18
Number of non-POIs : 128


### Do we have sufficient data points?

In [54]:
# Udacity course provided a compiled list of all POI names from Enron corpus
# poi_names.txt is newline delimited
# read poi_names.txt file: each newline to string in a list
poi_names_txt = open("poi_names.txt", "r").read().splitlines()

print "1st line: " + poi_names_txt[0]
print "2nd line: " + poi_names_txt[1]
print "3rd line: " + poi_names_txt[2]
print "37th line: " + poi_names_txt[36]
print "Number of POIs from Enron corpus: %i"%(len(poi_names_txt)-2)

1st line: http://usatoday30.usatoday.com/money/industries/energy/2005-12-28-enron-participants_x.htm
2nd line: 
3rd line: (y) Lay, Kenneth
37th line: (n) Loehr, Christopher
Number of POIs from Enron corpus: 35


The name list of POIs which were extracted from Enron corpus database (emails of total 158 employees) showed 35 of POIs, whereas the combined dataset of financial and email data had 18 of POIs. 

About half of POIs were missing in the email + finance data dictionary. This might cause problems on understanding the full scope of patterns between features and POI. 

However, adding POIs data points from email data to financial data and leaving "NaN" value for all financial features of missing POIs would introduce "NaN" driving biases.

### For each person, how many features are available?

In [4]:
len(data_dict[data_dict.keys()[0]])

21

### What are the features?

In [13]:
# the key of features for the first key
features_list = data_dict[data_dict.keys()[0]].keys() 
pprint.pprint(features_list)

['salary',
 'to_messages',
 'deferral_payments',
 'total_payments',
 'exercised_stock_options',
 'bonus',
 'restricted_stock',
 'shared_receipt_with_poi',
 'restricted_stock_deferred',
 'total_stock_value',
 'expenses',
 'loan_advances',
 'from_messages',
 'other',
 'from_this_person_to_poi',
 'poi',
 'director_fees',
 'deferred_income',
 'long_term_incentive',
 'email_address',
 'from_poi_to_this_person']


### How many NaN (Not a Number) exist per feature?

In [14]:
# create a dictionary of feature and count of NaN pairs
count_NaN = {}
for feature in features_list:
    count_NaN[feature] = 0

for person in data_dict:
    for feature in data_dict[person]:
        if data_dict[person][feature] == "NaN":
            count_NaN[feature] +=1

# sort the dictionary by ascending ordering of values 
count_NaN = sorted(count_NaN.items(), key=operator.itemgetter(1))
pprint.pprint(count_NaN)

[('poi', 0),
 ('total_stock_value', 20),
 ('total_payments', 21),
 ('email_address', 35),
 ('restricted_stock', 36),
 ('exercised_stock_options', 44),
 ('salary', 51),
 ('expenses', 51),
 ('other', 53),
 ('to_messages', 60),
 ('shared_receipt_with_poi', 60),
 ('from_messages', 60),
 ('from_poi_to_this_person', 60),
 ('from_this_person_to_poi', 60),
 ('bonus', 64),
 ('long_term_incentive', 80),
 ('deferred_income', 97),
 ('deferral_payments', 107),
 ('restricted_stock_deferred', 128),
 ('director_fees', 129),
 ('loan_advances', 142)]


### Would NaN introduce bias to the features?

In [41]:
# create a dictionary showing the number of NaN and 
# number of POI with NaN each feature
NaN_dict = {}
keys = ['NaN_total', 'NaN_poi']

for key in keys:
    NaN_dict[key] = {}
    for feature in features_list:
        NaN_dict[key][feature] = 0
        
for person in data_dict:
    for feature in data_dict[person]:
        if data_dict[person][feature] == "NaN":
            NaN_dict['NaN_total'][feature] +=1
        
        if data_dict[person][feature] == "NaN" and data_dict[person]['poi'] == True:
            NaN_dict['NaN_poi'][feature] +=1

# convert from a dictionary to a panda dataframe
NaN_df = pd.DataFrame(NaN_dict)
NaN_df['NaN_non-poi'] = NaN_df['NaN_total']-NaN_df['NaN_poi']
NaN_df['%NaN_in_poi'] = (NaN_df['NaN_poi']/18)*100 # from total 18 POI
NaN_df['%NaN_in_non-poi'] = (NaN_df['NaN_non-poi']/128)*100 # from total 128 non-POI
NaN_df['diff_%'] = NaN_df['%NaN_in_poi'] - NaN_df['%NaN_in_non-poi']
NaN_df = NaN_df.sort(['diff_%'])
NaN_df



Unnamed: 0,NaN_poi,NaN_total,NaN_non-poi,%NaN_in_poi,%NaN_in_non-poi,diff_%
other,0,53,53,0.0,41.40625,-41.40625
expenses,0,51,51,0.0,39.84375,-39.84375
bonus,2,64,62,11.111111,48.4375,-37.326389
salary,1,51,50,5.555556,39.0625,-33.506944
deferred_income,7,97,90,38.888889,70.3125,-31.423611
email_address,0,35,35,0.0,27.34375,-27.34375
long_term_incentive,6,80,74,33.333333,57.8125,-24.479167
restricted_stock,1,36,35,5.555556,27.34375,-21.788194
to_messages,4,60,56,22.222222,43.75,-21.527778
shared_receipt_with_poi,4,60,56,22.222222,43.75,-21.527778


I thought that features with a greater number of "NaN" value (e.g. 'loan_advances', 'director_fees', 'restricted_stock_deferred', etc.) would introduce bias. However, the disproportion in the numbers of "NaN" value between POI labeled group vs. non-POI labeled group might be more problematic. The features with large differences between % NaN in POI group vs. % NaN in non-POI group, for example, 'other' and 'expenses' are likely biased by "NaN" value. This means that if a supervised classification algorithm was to use 'other' as a feature, I would think that it might interpret "NaN" for 'other' as a clue that a person is a non-POI, so I would expect it to associate a "NaN" value with non-POI label.

I am not sure whether it is ok to associate lack of information such as "NaN" value with a particular label. I will keep this in mind and consider excluding the NaN biased features at the feature selection stage.


## Summary of data exploration
- Total number of data points: 146
- Total number of data points labeled as POI: 18
- Total number of data points labeled as non-POI: 126
- Imbalanced classes
- Number of missing POIs: 17
- Number of initial features: 21
- List of features with the number of "NaN" value greater than 73 (50% cut-off): 

| feature name  | number of NaN  |
|:---:|:---:|
| 'loan_advances' | 142  |
| 'director_fees'  | 129  |
| 'restricted_stock_deferred'  | 128  |
|  'deferral_payments' | 107  |
| 'deferred_income'  | 97  |
| 'long_term_incentive'  |  80 |
    

- List of features with "NaN" value disproportionally distributed between POI vs. non-POI groups:

|    feature_name   | NaN_total | NaN_poi | NaN_non-poi | %NaN_in_poi | %NaN_in_non-poi | %Difference|
|:-----------------:|:---------:|:-------:|:-----------:|:-----------:|:---------------:|:---------------:|
|      'other'      |     53    |    0    |      53     |      0      |        41       |       -41       |
|     'expenses'    |     51    |    0    |      51     |      0      |        40       |       -40       |
|      'bonus'      |     64    |    2    |      62     |      11     |        48       |       -37       |
|      'salary'     |     51    |    1    |      50     |      6      |        39       |       -34       |
| 'deferred_income' |     97    |    7    |      90     |      39     |        70       |       -31       |

## Q1-3: How machine learning is useful in trying to accomplish the project goal and answer the project question

It is uncertain that the existing financial and email dataset can provide good indicators/predictors in identifying POI. After data exploration, I realized that there are some limitations such as NaN driving bias and missing half of POIs. 

With these limitations and imperfect situation, machine learning can be useful in discovering some hidden patterns in features associated with POI labels and understanding relationship between a feature or a bundle of features and POI labels. After validating and evaluating the performance of machine learning algorithm, we can answer whether these simple numeric features can indicate or predict identification of POI. 

According to scikit-learn algorithm cheat-sheet below, predicting a category>yes>do you have labeled data>yes>less than 100k samples>yes> and the options are:


- Linear SVC 
- KNeighbors 
- SVC ensemble    

![image](http://scikit-learn.org/stable/_static/ml_map.png)

To review on algorithms covered from Udacity lectures, I will also try:

- Gaussian Naive Bayes
- Decision Trees
- Adaboost (boosted decision tree)
- Random Forest


# Outlier Investigation

### Who has the most NaN?

In [165]:
# create a dictionary of person and count of NaN pairs
missing_value = {}

for person in data_dict:
    missing_value[person] = 0
    for feature in data_dict[person]:
        if data_dict[person][feature] == "NaN":
            missing_value[person] +=1

# sort the dictionary by ascending ordering of values 
missing_value = sorted(missing_value.items(), key=operator.itemgetter(1))

# print top 5 those who have the most NaN
pprint.pprint(missing_value[-5:])

[('WHALEY DAVID A', 18),
 ('WROBEL BRUCE', 18),
 ('THE TRAVEL AGENCY IN THE PARK', 18),
 ('GRAMM WENDY L', 18),
 ('LOCKHART EUGENE E', 20)]


### Glance at numerical variable distributions

In [71]:
# to summary statistics of each feature, I use pandas dataframe
# convert a python dictionary to a dataframe 
# with features as columns and people as rows
df = pd.DataFrame(data_dict)
df_trans = df.transpose()

In [4]:
# to get numerical statistics, replace string "NaN" to zero (0)
def to_zero(v):
    if v == 'NaN':
        v = 0
    return v
df_trans = df_trans.applymap(to_zero)
df_trans.describe()

Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,loan_advances,long_term_incentive,other,restricted_stock,restricted_stock_deferred,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
count,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0
mean,1333474.0,438796.5,-382762.2,19422.49,4182736.0,70748.27,358.60274,38.226027,24.287671,1149658.0,664683.9,585431.8,1749257.0,20516.37,365811.4,692.986301,1221.589041,4350622.0,5846018.0
std,8094029.0,2741325.0,2378250.0,119054.3,26070400.0,432716.3,1441.259868,73.901124,79.278206,9649342.0,4046072.0,3682345.0,10899950.0,1439661.0,2203575.0,1072.969492,2226.770637,26934480.0,36246810.0
min,0.0,-102500.0,-27992890.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-2604490.0,-7576788.0,0.0,0.0,0.0,0.0,-44093.0
25%,0.0,0.0,-37926.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8115.0,0.0,0.0,0.0,0.0,93944.75,228869.5
50%,300000.0,0.0,0.0,0.0,608293.5,20182.0,16.5,2.5,0.0,0.0,0.0,959.5,360528.0,0.0,210596.0,102.5,289.0,941359.5,965955.0
75%,800000.0,9684.5,0.0,0.0,1714221.0,53740.75,51.25,40.75,13.75,0.0,375064.8,150606.5,814528.0,0.0,270850.5,893.5,1585.75,1968287.0,2319991.0
max,97343620.0,32083400.0,0.0,1398517.0,311764000.0,5235198.0,14368.0,528.0,609.0,83925000.0,48521930.0,42667590.0,130322300.0,15456290.0,26704230.0,5521.0,15149.0,309886600.0,434509500.0


## Q1-4: Are there any outliers in the dataset?

In [169]:
# I defined outliers as being above of 99% quantile here
# get lists of people above 99% quantile for each feature
highest = {}
for column in df_trans.columns:
    if df_trans[column].dtypes == "int64":
        highest[column]=[]
        q = df_trans[column].quantile(0.99)
        highest[column] = df_trans[data_df[column] > q].index.tolist()
    
pprint.pprint(highest)

{'bonus': ['LAVORATO JOHN J', 'TOTAL'],
 'deferral_payments': ['FREVERT MARK A', 'TOTAL'],
 'deferred_income': [],
 'director_fees': ['BHATNAGAR SANJAY', 'TOTAL'],
 'exercised_stock_options': ['LAY KENNETH L', 'TOTAL'],
 'expenses': ['MCCLELLAN GEORGE', 'TOTAL'],
 'from_messages': ['KAMINSKI WINCENTY J', 'KEAN STEVEN J'],
 'from_poi_to_this_person': ['DIETRICH JANET R', 'LAVORATO JOHN J'],
 'from_this_person_to_poi': ['DELAINEY DAVID W', 'LAVORATO JOHN J'],
 'loan_advances': ['LAY KENNETH L', 'TOTAL'],
 'long_term_incentive': ['MARTIN AMANDA K', 'TOTAL'],
 'other': ['LAY KENNETH L', 'TOTAL'],
 'restricted_stock': ['LAY KENNETH L', 'TOTAL'],
 'restricted_stock_deferred': ['BELFER ROBERT', 'BHATNAGAR SANJAY'],
 'salary': ['SKILLING JEFFREY K', 'TOTAL'],
 'shared_receipt_with_poi': ['BELDEN TIMOTHY N', 'SHAPIRO RICHARD S'],
 'to_messages': ['KEAN STEVEN J', 'SHAPIRO RICHARD S'],
 'total_payments': ['LAY KENNETH L', 'TOTAL'],
 'total_stock_value': ['LAY KENNETH L', 'TOTAL']}


### What are the outliers repeatedly shown among the features?

In [170]:
# summarize the previous dictionary, highest
# create a dictionary of outliers and the frequency of being outlier
highest_count = {}
for feature in highest:
    for person in highest[feature]:
        if person not in highest_count:
            highest_count[person] = 1
        else:
            highest_count[person] += 1
            
highest_count = sorted(highest_count.items(), key=operator.itemgetter(1))   
highest_count

[('DELAINEY DAVID W', 1),
 ('MARTIN AMANDA K', 1),
 ('SKILLING JEFFREY K', 1),
 ('BELDEN TIMOTHY N', 1),
 ('DIETRICH JANET R', 1),
 ('FREVERT MARK A', 1),
 ('KAMINSKI WINCENTY J', 1),
 ('BELFER ROBERT', 1),
 ('MCCLELLAN GEORGE', 1),
 ('KEAN STEVEN J', 2),
 ('BHATNAGAR SANJAY', 2),
 ('SHAPIRO RICHARD S', 2),
 ('LAVORATO JOHN J', 3),
 ('LAY KENNETH L', 6),
 ('TOTAL', 12)]

## Summary of outlier Investigation

- Top 5 people who has the most "NaN":

|          person name          | number of NaN |
|:-----------------------------:|:-------------:|
|       LOCKHART EUGENE E       |       20      |
|         GRAMM WENDY L         |       18      |
| THE TRAVEL AGENCY IN THE PARK |       18      |
|          WROBEL BRUCE         |       18      |
|         WHALEY DAVID A        |       18      |

- Top 3 people repeatedly shown as outliers:

|   person name   | frequency of being outlier |
|:---------------:|:--------------------------:|
|      TOTAL      |             12             |
|  LAY KENNETH L  |              6             |
| LAVORATO JOHN J |              3             |

### Take a look at outliers

In [178]:
df[['LOCKHART EUGENE E', 'GRAMM WENDY L', \
    'THE TRAVEL AGENCY IN THE PARK', \
    'WROBEL BRUCE', 'WHALEY DAVID A', \
    'TOTAL', 'LAY KENNETH L', 'LAVORATO JOHN J']]

Unnamed: 0,LOCKHART EUGENE E,GRAMM WENDY L,THE TRAVEL AGENCY IN THE PARK,WROBEL BRUCE,WHALEY DAVID A,TOTAL,LAY KENNETH L,LAVORATO JOHN J
bonus,,,,,,97343619,7000000,8000000
deferral_payments,,,,,,32083396,202911,
deferred_income,,,,,,-27992891,-300000,
director_fees,,119292,,,,1398517,,
email_address,,,,,,,kenneth.lay@enron.com,john.lavorato@enron.com
exercised_stock_options,,,,139130,98718,311764000,34348384,4158995
expenses,,,,,,5235198,99832,49537
from_messages,,,,,,,36,2585
from_poi_to_this_person,,,,,,,123,528
from_this_person_to_poi,,,,,,,16,411


## Q1-5: How to handle outliers?

'TOTAL' seemed an outlier introduced by spreadsheet quirk. It was the sum of all entries from the [pdf financial data](enron61702insiderpay.pdf). It needs to be removed from the dataset.

In addition, 'LOCKHART EUGENE E' might need to be removed as well because he does not have any value other than NaN and is labeled as non-POI. 

Among the outliers and data points with too many missing values, only 'LAY KENNETH L' was labeled as POI and he was chairman of the Enron board of directors. So I think these extreme values for this individual have a meaningful reason, not introduced by typos or technical errors.

'LAVORATO JOHN J' is an interesting individual who was recieved the largest bonus and the most frequently communicated with POI via emails, but he is not labeled as POI. So, I expect that this person would be lied near the border line of classification or tend to be mis-classified.

I tend to keep the other outliers detected, including 'THE TRAVEL AGENCY IN THE PARK'. According to the footnote from the [pdf financial data](enron61702insiderpay.pdf), the travel agency was coowned by the sister of Enron's former Chairman and I don't have solid reasons to exclude this from the dataset.

- List of data points to remove:
    
    - 'TOTAL'
    - 'LOCKHART EUGENE E'

In [70]:
### there's an outlier--remove it! 
data_dict.pop("TOTAL", 0)
data_dict.pop("LOCKHART EUGENE E", 0)
len(data_dict)

144

Number of key was 146 - 1('TOTAL') - 1(all zeros) = 144

In [72]:
df = pd.DataFrame(data_dict)
df_trans = df.transpose()
df_trans = df_trans.applymap(to_zero)
#df_trans.describe()

Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,loan_advances,long_term_incentive,other,restricted_stock,restricted_stock_deferred,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
count,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0
mean,675997.4,222089.6,-193683.3,9980.319444,2075802.0,35375.340278,363.583333,38.756944,24.625,582812.5,336957.8,297260.1,868536.3,73417.9,185446.0,702.611111,1238.555556,2259057.0,2909786.0
std,1233155.0,754101.3,606011.1,31300.575144,4795513.0,45309.303038,1450.675239,74.276769,79.778266,6794472.0,687182.6,1131068.0,2016572.0,1301983.0,197042.1,1077.290736,2237.564816,8846594.0,6189018.0
min,0.0,-102500.0,-3504386.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-2604490.0,-1787380.0,0.0,0.0,0.0,0.0,-44093.0
25%,0.0,0.0,-37086.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,24345.0,0.0,0.0,0.0,0.0,99648.25,244326.5
50%,300000.0,0.0,0.0,0.0,608293.5,20182.0,17.5,4.0,0.0,0.0,0.0,959.5,360528.0,0.0,210596.0,114.0,347.5,941359.5,965955.0
75%,800000.0,8535.5,0.0,0.0,1683580.0,53328.25,53.0,41.25,14.0,0.0,374586.2,150507.5,737456.0,0.0,269667.5,933.75,1623.0,1945668.0,2295176.0
max,8000000.0,6426990.0,0.0,137864.0,34348380.0,228763.0,14368.0,528.0,609.0,81525000.0,5145434.0,10359730.0,14761690.0,15456290.0,1111258.0,5521.0,15149.0,103559800.0,49110080.0


***
# Part2. Feature Engineering

As part of the project, I should attempt to engineer my own feature that does not come ready-made in the dataset. Before creating new features, I need to explore features. 

## Taka a look at features

### 1. Email features

    to_messages, from_poi_to_this_person, from_messages, from_this_person_to_poi, shared_receipt_with_poi


Among 6 of email features, I think email_address can be removed to make all numerical features plus I don't think email_address will give any meaningful information in classifying the labels. 


### 2. Financial features can be grouped into two categories: payments and stock value

| categories  | features with positive values                                                                        | features with negative values | summed to         |
|-------------|------------------------------------------------------------------------------------------------------|-------------------------------|-------------------|
| payments    | salary, bonus, long_term_incentive, deferral_payments, loan_advances, other, expenses, director_fees | deferred_income               | total_payments    |
| stock value | exercised_stock_options, restricted_stock                                                            | restricted_stock_deferred     | total_stock_value |

'total_payments' and 'total_stock_value' are the summary features of each category. They can either well represent the latent features of the two category or cancel out meaningful patterns of individual features. So, here are some potential ways I can engineer the features.

## Braindstorm How to Treat Features

### 1. Treate all the numerical features individually
    - Feature transformation using PCA (requires feature scaling prior to PCA) then feature selection
    - Feature selection directly without any transformation
### 2. Treate the numerical features as 3 latent features (payment, stock, and email)
    - Feature transformation using PCA separately (each latent feature has a set of PCA feature) then feature selection
    - Relativization prior to PCA transformation then feature selection
    - Relativization then feature selection
Mixing features with absolute values and those with relative values can provide more potential ways in feature engineering, but for now I focus on comparing feature importances or scores of 5 different combinations described above.

**Relativization can be achieved two ways:**
    1. feature/summed to
    2. feature/(summed to - feature with negative values) because feature with negative values canceled out the sum
                
**For email features, create features relative fraction of messages exchanged with POI among total messages:**
     1. ("from_this_person_to_poi" + "from_poi_to_this_person")/("from_messages" + "to_messages")
     2. "from_poi_to_this_person"/"to_messages
     3. "from_this_person_to_poi"/"from_messages"

# Remove features
email_address is not numeric variable so I will remove this feature from the dataframe.

In [222]:
# remove column email_address from df_trans
df_trans = df_trans.drop('email_address', 1)

# Create new features

## Q2-1: what features to create and the rationale behind it
I will create 12 new features of the relative values of payment and stock by using relativization method 1. and 3 new features of the fraction of emails exchanged with POI.

In [14]:
# to seperate the POI label from feature_list and remove email_address
features_list.remove('poi')
features_list.remove('email_address')
features_list

['salary',
 'to_messages',
 'deferral_payments',
 'total_payments',
 'exercised_stock_options',
 'bonus',
 'restricted_stock',
 'shared_receipt_with_poi',
 'restricted_stock_deferred',
 'total_stock_value',
 'expenses',
 'loan_advances',
 'from_messages',
 'other',
 'from_this_person_to_poi',
 'director_fees',
 'deferred_income',
 'long_term_incentive',
 'from_poi_to_this_person']

In [5]:
label = ['poi']

In [107]:
# create new features of relative values of each payment feature to total_payments
payment_features = ['salary', 'bonus', 'long_term_incentive', \
                    'deferral_payments', 'loan_advances', 'other', \
                    'expenses', 'director_fees', 'deferred_income']

rel_payment = []

for feature in payment_features:
    new_feature_name = 'rel_' + feature
    df_trans[new_feature_name] = (df_trans[feature]/df_trans['total_payments']).replace([np.inf, -np.inf, np.nan], 0)
    rel_payment.append(new_feature_name)

In [108]:
rel_payment

['rel_salary',
 'rel_bonus',
 'rel_long_term_incentive',
 'rel_deferral_payments',
 'rel_loan_advances',
 'rel_other',
 'rel_expenses',
 'rel_director_fees',
 'rel_deferred_income']

In [109]:
payment_features.append('total_payments')
payment_features

['salary',
 'bonus',
 'long_term_incentive',
 'deferral_payments',
 'loan_advances',
 'other',
 'expenses',
 'director_fees',
 'deferred_income',
 'total_payments']

In [110]:
# create new features of relative values of each stock feature to total_stock_value
stock_features = ['exercised_stock_options', 'restricted_stock', \
                  'restricted_stock_deferred']

rel_stock = []

for feature in stock_features:
    new_feature_name = 'rel_' + feature
    df_trans[new_feature_name] = (df_trans[feature]/df_trans['total_stock_value']).replace([np.inf, -np.inf, np.nan], 0)
    rel_stock.append(new_feature_name)

In [111]:
rel_stock

['rel_exercised_stock_options',
 'rel_restricted_stock',
 'rel_restricted_stock_deferred']

In [112]:
stock_features.append('total_stock_value')
stock_features

['exercised_stock_options',
 'restricted_stock',
 'restricted_stock_deferred',
 'total_stock_value']

In [75]:
# create new features of fraction of emails exchanged with POI
df_trans['fraction_poi']=((df_trans['from_this_person_to_poi']+\
                          df_trans['from_poi_to_this_person'])/\
(df_trans['from_messages']+df_trans['to_messages'])).fillna(0)

df_trans['fraction_to_poi']=(df_trans['from_this_person_to_poi']/\
df_trans['from_messages']).fillna(0)

df_trans['fraction_from_poi']=(df_trans['from_poi_to_this_person']/\
df_trans['to_messages']).fillna(0)

In [121]:
financial_features = payment_features+stock_features
financial_features

['salary',
 'bonus',
 'long_term_incentive',
 'deferral_payments',
 'loan_advances',
 'other',
 'expenses',
 'director_fees',
 'deferred_income',
 'total_payments',
 'exercised_stock_options',
 'restricted_stock',
 'restricted_stock_deferred',
 'total_stock_value']

In [122]:
rel_financial_features = rel_payment+rel_stock
rel_financial_features

['rel_salary',
 'rel_bonus',
 'rel_long_term_incentive',
 'rel_deferral_payments',
 'rel_loan_advances',
 'rel_other',
 'rel_expenses',
 'rel_director_fees',
 'rel_deferred_income',
 'rel_exercised_stock_options',
 'rel_restricted_stock',
 'rel_restricted_stock_deferred']

In [123]:
# numeric feataure list which excludes email adress
email_features = ['to_messages', 'from_poi_to_this_person', 'from_messages',
                     'from_this_person_to_poi', 'shared_receipt_with_poi', 
                      'fraction_poi', 'fraction_to_poi', 'fraction_from_poi']

In [124]:
total_features = financial_features + email_features
rel_total_features = rel_financial_features + email_features

In [125]:
print len(total_features)
print len(rel_total_features)

22
20


In [76]:
df_trans.describe()

Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,loan_advances,...,rel_other,rel_expenses,rel_director_fees,rel_deferred_income,rel_exercised_stock_options,rel_restricted_stock,rel_restricted_stock_deferred,fraction_poi,fraction_to_poi,fraction_from_poi
count,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,...,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0
mean,675997.4,222089.6,-193683.3,9980.319444,2075802.0,35375.340278,363.583333,38.756944,24.625,582812.5,...,0.108559,0.095527,5.914364,-6.082185,0.498924,0.403771,-0.049046,0.028493,0.109922,0.022672
std,1233155.0,754101.3,606011.1,31300.575144,4795513.0,45309.303038,1450.675239,74.276769,79.778266,6794472.0,...,0.221239,0.240176,58.879276,58.868342,0.396188,0.473146,0.255201,0.042827,0.185935,0.036417
min,0.0,-102500.0,-3504386.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,-701.013514,-0.074502,0.0,-2.493526,0.0,0.0,0.0
25%,0.0,0.0,-37086.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,-0.077054,0.0,0.0,0.0,0.0,0.0,0.0
50%,300000.0,0.0,0.0,0.0,608293.5,20182.0,17.5,4.0,0.0,0.0,...,0.00072,0.015768,0.0,0.0,0.627935,0.284209,0.0,0.008772,0.0,0.004952
75%,800000.0,8535.5,0.0,0.0,1683580.0,53328.25,53.0,41.25,14.0,0.0,...,0.075646,0.055635,0.0,0.0,0.850136,0.650782,0.0,0.043337,0.198827,0.029918
max,8000000.0,6426990.0,0.0,137864.0,34348380.0,228763.0,14368.0,528.0,609.0,81525000.0,...,1.0,1.0,701.013514,0.0,1.0,3.493526,0.0,0.224352,1.0,0.217341


In [223]:
# check any numpy NaN
df_trans.isnull().sum().sum()

0L

In [78]:
# create subset of dataframe including only original features
original_df = df_trans[features_list]
original_df.columns

Index([u'salary', u'to_messages', u'deferral_payments', u'total_payments',
       u'exercised_stock_options', u'bonus', u'restricted_stock',
       u'shared_receipt_with_poi', u'restricted_stock_deferred',
       u'total_stock_value', u'expenses', u'loan_advances', u'from_messages',
       u'other', u'from_this_person_to_poi', u'director_fees',
       u'deferred_income', u'long_term_incentive', u'from_poi_to_this_person'],
      dtype='object')

In [192]:
original_df

Unnamed: 0,salary,to_messages,deferral_payments,total_payments,exercised_stock_options,bonus,restricted_stock,shared_receipt_with_poi,restricted_stock_deferred,total_stock_value,expenses,loan_advances,from_messages,other,from_this_person_to_poi,director_fees,deferred_income,long_term_incentive,from_poi_to_this_person
ALLEN PHILLIP K,201955,2902,2869717,4484442,1729541,4175000,126027,1407,-126027,1729541,13868,0,2195,152,65,0,-3081055,304805,47
BADUM JAMES P,0,0,178980,182466,257817,0,0,0,0,257817,3486,0,0,0,0,0,0,0,0
BANNANTINE JAMES M,477,566,0,916197,4046157,0,1757552,465,-560222,5243487,56301,0,29,864523,0,0,-5104,0,39
BAXTER JOHN C,267102,0,1295738,5634343,6680544,1200000,3942714,0,0,10623258,11200,0,0,2660303,0,0,-1386055,1586055,0
BAY FRANKLIN R,239671,0,260455,827696,0,400000,145796,0,-82782,63014,129142,0,0,69,0,0,-201641,0,0
BAZELIDES PHILIP J,80818,0,684694,860136,1599641,0,0,0,0,1599641,0,0,0,874,0,0,0,93750,0
BECK SALLY W,231330,7315,0,969068,0,700000,126027,2639,0,126027,37172,0,4343,566,386,0,0,0,144
BELDEN TIMOTHY N,213999,7991,2144013,5501630,953136,5249999,157569,5521,0,1110705,17355,0,484,210698,108,0,-2334434,0,228
BELFER ROBERT,0,0,-102500,102500,3285,0,0,0,44093,-44093,0,0,0,0,0,3285,0,0,0
BERBERIAN DAVID,216582,0,0,228474,1624396,0,869220,0,0,2493616,11892,0,0,0,0,0,0,0,0


In [79]:
len(original_df.columns)

19

In [80]:
label_nparray = df_trans['poi'].as_matrix()
label_nparray

array([False, False, False, False, False, False, False,  True, False,
       False, False, False, False, False, False,  True, False, False,
       False, False,  True, False,  True, False, False, False,  True,
       False, False, False, False,  True, False, False, False, False,
       False, False, False, False, False, False, False,  True, False,
       False, False, False, False, False, False, False, False, False,
        True, False, False, False, False,  True, False, False, False,
       False, False,  True, False, False, False, False, False, False,
       False, False, False, False,  True,  True, False,  True, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True,  True, False, False, False, False,
       False,  True, False, False,  True, False, False, False, False,
       False, False,

# Feature Scaling

## Q2-2: do I have to do any scaling? why or why not?
Yes. I will use **MinMaxScaler** to adjust financial (in $) and email (count) features to be equally weighted and ranged between 0-1.

In [238]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_trans), \
                         index=df_trans.index, columns=df_trans.columns)

In [239]:
df_scaled

Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,loan_advances,...,rel_other,rel_expenses,rel_director_fees,rel_deferred_income,rel_exercised_stock_options,rel_restricted_stock,rel_restricted_stock_deferred,fraction_poi,fraction_to_poi,fraction_from_poi
ALLEN PHILLIP K,0.521875,0.455199,0.120800,0.000000,0.050353,0.060622,0.152770,0.089015,0.106732,0.0,...,0.000034,0.003092,0.000000,0.999020,1.000000,0.020858,0.970777,0.097943,0.029613,0.074518
BADUM JAMES P,0.000000,0.043109,1.000000,0.000000,0.007506,0.015238,0.000000,0.000000,0.000000,0.0,...,0.000000,0.019105,0.000000,1.000000,1.000000,0.000000,1.000000,0.000000,0.000000,0.000000
BANNANTINE JAMES M,0.000000,0.015698,0.998544,0.000000,0.117798,0.246111,0.002018,0.073864,0.000000,0.0,...,0.943599,0.061451,0.000000,0.999992,0.787486,0.095945,0.957152,0.292158,0.000000,0.317034
BAXTER JOHN C,0.150000,0.214142,0.604480,0.000000,0.194494,0.048959,0.000000,0.000000,0.000000,0.0,...,0.472159,0.001988,0.000000,0.999649,0.654594,0.106236,1.000000,0.000000,0.000000,0.000000
BAY FRANKLIN R,0.050000,0.055587,0.942460,0.000000,0.000000,0.564523,0.000000,0.000000,0.000000,0.0,...,0.000083,0.156026,0.000000,0.999652,0.069336,0.662285,0.473152,0.000000,0.000000,0.000000
BAZELIDES PHILIP J,0.000000,0.120560,1.000000,0.000000,0.046571,0.000000,0.000000,0.000000,0.000000,0.0,...,0.001016,0.000000,0.000000,1.000000,1.000000,0.000000,1.000000,0.000000,0.000000,0.000000
BECK SALLY W,0.087500,0.015698,1.000000,0.000000,0.000000,0.162491,0.302269,0.272727,0.633826,0.0,...,0.000584,0.038359,0.000000,1.000000,0.069336,0.286244,1.000000,0.202639,0.088879,0.090575
BELDEN TIMOTHY N,0.656250,0.344056,0.333854,0.000000,0.027749,0.075865,0.033686,0.431818,0.177340,0.0,...,0.038297,0.003155,0.000000,0.999395,0.867972,0.040608,1.000000,0.176714,0.223140,0.131278
BELFER ROBERT,0.000000,0.000000,1.000000,0.023828,0.000096,0.000000,0.000000,0.000000,0.000000,0.0,...,0.000000,0.000000,0.000046,1.000000,0.000000,0.000000,0.598961,0.000000,0.000000,0.000000
BERBERIAN DAVID,0.000000,0.015698,1.000000,0.000000,0.047292,0.051984,0.000000,0.000000,0.000000,0.0,...,0.000000,0.052050,0.000000,1.000000,0.675591,0.099778,1.000000,0.000000,0.000000,0.000000


In [231]:
df_scaled.shape # returns length of array and length of item

(144, 35)

# Feature Selection

## Q2-3: what selection process to use?

The goal of feature selection is to select best 7 or less features. The number 7 threshold came from the curve of dimensionality, where you may need exponentially more data points as you add more features, that is, 2^(n_featuers) = # of data points. I have 144 data points. 2^7 = 128, so 7 is the max feature number. Thus, I use **SelectKBest** process to pick 7 features.

## Q2-4: what feature scores to compare and reasons for the choice of parameter values

I choose **f_classif** scoring function over variances, chi2, and mutual_info_classif. 

- Variance can be useful for unsupervised classification. Since I have already labels, utilizing labels for scoring could be better than soley reling on x-variables. 

- The chi-square distribution arises in tests of hypotheses concerning the independence of two random variables and concerning whether a discrete random variable follows a specified distribution. The F-distribution arises in tests of hypotheses concerning whether or not two population variances are equal and concerning whether or not three or more population means are equal. In other words, chi-square is most appropriate for categorical data, whereas f-value can be used for continuous data.

- The mutual information (MI) between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold
https://discussions.udacity.com/t/f-classif-versus-chi2/245226
https://stats.libretexts.org/Textbook_Maps/General_Statistics/Map%3A_Introductory_Statistics_(Shafer_and_Zhang)/11%3A_Chi-Square_Tests_and_F-Tests
http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html#sklearn.feature_selection.mutual_info_classif

In [84]:
# select 7 features that have highest ANOVA F-value with the factor by poi label
from sklearn.feature_selection import SelectKBest

selector = SelectKBest(k=7)
original_7selected = selector.fit_transform(original_scaled, label_nparray)
scores = zip(features_list, selector.scores_, selector.pvalues_)
sorted_scores = sorted(scores, key = lambda x: x[1], reverse=True)
print"features with F-value & p-value:"

n=0
while (n < len(sorted_scores)):
    print n+1, sorted_scores[n]
    n +=1

features with F-value & p-value:
1 ('exercised_stock_options', 25.097541528735491, 1.5945438463623382e-06)
2 ('total_stock_value', 24.467654047526391, 2.1058066490127594e-06)
3 ('bonus', 21.060001707536578, 9.7024743412322453e-06)
4 ('salary', 18.575703268041778, 3.0337961075305315e-05)
5 ('deferred_income', 11.595547659732164, 0.00085980314391924004)
6 ('long_term_incentive', 10.072454529369448, 0.0018454351466116368)
7 ('restricted_stock', 9.3467007910514379, 0.0026699611393240469)
8 ('total_payments', 8.8667215371077805, 0.0034159213705928374)
9 ('shared_receipt_with_poi', 8.7464855321290802, 0.0036344020243633686)
10 ('loan_advances', 7.2427303965360172, 0.0079738162605691599)
11 ('expenses', 6.234201140506757, 0.013673150875383932)
12 ('from_poi_to_this_person', 5.3449415231473347, 0.022220727960811395)
13 ('other', 4.2049708583014187, 0.042144700903259204)
14 ('from_this_person_to_poi', 2.4265081272428799, 0.12152433983710857)
15 ('director_fees', 2.1076559432760891, 0.1487694952

In [85]:
original_7selected.shape

(144L, 7L)

In [86]:
optimized_features_list = list(map(lambda x: x[0], sorted_scores))[0:7]
print(optimized_features_list)

['exercised_stock_options', 'total_stock_value', 'bonus', 'salary', 'deferred_income', 'long_term_incentive', 'restricted_stock']


# Part3. Algorithm Selection

# Validation Strategy

## Q3-1: what is validation?
Validation is an important process to asset the performance of a machine-learning algorithm. 

## Q3-2: what is a classic mistake you can make if you do it wrong? 
A classic mistake for my analysis is over-fitting. Learning the parameters of a prediction function and testing it on the same data is a methodological mistake, leading almost a perfect score, but it would fail to predict on unseen data. 

## Q3-3: how did you validate your analysis?  
I think a proper validation method for the dataset with imbalanced classes is using cross validation iterators with stratification based on class labels, such as **StratifiedKFold** and **StratifiedShuffleSplit**. This would ensure that relative class frequencies is approximately preserved in each train and test set.


In [87]:
# generate iter 3 train-test pairs
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=44)

for train_index, test_index in skf.split(original_7selected, label_nparray):
   #print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = original_7selected[train_index], original_7selected[test_index]
   y_train, y_test = label_nparray[train_index], label_nparray[test_index]

print X_train.shape, y_train.shape
print X_test.shape, y_test.shape

(96L, 7L) (96L,)
(48L, 7L) (48L,)


In [88]:
from sklearn.model_selection import StratifiedShuffleSplit

#sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.33, random_state=44)
sss = StratifiedShuffleSplit(n_splits=1000, random_state=44)

for train_index, test_index in sss.split(original_7selected, label_nparray):
   #print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = original_7selected[train_index], original_7selected[test_index]
   y_train, y_test = label_nparray[train_index], label_nparray[test_index]

print X_train.shape, y_train.shape
print X_test.shape, y_test.shape

(129L, 7L) (129L,)
(15L, 7L) (15L,)


# Classifier Selection

## Q3-4: what algorithms to begin? 
- SVC
- KNeighbors 
- Gaussian Naive Bayes
- Decision Trees
- Adaboost (boosted decision tree)
- Random Forest

## 1. SVC Classifier

In [106]:
from sklearn import svm
from sklearn.model_selection import cross_val_score

svml = svm.LinearSVC()

scores = cross_val_score(svml, original_7selected, label_nparray, cv=sss)
scores.mean()

0.86926666666666685

In [90]:
svmr = svm.SVC(kernel='rbf', probability=True)

scores = cross_val_score(svmr, original_7selected, label_nparray, cv=sss)
scores.mean()

0.86666666666666692

## 2. KNeighbors Classifier

In [91]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier()

scores = cross_val_score(neigh, original_7selected, label_nparray, cv=sss)
scores.mean()

0.85613333333333363

## 3.  GaussianNB Classifier

In [92]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

scores = cross_val_score(gnb, original_7selected, label_nparray, cv=sss)
scores.mean()

0.86100000000000021

## 4. DecisionTree Classifier

In [93]:
from sklearn import tree
dtc = tree.DecisionTreeClassifier()

scores = cross_val_score(dtc, original_7selected, label_nparray, cv=sss)
scores.mean()

0.80313333333333337

## 5. AdaBoost Classifier

In [94]:
from sklearn.ensemble import AdaBoostClassifier
adb = AdaBoostClassifier()

scores = cross_val_score(adb, original_7selected, label_nparray, cv=sss)
scores.mean()

0.82126666666666681

## 6. RandomForest Classifier

In [95]:
from sklearn.ensemble import RandomForestClassifier
rdf = RandomForestClassifier()

scores = cross_val_score(rdf, original_7selected, label_nparray, cv=sss)
scores.mean()

0.85580000000000023

## Q3-5: how did model performance differ between algorithms?
Based on accuracy, they are not very different in performance.

# Evaluation Metrics Usage

## Q3-6:give at least 2 evaluation metrics and your average performance for each of them.

- accuracy: correct label (predicted label == true label)/total testing data points
- precision: true POI/(true POI + false non-POI)
- recall: true POI/(true POI + false POI)
- average_precision: the area under the precision-recall curve
- f1: 2 * (precision * recall) / (precision + recall)
- f1_weighted: Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

In [96]:
scorer = ["accuracy", "precision", "recall", "average_precision", "f1", "f1_weighted"]
for score in scorer:
    m_score = cross_val_score(svmr, original_7selected, label_nparray, cv=sss, \
                        scoring=score).mean()
    print score, ':', m_score

#https://stackoverflow.com/questions/35876508/evaluate-multiple-scores-on-sklearn-cross-val-score

accuracy : 0.866666666667
precision : 0.0
recall : 0.0
average_precision : 0.38899757881
f1 : 0.0
f1_weighted : 0.804761904762


In [97]:
for score in scorer:
    m_score = cross_val_score(neigh, original_7selected, label_nparray, cv=sss, \
                        scoring=score).mean()
    print score, ':', m_score


accuracy : 0.856133333333
precision : 0.0195
recall : 0.0105
average_precision : 0.33322078824
f1 : 0.0135
f1_weighted : 0.800994179894


## Q3-7: Explain an interpretation of the metrics that says something human-understandable about the algorithm’s performance.

Overall performance of Linear SVC to identify POI labels was poor. The accuracy and f1_weighted scores are based on both true POI and true non-POI labels and showed relatively high performance, where the scores were highly weighted by non-POI label with 87.5% of class size. If all people in the testing set (which was splited by stratifying) are predicted to be non-POI, the accuracy will be as high as 87.5% regardless any feature values of individuals. Thus, the accuracies around 88% are meaningless evaluation and indicate that the classifier is not a very insightful strategy in this case. 

The precision score showed that 1 out of 3 predicted as POI was truely POI, while the mojority was false positive. This result will be costly in practice because we need to investigate a lot of non-POIs to catch small number of POI. This also increases a chance of that inocent people get legal punishment. 

The recall score showed that only 1 out of 10 true POIs was identified as POI, while 90% of true POIs were not identified as POI. If we rely on this classifier, we only can catch 10% of the bad guys and let 90% of the bad guys go. 

f1 is about middle point of precision and recall scores, showing that overall performance of this classifier is very poor. 

# Algorithm Tuning

## Q3-8: 4.	what does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?  

The machine learning algorithms are parameterized so that their behavior can be tuned for a given problem. It's important to perform parameter tuning here to adjust the precision and recall. 

Parameters tuning refers to the adjustment of the algorithm when training, in order to improve the fit on the test set. Parameter can influence the outcome of the learning process, the more tuned the parameters, the more biased the algorithm will be to the training data & test harness. The strategy can be effective but it can also lead to more fragile models & overfit the test harness but don't perform well in practice

## Q3-9: How did you tune the parameters of your particular algorithm? 

I use automated parameter search processes, such as **GridSearchCV** and **RandomizedSearchCV**.

In [99]:
from sklearn.model_selection import GridSearchCV

clf = svm.SVC()

parameters = {'kernel': ['rbf', 'linear', 'poly'], 'C': [0.1, 1, 10, 100, 1000],\
         'gamma': [1, 0.1, 0.01, 0.001, 0.0001]}

grid_search = GridSearchCV(clf, parameters)
gird_result = grid_search.fit(original_7selected, label_nparray).best_estimator_

In [100]:
gird_result

SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [101]:
for score in scorer:
    m_score = cross_val_score(gird_result, original_7selected, label_nparray, \
                              cv=sss, scoring=score).mean()
    print score, ':', m_score

accuracy : 0.866666666667
precision : 0.0
recall : 0.0
average_precision : 0.311755828893
f1 : 0.0
f1_weighted : 0.804761904762


In [102]:
from sklearn.model_selection import RandomizedSearchCV
import scipy.stats
from time import time

parameters = {'C': scipy.stats.expon(scale=100), \
              'gamma': scipy.stats.expon(scale=.1), \
              'kernel': ['rbf', 'linear', 'poly'], \
              'class_weight':['balanced', None]}

random_search = RandomizedSearchCV(clf, parameters, n_iter=20)
start = time()
random_result = random_search.fit(original_7selected, label_nparray).best_estimator_

#print("RandomizedSearchCV took %.2f seconds for %d candidates"
#      " parameter settings." % ((time() - start), 20))

In [103]:
random_result

SVC(C=7.3021825709018966, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.018054570576361249,
  kernel='poly', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [104]:
for score in scorer:
    m_score = cross_val_score(random_result, original_7selected, label_nparray, \
                              cv=sss, scoring=score).mean()
    print score, ':', m_score

accuracy : 0.866666666667
precision : 0.0
recall : 0.0
average_precision : 0.438628730297
f1 : 0.0
f1_weighted : 0.804761904762


The parameter search does not really improved much.

# Part4. Build pipeline

## Pipeline Approach1
Select k number of features using univariate selection method (SelectKBest) with f-value, and then fit to classifier.

In [226]:
# Create a procedue to take feature list and result from pipeline grid search
# and return cross-validation evalutating metrics using tester.py module
import tester

def performance(old_list, grid_result):
    print "Best estimator:"
    print gird_result
    print "\nThis took %.2f seconds\n" %(time() - start)
    
    if gird_result.named_steps['select']:
        selector = gird_result.named_steps['select']
        k_features = gird_result.named_steps['select'].get_params(deep=True)['k']
        selected = selector.fit_transform(df_scaled[old_list], label_nparray)
        scores = zip(old_list, selector.scores_, selector.pvalues_)
        sorted_scores = sorted(scores, key = lambda x: x[1], reverse=True)
        new_list = list(map(lambda x: x[0], sorted_scores))[0:k_features]
    else:
        new_list = old_list
    
    new_list = ['poi']+ new_list
    new_dataset = df_scaled[new_list].to_dict(orient = 'index')  
    new_clf = gird_result.named_steps['clf']
    tester.dump_classifier_and_data(new_clf, new_dataset, new_list)
    tester.main()    

### Approach1 with total_features and SVC

In [240]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

approach1 = Pipeline([('select', SelectKBest()), \
                      ('clf', SVC(kernel='rbf'))])

parameters = {'select__k':[20, 15, 10, 7], \
              'clf__C': [0.1, 1, 10, 100, 1000], \
              'clf__gamma': [1, 0.1, 0.01, 0.001, 0.0001]}

grid_search = GridSearchCV(approach1, parameters, scoring='f1')
start = time()
gird_result = grid_search.fit(df_scaled[total_features], label_nparray).best_estimator_
performance(total_features, gird_result)

Best estimator:
Pipeline(steps=[('select', SelectKBest(k=20, score_func=<function f_classif at 0x000000000AADE2E8>)), ('clf', SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

This took 2.26 seconds

SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
	Accuracy: 0.85547	Precision: 0.39258	Recall: 0.15350	F1: 0.22070	F2: 0.17479
	Total predictions: 15000	True positives:  307	False positives:  475	False negatives: 1693	True negatives: 12525



### Approach1 with rel_total_features and SVC

In [241]:
parameters = {'select__k':[20, 15, 10, 7], \
              'clf__C': [0.1, 1, 10, 100, 1000], \
              'clf__gamma': [1, 0.1, 0.01, 0.001, 0.0001]}

grid_search = GridSearchCV(approach1, parameters, scoring='f1')
start = time()
gird_result = grid_search.fit(df_scaled[rel_total_features], label_nparray).best_estimator_
performance(rel_total_features, gird_result)

Best estimator:
Pipeline(steps=[('select', SelectKBest(k=7, score_func=<function f_classif at 0x000000000AADE2E8>)), ('clf', SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

This took 2.19 seconds

SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
	Accuracy: 0.80000	Precision: 0.43835	Recall: 0.35550	F1: 0.39260	F2: 0.36947
	Total predictions: 11000	True positives:  711	False positives:  911	False negatives: 1289	True negatives: 8089



### Approach1 with financial_features and SVC

In [242]:
parameters = {'select__k':[14, 10, 7], \
              'clf__C': [0.1, 1, 10, 100, 1000], \
              'clf__gamma': [1, 0.1, 0.01, 0.001, 0.0001]}

grid_search = GridSearchCV(approach1, parameters, scoring='f1')
start = time()
gird_result = grid_search.fit(df_scaled[financial_features], label_nparray).best_estimator_
performance(financial_features, gird_result)

Best estimator:
Pipeline(steps=[('select', SelectKBest(k=14, score_func=<function f_classif at 0x000000000AADE2E8>)), ('clf', SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

This took 1.68 seconds

SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
	Accuracy: 0.86367	Precision: 0.47066	Recall: 0.18050	F1: 0.26093	F2: 0.20589
	Total predictions: 15000	True positives:  361	False positives:  406	False negatives: 1639	True negatives: 12594



### Approach1 with rel_financial_features and SVC

In [243]:
parameters = {'select__k':[12, 10, 7], \
              'clf__C': [0.1, 1, 10, 100, 1000], \
              'clf__gamma': [1, 0.1, 0.01, 0.001, 0.0001]}

grid_search = GridSearchCV(approach1, parameters, scoring='f1')
start = time()
gird_result = grid_search.fit(df_scaled[rel_financial_features], label_nparray).best_estimator_
performance(rel_financial_features, gird_result)

Best estimator:
Pipeline(steps=[('select', SelectKBest(k=10, score_func=<function f_classif at 0x000000000AADE2E8>)), ('clf', SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

This took 1.68 seconds

SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
	Accuracy: 0.87493	Precision: 1.00000	Recall: 0.06200	F1: 0.11676	F2: 0.07632
	Total predictions: 15000	True positives:  124	False positives:    0	False negatives: 1876	True negatives: 13000



### Approach1 with email_features and SVC

In [244]:
parameters = {'select__k':[8, 7, 5], \
              'clf__C': [0.1, 1, 10, 100, 1000], \
              'clf__gamma': [1, 0.1, 0.01, 0.001, 0.0001]}

grid_search = GridSearchCV(approach1, parameters, scoring='f1')
start = time()
gird_result = grid_search.fit(df_scaled[email_features], label_nparray).best_estimator_
performance(email_features, gird_result)

Best estimator:
Pipeline(steps=[('select', SelectKBest(k=7, score_func=<function f_classif at 0x000000000AADE2E8>)), ('clf', SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

This took 1.68 seconds

SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
	Accuracy: 0.77867	Precision: 0.22475	Recall: 0.40500	F1: 0.28908	F2: 0.34902
	Total predictions: 9000	True positives:  405	False positives: 1397	False negatives:  595	True negatives: 6603



### Summary of SVM Classifier with Pipeline Approach1

| Features | Accuracy | Precision | Recall | F1 | F2 |
|------------------------|----------|-----------|---------|---------|---------|
| total_features | 0.85547 | 0.39258 | 0.15350 | 0.22070 | 0.17479 |
| rel_total_features | 0.80000 | 0.43835 | 0.35550 | 0.39260 | 0.36947 |
| financial_features | 0.86367 | 0.47066 | 0.18050 | 0.26093 | 0.20589 |
| rel_financial_features | 0.87493 | 1.00000 | 0.06200 | 0.11676 | 0.07632 |
| email_features | 0.77867 | 0.22475 | 0.40500 | 0.28908 | 0.34902 |

## Pipeline Approach2
Transform features using PCA, and then fit to classifier.

In [245]:
def performance_w_pca(grid_result):
    print "Best estimator:"
    print gird_result
    print "\nThis took %.2f seconds\n" %(time() - start)
    
    reducer = gird_result.named_steps['reducer']
    reduced = pd.DataFrame(reducer.fit_transform(df_scaled[total_features]), index=df_scaled.index)
    new_list = list(reduced.columns)
    new_list = ['poi']+ new_list
    reduced.insert(0, 'poi', df_scaled.poi)
    new_dataset = reduced.to_dict(orient = 'index') 
    new_clf = gird_result.named_steps['clf']
    tester.dump_classifier_and_data(new_clf, new_dataset, new_list)
    tester.main()
    


### Approach2 with total_features and SVC

In [246]:
from sklearn.decomposition import PCA

approach2 = Pipeline([('reducer', PCA()), \
                      ('clf', SVC(kernel='rbf'))])

parameters = {'reducer__n_components':[1, 2, 3, 5, 7, 10], \
              'clf__C': [0.1, 1, 10, 100, 1000], \
              'clf__gamma': [1, 0.1, 0.01, 0.001, 0.0001]}

grid_search = GridSearchCV(approach2, parameters, scoring='f1')
start = time()
gird_result = grid_search.fit(df_scaled[total_features], label_nparray).best_estimator_
performance_w_pca(gird_result)

Best estimator:
Pipeline(steps=[('reducer', PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

This took 3.62 seconds

SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
	Accuracy: 0.86447	Precision: 0.47773	Recall: 0.17700	F1: 0.25830	F2: 0.20249
	Total predictions: 15000	True positives:  354	False positives:  387	False negatives: 1646	True negatives: 12613



### Approach2 with rel_total_features and SVC

In [247]:
grid_search = GridSearchCV(approach2, parameters, scoring='f1')
start = time()
gird_result = grid_search.fit(df_scaled[rel_total_features], label_nparray).best_estimator_
performance_w_pca(gird_result)

Best estimator:
Pipeline(steps=[('reducer', PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

This took 3.77 seconds

SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
	Accuracy: 0.84640	Precision: 0.30808	Recall: 0.12200	F1: 0.17479	F2: 0.13876
	Total predictions: 15000	True positives:  244	False positives:  548	False negatives: 1756	True negatives: 12452



### Approach2 with financial_features and SVC

In [248]:
grid_search = GridSearchCV(approach2, parameters, scoring='f1')
start = time()
gird_result = grid_search.fit(df_scaled[financial_features], label_nparray).best_estimator_
performance_w_pca(gird_result)

Best estimator:
Pipeline(steps=[('reducer', PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.01, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

This took 3.62 seconds

SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.01, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
	Accuracy: 0.87820	Precision: 0.63796	Recall: 0.20000	F1: 0.30453	F2: 0.23183
	Total predictions: 15000	True positives:  400	False positives:  227	False negatives: 1600	True negatives: 12773



### Approach2 with rel_financial_features and SVC

In [249]:
grid_search = GridSearchCV(approach2, parameters, scoring='f1')
start = time()
gird_result = grid_search.fit(df_scaled[rel_financial_features], label_nparray).best_estimator_
performance_w_pca(gird_result)

Best estimator:
Pipeline(steps=[('reducer', PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

This took 3.83 seconds

SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
	Accuracy: 0.86240	Precision: 0.39809	Recall: 0.06250	F1: 0.10804	F2: 0.07517
	Total predictions: 15000	True positives:  125	False positives:  189	False negatives: 1875	True negatives: 12811



### Approach2 with email_features and SVC

In [250]:
parameters = {'reducer__n_components':[1, 2, 3, 5, 7], \
              'clf__C': [0.1, 1, 10, 100, 1000], \
              'clf__gamma': [1, 0.1, 0.01, 0.001, 0.0001]}

grid_search = GridSearchCV(approach2, parameters, scoring='f1')
start = time()
gird_result = grid_search.fit(df_scaled[email_features], label_nparray).best_estimator_
performance_w_pca(gird_result)

Best estimator:
Pipeline(steps=[('reducer', PCA(copy=True, iterated_power='auto', n_components=7, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

This took 3.04 seconds

SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
	Accuracy: 0.79233	Precision: 0.24106	Recall: 0.25950	F1: 0.24994	F2: 0.25559
	Total predictions: 15000	True positives:  519	False positives: 1634	False negatives: 1481	True negatives: 11366



### Summary of SVM Classifier with Pipeline Approach1

| Features | Accuracy | Precision | Recall | F1 | F2 |
|------------------------|----------|-----------|---------|---------|---------|
| total_features | 0.85547 | 0.39258 | 0.15350 | 0.22070 | 0.17479 |
| rel_total_features | 0.80000 | 0.43835 | 0.35550 | 0.39260 | 0.36947 |
| financial_features | 0.86367 | 0.47066 | 0.18050 | 0.26093 | 0.20589 |
| rel_financial_features | 0.87493 | 1.00000 | 0.06200 | 0.11676 | 0.07632 |
| email_features | 0.77867 | 0.22475 | 0.40500 | 0.28908 | 0.34902 |

### Summary of SVM Classifier with Pipeline Approach2

| Features | Accuracy | Precision | Recall | F1 | F2 |
|------------------------|----------|-----------|---------|---------|---------|
| total_features | 0.86447 | 0.47773 | 0.17700 | 0.25830 | 0.20249 |
| rel_total_features | 0.84640 | 0.30808 | 0.12200 | 0.17479 | 0.13876 |
| financial_features | 0.87820 | 0.63796 | 0.20000 | 0.30453 | 0.23183 |
| rel_financial_features | 0.86240 | 0.39809 | 0.06250 | 0.10804 | 0.07517 |
| email_features | 0.79233 | 0.24106 | 0.25950 | 0.24994 | 0.25559 |

## Pipeline Approach3
 and then fit to classifier.