# Identify Fraud from Enron Email Project
## June 2017, by Jude Moon
<br />

# Project Overview
In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives. 

In this project, I will play a detective, and put the new skills to use by building a person of interest (POI) identifier based on financial and email data made public as a result of the Enron scandal. I used [the provided dataset](https://github.com/udacity/ud120-projects/tree/master/final_project) from [Udacity Intro to Machine Learning Course](https://www.udacity.com/course/intro-to-machine-learning--ud120), which was combined with a hand-generated list of POI in the fraud case. POIs are individuals who were indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity.

This document is to keep notes as I work through the project and compose answers to [a series of questions](https://docs.google.com/document/d/1NDgi1PrNJP7WTbfSUuRUnz8yzs5nGVTSzpO7oeNTEWA/pub?embedded=true) provided by Udacity, to show my thought processes and approaches to solve this problem.
***

# Part1. Data Exploration
## Q1-1: Summarize the goal of this project
The goal of the Enron project is to build a valid algorithm to identify Enron Employees who may have committed fraud (labeled as a person of interest, aka POI), using features from their financial and email datasets.

## Q1-2: Give some background on the dataset 

In [2]:
%pylab inline
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import os
import re
import sys
import pprint
import operator
import scipy.stats
from time import time
sys.path.append("../tools/")
#from feature_format import featureFormat, targetFeatureSplit
import tester

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy


In [3]:
# loads up the dataset (pickled dict of dicts)
data_dict = pickle.load(open("final_project_dataset.pkl", "r"))

### Enron dataset (emails + finances) has the form:
    
    data_dict["LASTNAME FIRSTNAME MIDDLEINITIAL"] = { features_dict }
    
The data dictionary is stored as a **pickle** file, which is a handy way to store and load python objects directly.

### How many data points (people) are in the dataset?

In [4]:
len(data_dict)

146

### How many POI?
In other words, count the number of entries in the dictionary where
data[person_name]["poi"]==1 
- 1 means POI 
- 0 means non-POI

In [6]:
count_poi = 0
for person in data_dict:
    if data_dict[person]["poi"] == 1:
        count_poi += 1
print "Number of POIs : %i" %count_poi
print "Number of non-POIs : %i" %(146-count_poi)
print "Percentage of POIs from the total : %i" %(count_poi*100/146) 

Number of POIs : 18
Number of non-POIs : 128
Percentage of POIs from the total : 12


### Do we have sufficient data points?

In [7]:
# Udacity course provided a compiled list of all POI names from Enron corpus
# poi_names.txt is newline delimited
# read poi_names.txt file: each newline to string in a list
poi_names_txt = open("poi_names.txt", "r").read().splitlines()

print "1st line: " + poi_names_txt[0]
print "2nd line: " + poi_names_txt[1]
print "3rd line: " + poi_names_txt[2]
print "37th line: " + poi_names_txt[36]
print "Number of POIs from Enron corpus: %i"%(len(poi_names_txt)-2)

1st line: http://usatoday30.usatoday.com/money/industries/energy/2005-12-28-enron-participants_x.htm
2nd line: 
3rd line: (y) Lay, Kenneth
37th line: (n) Loehr, Christopher
Number of POIs from Enron corpus: 35


The name list of POIs which were extracted from Enron corpus database (emails of total 158 employees) showed 35 of POIs, whereas the combined dataset of financial and email data had 18 of POIs. 

About half of POIs were missing in the email + finance data dictionary. This might cause problems on understanding the full scope of patterns between features and POI. 

However, adding POIs data points from email data to financial data and leaving "NaN" value for all financial features of missing POIs would introduce "NaN" driving biases.

### For each person, how many features are available?

In [8]:
len(data_dict[data_dict.keys()[0]])

21

### What are the features?

In [36]:
# the key of features for the first key
features_list = data_dict[data_dict.keys()[0]].keys() 
pprint.pprint(features_list)

['salary',
 'to_messages',
 'deferral_payments',
 'total_payments',
 'exercised_stock_options',
 'bonus',
 'restricted_stock',
 'shared_receipt_with_poi',
 'restricted_stock_deferred',
 'total_stock_value',
 'expenses',
 'loan_advances',
 'from_messages',
 'other',
 'from_this_person_to_poi',
 'poi',
 'director_fees',
 'deferred_income',
 'long_term_incentive',
 'email_address',
 'from_poi_to_this_person']


### How many NaN (Not a Number) exist per feature?

In [10]:
# create a dictionary of feature and count of NaN pairs
count_NaN = {}
for feature in features_list:
    count_NaN[feature] = 0

for person in data_dict:
    for feature in data_dict[person]:
        if data_dict[person][feature] == "NaN":
            count_NaN[feature] +=1

# sort the dictionary by ascending ordering of values 
count_NaN = sorted(count_NaN.items(), key=operator.itemgetter(1))
pprint.pprint(count_NaN)

[('poi', 0),
 ('total_stock_value', 20),
 ('total_payments', 21),
 ('email_address', 35),
 ('restricted_stock', 36),
 ('exercised_stock_options', 44),
 ('salary', 51),
 ('expenses', 51),
 ('other', 53),
 ('to_messages', 60),
 ('shared_receipt_with_poi', 60),
 ('from_messages', 60),
 ('from_poi_to_this_person', 60),
 ('from_this_person_to_poi', 60),
 ('bonus', 64),
 ('long_term_incentive', 80),
 ('deferred_income', 97),
 ('deferral_payments', 107),
 ('restricted_stock_deferred', 128),
 ('director_fees', 129),
 ('loan_advances', 142)]


### Would NaN introduce bias to the features?

In [11]:
# create a dictionary showing the number of NaN and 
# number of POI with NaN each feature
NaN_dict = {}
keys = ['NaN_total', 'NaN_poi']

for key in keys:
    NaN_dict[key] = {}
    for feature in features_list:
        NaN_dict[key][feature] = 0
        
for person in data_dict:
    for feature in data_dict[person]:
        if data_dict[person][feature] == "NaN":
            NaN_dict['NaN_total'][feature] +=1
        
        if data_dict[person][feature] == "NaN" and data_dict[person]['poi'] == True:
            NaN_dict['NaN_poi'][feature] +=1

# convert from a dictionary to a panda dataframe
NaN_df = pd.DataFrame(NaN_dict)
NaN_df['NaN_non-poi'] = NaN_df['NaN_total']-NaN_df['NaN_poi']
NaN_df['%NaN_in_poi'] = (NaN_df['NaN_poi']/18)*100 # from total 18 POI
NaN_df['%NaN_in_non-poi'] = (NaN_df['NaN_non-poi']/128)*100 # from total 128 non-POI
NaN_df['diff_%'] = NaN_df['%NaN_in_poi'] - NaN_df['%NaN_in_non-poi']
NaN_df = NaN_df.sort(['diff_%'])
NaN_df



Unnamed: 0,NaN_poi,NaN_total,NaN_non-poi,%NaN_in_poi,%NaN_in_non-poi,diff_%
other,0,53,53,0.0,41.40625,-41.40625
expenses,0,51,51,0.0,39.84375,-39.84375
bonus,2,64,62,11.111111,48.4375,-37.326389
salary,1,51,50,5.555556,39.0625,-33.506944
deferred_income,7,97,90,38.888889,70.3125,-31.423611
email_address,0,35,35,0.0,27.34375,-27.34375
long_term_incentive,6,80,74,33.333333,57.8125,-24.479167
restricted_stock,1,36,35,5.555556,27.34375,-21.788194
to_messages,4,60,56,22.222222,43.75,-21.527778
shared_receipt_with_poi,4,60,56,22.222222,43.75,-21.527778


I thought that features with a greater number of "NaN" value (e.g. 'loan_advances', 'director_fees', 'restricted_stock_deferred', etc.) would introduce bias. However, the disproportion in the numbers of "NaN" value between POI labeled group vs. non-POI labeled group might be more problematic. The features with large differences between % NaN in POI group vs. % NaN in non-POI group, for example, 'other' and 'expenses' are likely biased by "NaN" value. This means that if a supervised classification algorithm was to use 'other' as a feature, I would think that it might interpret "NaN" for 'other' as a clue that a person is a non-POI, so I would expect it to associate a "NaN" value with non-POI label.

I am not sure whether it is ok to associate lack of information such as "NaN" value with a particular label. I will keep this in mind and consider excluding the NaN biased features at the feature selection stage.


## Summary of data exploration
- Total number of data points: 146
- Total number of data points labeled as POI: 18
- Total number of data points labeled as non-POI: 126
- Imbalanced classes
- Number of missing POIs: 17
- Number of initial features: 21
- List of features with the number of "NaN" value greater than 73 (50% cut-off): 

| feature name  | number of NaN  |
|:---:|:---:|
| 'loan_advances' | 142  |
| 'director_fees'  | 129  |
| 'restricted_stock_deferred'  | 128  |
|  'deferral_payments' | 107  |
| 'deferred_income'  | 97  |
| 'long_term_incentive'  |  80 |
    

- List of features with "NaN" value disproportionally distributed between POI vs. non-POI groups:

|    feature_name   | NaN_total | NaN_poi | NaN_non-poi | %NaN_in_poi | %NaN_in_non-poi | %Difference|
|:-----------------:|:---------:|:-------:|:-----------:|:-----------:|:---------------:|:---------------:|
|      'other'      |     53    |    0    |      53     |      0      |        41       |       -41       |
|     'expenses'    |     51    |    0    |      51     |      0      |        40       |       -40       |
|      'bonus'      |     64    |    2    |      62     |      11     |        48       |       -37       |
|      'salary'     |     51    |    1    |      50     |      6      |        39       |       -34       |
| 'deferred_income' |     97    |    7    |      90     |      39     |        70       |       -31       |

## Q1-3: How machine learning is useful in trying to accomplish the project goal and answer the project question

It is uncertain that the existing financial and email dataset can provide good indicators/predictors in identifying POI. After data exploration, I realized that there are some limitations such as NaN driving bias and missing half of POIs. 

With these limitations and imperfect situation, machine learning can be useful in discovering some hidden patterns in features associated with POI labels and understanding relationship between a feature or a bundle of features and POI labels. After validating and evaluating the performance of machine learning algorithm, we can answer whether these simple numeric features can indicate or predict identification of POI. 

According to scikit-learn algorithm cheat-sheet below, predicting a category>yes>do you have labeled data>yes>less than 100k samples>yes> and the options are:


- Linear SVC 
- KNeighbors 
- SVC ensemble    

![image](http://scikit-learn.org/stable/_static/ml_map.png)

To review on algorithms covered from Udacity lectures, I will also try:

- Gaussian Naive Bayes
- Decision Trees
- Adaboost (boosted decision tree)
- Random Forest


# Outlier Investigation

### Who has the most NaN?

In [12]:
# create a dictionary of person and count of NaN pairs
missing_value = {}

for person in data_dict:
    missing_value[person] = 0
    for feature in data_dict[person]:
        if data_dict[person][feature] == "NaN":
            missing_value[person] +=1

# sort the dictionary by ascending ordering of values 
missing_value = sorted(missing_value.items(), key=operator.itemgetter(1))

# print top 5 those who have the most NaN
pprint.pprint(missing_value[-5:])

[('WHALEY DAVID A', 18),
 ('WROBEL BRUCE', 18),
 ('THE TRAVEL AGENCY IN THE PARK', 18),
 ('GRAMM WENDY L', 18),
 ('LOCKHART EUGENE E', 20)]


### Glance at numerical variable distributions

In [13]:
# to summary statistics of each feature, I use pandas dataframe
# convert a python dictionary to a dataframe 
# with features as columns and people as rows
df = pd.DataFrame(data_dict)
df_trans = df.transpose()

In [15]:
# to get numerical statistics, replace string "NaN" to zero (0)
def to_zero(v):
    if v == 'NaN':
        v = 0
    return v
df_trans = df_trans.applymap(to_zero)

# check any numpy NaN
print df_trans.isnull().sum().sum()

# summary of variable distribution and center statistics
df_trans.describe()

0


Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,loan_advances,long_term_incentive,other,restricted_stock,restricted_stock_deferred,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
count,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0
mean,1333474.0,438796.5,-382762.2,19422.49,4182736.0,70748.27,358.60274,38.226027,24.287671,1149658.0,664683.9,585431.8,1749257.0,20516.37,365811.4,692.986301,1221.589041,4350622.0,5846018.0
std,8094029.0,2741325.0,2378250.0,119054.3,26070400.0,432716.3,1441.259868,73.901124,79.278206,9649342.0,4046072.0,3682345.0,10899950.0,1439661.0,2203575.0,1072.969492,2226.770637,26934480.0,36246810.0
min,0.0,-102500.0,-27992890.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-2604490.0,-7576788.0,0.0,0.0,0.0,0.0,-44093.0
25%,0.0,0.0,-37926.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8115.0,0.0,0.0,0.0,0.0,93944.75,228869.5
50%,300000.0,0.0,0.0,0.0,608293.5,20182.0,16.5,2.5,0.0,0.0,0.0,959.5,360528.0,0.0,210596.0,102.5,289.0,941359.5,965955.0
75%,800000.0,9684.5,0.0,0.0,1714221.0,53740.75,51.25,40.75,13.75,0.0,375064.8,150606.5,814528.0,0.0,270850.5,893.5,1585.75,1968287.0,2319991.0
max,97343620.0,32083400.0,0.0,1398517.0,311764000.0,5235198.0,14368.0,528.0,609.0,83925000.0,48521930.0,42667590.0,130322300.0,15456290.0,26704230.0,5521.0,15149.0,309886600.0,434509500.0


## Q1-4: Are there any outliers in the dataset?

In [16]:
# I defined outliers as being above of 99% quantile here
# get lists of people above 99% quantile for each feature
highest = {}
for column in df_trans.columns:
    if df_trans[column].dtypes == "int64":
        highest[column]=[]
        q = df_trans[column].quantile(0.99)
        highest[column] = df_trans[df_trans[column] > q].index.tolist()
    
pprint.pprint(highest)

{'bonus': ['LAVORATO JOHN J', 'TOTAL'],
 'deferral_payments': ['FREVERT MARK A', 'TOTAL'],
 'deferred_income': [],
 'director_fees': ['BHATNAGAR SANJAY', 'TOTAL'],
 'exercised_stock_options': ['LAY KENNETH L', 'TOTAL'],
 'expenses': ['MCCLELLAN GEORGE', 'TOTAL'],
 'from_messages': ['KAMINSKI WINCENTY J', 'KEAN STEVEN J'],
 'from_poi_to_this_person': ['DIETRICH JANET R', 'LAVORATO JOHN J'],
 'from_this_person_to_poi': ['DELAINEY DAVID W', 'LAVORATO JOHN J'],
 'loan_advances': ['LAY KENNETH L', 'TOTAL'],
 'long_term_incentive': ['MARTIN AMANDA K', 'TOTAL'],
 'other': ['LAY KENNETH L', 'TOTAL'],
 'restricted_stock': ['LAY KENNETH L', 'TOTAL'],
 'restricted_stock_deferred': ['BELFER ROBERT', 'BHATNAGAR SANJAY'],
 'salary': ['SKILLING JEFFREY K', 'TOTAL'],
 'shared_receipt_with_poi': ['BELDEN TIMOTHY N', 'SHAPIRO RICHARD S'],
 'to_messages': ['KEAN STEVEN J', 'SHAPIRO RICHARD S'],
 'total_payments': ['LAY KENNETH L', 'TOTAL'],
 'total_stock_value': ['LAY KENNETH L', 'TOTAL']}


### What are the outliers repeatedly shown among the features?

In [17]:
# summarize the previous dictionary, highest
# create a dictionary of outliers and the frequency of being outlier
highest_count = {}
for feature in highest:
    for person in highest[feature]:
        if person not in highest_count:
            highest_count[person] = 1
        else:
            highest_count[person] += 1
            
highest_count = sorted(highest_count.items(), key=operator.itemgetter(1))   
highest_count

[('DELAINEY DAVID W', 1),
 ('MARTIN AMANDA K', 1),
 ('SKILLING JEFFREY K', 1),
 ('BELDEN TIMOTHY N', 1),
 ('DIETRICH JANET R', 1),
 ('FREVERT MARK A', 1),
 ('KAMINSKI WINCENTY J', 1),
 ('BELFER ROBERT', 1),
 ('MCCLELLAN GEORGE', 1),
 ('KEAN STEVEN J', 2),
 ('BHATNAGAR SANJAY', 2),
 ('SHAPIRO RICHARD S', 2),
 ('LAVORATO JOHN J', 3),
 ('LAY KENNETH L', 6),
 ('TOTAL', 12)]

## Summary of Outlier Investigation

- Top 5 people who has the most "NaN":

|          person name          | number of NaN |
|:-----------------------------:|:-------------:|
|       LOCKHART EUGENE E       |       20      |
|         GRAMM WENDY L         |       18      |
| THE TRAVEL AGENCY IN THE PARK |       18      |
|          WROBEL BRUCE         |       18      |
|         WHALEY DAVID A        |       18      |

- Top 3 people repeatedly shown as outliers:

|   person name   | frequency of being outlier |
|:---------------:|:--------------------------:|
|      TOTAL      |             12             |
|  LAY KENNETH L  |              6             |
| LAVORATO JOHN J |              3             |

### Take a look at outliers

In [18]:
df[['LOCKHART EUGENE E', 'GRAMM WENDY L', \
    'THE TRAVEL AGENCY IN THE PARK', \
    'WROBEL BRUCE', 'WHALEY DAVID A', \
    'TOTAL', 'LAY KENNETH L', 'LAVORATO JOHN J']]

Unnamed: 0,LOCKHART EUGENE E,GRAMM WENDY L,THE TRAVEL AGENCY IN THE PARK,WROBEL BRUCE,WHALEY DAVID A,TOTAL,LAY KENNETH L,LAVORATO JOHN J
bonus,,,,,,97343619,7000000,8000000
deferral_payments,,,,,,32083396,202911,
deferred_income,,,,,,-27992891,-300000,
director_fees,,119292,,,,1398517,,
email_address,,,,,,,kenneth.lay@enron.com,john.lavorato@enron.com
exercised_stock_options,,,,139130,98718,311764000,34348384,4158995
expenses,,,,,,5235198,99832,49537
from_messages,,,,,,,36,2585
from_poi_to_this_person,,,,,,,123,528
from_this_person_to_poi,,,,,,,16,411


## Q1-5: How to handle outliers?

'TOTAL' seemed an outlier introduced by spreadsheet quirk. It was the sum of all entries from the [pdf financial data](enron61702insiderpay.pdf). It needs to be removed from the dataset.

In addition, 'LOCKHART EUGENE E' might need to be removed as well because he does not have any value other than NaN and is labeled as non-POI. 

Among the outliers and data points with too many missing values, only 'LAY KENNETH L' was labeled as POI and he was chairman of the Enron board of directors. So, I think these extreme values for this individual have a meaningful reason, not introduced by typos or technical errors.

'LAVORATO JOHN J' is an interesting individual who was recieved the largest bonus and the most frequently communicated with POI via emails, but he is not labeled as POI. So, I expect that this person would be lied near the border line of classification or tend to be mis-classified.

I tend to keep the other outliers detected, including 'THE TRAVEL AGENCY IN THE PARK'. According to the footnote from the [pdf financial data](enron61702insiderpay.pdf), the travel agency was co-owned by the sister of Enron's former Chairman and I don't have solid reasons to exclude this from the dataset.

- List of data points to remove:
    
    - 'TOTAL'
    - 'LOCKHART EUGENE E'

In [19]:
### there's an outlier--remove it! 
data_dict.pop("TOTAL", 0)
data_dict.pop("LOCKHART EUGENE E", 0)
len(data_dict)

144

Number of key was 146 - 1('TOTAL') - 1(all zeros) = 144

In [20]:
# update dataframe excluding outliers
df = pd.DataFrame(data_dict)
df_trans = df.transpose()
df_trans = df_trans.applymap(to_zero)

***
# Part2. Feature Engineering

As part of the project, I should attempt to engineer my own feature that does not come ready-made in the dataset. Before creating new features, I need to explore features. 

## Taka a look at features

### 1. Email features

    to_messages, from_poi_to_this_person, from_messages, from_this_person_to_poi, shared_receipt_with_poi


Among 6 of email features, I think email_address can be removed to make all numerical features plus I don't think email_address will give any meaningful information in classifying the labels. 


### 2. Financial features can be grouped into two categories: payments and stock value

| categories  | features with positive values                                                                        | features with negative values | summed to         |
|-------------|------------------------------------------------------------------------------------------------------|-------------------------------|-------------------|
| payments    | salary, bonus, long_term_incentive, deferral_payments, loan_advances, other, expenses, director_fees | deferred_income               | total_payments    |
| stock value | exercised_stock_options, restricted_stock                                                            | restricted_stock_deferred     | total_stock_value |

'total_payments' and 'total_stock_value' are the summary features of each category. They can either well represent the latent features of the two category or cancel out meaningful patterns of individual features. So, here are some potential ways I can engineer the features.

## Brainstorm How to Treat Features

### 1. Treat all the numerical features individually
    - Feature transformation using PCA (requires feature scaling prior to PCA) then feature selection
    - Feature selection directly without any transformation
### 2. Treat the numerical features as 3 latent features (payment, stock, and email)
    - Feature transformation using PCA separately (each latent feature has a set of PCA feature) then feature selection
    - Relativization prior to PCA transformation then feature selection
    - Relativization then feature selection

**Relativization can be achieved two ways:**
    1. feature/summed to
    2. feature/(summed to - feature with negative values) because feature with negative values canceled out the sum
                
**For email features, create features relative fraction of messages exchanged with POI among total messages:**
     1. ("from_this_person_to_poi" + "from_poi_to_this_person")/("from_messages" + "to_messages")
     2. "from_poi_to_this_person"/"to_messages
     3. "from_this_person_to_poi"/"from_messages"

# Remove features
email_address is not numeric variable so I will remove this feature from the dataframe.

In [21]:
# remove column email_address from df_trans
df_trans = df_trans.drop('email_address', 1)

# Create new features

## Q2-1: what features to create and the rationale behind it
I will create 12 new features of the relative values of payment and stock by using relativization method 1. and 3 new features of the fraction of emails exchanged with POI.

In [37]:
# to seperate the POI label from feature_list and remove email_address
label = ['poi']
features_list.remove('poi')
features_list.remove('email_address')
print len(features_list)
features_list

19


['salary',
 'to_messages',
 'deferral_payments',
 'total_payments',
 'exercised_stock_options',
 'bonus',
 'restricted_stock',
 'shared_receipt_with_poi',
 'restricted_stock_deferred',
 'total_stock_value',
 'expenses',
 'loan_advances',
 'from_messages',
 'other',
 'from_this_person_to_poi',
 'director_fees',
 'deferred_income',
 'long_term_incentive',
 'from_poi_to_this_person']

In [38]:
# create new features of relative values of each payment feature to total_payments
payment_features = ['salary', 'bonus', 'long_term_incentive', \
                    'deferral_payments', 'loan_advances', 'other', \
                    'expenses', 'director_fees', 'deferred_income']

rel_payment = []
for feature in payment_features:
    new_feature_name = 'rel_' + feature
    df_trans[new_feature_name] = (df_trans[feature]/df_trans['total_payments']).replace([np.inf, -np.inf, np.nan], 0)
    rel_payment.append(new_feature_name)

print len(rel_payment)
rel_payment

9


['rel_salary',
 'rel_bonus',
 'rel_long_term_incentive',
 'rel_deferral_payments',
 'rel_loan_advances',
 'rel_other',
 'rel_expenses',
 'rel_director_fees',
 'rel_deferred_income']

In [39]:
payment_features.append('total_payments')
print len(payment_features)
payment_features

10


['salary',
 'bonus',
 'long_term_incentive',
 'deferral_payments',
 'loan_advances',
 'other',
 'expenses',
 'director_fees',
 'deferred_income',
 'total_payments']

In [26]:
# create new features of relative values of each stock feature to total_stock_value
stock_features = ['exercised_stock_options', 'restricted_stock', \
                  'restricted_stock_deferred']

rel_stock = []
for feature in stock_features:
    new_feature_name = 'rel_' + feature
    df_trans[new_feature_name] = (df_trans[feature]/df_trans['total_stock_value']).replace([np.inf, -np.inf, np.nan], 0)
    rel_stock.append(new_feature_name)

rel_stock

['rel_exercised_stock_options',
 'rel_restricted_stock',
 'rel_restricted_stock_deferred']

In [27]:
stock_features.append('total_stock_value')
stock_features

['exercised_stock_options',
 'restricted_stock',
 'restricted_stock_deferred',
 'total_stock_value']

In [40]:
financial_features = payment_features+stock_features
print len(financial_features)
financial_features

14


['salary',
 'bonus',
 'long_term_incentive',
 'deferral_payments',
 'loan_advances',
 'other',
 'expenses',
 'director_fees',
 'deferred_income',
 'total_payments',
 'exercised_stock_options',
 'restricted_stock',
 'restricted_stock_deferred',
 'total_stock_value']

In [41]:
rel_financial_features = rel_payment+rel_stock
print len(rel_financial_features)
rel_financial_features

12


['rel_salary',
 'rel_bonus',
 'rel_long_term_incentive',
 'rel_deferral_payments',
 'rel_loan_advances',
 'rel_other',
 'rel_expenses',
 'rel_director_fees',
 'rel_deferred_income',
 'rel_exercised_stock_options',
 'rel_restricted_stock',
 'rel_restricted_stock_deferred']

In [30]:
# create new features of fraction of emails exchanged with POI
df_trans['fraction_poi']=((df_trans['from_this_person_to_poi']+\
                          df_trans['from_poi_to_this_person'])/\
(df_trans['from_messages']+df_trans['to_messages'])).fillna(0)

df_trans['fraction_to_poi']=(df_trans['from_this_person_to_poi']/\
df_trans['from_messages']).fillna(0)

df_trans['fraction_from_poi']=(df_trans['from_poi_to_this_person']/\
df_trans['to_messages']).fillna(0)

In [42]:
# numeric feataure list which excludes email adress
email_features = ['to_messages', 'from_poi_to_this_person', 'from_messages',
                     'from_this_person_to_poi', 'shared_receipt_with_poi', 
                      'fraction_poi', 'fraction_to_poi', 'fraction_from_poi']
len(email_features)

8

In [43]:
total_features = financial_features + email_features
rel_total_features = rel_financial_features + email_features

print len(total_features)
print len(rel_total_features)

22
20


In [33]:
df_trans.describe()

Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,loan_advances,...,rel_other,rel_expenses,rel_director_fees,rel_deferred_income,rel_exercised_stock_options,rel_restricted_stock,rel_restricted_stock_deferred,fraction_poi,fraction_to_poi,fraction_from_poi
count,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,...,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0
mean,675997.4,222089.6,-193683.3,9980.319444,2075802.0,35375.340278,363.583333,38.756944,24.625,582812.5,...,0.108559,0.095527,5.914364,-6.082185,0.498924,0.403771,-0.049046,0.028493,0.109922,0.022672
std,1233155.0,754101.3,606011.1,31300.575144,4795513.0,45309.303038,1450.675239,74.276769,79.778266,6794472.0,...,0.221239,0.240176,58.879276,58.868342,0.396188,0.473146,0.255201,0.042827,0.185935,0.036417
min,0.0,-102500.0,-3504386.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,-701.013514,-0.074502,0.0,-2.493526,0.0,0.0,0.0
25%,0.0,0.0,-37086.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,-0.077054,0.0,0.0,0.0,0.0,0.0,0.0
50%,300000.0,0.0,0.0,0.0,608293.5,20182.0,17.5,4.0,0.0,0.0,...,0.00072,0.015768,0.0,0.0,0.627935,0.284209,0.0,0.008772,0.0,0.004952
75%,800000.0,8535.5,0.0,0.0,1683580.0,53328.25,53.0,41.25,14.0,0.0,...,0.075646,0.055635,0.0,0.0,0.850136,0.650782,0.0,0.043337,0.198827,0.029918
max,8000000.0,6426990.0,0.0,137864.0,34348380.0,228763.0,14368.0,528.0,609.0,81525000.0,...,1.0,1.0,701.013514,0.0,1.0,3.493526,0.0,0.224352,1.0,0.217341


In [34]:
# check any numpy NaN
df_trans.isnull().sum().sum()

0L

## Summary of Feature Exploration

| List Name | Features | # of Features |
|------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| features_list | ['salary','to_messages','deferral_payments','total_payments','exercised_stock_options','bonus', 'restricted_stock','shared_receipt_with_poi','restricted_stock_deferred','total_stock_value', 'expenses','loan_advances','from_messages','other','from_this_person_to_poi','director_fees', 'deferred_income','long_term_incentive','from_poi_to_this_person'] | 19 |
| rel_payment | ['rel_salary','rel_bonus','rel_long_term_incentive','rel_deferral_payments','rel_loan_advances', 'rel_other','rel_expenses','rel_director_fees','rel_deferred_income'] | 9 |
| payment_features | ['salary','bonus','long_term_incentive','deferral_payments','loan_advances','other','expenses', 'director_fees','deferred_income','total_payments'] | 10 |
| rel_stock | ['rel_exercised_stock_options','rel_restricted_stock','rel_restricted_stock_deferred'] | 3 |
| stock_features | ['exercised_stock_options','restricted_stock','restricted_stock_deferred','total_stock_value'] | 4 |
| financial_features | payment_features+stock_features | 14 |
| rel_financial_features | rel_payment+rel_stock | 12 |
| email_features | ['to_messages', 'from_poi_to_this_person', 'from_messages','from_this_person_to_poi',  'shared_receipt_with_poi','fraction_poi', 'fraction_to_poi', 'fraction_from_poi'] | 8 |
| total_features | financial_features + email_features | 22 |
| rel_total_features | rel_financial_features + email_features | 20 |

# Feature Scaling

## Q2-2: do I have to do any scaling? why or why not?
Yes. I will use **MinMaxScaler** to adjust financial (in $) and email (count) features to be equally weighted and ranged between 0-1.

In [44]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_trans), \
                         index=df_trans.index, columns=df_trans.columns)

In [47]:
df_scaled.shape # returns length of array and length of item

(144, 35)

In [46]:
df_scaled.describe()

Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,loan_advances,...,rel_other,rel_expenses,rel_director_fees,rel_deferred_income,rel_exercised_stock_options,rel_restricted_stock,rel_restricted_stock_deferred,fraction_poi,fraction_to_poi,fraction_from_poi
count,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,...,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0
mean,0.0845,0.049711,0.944731,0.072392,0.060434,0.154638,0.025305,0.073403,0.040435,0.007149,...,0.108559,0.095527,0.008437,0.991324,0.533666,0.115577,0.980331,0.127001,0.109922,0.104317
std,0.154144,0.115492,0.172929,0.22704,0.139614,0.198062,0.100966,0.140676,0.130999,0.083342,...,0.221239,0.240176,0.083992,0.083976,0.368718,0.135435,0.102345,0.190891,0.185935,0.167558
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.015698,0.989417,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.99989,0.069336,0.0,1.0,0.0,0.0,0.0
50%,0.0375,0.015698,1.0,0.0,0.01771,0.088222,0.001218,0.007576,0.0,0.0,...,0.00072,0.015768,0.0,1.0,0.653732,0.081353,1.0,0.039099,0.0,0.022782
75%,0.1,0.017005,1.0,0.0,0.049015,0.233116,0.003689,0.078125,0.022989,0.0,...,0.075646,0.055635,0.0,1.0,0.860527,0.186282,1.0,0.193167,0.198827,0.137655
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


# Feature Selection

## Q2-3: why do I need to select features?

The goal of feature selection is to select best number of top features or reduce dimension of features. According to [a blog post by Jason Brownlee](http://machinelearningmastery.com/feature-selection-machine-learning-python/), having irrelevant features in the dataset can decrease the accuracy of many models.

Three benefits of performing feature selection before modeling your data are:

- Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
- Improves Accuracy: Less misleading data means modeling accuracy improves.
- Reduces Training Time: Less data means that algorithms train faster.


## Q2-4: what selection process to use?

1. Univariate Selection such as SelectKBest: statistical tests can be used to select the features that have the strongest relationship with the output variable. For the first trial, I will choose 7 or less features. The number 7 threshold came from the curve of dimensionality, where you may need exponentially more data points as you add more features, that is, 2^(n_featuers) = # of data points. I have 144 data points. 2^7 = 128, so 7 is the max feature number. Thus, I use **SelectKBest** process to pick 7 features.

2. Dimensionality Reduction such as PCA: PCA (or Principal Component Analysis) uses linear algebra to transform the dataset into a compressed form. I think chosing 2-3 dimensions after PCA transformation could be good start.

## Q2-5: which feature scores to compare and reasons for the choice of parameter values

I choose **f_classif** scoring function over variances, chi2, and mutual_info_classif. 

- [Variance](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold) can be useful for unsupervised classification. Since I have already labels, utilizing labels for scoring could be better than soley reling on x-variables. 

- The chi-square distribution arises in tests of hypotheses concerning the independence of two random variables and concerning whether a discrete random variable follows a specified distribution. The F-distribution arises in tests of hypotheses concerning whether or not two population variances are equal and concerning whether or not three or more population means are equal. In other words, chi-square is most appropriate for categorical data, whereas f-value can be used for continuous data [(read more)](https://discussions.udacity.com/t/f-classif-versus-chi2/245226).

- [The mutual information (MI)](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html#sklearn.feature_selection.mutual_info_classif) between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

In [53]:
# select 7 features that have highest ANOVA F-value with the factor by poi label
from sklearn.feature_selection import SelectKBest

selector = SelectKBest(k=7)
selected7 = selector.fit_transform(df_scaled[features_list], df_scaled['poi'])
selected7.shape

(144L, 7L)

### Feature Scores

In [54]:
scores = zip(features_list, selector.scores_, selector.pvalues_)
sorted_scores = sorted(scores, key = lambda x: x[1], reverse=True)
print"features with F-value & p-value:"
n=0
while (n < len(sorted_scores)):
    print n+1, sorted_scores[n]
    n +=1

features with F-value & p-value:
1 ('exercised_stock_options', 25.097541528735491, 1.5945438463623382e-06)
2 ('total_stock_value', 24.467654047526391, 2.1058066490127594e-06)
3 ('bonus', 21.060001707536578, 9.7024743412322453e-06)
4 ('salary', 18.575703268041778, 3.0337961075305315e-05)
5 ('deferred_income', 11.595547659732164, 0.00085980314391924004)
6 ('long_term_incentive', 10.072454529369448, 0.0018454351466116368)
7 ('restricted_stock', 9.3467007910514379, 0.0026699611393240469)
8 ('total_payments', 8.8667215371077805, 0.0034159213705928374)
9 ('shared_receipt_with_poi', 8.7464855321290802, 0.0036344020243633686)
10 ('loan_advances', 7.2427303965360172, 0.0079738162605691599)
11 ('expenses', 6.234201140506757, 0.013673150875383932)
12 ('from_poi_to_this_person', 5.3449415231473347, 0.022220727960811395)
13 ('other', 4.2049708583014187, 0.042144700903259204)
14 ('from_this_person_to_poi', 2.4265081272428799, 0.12152433983710857)
15 ('director_fees', 2.1076559432760891, 0.1487694952

In [52]:
optimized_features_list = list(map(lambda x: x[0], sorted_scores))[0:7]
print(optimized_features_list)

['exercised_stock_options', 'total_stock_value', 'bonus', 'salary', 'deferred_income', 'long_term_incentive', 'restricted_stock']


I have noticed that some features (e.g. bonus, salary, and deferred_income) with "NaN" value disproportionally distributed between POI vs. non-POI groups show statistically strong relationship with labels. It is possible that the F-score was influenced by "NaN" driving biases.

# Part3. Algorithm Search Planning 

# Validation Strategy

## Q3-1: what is validation?
Validation is an important process to asset the performance of a machine-learning algorithm. 

## Q3-2: what is a classic mistake you can make if you do it wrong? 
A classic mistake for my analysis is over-fitting. Learning the parameters of a prediction function and testing it on the same data is a methodological mistake, leading almost a perfect score, but it would fail to predict on unseen data. 

## Q3-3: how to validate algorithm analysis?  
I think a proper validation method for the dataset with imbalanced classes is using cross validation iterators with stratification based on class labels, such as **StratifiedKFold** and **StratifiedShuffleSplit**. This would ensure that relative class frequencies is approximately preserved in each train and test set.

In [57]:
# generate a 3 train-test pairs iterator with test set size = 0.33
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=44)

for train_index, test_index in skf.split(selected7, df_scaled['poi']):
   #print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = selected7[train_index], selected7[test_index]
   y_train, y_test = df_scaled['poi'][train_index], df_scaled['poi'][test_index]

print X_train.shape, y_train.shape
print X_test.shape, y_test.shape

(96L, 7L) (96L,)
(48L, 7L) (48L,)


In [58]:
# generate a 1000 train-test pairs iterator with test set size = 0.1
from sklearn.model_selection import StratifiedShuffleSplit

#sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.33, random_state=44)
sss = StratifiedShuffleSplit(n_splits=1000, random_state=44)

for train_index, test_index in sss.split(selected7, df_scaled['poi']):
   #print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = selected7[train_index], selected7[test_index]
   y_train, y_test = df_scaled['poi'][train_index], df_scaled['poi'][test_index]

print X_train.shape, y_train.shape
print X_test.shape, y_test.shape

(129L, 7L) (129L,)
(15L, 7L) (15L,)


To evaluate the project, project reviewers use tester.py. Thus, it is convenient for me to use the same validation method. The validation method used in tester.py is below.

>cv = StratifiedShuffleSplit(labels, folds = 1000, random_state = 42)

This is old version of StratifiedShuffleSplit which requires labels. sss with newer version of StratifiedShuffleSplit is equivalent to this.

>sss = StratifiedShuffleSplit(n_splits=1000, random_state=44)

Thus, I will stick with sss from now on. 

# Algorithm Exploration

## Q3-4: what algorithms to begin? 

When dealing with small amounts of data, it’s reasonable to try as many algorithms as possible and to pick the best one since the cost of experimentation is low according to [blog post by Cheng-Tao Chu](http://ml.posthaven.com/machine-learning-done-wrong).
- SVC
- KNeighbors 
- Gaussian Naive Bayes
- Decision Trees
- Adaboost (boosted decision tree)
- Random Forest

## Research about algorithms and their parameters

### 1. SVC Classifier

According to [blog post by Cheng-Tao Chu](http://ml.posthaven.com/machine-learning-done-wrong), SVM is one of the most popular off-the-shelf modeling algorithms and one of its most powerful features is the ability to fit the model with different kernels. SVM kernels can be thought of as a way to automatically combine existing features to form a richer feature space. Since this power feature comes almost for free, most practitioners by default use kernel when training a SVM model. However, when the data has n<<p (number of samples << number of features) --  common in industries like medical data -- the richer feature space implies a much higher risk to overfit the data. In fact, high variance models should be avoided entirely when n<<p.

According to [Edwin Chen](http://blog.echen.me/2011/04/27/choosing-a-machine-learning-classifier/), High accuracy, nice theoretical guarantees regarding overfitting, and with an appropriate kernel they can work well even if you’re data isn’t linearly separable in the base feature space. Especially popular in the classification problems where very high-dimensional spaces are the norm. 

I think that SVC might be a good choice of classifier for this project, where we have multivariate features that could have non-linear interactions among features. 

#### Parameters

- C (penalty): 'clf__C': [0.1, 1, 10, 100, 1000]
- kernal: 'clf__kernel': ['rbf', 'linear', 'poly'],
- gamma (kernal coefficient): 'clf__gamma': [1, 0.1, 0.01, 0.001, 0.0001]
- tol (Tolerance for stopping criterion): 'clf__tol': [1e-3, 1e-4, 1e-5]
- class_weight: 'clf__class_weight': ['balanced', None]

In [61]:
# fit into SVC classifier and get accuracy using cross validation
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
svc = SVC()

scores = cross_val_score(svc, selected7, df_scaled['poi'], cv=sss)
scores.mean()

0.86666666666666692

### 2. GaussianNB Classifier

According to [Edwin Chen](http://blog.echen.me/2011/04/27/choosing-a-machine-learning-classifier/), super simple, you’re just doing a bunch of counts. If the NB conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. And even if the NB assumption doesn’t hold, a NB classifier still often does a great job in practice. A good bet if want something fast and easy that performs pretty well. Its main disadvantage is that it can’t learn interactions between features.

According to [sklearn documentation](http://scikit-learn.org/stable/modules/naive_bayes.html), NB learners and classifiers can be extremely fast. The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality. On the flip side, it is known to be a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.

I think that NB might be a good choice of classifier for this project, where we have limited number of observations (total = 144) and even smaller training set for cross-validation. NB is high bias and low variance classifier, which has an advantage over low bias and high variance classifiers like KNeighbors, since the  latter will overfit.

#### Parameters

- priors (Prior probabilities of the classes): default = None; meaning the priors are adjusted according to the data

In [62]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

scores = cross_val_score(gnb, selected7, df_scaled['poi'], cv=sss)
scores.mean()

0.86100000000000021

### 3. KNeighbors Classifier

According to [sklearn documentation](http://scikit-learn.org/stable/modules/neighbors.html#classification), the principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest neighbor learning), or vary based on the local density of points (radius-based neighbor learning). Despite its simplicity, nearest neighbors have been successful in a large number of classification and regression problems. Being a non-parametric method, it is often successful in classification situations where the decision boundary is very irregular.

I think that KNeighbors can be a risky option since it tends to overfit for small set of data. But it could be a robust method to handle outliers due to its non-parametricity.

#### Parameters

- n_neighbors : 'clf__n_neighbors': [5, 8, 10, 15]
- weights : 'clf__weights' : ['uniform','distance']
- algorithm : 'clf__algorithm' : [‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’]
- metric [(distance metric)](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html) : 'clf__metric' : ['euclidean', 'manhattan', 'minkowski']

In [63]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier()

scores = cross_val_score(neigh, selected7, df_scaled['poi'], cv=sss)
scores.mean()

0.85613333333333363

### 4. DecisionTree Classifier

According to [Edwin Chen](http://blog.echen.me/2011/04/27/choosing-a-machine-learning-classifier/), DecisionTree (DT) is easy to interpret and explain. They easily handle feature interactions and they’re non-parametric, so you don’t have to worry about outliers or whether the data is linearly separable (e.g., decision trees easily take care of cases where you have class A at the low end of some feature x, class B in the mid-range of feature x, and A again at the high end). One disadvantage is that they don’t support online learning, so you have to rebuild your tree when new examples come on. Another disadvantage is that they easily overfit. According to [sklearn documentation](http://scikit-learn.org/stable/modules/tree.html), mechanisms such as pruning (not currently supported), setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.

I think that DT can be a risky classifier, not only because of overfitting, but also because of high tendency of bias for a dataset with imbalanced classes.

#### Parameters

- criterion (measurement for split quality): 'clf__n_estimators': [“gini”, “entropy”]
- splitter: 'clf__splitter': [“best”, “random”]
- max_features : 'clf__max_features': [0.5, “auto”, “log2”, None]
- max_depth : 'clf__max_depth': [3, 5, 10, None]
- min_samples_leaf : 'clf__min_samples_leaf': [5, 4, 3, 2, 1]
- class_weight: 'clf__class_weight': [“balanced_subsample”, “balanced”, None]

In [65]:
from sklearn import tree
dt = tree.DecisionTreeClassifier()

scores = cross_val_score(dt, selected7, df_scaled['poi'], cv=sss)
scores.mean()

0.80040000000000011

### 5. RandomForest Classifier

According to [sklearn documentation](http://scikit-learn.org/stable/modules/tree.html), DT can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble. Another disadvantage is that DT easily overfit, but that’s where ensemble methods like random forests (or boosted trees) come in. Plus, random forests (RF) are often the winner for lots of problems in classification (usually slightly ahead of SVMs), they’re scalable, and you don’t have to worry about tuning a bunch of parameters like you do with SVMs, so they seem to be quite popular these days

I think that RF can be good alternative classifier to DT. According to [sklearn documentation](http://scikit-learn.org/stable/modules/ensemble.html#forest), as a result of the randomness, the bias of the forest usually slightly increases with respect to the bias of a single non-random tree but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.


#### Parameters

- n_estimators (the number of trees in the forest) : 'clf__n_estimators': [10, 50, 100, 200]
- criterion (measurement for split quality): 'clf__criterion': [“gini”, “entropy”]
- max_features : 'clf__max_features': [0.5, “auto”, “log2”, None]
- max_depth : 'clf__max_depth': [3, 5, 10, None]
- min_samples_leaf : 'clf__min_samples_leaf': [5, 4, 3, 2, 1]
- class_weight: 'clf__class_weight': [“balanced”, None]

In [66]:
from sklearn.ensemble import RandomForestClassifier
rdf = RandomForestClassifier()

scores = cross_val_score(rdf, selected7, df_scaled['poi'], cv=sss)
scores.mean()

0.85613333333333363

### 6. AdaBoost Classifier

The core principle of AdaBoost (AB) is to fit a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. According to [blog post by Cheng-Tao Chu](http://ml.posthaven.com/machine-learning-done-wrong), some models are more sensitive to outliers than others. For instance, AdaBoost might treat those outliers as "hard" cases and put tremendous weights on outliers, while DT might simply count each outlier as one false classification. If the data set contains a fair number of outliers, it's important to either use modeling algorithm robust against outliers or filter the outliers out.

AB will run slow like RF because of iteration steps but the advantage of both classifiers is high predictive accuracy. I think it is worth trying because we have relatively small set of data.

#### Parameters

- base_estimator: default=DecisionTreeClassifier

In [67]:
from sklearn.ensemble import AdaBoostClassifier
adb = AdaBoostClassifier()

scores = cross_val_score(adb, selected7, df_scaled['poi'], cv=sss)
scores.mean()

0.8212666666666667

# Evaluation Metrics Usage

## Q3-5:give at least 2 evaluation metrics and the average performance for each of them.

- accuracy: correct label (predicted label == true label)/total testing data points
- precision: true POI/(true POI + false non-POI)
- recall: true POI/(true POI + false POI)
- average_precision: the area under the precision-recall curve
- f1: 2 * (precision * recall) / (precision + recall)
- f1_weighted: Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

[Evaluate multiple scores on sklearn cross_val_score](https://stackoverflow.com/questions/35876508/evaluate-multiple-scores-on-sklearn-cross-val-score) for code below.

In [73]:
# compare evaluating metrics on SVC 
scorer = ["accuracy", "precision", "recall", "average_precision", "f1", "f1_weighted"]
def print_scores(clf):
    for score in scorer:
        m_score = cross_val_score(clf, selected7, df_scaled['poi'], cv=sss, \
                        scoring=score).mean()
        print score, ':', m_score

print_scores(svc)

accuracy : 0.866666666667
precision : 0.0
recall : 0.0
average_precision : 0.38899757881
f1 : 0.0
f1_weighted : 0.804761904762


In [74]:
# compare evaluating metrics on GaussianNB
print_scores(gnb)

accuracy : 0.861
precision : 0.429035714286
recall : 0.3995
average_precision : 0.483781832751
f1 : 0.389465873016
f1_weighted : 0.8492964868


# Algorithm Tuning

## Q3-6: what does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?  

The machine learning algorithms are parameterized so that their behavior can be tuned for a given problem. It's important to perform parameter tuning here to adjust the precision and recall. 

Parameters tuning refers to the adjustment of the algorithm when training, in order to improve the fit on the test set. Parameter can influence the outcome of the learning process, the more tuned the parameters, the more biased the algorithm will be to the training data & test harness. The strategy can be effective but it can also lead to more fragile models & overfit the test harness but don't perform well in practice

## Q3-7: How to tune the parameters of your particular algorithm? 

I can use automated parameter search processes, such as **GridSearchCV** and **RandomizedSearchCV**.

In [85]:
# tune parameters of SVC using GridSearchCV
from sklearn.model_selection import GridSearchCV

clf = svc
parameters = {'kernel': ['rbf', 'linear', 'poly'], \
              'C': [0.1, 1, 10, 100, 1000],\
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001], \
              'degree': [3, 4, 5], \
              'class_weight':['balanced', None]}

grid_search = GridSearchCV(clf, parameters)
grid_result = grid_search.fit(selected7, df_scaled['poi']).best_estimator_
grid_result

SVC(C=0.1, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=4, gamma=0.1, kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [86]:
# compare evaluating metrics on best estimator from grid search cv
print_scores(grid_result)

accuracy : 0.133333333333
precision : 0.133333333333
recall : 1.0
average_precision : 0.4336371448
f1 : 0.235294117647
f1_weighted : 0.0313725490196


In [92]:
# tune parameters of SVC using RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV

parameters = {'kernel': ['rbf', 'linear', 'poly'], \
              'C': scipy.stats.expon(scale=100), \
              'gamma': scipy.stats.expon(scale=.1), \
              'class_weight':['balanced', None]}

random_search = RandomizedSearchCV(clf, parameters, n_iter=20)
start = time()
random_result = random_search.fit(selected7, df_scaled['poi']).best_estimator_
random_result

SVC(C=88.085553333754788, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.064289877113293634,
  kernel='poly', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [93]:
# compare evaluating metrics on best estimator from randomized search cv
print_scores(random_result)

accuracy : 0.867533333333
precision : 0.013
recall : 0.0065
average_precision : 0.421023055417
f1 : 0.00866666666667
f1_weighted : 0.806304938272


# Part4. Algorithm Search

Deciding the number of features (e.g. k=7 for SelectKBest) is somewhat arbitrary and it can be dependent on classifier algorithms; different algorithms have different optimized number of features. So, instead of deciding a rigid number, I will use pipeline to optimize the number of features according to a choice of classifier.

- Approach1: Select features and Optimize parameters of classifier
- Approach2: Reduce feature dimensions and Optimize parameters of classifier

## Pipeline Approach1

>approach1 = Pipeline([('selector', SelectKBest()), ('clf', classifier)])

Construct steps for optimizing the number of features using univariate selection method and the parameters of classifier simultaneously using Pipeline. 

>grid_search = GridSearchCV(approach1, parameters[classifier], scoring='f1')

Select k number of features and find best estimator with highest f-values using GridSearchCV.

>tester.dump_classifier_and_data(new_clf, new_dataset, new_list)
>tester.main()

Finally, get new_clf, new_dataset, and new_list from the grid_search results and plug the optimized results into tester to get evaluating metrics from cross-validation.

In [94]:
# Create a procedue to take feature list and result from pipeline grid search
# and return cross-validation evalutating metrics using tester.py module
def performance(old_list, grid_result):
    selector = gird_result.named_steps['selector']
    k_features = gird_result.named_steps['selector'].get_params(deep=True)['k']
    print "Number of features selected: %i" %(k_features)
    selected = selector.fit_transform(df_scaled[old_list], df_scaled['poi'])
    scores = zip(old_list, selector.scores_, selector.pvalues_)
    sorted_scores = sorted(scores, key = lambda x: x[1], reverse=True)
    new_list = list(map(lambda x: x[0], sorted_scores))[0:k_features]
    new_list = ['poi']+ new_list
    new_dataset = df_scaled[new_list].to_dict(orient = 'index')  
    new_clf = gird_result.named_steps['clf']
    tester.dump_classifier_and_data(new_clf, new_dataset, new_list)
    tester.main()
    print "\nThis took %.2f seconds\n" %(time() - start)
    print "--------------------------------------------------------"

### Test 9 pairs of Classifier-FeatureList
- Pipeline: Approach1 
- Classifiers: SVC, GaussianNB, and KNeighbors
- FeatureLists:  1. features_list, 2. total_features, 3. rel_total_features

In [84]:
# build pipeline with selector and clf steps
# and iterate 3 classifiers (svc, gnb, and neigh) with their parameter sets
# and iterate 3 feature lists:  1. features_list, 2. total_features, 3. rel_total_features
from sklearn.pipeline import Pipeline

# declare paremeters grid
parameters = {svc: {'selector__k':[19, 15, 10, 7], \
                     'clf__kernel': ['rbf', 'linear', 'poly'], \
                     'clf__C': [0.1, 1, 10, 100, 1000], \
                     'clf__gamma': [1, 0.1, 0.01, 0.001, 0.0001], \
                     'clf__class_weight': ['balanced', None]}, \
              gnb: {'selector__k':[19, 15, 10, 7]}, \
              neigh: {'selector__k':[19, 15, 10, 7], \
                      'clf__n_neighbors': [5, 8, 10, 15], \
                      'clf__weights' : ['uniform','distance'], \
                      'clf__algorithm' : ['auto', 'ball_tree', 'kd_tree', 'brute'], \
                      'clf__metric' : ['euclidean', 'manhattan', 'minkowski']}}

num = 1
for features in [features_list, total_features, rel_total_features]:
    print num
    for classifier in parameters:
        approach1 = Pipeline([('selector', SelectKBest()), \
                      ('clf', classifier)])
        grid_search = GridSearchCV(approach1, parameters[classifier], scoring='f1')
        start = time()
        gird_result = grid_search.fit(df_scaled[features], df_scaled['poi']).best_estimator_
        performance(features, gird_result)
    print "========================================================"
    num += 1
    

1
Number of features selected: 19
SVC(C=1000, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
	Accuracy: 0.77407	Precision: 0.30660	Recall: 0.55050	F1: 0.39385	F2: 0.47494
	Total predictions: 15000	True positives: 1101	False positives: 2490	False negatives:  899	True negatives: 10510


This took 108.90 seconds

--------------------------------------------------------
Number of features selected: 10
GaussianNB(priors=None)
	Accuracy: 0.83753	Precision: 0.36829	Recall: 0.30550	F1: 0.33397	F2: 0.31629
	Total predictions: 15000	True positives:  611	False positives: 1048	False negatives: 1389	True negatives: 11952


This took 1.04 seconds

--------------------------------------------------------
Number of features selected: 19
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
           metric_params=N

### Test DecisionTree with 3 FeatureLists
- Pipeline: Approach1 
- Classifier: DecisionTree
- FeatureLists:  1. features_list, 2. total_features, 3. rel_total_features

In [99]:
approach1 = Pipeline([('selector', SelectKBest()), \
                      ('clf', dt)])

parameters = {'selector__k':[19, 15, 10, 7], \
              'clf__criterion': ['gini', 'entropy'], \
              'clf__splitter': ['best', 'random'], \
              'clf__max_features': [0.5, 'auto', 'log2', None], \
              'clf__max_depth': [3, 5, 10, None], \
              'clf__min_samples_leaf': [5, 4, 3, 2, 1], \
              'clf__class_weight': ['balanced', None]}

num = 1
for features in [features_list, total_features, rel_total_features]:
    print num
    grid_search = GridSearchCV(approach1, parameters, scoring='f1')
    start = time()
    gird_result = grid_search.fit(df_scaled[features], df_scaled['poi']).best_estimator_
    performance(features, gird_result)
    num += 1
 

1
Number of features selected: 7
DecisionTreeClassifier(class_weight='balanced', criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=2,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='random')
	Accuracy: 0.72360	Precision: 0.23493	Recall: 0.47550	F1: 0.31448	F2: 0.39467
	Total predictions: 15000	True positives:  951	False positives: 3097	False negatives: 1049	True negatives: 9903


This took 71.90 seconds

--------------------------------------------------------
2
Number of features selected: 19
DecisionTreeClassifier(class_weight='balanced', criterion='gini', max_depth=5,
            max_features='log2', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=4,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
	Accuracy: 0.77820	Pre

### Test RandomForest with 3 FeatureLists
- Pipeline: Approach1 
- Classifier: RandomForest
- FeatureLists:  1. features_list, 2. total_features, 3. rel_total_features

In [101]:
approach1 = Pipeline([('selector', SelectKBest()), \
                      ('clf', rdf)])

parameters = {'selector__k':[19, 15, 10, 7]}

num = 1
for features in [features_list, total_features, rel_total_features]:
    print num
    grid_search = GridSearchCV(approach1, parameters, scoring='f1')
    start = time()
    gird_result = grid_search.fit(df_scaled[features], df_scaled['poi']).best_estimator_
    performance(features, gird_result)
    num += 1
 

1
Number of features selected: 15
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)
	Accuracy: 0.85547	Precision: 0.36751	Recall: 0.11650	F1: 0.17692	F2: 0.13493
	Total predictions: 15000	True positives:  233	False positives:  401	False negatives: 1767	True negatives: 12599


This took 36.98 seconds

--------------------------------------------------------
2
Number of features selected: 7
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
        

### Test AdaBoost with 3 FeatureLists
- Pipeline: Approach1 
- Classifier: AdaBoost
- FeatureLists:  1. features_list, 2. total_features, 3. rel_total_features

In [102]:
approach1 = Pipeline([('selector', SelectKBest()), \
                      ('clf', adb)])

parameters = {'selector__k':[19, 15, 10, 7]}
  
num = 1
for features in [features_list, total_features, rel_total_features]:
    print num
    grid_search = GridSearchCV(approach1, parameters, scoring='f1')
    start = time()
    gird_result = grid_search.fit(df_scaled[features], df_scaled['poi']).best_estimator_
    performance(features, gird_result)
    num += 1

1
Number of features selected: 15
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
	Accuracy: 0.85040	Precision: 0.41528	Recall: 0.29900	F1: 0.34767	F2: 0.31674
	Total predictions: 15000	True positives:  598	False positives:  842	False negatives: 1402	True negatives: 12158


This took 118.86 seconds

--------------------------------------------------------
2
Number of features selected: 7
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
	Accuracy: 0.82953	Precision: 0.33491	Recall: 0.28250	F1: 0.30648	F2: 0.29163
	Total predictions: 15000	True positives:  565	False positives: 1122	False negatives: 1435	True negatives: 11878


This took 108.43 seconds

--------------------------------------------------------
3
Number of features selected: 7
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_esti

## Pipeline Approach2

>approach2 = Pipeline([('reducer', PCA()), ('clf', classifier)])

Construct steps for optimizing the number of principal components using dimensionality reduction method (PCA) and the parameters of classifier simultaneously using Pipeline. 

>grid_search = GridSearchCV(approach2, parameters[classifier], scoring='f1')

Select n number of components and find best estimator with highest f-values using GridSearchCV.

>tester.dump_classifier_and_data(new_clf, new_dataset, new_list)
>tester.main()

Finally, get new_clf, new_dataset (pca tranformed data), and new_list (pca dimension) from the grid_search results and plug the optimized results into tester to get evaluating metrics from cross-validation.

### Test 9 pairs of Classifier-FeatureList
- Pipeline: Approach2
- Classifiers: SVC, GaussianNB, and KNeighbors
- FeatureLists:  1. features_list, 2. total_features, 3. rel_total_features

In [106]:
# Create a procedue to take feature list and result from pipeline grid search
# and return cross-validation evalutating metrics using tester.py module
def performance_w_pca(old_list, grid_result):
    reducer = gird_result.named_steps['reducer']
    n_components = gird_result.named_steps['reducer'].get_params(deep=True)['n_components']
    print "Number of component: %i" %(n_components)
    reduced = pd.DataFrame(reducer.fit_transform(df_scaled[old_list]), index=df_scaled.index)
    new_list = list(reduced.columns)
    new_list = ['poi']+ new_list
    reduced.insert(0, 'poi', df_scaled.poi)
    new_dataset = reduced.to_dict(orient = 'index') 
    new_clf = gird_result.named_steps['clf']
    tester.dump_classifier_and_data(new_clf, new_dataset, new_list)
    tester.main()
    print "\nThis took %.2f seconds\n" %(time() - start)
    print "--------------------------------------------------------"

In [122]:
# build pipeline with reducer and clf steps
# and iterate 3 classifiers (svc, gnb, and neigh) with their parameter sets
# and iterate 3 feature lists:  1. features_list, 2. total_features, 3. rel_total_features
from sklearn.decomposition import PCA

parameters = {svc: {'reducer__n_components':[1, 2, 3, 5, 7, 10], \
                    'clf__kernel': ['rbf', 'linear', 'poly'], \
                    'clf__C': [0.1, 1, 10, 100, 1000], \
                    'clf__gamma': [1, 0.1, 0.01, 0.001, 0.0001], \
                    'clf__class_weight': ['balanced', None]}, \
              gnb: {'reducer__n_components':[1, 2, 3, 5, 7, 10]}, \
              neigh: {'reducer__n_components':[1, 2, 3, 5, 7, 10], \
                      'clf__n_neighbors': [5, 8, 10, 15], \
                      'clf__weights' : ['uniform','distance'], \
                      'clf__algorithm' : ['auto', 'ball_tree', 'kd_tree', 'brute'], \
                      'clf__metric' : ['euclidean', 'manhattan', 'minkowski']}}

num = 1
for features in [features_list, total_features, rel_total_features]:
    print num
    for classifier in parameters:
        approach2 = Pipeline([('reducer', PCA()), \
                      ('clf', classifier)])
        grid_search = GridSearchCV(approach2, parameters[classifier], scoring='f1')
        start = time()
        gird_result = grid_search.fit(df_scaled[features], df_scaled['poi']).best_estimator_
        performance_w_pca(features, gird_result)
    print "========================================================"
    num += 1

1
Number of component: 1
SVC(C=10, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
	Accuracy: 0.73587	Precision: 0.31178	Recall: 0.81250	F1: 0.45064	F2: 0.61497
	Total predictions: 15000	True positives: 1625	False positives: 3587	False negatives:  375	True negatives: 9413


This took 31.94 seconds

--------------------------------------------------------
Number of component: 7
GaussianNB(priors=None)
	Accuracy: 0.83107	Precision: 0.38259	Recall: 0.43500	F1: 0.40711	F2: 0.42340
	Total predictions: 15000	True positives:  870	False positives: 1404	False negatives: 1130	True negatives: 11596


This took 1.15 seconds

--------------------------------------------------------
Number of component: 1
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
           metric_params=None, n_jobs=1, n_neighbors=5, p

### Test DecisionTree with 3 FeatureLists
- Pipeline: Approach2 
- Classifier: DecisionTree
- FeatureLists:  1. features_list, 2. total_features, 3. rel_total_features

In [112]:
approach2 = Pipeline([('reducer', PCA()), ('clf', dt)])

parameters = {'reducer__n_components':[1, 2, 3, 5, 7, 10], \
              'clf__criterion': ['gini', 'entropy'], \
              'clf__splitter': ['best', 'random'], \
              'clf__max_features': [0.5, 'auto', 'log2', None], \
              'clf__max_depth': [3, 5, 10, None], \
              'clf__min_samples_leaf': [5, 4, 3, 2, 1], \
              'clf__class_weight': ['balanced', None]}

num = 1
for features in [features_list, total_features, rel_total_features]:
    print num
    grid_search = GridSearchCV(approach2, parameters, scoring='f1')
    start = time()
    gird_result = grid_search.fit(df_scaled[features], df_scaled['poi']).best_estimator_
    performance_w_pca(features, gird_result)
    num += 1
 

1
Number of component: 3
DecisionTreeClassifier(class_weight='balanced', criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=2,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='random')
	Accuracy: 0.68093	Precision: 0.23992	Recall: 0.64250	F1: 0.34937	F2: 0.48106
	Total predictions: 15000	True positives: 1285	False positives: 4071	False negatives:  715	True negatives: 8929


This took 124.41 seconds

--------------------------------------------------------
2
Number of component: 10
DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
            max_depth=None, max_features=0.5, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='random')
	Accuracy: 0.80047	Precision: 0.233

### Test RandomForest with 3 FeatureLists
- Pipeline: Approach2
- Classifier: RandomForest
- FeatureLists:  1. features_list, 2. total_features, 3. rel_total_features

In [113]:
approach2 = Pipeline([('reducer', PCA()),('clf', rdf)])

parameters = {'reducer__n_components':[1, 2, 3, 5, 7, 10]}

num = 1
for features in [features_list, total_features, rel_total_features]:
    print num
    grid_search = GridSearchCV(approach2, parameters, scoring='f1')
    start = time()
    gird_result = grid_search.fit(df_scaled[features], df_scaled['poi']).best_estimator_
    performance_w_pca(features, gird_result)
    num += 1

1
Number of component: 1
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)
	Accuracy: 0.82613	Precision: 0.28889	Recall: 0.20800	F1: 0.24186	F2: 0.22034
	Total predictions: 15000	True positives:  416	False positives: 1024	False negatives: 1584	True negatives: 11976


This took 35.37 seconds

--------------------------------------------------------
2
Number of component: 1
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=

### Test AdaBoost with 3 FeatureLists
- Pipeline: Approach2
- Classifier: AdaBoost
- FeatureLists:  1. features_list, 2. total_features, 3. rel_total_features

In [114]:
approach2 = Pipeline([('reducer', PCA()),('clf', adb)])

parameters = {'reducer__n_components':[1, 2, 3, 5, 7, 10]}
  
num = 1
for features in [features_list, total_features, rel_total_features]:
    print num
    grid_search = GridSearchCV(approach2, parameters, scoring='f1')
    start = time()
    gird_result = grid_search.fit(df_scaled[features], df_scaled['poi']).best_estimator_
    performance_w_pca(features, gird_result)
    num += 1

1
Number of component: 1
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
	Accuracy: 0.82207	Precision: 0.26592	Recall: 0.19000	F1: 0.22164	F2: 0.20151
	Total predictions: 15000	True positives:  380	False positives: 1049	False negatives: 1620	True negatives: 11951


This took 104.41 seconds

--------------------------------------------------------
2
Number of component: 3
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
	Accuracy: 0.80947	Precision: 0.19003	Recall: 0.13150	F1: 0.15544	F2: 0.14013
	Total predictions: 15000	True positives:  263	False positives: 1121	False negatives: 1737	True negatives: 11879


This took 114.21 seconds

--------------------------------------------------------
3
Number of component: 10
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=

## Summary of Pipeline Results

| Pipeline | Classifier | FeatureList | Accuracy | Precision | Recall | F1 | F2 | Seconds |
|-----------|--------------|--------------------|----------|-----------|-----------|-----------|-----------|---------|
| Approach1 | SVC | features_list | 0.774 | 0.307 | **0.551** | 0.394 | **0.475** | 108.9 |
|  |  | total_features | 0.701 | 0.263 | **0.690** | 0.381 | 0.521 | 77.53 |
|  |  | rel_total_features | 0.854 | **0.595** | **0.619** | **0.607** | **0.614** | 80.02 |
|  | GaussianNB | features_list | 0.838 | 0.368 | 0.306 | 0.334 | 0.316 | 1.04 |
|  |  | total_features | 0.840 | 0.377 | 0.311 | 0.341 | 0.322 | 1.28 |
|  |  | rel_total_features | 0.469 | 0.148 | **0.625** | 0.239 | 0.379 | 1.09 |
|  | Kneighbors | features_list | 0.861 | 0.348 | 0.046 | 0.081 | 0.056 | 13.92 |
|  |  | total_features | 0.868 | **0.528** | 0.095 | 0.161 | 0.114 | 14.12 |
|  |  | rel_total_features | 0.812 | 0.399 | 0.067 | 0.115 | 0.080 | 15.75 |
|  | DecisionTree | features_list | 0.724 | 0.235 | 0.476 | 0.314 | 0.395 | 71.9 |
|  |  | total_features | 0.778 | 0.318 | **0.579** | **0.410** | **0.497** | 72.44 |
|  |  | rel_total_features | 0.838 | 0.335 | 0.220 | 0.266 | 0.236 | 73.66 |
|  | RandomForest | features_list | 0.855 | 0.368 | 0.117 | 0.177 | 0.135 | 36.98 |
|  |  | total_features | 0.862 | 0.458 | 0.172 | 0.250 | 0.196 | 37.75 |
|  |  | rel_total_features | 0.814 | 0.305 | 0.093 | 0.142 | 0.107 | 37.32 |
|  | AdaBoost | features_list | 0.850 | 0.415 | 0.299 | 0.348 | 0.317 | 118.86 |
|  |  | total_features | 0.830 | 0.335 | 0.283 | 0.306 | 0.292 | 108.43 |
|  |  | rel_total_features | 0.771 | 0.306 | 0.204 | 0.244 | 0.218 | 104.48 |
| Approach2 | SVC | features_list | 0.736 | 0.312 | **0.813** | **0.451** | **0.615** | 31.94 |
|  |  | total_features | 0.711 | 0.281 | **0.747** | **0.408** | **0.560** | 31.85 |
|  |  | rel_total_features | 0.741 | 0.310 | **0.764** | **0.441** | **0.591** | 34.72 |
|  | GaussianNB | features_list | 0.819 | 0.350 | 0.415 | 0.379 | 0.400 | 1.09 |
|  |  | total_features | 0.831 | 0.383 | 0.435 | **0.407** | **0.423** | 1.11 |
|  |  | rel_total_features | 0.754 | 0.182 | 0.242 | 0.208 | 0.227 | 1.15 |
|  | Kneighbors | features_list | 0.868 | **0.513** | 0.195 | 0.282 | 0.222 | 20.62 |
|  |  | total_features | 0.847 | 0.025 | 0.004 | 0.007 | 0.005 | 21.2 |
|  |  | rel_total_features | 0.847 | 0.029 | 0.005 | 0.008 | 0.005 | 22.48 |
|  | DecisionTree | features_list | 0.681 | 0.240 | **0.643** | 0.349 | **0.481** | 124.41 |
|  |  | total_features | 0.800 | 0.233 | 0.217 | 0.225 | 0.220 | 125.89 |
|  |  | rel_total_features | 0.824 | 0.258 | 0.172 | 0.206 | 0.184 | 117.21 |
|  | RandomForest | features_list | 0.826 | 0.289 | 0.208 | 0.242 | 0.220 | 35.37 |
|  |  | total_features | 0.823 | 0.298 | 0.241 | 0.267 | 0.251 | 34.91 |
|  |  | rel_total_features | 0.841 | 0.114 | 0.028 | 0.045 | 0.033 | 35.96 |
|  | AdaBoost | features_list | 0.822 | 0.266 | 0.190 | 0.222 | 0.202 | 104.41 |
|  |  | total_features | 0.809 | 0.190 | 0.132 | 0.155 | 0.140 | 114.21 |
|  |  | rel_total_features | 0.802 | 0.172 | 0.127 | 0.146 | 0.134 | 113.89 |

In [124]:
selected7 = selector.fit_transform(df_scaled[rel_total_features], df_scaled['poi'])
scores = zip(rel_total_features, selector.scores_, selector.pvalues_)
sorted_scores = sorted(scores, key = lambda x: x[1], reverse=True)
print"features with F-value & p-value:"
n=0
while (n < len(sorted_scores)):
    print n+1, sorted_scores[n]
    n +=1

features with F-value & p-value:
1 ('rel_bonus', 20.988768488080161, 1.002166059752133e-05)
2 ('fraction_to_poi', 16.641707070468989, 7.4941540250267645e-05)
3 ('rel_long_term_incentive', 14.014032672700869, 0.00026283167217943732)
4 ('shared_receipt_with_poi', 8.7464855321290802, 0.0036344020243633686)
5 ('fraction_poi', 5.5185055438125357, 0.020194477662584531)
6 ('rel_loan_advances', 5.396395592254871, 0.021598722340364536)
7 ('from_poi_to_this_person', 5.3449415231473347, 0.022220727960811395)
8 ('fraction_from_poi', 3.2107619169667667, 0.075284900599149329)
9 ('rel_salary', 2.7730011744152487, 0.098071056290785164)
10 ('from_this_person_to_poi', 2.4265081272428799, 0.12152433983710857)
11 ('to_messages', 1.6988243485808538, 0.19455111487450777)
12 ('rel_deferral_payments', 1.3381166890229022, 0.24930866438997767)
13 ('rel_restricted_stock', 1.1488763692954786, 0.28560279187395243)
14 ('rel_restricted_stock_deferred', 0.75851839176838753, 0.38526259639292249)
15 ('rel_other', 0.717

In [125]:
finalized_features_list = list(map(lambda x: x[0], sorted_scores))[0:7]
print(finalized_features_list)

['rel_bonus', 'fraction_to_poi', 'rel_long_term_incentive', 'shared_receipt_with_poi', 'fraction_poi', 'rel_loan_advances', 'from_poi_to_this_person']


- rel_bonus: "NaN" disproportion biased feature
- rel_loan_advances: the most "NaN" abundant feature

## what about using email_features? 

In [121]:


parameters = {svc: {'selector__k':[8, 7, 6, 5], \
                     'clf__kernel': ['rbf', 'linear', 'poly'], \
                     'clf__C': [0.1, 1, 10, 100, 1000], \
                     'clf__gamma': [1, 0.1, 0.01, 0.001, 0.0001], \
                     'clf__class_weight': ['balanced', None]}, \
              gnb: {'selector__k':[8, 7, 6, 5]}, \
              neigh: {'selector__k':[8, 7, 6, 5], \
                      'clf__n_neighbors': [5, 8, 10, 15], \
                      'clf__weights' : ['uniform','distance'], \
                      'clf__algorithm' : ['auto', 'ball_tree', 'kd_tree', 'brute'], \
                      'clf__metric' : ['euclidean', 'manhattan', 'minkowski']}, \
              dt: {'selector__k':[8, 7, 6, 5], \
                  'clf__criterion': ['gini', 'entropy'], \
                  'clf__splitter': ['best', 'random'], \
                  'clf__max_features': [0.5, 'auto', 'log2', None], \
                  'clf__max_depth': [3, 5, 10, None], \
                  'clf__min_samples_leaf': [5, 4, 3, 2, 1], \
                  'clf__class_weight': ['balanced', None]}}


for classifier in parameters:
    approach1 = Pipeline([('selector', SelectKBest()), \
                      ('clf', classifier)])
    grid_search = GridSearchCV(approach1, parameters[classifier], scoring='f1')
    start = time()
    gird_result = grid_search.fit(df_scaled[email_features], df_scaled['poi']).best_estimator_
    performance(email_features, gird_result)


Number of features selected: 5
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
	Accuracy: 0.86311	Precision: 0.39219	Recall: 0.42200	F1: 0.40655	F2: 0.41568
	Total predictions: 9000	True positives:  422	False positives:  654	False negatives:  578	True negatives: 7346


This took 67.80 seconds

--------------------------------------------------------
Number of features selected: 5
SVC(C=100, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
	Accuracy: 0.83522	Precision: 0.37134	Recall: 0.69700	F1: 0.48453	F2: 0.59299
	Total predictions: 9000	True positives:  697	Fal

In [123]:
parameters = {svc: {'reducer__n_components':[1, 2, 3, 4, 5], \
                    'clf__kernel': ['rbf', 'linear', 'poly'], \
                    'clf__C': [0.1, 1, 10, 100, 1000], \
                    'clf__gamma': [1, 0.1, 0.01, 0.001, 0.0001], \
                    'clf__class_weight': ['balanced', None]}, \
              gnb: {'reducer__n_components':[1, 2, 3, 4, 5]}, \
              neigh: {'reducer__n_components':[1, 2, 3, 4, 5], \
                      'clf__n_neighbors': [5, 8, 10, 15], \
                      'clf__weights' : ['uniform','distance'], \
                      'clf__algorithm' : ['auto', 'ball_tree', 'kd_tree', 'brute'], \
                      'clf__metric' : ['euclidean', 'manhattan', 'minkowski']}, \
              dt: {'reducer__n_components':[1, 2, 3, 4, 5], \
                   'clf__criterion': ['gini', 'entropy'], \
                   'clf__splitter': ['best', 'random'], \
                   'clf__max_features': [0.5, 'auto', 'log2', None], \
                   'clf__max_depth': [3, 5, 10, None], \
                   'clf__min_samples_leaf': [5, 4, 3, 2, 1], \
                   'clf__class_weight': ['balanced', None]}}


for classifier in parameters:
    approach2 = Pipeline([('reducer', PCA()), ('clf', classifier)])
    grid_search = GridSearchCV(approach2, parameters[classifier], scoring='f1')
    start = time()
    gird_result = grid_search.fit(df_scaled[email_features], df_scaled['poi']).best_estimator_
    performance_w_pca(email_features, gird_result)


Number of component: 4
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
	Accuracy: 0.82940	Precision: 0.32254	Recall: 0.25400	F1: 0.28420	F2: 0.26527
	Total predictions: 15000	True positives:  508	False positives: 1067	False negatives: 1492	True negatives: 11933


This took 102.93 seconds

--------------------------------------------------------
Number of component: 2
SVC(C=10, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.01, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
	Accuracy: 0.76973	Precision: 0.32094	Recall: 0.65150	F1: 0.43003	F2: 0.54022
	Total predictions: 15000	True positives: 1303	False pos

## Summary of Pipeline Results on email_features

| Pipeline | Classifier | FeatureList | Accuracy | Precision | Recall | F1 | F2 | Seconds |
|-----------|--------------|----------------|----------|-----------|--------|-------|-------|---------|
| Approach1 | SVC | email_features | 0.835 | 0.371 | **0.697** | **0.485** | **0.593** | 18.3 |
|  | GaussianNB | email_features | 0.840 | 0.202 | 0.149 | 0.172 | 0.157 | 0.91 |
|  | Kneighbors | email_features | 0.858 | 0.311 | 0.230 | 0.265 | 0.243 | 12.93 |
|  | DecisionTree | email_features | 0.863 | 0.392 | 0.422 | **0.407** | **0.416** | 67.8 |
| Approach2 | SVC | email_features | 0.770 | 0.321 | **0.652** | **0.430** | **0.540** | 25.53 |
|  | GaussianNB | email_features | 0.783 | 0.105 | 0.084 | 0.093 | 0.087 | 1.1 |
|  | Kneighbors | email_features | 0.845 | 0.289 | 0.112 | 0.161 | 0.127 | 17.84 |
|  | DecisionTree | email_features | 0.829 | 0.323 | 0.254 | 0.284 | 0.265 | 102.93 |

The evaluating metric scores on email_feataures are relatively higher compared to total_features but do not exceed rel_total_features with SVC classifier.

# Part5. Algorithm Selection

## Best Feataure-Classifier Combination

- Pipeline: Approach1
- Classifier: SVC(C=100, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
- FeatureLists:  ['rel_bonus', 'fraction_to_poi', 'rel_long_term_incentive', 'shared_receipt_with_poi', 'fraction_poi', 'rel_loan_advances', 'from_poi_to_this_person']
- Result: 

| Accuracy: | Precision: | Recall: | F1: | F2: |
|:---------:|:----------:|:-------:|:-----:|:-----:|
| 0.854 | 0.595 | 0.619 | 0.607 | 0.614 |

| Total predictions: | TRUE positives: | FALSE positives: | FALSE negatives: | TRUE negatives: |
|:------------------:|:---------------:|:----------------:|:----------------:|:---------------:|
| 11000 | 1237 | 842 | 763 | 8158 |

### Finalize my_classifier.pkl, my_dataset.pkl, my_feature_list.pkl

In [128]:
new_list = ['poi']+ finalized_features_list
new_dataset = df_scaled[new_list].to_dict(orient = 'index')  
new_clf = SVC(C=100, cache_size=200, class_weight='balanced', coef0=0.0, \
              decision_function_shape=None, degree=3, gamma=1, kernel='poly', \
              max_iter=-1, probability=False, random_state=None, \
              shrinking=True, tol=0.001, verbose=False)
    
tester.dump_classifier_and_data(new_clf, new_dataset, new_list)
tester.main()

SVC(C=100, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
	Accuracy: 0.85409	Precision: 0.59500	Recall: 0.61850	F1: 0.60652	F2: 0.61365
	Total predictions: 11000	True positives: 1237	False positives:  842	False negatives:  763	True negatives: 8158



##  Q5-1: did I have to do any scaling?
Yes. I used MinMaxScaler to adjust financial (in $) and email (in count) features to be equally weighted and ranged between 0-1. Here, the scaled dataset is "df_scaled."

## Q5-2: what features did I end up using in your POI identifier?
finalized_features_list = ['rel_bonus', 'fraction_to_poi', 'rel_long_term_incentive', 'shared_receipt_with_poi', 'fraction_poi', 'rel_loan_advances', 'from_poi_to_this_person']

## Q5-3: what selection process did I use to pick features?
I used SelectKBest() with f_classif score function to select 7 features among 20 features from rel_total_features. I have chosen f_classif because f-value can be used for continuous data, while chi-square is more appropriate for categorical data. The number 7 was resulted from grid search in the pipeline. 

>approach1 = Pipeline([('selector', SelectKBest()), ('clf', SVC())])

>parameters = \{'selector\__k':[19, 15, 10, 7], 
                     'clf\__kernel': ['rbf', 'linear', 'poly'], 
                     'clf\__C': [0.1, 1, 10, 100, 1000], 
                     'clf\__gamma': [1, 0.1, 0.01, 0.001, 0.0001], 
                     'clf\__class_weight': ['balanced', None]\}

>grid_search = GridSearchCV(approach1, parameters, scoring='f1')

I also tried PCA to reduce feature dimensions. By using PCA, the recall scores have improved from 0.6 to 0.8 for SVC classifiers, but the precision scores decreased from 0.5 to 0.3, where f1 and f2 scores remained similar or little bit lower compared to using SelectKBest method.

## Q5-4: what algorithm did I end up using? 

Support vector machine classifier with degree 3 polynomial kernel.

SVC(C=100, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)


## Q5-5: what other one(s) did I try?

I have tried GaussianNB, KNeighbors, DecisionTrees, RandomForest, and Adaboost (boosted decision tree). 


## Q5-6: how did model performance differ between algorithms?

I expected that KNeighbors and DecisionTrees perform poorly for this dataset because they tend to overfit for small dataset and increase bias for imbalanced classes. RandomForest and Adaboost were turned out to be the weakest algorithms for this task. The training speed and grid search speed were very slow for them, so I had to start with default parameters. To optimize RandomForest and Adaboost classifiers for this dataset seemed to require a large amount of time. I did not investigate on them further because their initial scores with default parameters were very low (f1 ranged between 0.1-0.3) and I think that they would not exceed what I got from SVC scores. 

## Q5-7: how did I tune the parameters of your particular algorithm?

I used automated parameter search processes using GridSearchCV. See Q5-3.

## Q5-8: what parameters did I tune?

- C (penalty): 'clf__C': [0.1, 1, 10, **100**, 1000]
- kernal: 'clf__kernel': ['rbf', 'linear', **'poly'**],
- gamma (kernal coefficient): 'clf__gamma': [**1,** 0.1, 0.01, 0.001, 0.0001]
- class_weight: 'clf__class_weight': [**'balanced'**, None]

## Q5-9:  how did I validate the algorithm analysis?

I used StratifiedShuffleSplit to validate the algorithms. Because the dataset has imbalanced classes, stratification based on class labels is required to ensure that relative class frequencies is approximately preserved in each train and test set.

To evaluate the project, project reviewers use tester.py. Thus, it is convenient for me to use the similar validation method. The validation method used in tester.py is below.

>cv = StratifiedShuffleSplit(labels, folds=1000, random_state = 42)

I used random_state = 44 instead to avoid overfitting.

## Q5-10: explain an interpretation of the metrics that says something human-understandable about the algorithm’s performance.

The overall performance of the final SVC classifier to identify POI labels was ok but not excellent. The accuracy is based on both true POI and true non-POI labels and showed relatively high performance, where the scores were highly weighted by non-POI label with 87.5% of class size to total data points. If all people in the testing set (which was split by stratifying) are predicted to be non-POI, the accuracy will be as high as 87.5% regardless any feature values of individuals. Thus, the accuracy gave little information and it might be meaningless evaluation for the dataset with imbalanced classes. 

The precision score showed that 6 out of 10 predicted as POI were truly POI, while the rest were false positive. Low precision score will be costly in practice because we need to investigate a lot of non-POIs to catch small number of POI. This also will increase a chance of that innocent people get legal punishment. Among other classifiers, the final classifier resulted in the highest precision score, meaning that it would minimize chances of false positive cases and reduce cost to investigate non-POIs claimed to be POIs. 

The recall score showed that only 6 out of 10 true POIs were identified as POI, while the rest of true POIs were not identified as POI. If we rely on this classifier, we only can catch 62% of the bad guys and let 38% of the bad guys go. Some of SVC classifiers with PCA-tranformation had up to 0.8 of recall scores but their precision scores were as low as 0.3 as trade-off. In some cases, higher recall scores value more than higher precision, but for this project I would tend to think that the balanced scores between precision and recall can be more helpful.

f1 (0.614) is about middle point of precision and recall scores, showing that overall performance of this classifier is OK, not too bad or not too great.
