# Enron Person of Interest Project

## 1. Overview

*Question 1. Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it.* 
The project tries to predict the people involved in the Enron scheme, ‘persons of interest’. Machine learning uses information about financial benefits and email communications, trying to figuring out patterns that distinguished the people of interests versus the rest. The dataset given includes individual’s information on financial benefits (payments and stock) involved with the company and email communications. It is obvious that the financial information will be useful for identifying people involved since money is the motive of the scheme. Individuals with especially high rate of communication the people of interests are likely to be one.

This is not a 50-50 bag of labels. There are 18 people of interests out of 144. There is a clear outlier with extreme financial data and this turns out to be the ‘total’ of all the data points, which is not a valid data point itself so I removed it. As for individuals that have extremely large financial values, I still keep the data points because they are people of interest. In terms of features, all features have missing data and some have a lot of missing data. **Therefore, I edited the feature_format.py file so that it properly scales the features using MinMaxScaler and adds an argument replace_median (replace NAs with median values).**

**Note:**
- Since this is a small dataset, I decided to run many grid-searched model. However, if you want to know the end of the movie, I find the model that applies principle component and logistic regression works best.
- This is a project for my Data Science course at Udacity, and the code for feature_format.py(except for a few edits by me) and tester.py is written by them 







In [2]:
#Setting up the environment
#!/usr/bin/python

import sys
import pickle
sys.path.append("../tools/")
import numpy as np
import pandas as pd
import matplotlib
%matplotlib inline
from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data, test_classifier
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.cross_validation import train_test_split

### features_list is a list of strings, each of which is a feature name.
### Task 1: Select what features you'll use.
### The first feature must be "poi".

 
### Load the dictionary containing the / m. 
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)
    data = pd.DataFrame.from_dict(data_dict,orient='index', dtype=np.float).drop('email_address',1)

# Remove irrelevant outliers
#data = data[~(data.index == 'TOTAL')]
print 'There are %i pois out of %i samples.' %(sum(data['poi']==1),len(data['poi']))

# Summary of missing values across features:
print data.isnull().sum(axis = 0)

There are 18 pois out of 146 samples.
salary                        51
to_messages                   60
deferral_payments            107
total_payments                21
exercised_stock_options       44
bonus                         64
restricted_stock              36
shared_receipt_with_poi       60
restricted_stock_deferred    128
total_stock_value             20
expenses                      51
loan_advances                142
from_messages                 60
other                         53
from_this_person_to_poi       60
poi                            0
director_fees                129
deferred_income               97
long_term_incentive           80
from_poi_to_this_person       60
dtype: int64


## 2. Feature Engineering

* Question 2. What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? *
I ended up keeping the features: "shared_receipt_with_poi","exercised_stock_options", and "from_frac". My strategy of choosing predictors is to run ANOVA F-value for each feature in 1500 times stratified resampling (to take care of possible overfitting) and choosing the ones with the best scores from each category: payment, stock value, and email communication. I narrowed down to some set of features(see below), and the set of these 3 features yields the highest performance. 

I employed MinMaxScaler for SVM algorithm since the financial features have much larger values than email features, we need to scale them otherwise the effect of financial features would be exaggerated in SVM algorithm. Besides, I also edited the feature_format.py file so that it properly scales the features and adds an argument replace_median (replace NAs with median values).

I added 3 new variables:
-	‘from_frac’, ‘to_frac’: proportion of to/ from emails of a person sent from/to a poi. The motivation is that pois may communicate with each other much more frequently. The idea is also stemmed from one of the lesson videos of Udacity. As you will see below, 'from_frac' has the highest performance and turns out to be one of the key features; removing it drives the recall down though slightly increases precision. 'to_frac' is in the feature group that has better scores. However, adding 'to_frac' has near zero effect on recall and drives down the precision.
-	‘total_money’: I added this variable because in case the pois try to hide their financial benefits by diversifying the way they receive the money, then this sum may tell information that the individuals  features don’t. This feature, however, has a middle-range performance score. Adding this feature decreases the model's performance in terms of both precion and recall. 
Please a table below for the effect of these features.


In [3]:
### Create new features: ‘from_frac’, ‘to_frac', and 'total_money':

data = data.fillna(0)

data['to_frac'] = data["from_poi_to_this_person"]/data["to_messages"]
data['from_frac'] = data["from_this_person_to_poi"]/data["from_messages"]
data['total_money'] = data['total_payments'] + data["total_stock_value"]
data = data.fillna(0)

# Convert the dataset to dictionary for later submission:
my_dataset = data.to_dict(orient = 'index')





In [4]:
### Investigate Variable Importance:

select = SelectKBest(k = 'all')

from sklearn.cross_validation import StratifiedShuffleSplit
labels = data['poi']
features = data.drop('poi',axis = 1)

# SelectKBest with multiple resamplings:
from sklearn.cross_validation import StratifiedShuffleSplit
cv = StratifiedShuffleSplit(labels, 1500, random_state = 42)
score = []
pval = []
for train_idx, test_idx in cv: 
    features_train = features.iloc[train_idx,:]
    labels_train = labels[train_idx]
    select.fit(features_train,labels_train)
    score.append(select.scores_)
    pval.append(select.pvalues_)
    
avgscore = np.mean(score, axis = 0)
avgpval = np.mean(pval, axis = 0)


print "Variable Importance Chart"
print pd.DataFrame(
    {'Feature': features.columns,
     'Score': [ '%.2f' % elem for elem in avgscore],
     'P-value': [ '%.2f' % elem for elem in avgpval]
    })


    

Variable Importance Chart
                      Feature P-value  Score
0                      salary    0.85   1.92
1                 to_messages    0.23   1.73
2           deferral_payments    0.64   0.23
3              total_payments    0.52   1.21
4     exercised_stock_options    0.62   2.73
5                       bonus    0.73   2.27
6            restricted_stock    0.78   0.99
7     shared_receipt_with_poi    0.01   8.20
8   restricted_stock_deferred    0.89   0.05
9           total_stock_value    0.66   2.61
10                   expenses    0.79   0.65
11              loan_advances    0.17   2.80
12              from_messages    0.72   0.17
13                      other    0.71   0.52
14    from_this_person_to_poi    0.19   2.49
15              director_fees    0.45   0.64
16            deferred_income    0.62   1.41
17        long_term_incentive    0.81   1.11
18    from_poi_to_this_person    0.04   5.18
19                    to_frac    0.10   3.16
20                  from_frac

I have tried including many features as computation cost is not a problem for this dataset. However, to my surprise, adding features makes the algorithms perform poorer. You will see below that I include all the features that has p-values < 0.3 in set feature_list1 and the same algorithm all perform poorer with this set. A possible explanation is that this dataset has many missing values (which we have filled them with 0 and this is wrong information).


# Model building and Evaluation

*Question 3. What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?*

The algorithm I employed are a pipeline whose first step is PCA with 1 principle component and the last step is logistic regression. Since this is a small dataset that doesn’t take long to train, I have tried pipeline with PCA with other classifiers: decision tree, k nearest neighbor, random forest. I chose the model of PCA with logistic regression since it has the highest recall and above-30% precision. I would argue that in this problem, we care more about recall than precision. Random forest and decision tree perform decently though inferior to logistic regression. In one model, random forest outperforms logistic regression slightly in precision score but still have a lower recall. K nearest neighbors, to my surprise as I expect 'pois' would have similar behaviors, did not perform very well.



** Model Performance Summary
<pre>
```

| Model         | Features       | Precision | Recall |
|---------------|----------------|-----------|--------|
| K Neighbors   | feature_list2  |  0.24     |  0.14  |
| Logistic Reg  | feature_list2  |  0.32     |  0.89  |** Best performance
| Decision Tree | feature_list2  |  0.31     |  0.65  |
| Random Forest | feature_list2  |  0.34     |  0.74  |
| K Neighbors   | feature_list1  |  0.28     |  0.24  |
| Logistic Reg  | feature_list1  |  0.22     |  0.87  |
| Decision Tree | feature_list1  |  0.18     |  0.82  |
| Random Forest | feature_list1  |  0.26     |  0.75  |
| Logistic Reg  | feature_list21 |  0.27     |  0.77  | *effect of removing 'total_money' from best-performing model
| Logistic Reg  | feature_list21 |  0.30     |  0.89  | *effect of removing 'to_frac' from best-performing model
| Logistic Reg  | feature_list21 |  0.36     |  0.78  | *effect of removing 'from_frac' from best-performing model

```
</pre>

In [None]:
# Sets of features to train on:
feature_list1 = ['to_messages','shared_receipt_with_poi','loan_advances','from_this_person_to_poi',
             'from_poi_to_this_person','to_frac','from_frac'] # features with p-val < 0.3
feature_list2 = ['from_frac','shared_receipt_with_poi','bonus']
feature_list21 = feature_list2 + ['total_money'] # adding 'total_money'
feature_list22 = feature_list2 + ['to_frac'] # adding 'to_frac'
feature_list23 = ['shared_receipt_with_poi','exercised_stock_options'] # removing 'from_frac'


feature_use = feature_list2 #Replace feature set to run on

features = data[feature_use]
features_list = ['poi'] + feature_use


from sklearn.neighbors import KNeighborsClassifier
clf_kn = KNeighborsClassifier()
from sklearn.tree import DecisionTreeClassifier
clf_dectree = DecisionTreeClassifier(class_weight = 'balanced',random_state = 111)
from sklearn.linear_model import LogisticRegression
clf_log = LogisticRegression(class_weight = 'balanced',random_state = 111)
from sklearn.ensemble import RandomForestClassifier
clf_rf = RandomForestClassifier(class_weight = 'balanced',random_state = 111)


from sklearn.preprocessing import MinMaxScaler
from sklearn.grid_search import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
pca = PCA()
n_compnt = range(len(feature_use)) + [3]
n_compnt.remove(0)
    
    
pipe_kn = Pipeline(steps = [('pca',pca),('scale',MinMaxScaler()),('kn',clf_kn)])
param_grid_kn = dict(pca__n_components = n_compnt,
                     kn__n_neighbors = [3,5,7,11],
                     kn__weights = ['uniform','distance'])
grid_search_kn = GridSearchCV(pipe_kn, param_grid_kn, scoring = 'recall')
grid_search_kn.fit(features,labels)
test_classifier(grid_search_kn.best_estimator_,my_dataset,features_list)


pipe_log = Pipeline(steps = [('pca',pca),('clf',clf_log)])
param_grid_log = dict(pca__n_components = n_compnt,
                      clf__C = [0.0003,0.001,0.003,0.01,0.03,0.1,0.3,1,3])
grid_search_log = GridSearchCV(pipe_log,param_grid_log,scoring = 'recall')
grid_search_log.fit(features,labels)
test_classifier(grid_search_log.best_estimator_,my_dataset,features_list)

pipe_dectree = Pipeline(steps = [('pca',pca),('clf',clf_dectree)])
param_grid_dectree = dict(pca__n_components = n_compnt,
                          clf__max_depth = [1,2,3,5,7])
grid_search_dectree = GridSearchCV(pipe_dectree,param_grid_dectree,scoring = 'recall')
grid_search_dectree.fit(features,labels)
test_classifier(grid_search_dectree.best_estimator_,my_dataset,features_list)

pipe_rf = Pipeline(steps = [('pca',pca),('clf',clf_rf)])
param_grid_rf = dict(pca__n_components = n_compnt,
                     clf__max_depth = [1,2,3,5,7])
grid_search_rf = GridSearchCV(pipe_rf,param_grid_rf,scoring = 'recall')
grid_search_rf.fit(features,labels)
test_classifier(grid_search_rf.best_estimator_,my_dataset,features_list)




Pipeline(steps=[('pca', PCA(copy=True, n_components=1, whiten=False)), ('scale', MinMaxScaler(copy=True, feature_range=(0, 1))), ('kn', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='distance'))])
	Accuracy: 0.77364	Precision: 0.29885	Recall: 0.18200	F1: 0.22623	F2: 0.19744
	Total predictions: 11000	True positives:  364	False positives:  854	False negatives: 1636	True negatives: 8146

Pipeline(steps=[('pca', PCA(copy=True, n_components=2, whiten=False)), ('clf', LogisticRegression(C=0.0001, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=111,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False))])
	Accuracy: 0.67118	Precision: 0.34132	Recall: 0.86950	F1: 0.49020	F2: 0.66399
	Total predictions: 11000	True positives: 1739	False positives: 3356	Fal

*Question 4. What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?  How did you tune the parameters of your particular algorithm? *

Tuning parameters of an algorithm means selecting the set of parameters to optimize the performance of the algorithm. If tuning parameters isn’t done properly, the model may overfit or underfit.

I used grid search with cross validation to tune the parameters. Since training the dataset takes place quickly, I tuned parameters for all-of-the-above algorithms (but only presented in the code the best three) and select the model that performs the best. I also tune the feature_format function, to see if the result is better if I replace missing values by medians instead of 0s. I find replacing the missing values by 0s yields the best result. 


*5.	What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis? *

Validation is evaluating the performance of the model on new data. Without validation, it is very easy to choose an overfitting model that performs very well on known data but poorly on new data. I employ validation in my use of grid search with cross validation and my use of test_classifier of tester.py, which uses stratified shuffle split cross validator.


# 4. Evaluation
*6. Give at least 2 evaluation metrics and your average performance for each of them.  Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance. 

The precision and recall of my model is 32% and 898% respectively. This means that my algorithm predicts a ‘poi’ accurately 32% of the times; of all people predicted as a ‘poi’, 32% are actually ‘pois’. The recall suggests that my model is able to identify 89% of ‘pois’. 

