# Identify Person of Interest

## Summary

The Enron scandal, publicized in October 2001, eventually led to the bankruptcy of the Enron Corporation, an American energy company based in Houston, Texas, and the de facto dissolution of Arthur Andersen, which was one of the five largest audit and accountancy partnerships in the world. In addition to being the largest bankruptcy reorganization in American history at that time, Enron was cited as the biggest audit failure. (source: Wikipedia)

In this project, I'm building a person of interest identifier based on financial and email data made public as a result of the Enron scandal and with the help of machine learning techniques. I won't process the data emails_by_address.

Machine Learning allows to predict poi feature. Feature 'poi' is the response variable and it takes value 1 in case of poi and 0 otherwise. A person of Interest is a person who might be involved in the fraud causing the bankruptcy of Enron. This is a classification task. I'll use specifically supervised machine learning since we have a labeled dataset where we know whether or not each datapoint is poi or not.

## Source of the data

The source file enron61702insiderpay.pdf provides the financial data and could be found in this github repository.

The Enron email corpus provide the email data. It is not exploited in this study. 

There are missing values in the dataset. In this case, for a given person if the value of a given feature is missing its value is set to "NaN". When the data is transformed into a numpy array, "NaN" is converted to 0 by default.

## Setup

In [2]:
# Required Libraries
import sys
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import string
sys.path.append("./tools/")
from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data, test_classifier
from explore import get_incompletes, display_examples,build_df
from select_features import Select_k_best
from create_new_features import add_fraction_to_dict,create_new_features,drop_features
#from sklearn.cross_validation import train_test_split for previous versions
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

### Versions:
Package|version
--|--
python|2.7.13
scikit-learn|0.18.1

## Data Exploration

### Dataset Study

## Task 1: Select what features you'll use.

I will not exploit the feature 'email_address'.

-|Feature  | Type | Comment
-|--|--|--
1|bonus|continuous| finance (payment type) feature
2|deferral_payments|continuous| finance (payment type) feature
3|deferred_income|continuous|finance (payment type) feature
4|director_fees|continuous|finance (payment type) feature
5|email_address|nominal|__NOT USED: email (text type) feature__
6|exercised_stock_options|continuous|finance (stock type) feature
7|expenses|continuous|finance (payment type) feature
8|from_messages|nominal|email (number of messages) feature 
9|from_poi_to_this_person|continuous|email (number of messages) feature
10|from_this_person_to_poi|continuous|email (number of messages) feature
11|loan_advances|continuous|finance (payment type) feature
12|long_term_incentive|continuous|finance (payment type) feature
13|other|continuous|finance (payment type) feature
14|poi|nominal|the label to identify a person of interest (boolean type)
15|restricted_stock|continuous|finance (stock type) feature
16|restricted_stock_deferred|continuous|finance (stock type) feature
17|salary|continuous|finance (payment type) feature
18|shared_receipt_with_poi|continuous|email (number of messages) feature
19|to_messages|continuous|email (number of messages) feature
20|total_payments|continuous|finance (payment type) feature
21|total_stock_value|continuous|finance (stock type) feature

Here is a summary of the data: 

There are initially 146 people in this dataset.

There are initially 21 features per person. poi feature is the target.

We have a wealth of features but not so much data points.

There are 18 persons of interest are there in this dataset. It represents 12.3% of the overall population.

Machine learning algorithms work best when the classes are balanced - close to 50% - but the dataset in our hands is unbalanced in the distribution of the classes. We need to keep the same percentage of classes among datasets - training and testing - we can use StratifiedShuffleSplit. Another challenge is to find the proper metrics for performance evaluation in case of inbalance in classes.

In [3]:
### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi" since it is the target.
features_list = ['poi',
                'bonus',
                'deferral_payments',
                'deferred_income',
                'director_fees',
                'exercised_stock_options',
                'expenses',
                'from_messages',
                'from_poi_to_this_person',
                'from_this_person_to_poi',
                'loan_advances',
                'long_term_incentive',
                'other',
                'restricted_stock',
                'restricted_stock_deferred',
                'salary',
                'shared_receipt_with_poi',
                'to_messages',
                'total_payments',
                'total_stock_value'] 

In [4]:
### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

### Missing Values

We could consider that if a person did not receive any financial retribution, its value is zero.

My choice is that NaN values have to be replaced by zeros.

For the same reason, for email features, NaN values have to be replaced by zeros.

Here below are the top 5 features with missing value:

-|feature name  | % of missing information (NaN)| nb of non null values
-|--|--|--
1|loan_advances|97.92| 3
2|director_fees|88.89| 16
3|restricted_stock_deferred|88.19| 17
4|deferral_payments|73.61| 38
5|deferred_income|66.67| 48

These features have not enough information. For this reason, I do not include the first 4 in feature selection. Here is the final discarded features:
- loan_advances
- restricted_stock_deferred
- director_fees
- deferral_payments

![Diagram](PearsonCorrelationOfFeatures.png)

__Linear Correlation deduced from Pearson's Correlation Diagram:__

shared_receipt_with_poi and to_messages have a strong linear relationship (increasing).

exercised_stock_options and total_stock_value have a strong linear relationship (increasing). It is not a surprise. 

Others and total_payment have a strong linear relationship (increasing).

Exercised_stock_options and Restricted stock have a linear relationship (increasing).

## Task 2: Remove outliers

### Outlier Investigation

When we get the list of persons with more than 90% incomplete information, we have the following table.

-|name  | % of incomplete information | poi | bonus| salary|Comment
-|--|--|--
1|WHALEY DAVID A|90.48| false |NaN|NaN|
2|WROBEL BRUCE|90.48| false |NaN|NaN|
3|LOCKHART EUGENE E|100.0| false |NaN|NaN|no information
4|THE TRAVEL AGENCY IN THE PARK|90.48| false |NaN|NaN|not a person obviously
5|GRAMM WENDY L|90.48| false |NaN|NaN|

The person named "LOCKHART EUGENE E" is not an outlier per se but has no information. It has to be removed from the dataset.

The person named "THE TRAVEL AGENCY IN THE PARK" not a person and many missing information above 90%. It is clearly not an individual and very few information is gained from this person. It has to be discarded.

When exploring the data, we can find that a person named "TOTAL" has the highest bonus and the highest salary with the following values:

bonus  =  97,343,619 $

salary = 26,704,229 $

Let's display the graph "Bonus vs Salary". A clear outlier is appearing.

![Diagram](Outlier_TOTAL.jpg)

The outlier "TOTAL" is caught by looking for the highest bonus and highest salary. 

poi value is 0

The name "TOTAL" is a hint that it is not a person but the sum of financial features. 

It has to be discarded. 

At last, here are the discarded outliers and assimilated :
- TOTAL
- THE TRAVEL AGENCY IN THE PARK
- LOCKHART EUGENE E

In [5]:
# removing the outlier called TOTAL
data_dict.pop( "TOTAL", 0 );
# not a person and many missing information above 90%
data_dict.pop( "THE TRAVEL AGENCY IN THE PARK", 0 );
# no information at all for LOCKHART EUGENE E
data_dict.pop( "LOCKHART EUGENE E",0);

In [6]:
df = build_df(data_dict)

After removal, we have the following graph:

![Diagram](bonusVSsalary.png)

total number of data points 143

number of poi:
False    125
True      18
Name: poi, dtype: int64

Percentage of POI  12.5874125874 %

## Task 3: Create new feature(s)

### Direction:
Create new feature(s)

The two created features are related to ratios of emails sent to or from POI. It shows for that person what is the proportion of emails echanged with poi. It shows how intense are echanges with POI regardless of the volume of emails.   

![Diagram](perc_from_poiVSperc_to_poi.png)

In [7]:
# creating new features: perc_to_poi and perc_from_poi
create_new_features(data_dict,df)

In [8]:
# Missing values : do not include these features
drop_features(df)  

## Prepare the data for Machine Learning algorithms

### Feature scaling

Naive Bayes, as well as Decision trees and Tree-based ensemble methods (RF, XGB) are invariant to feature scaling.

Thus, I did not used feature scaling. 

Affected algorithms are for example: SVM

### Selection of features and use of SelectKBest

I'll also do feature selection using percentile and k-best algorithms, I want to see the top 9 features these algorithms will select

![Diagram](FeatureImportance.png)

There are 13 selected features and after getting rid of correlated features. Here they are:
- 'poi',
- 'exercised_stock_options',
- 'bonus',
- 'salary',
- 'perc_to_poi',
- 'deferred_income',
- 'long_term_incentive',
- 'total_payments',
- 'shared_receipt_with_poi',
- 'expenses',
- 'from_poi_to_this_person',
- 'perc_from_poi',
- 'from_this_person_to_poi',
- 'from_messages'

In [9]:
# to make this notebook's output identical at every run
np.random.seed(42)

In [10]:
# select best features and getting rid of correlated features: 13 are remaining in total
features_list = ['poi',
                'exercised_stock_options',
                'bonus',
                'salary',
                'perc_to_poi',
                'deferred_income',
                'long_term_incentive',
                'total_payments',
                'shared_receipt_with_poi',
                'expenses',
                'from_poi_to_this_person',
                'perc_from_poi',
                'from_this_person_to_poi',
                'from_messages']

In [11]:
### Store to my_dataset for easy export below.
my_dataset = data_dict

In [12]:
### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

In [13]:
### split data into training and testing datasets
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.1, random_state=42)

print("Number train dataset: ", len(features_train))
print("Number test dataset: ", len(features_test))
print("Total number: ", len(features_train)+len(features_test))

('Number train dataset: ', 127)
('Number test dataset: ', 15)
('Total number: ', 142)


## Task 4: Try a varity of classifiers

### Direction:
Try a varity of classifiers

Please name your classifier clf for easy export below.

Note that if you want to do PCA or other multi-stage operations,
you'll need to use Pipelines. For more info:
http://scikit-learn.org/stable/modules/pipeline.html

The process is the following:
1. Instantiate the model
2. Train the model on training data
3. Compute the performance (Accuracy, Precision, Recall, F1) using cross-validation.

I tried the following algorithms and the results are:

Algorithm|Accuracy|Precision|Recall|F1
--|--|--|--|--|--
NaiveBayes|0.83993|0.37073|0.28750|0.32385
DecisionTree|0.81713|0.30701|0.29550|0.30115
RandomForest|0.86327|0.44218|0.09750|0.15977
AdaBoost|0.81860|0.31195|0.29900|0.30534

They are all below expectation.

## Task 5: Tune your classifier to achieve better than .3 precision and recall

#### Direction:
Tune your classifier to achieve better than .3 precision and recall 
using our testing script. Check the tester.py script in the final project
folder for details on the evaluation method, especially the test_classifier
function. 

Because of the small size of the dataset, the script uses
stratified shuffle split cross validation. 
For more info: 
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedShuffleSplit.html

### Tuning

Tuning consists in finding the algorithm parameters to get more accurate machine learning models on both training set and testing set. 

The model parameters are learned during training - e.g coefficients in linear regression - and we should not confuse model parameters with hyperparameters.  

Hyperparameters are set by the Data Analyst before training the model.

When there are many hyperparameters, tuning could be tedious. In this case, it is efficient - cost and time - to use automatic search such as GridSearchCV.

It is important to fine-tune the model since we want the best performance (e.g. minimal errors, best accuracy, best precision, and so on). 

I selected F1 for performance scoring since it is a good balance between Recall and Precision.

When tuning with one algorithm fails, there is an option to switch to other algorithms. 

__Naive Bayes tuning:__

There is no hyperparameter tuning since the algorithm does not allow algorithm parameter tuning. 

To improve the performance, we can only play with the dataset - get more data - and the number of features. 

I did play with the number of features. 

I finally selected one feature: exercised_stock_options.

The process followed is: 
1. set the feature list with the unique feature
2. Instantiate the Naive Bayes model
3. Train the Naive Bayes model with the training data
4. Evaluation on the final Naive Bayes model


__Decision Tree tuning:__

Random Forest algorithm has a lot of hyperparameters and I decided to use the following ones: 
'criterion','max_depth','min_samples_split','max_features'. 

I also tried a smaller set of features to improve further the performance. The selection of the final features used the feature importances of the previous run Decision Tree model. 

Features giving the best results: 'exercised_stock_options','salary','perc_to_poi','deferred_income','total_payments','shared_receipt_with_poi','expenses','from_poi_to_this_person','from_this_person_to_poi'

The process followed is: 
1. set the feature list with the list here above
2. Tuning with the following hyperparameters: criterion, max_depth, min_samples_split, max_features
3. Use of GridSearchCV and F1 as scoring
4. reuse of best estimator
5. Evaluation on the final Decision Tree model

__Random Forest tuning:__

Random Forest algorithm has a wealth of hyperparameters and I decided to use the following ones: 'criterion','n_estimators','max_depth','max_features'. 

I also tried a smaller set of features to improve further the performance. The selection of the final features used the feature importances from the previous run of Random Forest. 

I finally selected two features: 'exercised_stock_options','bonus'

The process followed is: 
1. set the feature list with the list here above
2. Tuning with the following hyperparameters: criterion, n_estimators, max_depth, max_features
3. Use of GridSearchCV and F1 as scoring
4. reuse of best estimator
5. Evaluation on the final Random Forest model

__Adaboost tuning:__

Adaboost algorithm has a wealth of hyperparameters and I decided to use the following ones:'n_estimators','algorithm','learning_rate'. 

I also tried a smaller set of features to improve further the performance. The selection of the final features used the feature importances from previous run of Adaboost.

Final features giving the best results: 'exercised_stock_options','perc_to_poi','long_term_incentive','total_payments','shared_receipt_with_poi','expenses','from_this_person_to_poi'

The process followed is: 
1. set the feature list with the list here above
2. Tuning with the following hyperparameters: n_estimators,algorithm,learning_rate
3. Use of GridSearchCV and F1 as scoring
4. reuse of best estimator
5. Evaluation on the final adaboost model

Algorithm|Accuracy|Precision|Recall|F1| pass/fail
--|--|--|--|--|--|--
Naive Bayes|0.90409|0.46055|0.32100|0.3783|passed
Decision Tree|0.83993|0.41938|0.52150|0.46490|passed
Random Forest|0.81754|0.40663|0.40500|0.40581|passed
AdaBoost|0.87213|0.51576|0.67100|0.58322| passed 

#### DecisionTreeClassifier visualization

![Diagram](DecisionTree02.png)

### Task 6

### Direction 

Dump your classifier, dataset, and features_list so anyone can
check your results. You do not need to change anything below, but make sure
that the version of poi_id.py that you submit can be run on its own and
generates the necessary .pkl files for validating your results.

See: poi_id.py file

## Validation

__Usage of Evaluation Metrics__

In case of classification problems and when the dataset is unbalanced - i.e. a difference in the numbers of positive and negative instances, usually with the negatives outnumbering the positives - the most appropriate metric is not accuracy but some alternate metrics (precision, recall).

Accuracy in classification problems is the number of correct predictions made by the model over all kinds predictions made.

The meaning of precision is the following: a precision of 0.3 means that there are 30% correct predictions among the positive predictions. Precision is a measure that tells us what proportion of people that we predicted as being poi, actually are poi.

Recall is a measure that tells us what proportion of people that actually are poi was predicted by the algorithm as being poi.

Here is an example with RandomForest Classifier with an imbalanced dataset:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features=1, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)
            
__Accuracy: 0.86327	Precision: 0.44218	Recall: 0.09750	F1: 0.15977	F2: 0.11551__

__Total predictions: 15000	True positives:  195	False positives:  246	False negatives: 1805	True negatives: 12754__

The key point is that accuracy metric fails to capture the poor performance of the classifier for the imbalanced dataset. For example, accuracy indicates that the performance of the classifier is fine with 0.86. But the recall (respectively precision)  indicates that the performance of the classifier is relatively poor (respectively acceptable) on the imbalanced dataset with 0.10 (respectively with 0.44). Hence, precision and recall reveal differences in performance that go unnoticed when using accuracy.

For GridSearchCV, I use F1 as scoring since the recall and the precision are equally important.

__Validation and its importance__

Validation is to ensure that the model could generalise well. For that purpose the dataset is to split the data into 2 sets: training set and test set. 

Cross-validation was used here. The concept is the following: Instead of using the whole dataset to train and then test on same data, I randomly divide our data into training and testing datasets.

If validation is not performed correctly, when the model is deployed into production it is highly probable that with new data the performance will drop since the model could not generalise well.

__Algorithm Performance__

The algorithm is validated when precision and recall are both at least 0.3

Algorithm|Accuracy|Precision|Recall|F1| Comment
--|--|--|--|--|--|--
Naive Bayes|0.90409|0.46055|0.32100|0.3783|
Decision Tree|0.83993|0.41938|0.52150|0.46490|
Random Forest|0.81754|0.40663|0.40500|0.40581|
AdaBoost|0.87213|0.51576|0.67100|0.58322| Selected since Best F1 

When tester.py is used to evaluate performance, precision and recall are both at least 0.3.

## Conclusion

The dataset is very small. And data matters as well as algorithm.

Three Outliers have been removed with significant impact.

New features have been added (e.g. perc_to_poi) and for some models it was efficient. The feature perc_to_poi is the second highest significant feature for Adaboost Model (selected model). The feature perc_to_poi is highly significant for the tuned Decision Tree model.

Relevant data have been used. When using all the features, models performed poorly because of overfitting in some cases. A good approach was to use a smaller set of features.

There was no way to have more training data.

The key influencing features are financial ones (e.g. exercised_stock_options)

Some features have very few information: loan_advances, director_fees, restricted_stock_deferred. The approach was to not include them in feature selection.

Tuning could be time consuming and specially if done manually. It's why the use of GridSearch helped a lot.

The selected model - the one with the best F1 - is AdaBoost with the following scores:

Accuracy: 0.87213	

Precision: 0.51576	

Recall: 0.67100	


## Documentation/References

[GitHub repository 1]

https://github.com/ageron/handson-ml

[GitHub repository 2]

https://github.com/MarcCollado/Enron

[GitHub repository 3]

https://github.com/travisseal/enron_data_udacity

[GitHub repository 4]

https://github.com/Jacobdudee/EnronModel

[GitHub repository 5]

https://github.com/adazamora/enron_ml/blob/master/ml_project.ipynb

[GitHub repository 6]

https://github.com/WillKoehrsen/Machine-Learning-Projects/tree/master/random_forest_explained

[Hdbk] Python for Data Science Handbook from Blog:
[blog](http://www.datasciencecentral.com/profiles/blogs/book-python-data-science-handbook?utm_content=buffer09a5c&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer)