# Enron Submission Free-Response Questions

A critical part of machine learning is making sense of your analysis process and communicating it to others. The questions below will help us understand your decision-making process and allow us to give feedback on your project. Please answer each question; your answers should be about 1-2 paragraphs per question. If you find yourself writing much more than that, take a step back and see if you can simplify your response!

When your evaluator looks at your responses, he or she will use a specific list of rubric items to assess your answers. Here is the link to that rubric: [Link] Each question has one or more specific rubric items associated with it, so before you submit an answer, take a look at that part of the rubric. If your response does not meet expectations for all rubric points, you will be asked to revise and resubmit your project. Make sure that your responses are detailed enough that the evaluator will be able to understand the steps you took and your thought processes as you went through the data analysis.

Once you’ve submitted your responses, your coach will take a look and may ask a few more focused follow-up questions on one or more of your answers.  

We can’t wait to see what you’ve put together for this project!


# Question 1
1.Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?  [relevant rubric items: “data exploration”, “outlier investigation”]

The goal of the project is to create a machine learning model that could separate out the POIs. Machine Learning can predict the POI/Non-POI feature after "learning" features in the training data.

The Enron dataset is comprised of email and financial dataset collected and prepared by the CALO Project. It contains data from about 150 users, mostly senior management of Enron, orgnized into folders.

### Dataset Overview
- 146 points, each representing a person (2 are not people)
- 18 of these pionts are labeled as a POI and 128 as Non-POI

In [35]:
import pickle
import pandas as pd
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

In [36]:
# Basic information of the data
data_df = pd.DataFrame(data_dict).T
data_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 146 entries, ALLEN PHILLIP K to YEAP SOON
Data columns (total 21 columns):
bonus                        146 non-null object
deferral_payments            146 non-null object
deferred_income              146 non-null object
director_fees                146 non-null object
email_address                146 non-null object
exercised_stock_options      146 non-null object
expenses                     146 non-null object
from_messages                146 non-null object
from_poi_to_this_person      146 non-null object
from_this_person_to_poi      146 non-null object
loan_advances                146 non-null object
long_term_incentive          146 non-null object
other                        146 non-null object
poi                          146 non-null object
restricted_stock             146 non-null object
restricted_stock_deferred    146 non-null object
salary                       146 non-null object
shared_receipt_with_poi      146 non-null objec

In [37]:
data_df.drop('email_address', axis=1, inplace=True)
data_df.head()

Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,loan_advances,long_term_incentive,other,poi,restricted_stock,restricted_stock_deferred,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
ALLEN PHILLIP K,4175000.0,2869717.0,-3081055.0,,1729541.0,13868,2195.0,47.0,65.0,,304805.0,152.0,False,126027.0,-126027.0,201955.0,1407.0,2902.0,4484442,1729541
BADUM JAMES P,,178980.0,,,257817.0,3486,,,,,,,False,,,,,,182466,257817
BANNANTINE JAMES M,,,-5104.0,,4046157.0,56301,29.0,39.0,0.0,,,864523.0,False,1757552.0,-560222.0,477.0,465.0,566.0,916197,5243487
BAXTER JOHN C,1200000.0,1295738.0,-1386055.0,,6680544.0,11200,,,,,1586055.0,2660303.0,False,3942714.0,,267102.0,,,5634343,10623258
BAY FRANKLIN R,400000.0,260455.0,-201641.0,,,129142,,,,,,69.0,False,145796.0,-82782.0,239671.0,,,827696,63014


### Outliers

After Scattering the dataset based on the salary and bonus feature of the points, I find two outliers: TOTAL and THE TRAVEL AGENCY IN THE PARK, which apparently are not people.

Also, I find there's a man called LOCKHART EUGENE E and all the values are NaN, so that's a totally invalid point and I need to remove it from the dataset.

In [38]:
# pop out two outliers and leaving 144 useful datasets
data_dict.pop('TOTAL')
data_dict.pop('THE TRAVEL AGENCY IN THE PARK')
data_dict.pop('LOCKHART EUGENE E')

{'bonus': 'NaN',
 'deferral_payments': 'NaN',
 'deferred_income': 'NaN',
 'director_fees': 'NaN',
 'email_address': 'NaN',
 'exercised_stock_options': 'NaN',
 'expenses': 'NaN',
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 'NaN',
 'long_term_incentive': 'NaN',
 'other': 'NaN',
 'poi': False,
 'restricted_stock': 'NaN',
 'restricted_stock_deferred': 'NaN',
 'salary': 'NaN',
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 'NaN',
 'total_stock_value': 'NaN'}

# Question 2
2.What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.  [relevant rubric items: “create new features”, “intelligently select features”, “properly scale features”]

### Features Used

The algorithm I finally used was DecisionTreeClassifier and I used SelectKBest to help me select the following nine features in my POI identifier, ordered by importances descending (the first decimal number of each row):

- feature no. 1 shared_receipt_with_poi 0.403561962807 15.7854252775
- feature no. 2 from_poi_to_this_person_ratio 0.147344335528 2.48838097173
- feature no. 3 salary 0.145954110897 17.7678544529
- feature no. 4 bonus 0.0950254426577 34.2129648303
- feature no. 5 from_this_person_to_poi_ratio 0.0942268021027 0.215888289188
- feature no. 6 bonus_salary_ratio 0.0780089283869 22.1067164085
- feature no. 7 exercised_stock_options 0.0358784176204 16.9328653375
- feature no. 8 total_stock_value 0.0 16.8651432616

shared_receipt_with_poi, from_poi_to_this_person_ratio, salary are the most important features.
### Selection Process
I used SelectKBest in a pipeline with grid search to find the K best features. SelectKBest removes all but the k hightest scoring features. The number, k, was chosen by an exhaustive grid search decided by the "f1" scoring estimator. 

### Features Scaling
I choose two classifier: SVM and DecisionTreeClassifier

According to [A Practical Guide to Support Vector Classification ](http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf), scaling before applying SVM is very important. The main advantage of scaling is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges. Another advantage is to avoid numerical difficulties during the calculation. Because kernel values usually depend on the inner products of feature vectors, e.g. the linear kernel and the polynomial ker- nel, large attribute values might cause numerical problems.

However, Scaling isn't required for tree-based algorithms because the splitting of the data is based on a threshold value. 
### Features Engineered
I add threee features to the dataset:
- bonus_salary_ratio
- from_poi_to_this_person_ratio
- from_this_person_to_poi_ratio

# Question 3
3.What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?  [relevant rubric item: “pick an algorithm”]

Ultimately I used DecisionTreeClassifier. I also tried SVM.
- The decision tree classifier had a best performace, with a precision of 0.38050 and a recall of 0.38493, both above the 0.3 threshold.
- The SVM had a precision of 0.59259 but a recall of 0.05600.

The SVM classifier is not a good fit for the extremely unbalanced classes in the Enron dataset.

# Questionn 4
4.What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?  How did you tune the parameters of your particular algorithm? What parameters did you tune? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier).  [relevant rubric items: “discuss parameter tuning”, “tune the algorithm”]

Tuning the parameters of an algorithm means adjusting the parameters in a certain way to achieve the optimal performance. If I don't tune the parameters, definitely I will end up using the defaults, which will not result in an optimal performance. In other words, the data can't be learned well and the performance may suffer.

I simply use GridSearchCV to help me to find the best parameters combination. The algorithm performance can be measured in many ways such as accuracy, precision and recall etc. 

For the chosen DecisionTreeClassifier, I have tried many combinations of parameters. It can be seen as below:
- dtc__criterion = ['gini', 'entropy'],
- dtc__min_samples_split = [2, 4, 6, 8, 10, 20],
- dtc__max_depth = [None, 5, 10, 15, 20],
- dtc__max_features = [None, 'sqrt', 'log2', 'auto'],

# Question 5
5.What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?  [relevant rubric items: “discuss validation”, “validation strategy”] 

Model validation is referred to as the process where a trained model is evaluated with a testing data set. The testing data set is a separate portion of the same data set from which the training set is derived. The main purpose of using the testing data set is to test the generalization ability of a trained model.

A classic mistake is to test your algorithm on the same data you trained on. In this situation your accuracy is 100%. So we have to separate our data into two parts--the training set and the testing set.

I also tried to use StratifiedShuffleSplit as an alternative to gauge the algorithm's performance. It can create multiple datasets to help to promote the accuracy (the enron dataset itself is so small)

# Question 6
6.Give at least 2 evaluation metrics and your average performance for each of them.  Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance. [relevant rubric item: “usage of evaluation metrics”]


Two important evaluation metrics are precision and recall. The average performance for them are 0.38946 and 0.38050, respectively. 

Details below:
- Accuracy: 0.83787      
- Precision: 0.38946      
- Recall: 0.38050 
- F1: 0.38493     
- F2: 0.38226

To understand it in a simpler way:
- Precisions refer to the ratio of correct positive predicitons made out of all positive predicitons we made. Let's see this algorithm's performance to clearly understand it: we made 1954 postive predicitons, but only 761 are correct, so precision = 761/1954 = 0.38946
- Recall refers to the ratio of correct positive predictions made out of all actual postive points. In reality, the number of the total positive points are 2000, in which 761 we predicted correctly, so recall is 761/2000 = 0.38050

The total predictions are 15000 points, in which 761 are true positives, 1193 are false positives, 1239 are false negatives and 11807 are true negatives.

It's easy to see there'e a tradeoff between precision and recall, that's why we use "f1" as the parameter of scoring when applying gridSeachCV.

F1 is the harmonic mean of precision and recall:
- f1 = 2 * true_positives/(2*true_positives + false_positives+false_negatives)

## Reference

https://www.cs.cmu.edu/~./enron/

https://link.springer.com/referenceworkentry/10.1007%2F978-1-4419-9863-7_233

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedShuffleSplit.html

https://en.wikipedia.org/wiki/F1_score