***
# Enron Project: Free-Response Questions
***

### Question 1
***

> Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it.  
As part of your answer, give some background on the dataset and how it can be used to answer the project question.   
Were there any outliers in the data when you got it, and how did you handle those?

The project goal was to use machine learning techniques over e-mails and corporate financial data to classify Enron employees into two groups: those involved in the financial scandal, named POIs (persons of interest), and those not. It leverages on computational power and statistical algorithms to identify patterns in known cases to label unknown ones into a group or another. 

To accomplish the task, both financial and e-mail information were summarized by person's name. A list of POIs was manually generated based on individuals who were indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity. There were total 21 features in the dataset: 14 financial features, 6 email features and the 'POI' feature, the one to be tested against the model. 

The dataset contains 146 data points, each representing one Enron employee. Of those, 18 were flagged as POIs and 128 as non-POIs.   

Of the 146 data points (persons), 3 outliers were removed: two that are clearly not persons ('TOTAL' and 'THE TRAVEL AGENCY IN THE PARK') and another with all values equal to zero ('LOCKHART EUGENE E'), which could bring mislead results to the model.   

Before running the models, the features values were checked for missing information. The aim was to avoid bias due to 'NaN' concentrated in POIs.   
For example, if all POIs have 'NaN' for a specific feature, the algorithm may conclude that "every time a feature is missing, the person is a POI".  

Two features were dropped because of 'NaN' issue: `restricted_stock_deferred` and `director_fees`. The `email` feature was also dropped, because there's nothing that can be predicted from it.

The next step was to create 2 new features to be tested: 
1. `to_poi_ratio`: the percentage of messages sent to a POI / total messages sent; 
2. `from_poi_ratio`: the percentage of messages received from a POI / total messages received.
In theory those involved in the fraud should be working more closely together.  

The list of the 19 features in the dataset, ordered by k-best scores (from best to worst, after rescaling) are as follows:  

|#   	|Feature   	|K-Best Score   	|
|---	|---	|---	|
|1.   	|exercised_stock_options   	|24.815079733218194   	|
|2.   	|total_stock_value   	|24.182898678566872   	|
|3.   	|bonus   	|20.792252047181538   	|
|4.   	|salary   	|18.289684043404513   	|
|5.   	|deferred_income   	|11.458476579280697   	|
|6.   	|long_term_incentive   	|9.9221860131898385   	|
|7.   	|restricted_stock   	|9.212810621977086   	|
|8.   	|total_payments   	|8.7727777300916809   	|
|9.   	|shared_receipt_with_poi   	|8.5894207316823774   	|
|10.   	|loan_advances   	|7.1840556582887247   	|
|11.   	|expenses   	|6.0941733106389666   	|
|12.   	|from_poi_to_this_person   	|5.2434497133749574   	|
|13.   	|from_poi_ratio   	|5.1239461527568899   	|
|14.   	|other   	|4.1874775069953785   	|
|15.   	|to_poi_ratio   	|4.0946533095769446   	|
|16.   	|from_this_person_to_poi   	|2.3826121082276743   	|
|17.   	|to_messages   	|1.6463411294420094   	|
|18.   	|deferral_payments   	|0.22461127473600509   	|
|19.   	|from_messages   	|0.16970094762175436   	|


From the above feature list we conclude that financial information (except for 'other') is better than e-mail information for predicting POIs.

### Question 2
***

> What features did you end up using in your POI identifier, and what selection process did you use to pick them?   
Did you have to do any scaling? Why or why not?  

> As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.)  

> In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values. 

Pipeline was used to garantee all steps were run in sequence over the same data sample. The steps were:
1. Scaling using MinMaxScaler
2. PCA transformation (all models tested with and without this step)
2. K-Best selection
4. Classifier (Gaussian Naive Bayes, Decision Tree, Adaboost and Random Forest)

The features were rescaled using MinMaxScaler (except for 'POI') because financial ones, expressed in USD, had a much wider range of values compared to emails sent/received, which would cause model distortion depending on the classifier used. 

All classifiers were tested using k-best method, with k equal to [4, 6, 8 and 10]. We ended up using 4 features, after grid searching all parameters combinations: `exercised_stock_options`, `total_stock_value`, `bonus` and `salary`.

### Question 3
***

> What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?

The tested classifiers were Gaussian Naive Bayes, Decision Tree, Adaboost and Random Forest.   
Gaussian Naive Bayes was the classifier of choice.

Best results for each classifiers were:
***  

|**Classifier**   	        |**PCA (Y/N)** 	|**Accuracy**   	|**Precision**   	|**Recall**   	|**f1**   	|
|---	                    |---	        |---	            |---	            |---	        |---
|**Gaussian NB**   	            |**No**   	        |**0.847**  	        |**0.412**   	        |**0.329**          |**0.366**   	|
|Decision Tree              |No   	        |0.840   	        |0.340  	        |0.215 	        |0.263   	|
|Adaboost   	            |No   	        |0.824   	        |0.294   	        |0.227 	        |0.256   	|
|Random Forest 	            |No   	        |0.849   	        |0.372   	        |0.188          |0.250   	|
|.....      	            |..... 	        |.....   	        |.....   	        |.....          |.....   	|
|Gaussian NB   	            |Yes   	        |0.812  	        |0.312   	        |0.343          |0.327   	|
|Decision Tree              |Yes   	        |0.810   	        |0.275  	        |0.258 	        |0.266   	|
|Adaboost   	            |Yes   	        |0.820   	        |0.291   	        |0.245 	        |0.266  	|
|Random Forest 	            |Yes   	        |0.837   	        |0.331   	        |0.218          |0.263   	|


Conclusions:
- Gaussian NB was the best classifier for the problem in hand, the only to pass the minimum bar of 0.3 for both precision and recall.
- PCA preprocessing didn't influence the results significantly.

### Question 4
***

> What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?  
How did you tune the parameters of your particular algorithm?

It means to test the algorithm against a wide range of possible combinations in search for the best solution on the train set. If it's not properly done, the model will unecessarily return wrong predictions when going live.  A model's efficiency is measued by how well it performs on unseen data. It may overfit or underfit in what is known as the bias-variance tradeoff. 

Overfitting (high variance, low bias) occurs when the model overreacts to minor fluctuations in the training data, incorporating a large noise component.   
Underfitting (low variance, high bias) happens when the model is overly simplified, unable to capture important trends. 

Algorithm parameters are not solely responsible for model assertiveness and speed. Depending on the input data and number of features available, it's necessary (or recommended) to preprocess before loading it: feature rescaling (i.e MinMax Rescaling), selection (i.e k-best), dimensional space reduction (i.e PCA) and other techniques which may facilitate the algorithms predicting power and reduce their potential for a complex fit.

GridsearchCV was the method used to systematically work through the multiple possible combinations of parameter tunes, cross-validating as it goes to determine which tune gives the best performance. 

The following parameters were tested on the models:  

Gaussian NB:
    * PCA: [True, False]
    * k-best: [4, 5, 6, 7, 8, 9, 10]

Decision Tree:
    * PCA: [True, False]
    * k-best: [4, 6, 8, 10]
    * criterion: ["entropy", "gini"]
    * min_samples_leaf: [2, 4, 6]
    * min_samples_split: [2, 4, 6]
    
Adaboost:
    * PCA: [True, False]
    * k-best: [4, 6, 8, 10]
    * n_estimators: [40, 50, 60]
    * learning_rate: [.6, .8, 1, 1.2, 1.5]

Random Forest:
    * PCA: [True, False]
    * k-best: [4, 6, 8, 10]
    * criterion: ["entropy", "gini"]
    * min_samples_leaf: [1, 2, 4]
    * min_samples_split: [2, 4, 6]
    * n_estimators: [5, 10, 20]

### Question 5
***

> What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?

Validation is the process of running the chosen algorithm against unseen data to verify model performance.   
A classic mistake is loading test data into the model for training, giving it a perfect fit but returning poor results when going live on unseen data.

Our analysis was cross-validated using `train_test_split` technique, which splits arrays or matrices into random train and test subsets. 80% of data points were randomly selected to train the model, while the remaining 20% were kept to test it. In sequence, it was tested using `StratifiedShuffleSplit` (SSS). The SSS cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds.

### Question 6
***

> Give at least 2 evaluation metrics and your average performance for each of them.  Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance.

Accuracy Score:  
(True Positives + True Negatives) / Total Cases.  
Model accuracy score: 0.847  
It means 84.7% of the time the model predicted true or false correctly. Of all people, 15.3% would either answer for a crime while being not guilty or walk away with it if guilty.


Precision Score:  
True Positives / (True Positives + False Positives)  
Model precision score:  0.412  
It means for every time the model predicted a person was a POI, 41.2% of the time he/she actually was.


Recall Score:  
True Positives / (True Positives + False Negatives)  
Model recall score: 0.329  
It means for every existing POI, 32.9% of the time the model correctly classified him/her as a POI. Solely using the model by this criteria, 67.1% of POIs would walk away with it.   