# P5 : Introduction to Machine learning

## Summary : 

This is the report made for Udacity's NanoDegree Introduction to Machine Lesson.

Subject treatead : 

- Naive Bayes
- SVM
- Decision Trees
- Choose Your Own Algorithm
- Datasets and Questions
- Regressions
- Outliers
- Clustering
- Feature Scaling
- Text Learning
- Feature Selection
- PCA
- Validation
- Evaluation Metrics
- Tying It All Together

Link to the [rubric](https://review.udacity.com/#!/rubrics/27/view) 

## 1
#### Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?  [relevant rubric items: “data exploration”, “outlier investigation”]

The goal of this project is to transform data into information, based on a machine learning process. In a less generic description, we plan to achieve the goal by building an algorithm to identify Enron Employees who may have committed fraud based on the public Enron financial and email dataset.

Machine learning (ML) is usefull for its suppervised learning techniques. We will build a classifier based on the data we have. 

The origin of the data is the famous Enron corpus[[0](https://en.wikipedia.org/wiki/Enron_Corpus)] declared of public interest by american Justice in 2002 during Enron's scandal investigations. 

However, we are given a pre-processed summary:
- based on Enron's employees financial Data pdf (Findlaw's) [[1](http://news.findlaw.com/hdocs/docs/enron/enron61702insiderpay.pdf)]
- Udacity's mentors work on email communications (from author,to author etc) 

This summary is provided in form of a python dictionnary with 146 keys representing 146 Enron employees. Here 5 examples ordered by salary.



In [1]:
import numpy as np
import glob
import os
import os.path
import re
import sys
import pickle
import pprint 
pp = pprint.PrettyPrinter(indent=4)
sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data

### Task 1: Select what features you'll use.
### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi".
features_list = ['poi','salary',"total_stock_value",'expenses','other','total_payments'] # You will need to use more features

### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

### Task 2: Remove outliers
### Task 3: Create new feature(s)
### Store to my_dataset for easy export below.
my_dataset = data_dict

import pandas 
my_dataset_df = pandas.DataFrame.from_dict(my_dataset,orient='index')
my_dataset_df.replace('NaN',np.nan, inplace=True)







In [4]:
my_dataset_df.sort_values('salary', axis=0, ascending=0, inplace=0).head(5)

Unnamed: 0,salary,to_messages,deferral_payments,total_payments,exercised_stock_options,bonus,restricted_stock,shared_receipt_with_poi,restricted_stock_deferred,total_stock_value,...,loan_advances,from_messages,other,from_this_person_to_poi,poi,director_fees,deferred_income,long_term_incentive,email_address,from_poi_to_this_person
TOTAL,26704229.0,,32083396.0,309886585.0,311764000.0,97343619.0,130322299.0,,-7576788.0,434509511.0,...,83925000.0,,42667589.0,,False,1398517.0,-27992891.0,48521928.0,,
SKILLING JEFFREY K,1111258.0,3627.0,,8682716.0,19250000.0,5600000.0,6843672.0,2042.0,,26093672.0,...,,108.0,22122.0,30.0,True,,,1920000.0,jeff.skilling@enron.com,88.0
LAY KENNETH L,1072321.0,4273.0,202911.0,103559793.0,34348384.0,7000000.0,14761694.0,2411.0,,49110078.0,...,81525000.0,36.0,10359729.0,16.0,True,,-300000.0,3600000.0,kenneth.lay@enron.com,123.0
FREVERT MARK A,1060932.0,3275.0,6426990.0,17252530.0,10433518.0,2000000.0,4188667.0,2979.0,,14622185.0,...,2000000.0,21.0,7427621.0,6.0,False,,-3367011.0,1617011.0,mark.frevert@enron.com,242.0
PICKERING MARK R,655037.0,898.0,,1386690.0,28798.0,300000.0,,728.0,,28798.0,...,400000.0,67.0,,0.0,False,,,,mark.pickering@enron.com,7.0



One outlier was identified ('TOTAL') and removed with dict.pop method. 

It's important to notice the low representation of poi : 18/146 ≈ 12%

## 2 

#### What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.  [relevant rubric items: “create new features”, “properly scale features”, “intelligently select feature”]

Final list of features : ['poi', 'salary', 'total_stock_value', 'deferred_income', 'exercised_stock_options', 'bonus'] 

Feature selection is by far the most back and forth process of the project. 
1. At first, we had the intuition that we needed to remove any data points withe some "missing data" so to feed the classifier only with clean data.
    1. We constructed this matrix of "feature completness" based on POI/NO_POI and selected all feature with completness >.85 for POI

In [2]:
import collections
import pprint
pp = pprint.PrettyPrinter(indent=4)
poi_df = my_dataset_df.loc[my_dataset_df['poi'] == 1]
non_poi_df = my_dataset_df.loc[my_dataset_df['poi'] == 0]
pct_dict = collections.defaultdict(dict)
for column in my_dataset_df.columns.values:
    pct_non_nan_total =  round(my_dataset_df[column].count()/float(len(my_dataset_df[column])),3)
    pct_non_nan_poi = poi_df[column].count()/float(len(poi_df[column]))
    pct_non_nan_non_poi = non_poi_df[column].count()/float(len(non_poi_df[column]))
    
    dict_for_column = {"pct_non_nan_total":pct_non_nan_total,
                      "pct_non_nan_poi":pct_non_nan_poi,
                      "pct_non_nan_non_poi":pct_non_nan_non_poi}
    
    pct_dict[column]= dict_for_column

pct_df = pandas.DataFrame.from_dict(pct_dict,orient='index')

In [3]:
pct_df[['pct_non_nan_total','pct_non_nan_poi','pct_non_nan_non_poi']].sort_values('pct_non_nan_poi',ascending=0)

Unnamed: 0,pct_non_nan_total,pct_non_nan_poi,pct_non_nan_non_poi
total_stock_value,0.863,1.0,0.84375
other,0.637,1.0,0.585938
total_payments,0.856,1.0,0.835938
email_address,0.76,1.0,0.726562
expenses,0.651,1.0,0.601562
poi,1.0,1.0,1.0
salary,0.651,0.944444,0.609375
restricted_stock,0.753,0.944444,0.726562
bonus,0.562,0.888889,0.515625
from_this_person_to_poi,0.589,0.777778,0.5625


2. With the first 8 features (email_address removed for low information presumptions) most (poor optimized) classifiers where not giving the satisfactory output : Precision,recall >.3
3. Classifiers with default parameters started to show better perfomances with datapoints containing missing values
4. Ensemble classifiers with default parameters started to show better perfomances with datapoints containing missing values, so we fed the classifier with all the features (except from_this_person_to_poi & from_poi_to_this_person because those features where surrounded by train/test leaked data discussions) 
5. Ensemble classifier showed better performances with feature selection method SelectKBest[3], with k=10 (tested from k=6 to k=14)
6. Ensemble classifier showed worst performances with GridSerchCV parameter fine tunning
7. Gaussian Naive Bayes showed very promissing results with the same set of features from SelectKBest, with k=10
8. Gaussian Naive Bayes showed improvement in performance with SelectKBest, k=5
9. We used FeatureUnion[2] to combine features obtained by PCA and univariate selection with slightly worst performnance
10. Definitive feature list : ['poi', 'salary', 'total_stock_value', 'deferred_income', 'exercised_stock_options', 'bonus'] selected by SelectKbest 

Note on feature creation: After observing similar performances on GNB + SelectkBest with K=3 and K=5 we created a pipeline PCA > SelectKbest

```
pipeline = Pipeline([("features", combined_features), ("gnb", GNB)])
param_grid = dict(features__pca__n_components=[1, 2],
                   features__univ_select__k=[2,3,4,5,6,7],
                   )
```

The objective was to try to reduce 2 or more feature into the principal component. However, this approach did not bring significant improvement in the metrics evaluated. 

In [5]:
print sorted(selector.scores_, reverse=True)[:5]

NameError: name 'selector' is not defined

[25.097541528735491, 24.467654047526398, 21.060001707536571, 18.575703268041785, 11.595547659730601]

## 3
#### What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?  [relevant rubric item: “pick an algorithm”]

We end up using a default Gaussian Naive Bayes classifier. 

We tested in this order:

1. Gaussian Naive Bayes with remove_any_zeros = True [Precision: 0.36102	Recall: 0.21300	F1: 0.26792]
2. Decision tree + GridSearchCV [Precision: 0.29597	Recall: 0.27200	F1: 0.28348]
3. SVC [Error]
4. Adabooust default [Precision: 0.34165	Recall: 0.24650	F1: 0.28638]
5. Adaboost + GridSearchCV [Precision: 0.20610	Recall: 0.65250	F1: 0.31325]
6. Random forest default [Precision: 0.47015	Recall: 0.18900	F1: 0.26961]
7. Random forest + GridSearchCV [Precision: 0.44704	Recall: 0.26800	F1: 0.33510]
6. Gaussian Naive Bayes + SelectkBest k=5 (Precision: 0.48876	Recall: 0.38050	F1: 0.42789)

Overall our initial strategy with remove_any_zeros = True was proven incorrect. Ensemble classifier did improve the performance at a computational cost (10X slower for fitting and testing).

## 4

#### What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?  How did you tune the parameters of your particular algorithm? 

Parameter tunning was done systematically with help of GridSearchCV. It's important because default classifier tend to respond better to a specific dataset once fine tunning is established. In this example , we fine tune by 'n_estimators',learning_rate and algorithm the adaboost classifier. 
```
param_grid = {'n_estimators': [10,45,50,55,60,100],\
 'learning_rate':[1.,2.,5.],\
 'algorithm': ['SAMME', 'SAMME.R']}
sss = StratifiedShuffleSplit()
gs = GridSearchCV(AdaBoostClassifier(),param_grid,scoring="f1", cv=sss)
gs.fit(features,labels)
clf = gs.best_estimator_
```
However the increment in performance were by any means 'game changers'. Feature selection without pre-made asumptions (remove_any_zeros = True) was far more productive.

As GNB has no parameter to tune, we made the tunning based on the K of seleckKbest features to select

>3 features, Precision: 0.486	Recall: 0.351	F1: 0.408	F2: 0.372

>4 features, Precision: 0.503	Recall: 0.323	F1: 0.393	F2: 0.348

>**5 features, Precision: 0.489	Recall: 0.381	F1: 0.428	F2: 0.398**

>6 features, Precision: 0.457	Recall: 0.370	F1: 0.409	F2: 0.385

>7 features, Precision: 0.457	Recall: 0.384	F1: 0.417	F2: 0.397

>8 features, Precision: 0.404	Recall: 0.318	F1: 0.356	F2: 0.332

>9 features, Precision: 0.328	Recall: 0.316	F1: 0.322	F2: 0.318






## 5
#### What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?  [relevant rubric item: “validation strategy”]

Validation is defined by the methods we apply to understand if our model does what it was ment to do. Generally it is done by separating part of the dataset (~10%-25%) into training (for selection of your algorithm and parameter settings) and testing sets (how well the fit model generalizes to new data). 

One classic mistake to avoid is to validate the model on the training data. Performance probably will be (very) good leading to think that the model is ready for production. But performance should be mesured as 'how well the model adapts' to **new** data, hence the initial separation into training/testing sets. 

In this particular case, initial dataset seizure is complicated as we have very few POI (only 18). 

So we relied on GridSearchCV based on StratifiedShuffleSplit() in order to have proper cross validation. Why not the classic train/test ?  If we happen to have so little data that we cannot have a separate test set (as is arguably the case in the project), then perhaps the best we can do is to go without the final test set. We no longer have a two-part split of the data into training-validation-test, but instead have the data in one group and we perform cross-validation on the data as a whole.

Then we evaluated the overall performance with tester.py provided by the assignment. Tester.py produce a test based on a 1000 fold StratifiedShuffleSplit()

## 6
#### Give at least 2 evaluation metrics and your average performance for each of them.  Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance. [relevant rubric item: “usage of evaluation metrics”]

Performance metrics to optimize were Precision and Recall[4]. Accuracy was much less important because model "All predictions are NON POI" would have an 83% accuracy. Not inspring confidence. 

with 


> true positive (TP)
eqv. with hit

>true negative (TN)
eqv. with correct rejection

>false positive (FP)
eqv. with false alarm, Type I error

>false negative (FN)
eqv. with miss, Type II error

Precision : TP / (TP + FP) 

> the fraction of retrieved documents that are relevant to the query

Recall : TP / (TP + FN)

> the fraction of the documents that are relevant to the query that are successfully retrieved.

As explained before our best results were

>  **Precision: 0.489 Recall: 0.381 F1: 0.428 F2: 0.398** 

on a default Gaussian Naive Bayes with features selected with selectKbest [see details in section 4]





-------

- [0] https://en.wikipedia.org/wiki/Enron_Corpus
- [1] http://news.findlaw.com/hdocs/docs/enron/enron61702insiderpay.pdf
- [2] http://scikit-learn.org/stable/auto_examples/feature_stacker.html#sphx-glr-auto-examples-feature-stacker-py
- [3] http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html
- [4] https://en.wikipedia.org/wiki/Precision_and_recall
