Kyle Shannon 1/18/16 - Project 5 - Enron POI Predictive Model

### Files:

**poi_id.py:** Main file where I construct my pipeline classifier: MinMax/Scaler/PCA/Classifiers etc.

**data_shape.py:** python script to remove outliers, print out useful info about data set and add features.

**data_viz.py:** python script taking a data_dict and trnasforming it into a Pandas DF. Also has several graphing functions to show correlation plots and pairplots. Useful for identifying feature interactions.

**tester.py:** file generated by Udacity to test out algorithm and print out useful information about performance.

**three pickle files:** files generated by poi_id.py which tester.py uses. The pickle files are essentially my classifier or pipeline along with the transformed data set and feature list.

**references.txt:** text file with list of refernces I used.

**ml_results.txt:** I piped the output of tester.py to this file so I could save all of my ML algorithm attempts. This is essentially a record of all my attempts picking an algorithm and trying to fine tune it.

### Understanding the Dataset and Question

##### Data Exploration & Outlier Investigation

I started out with basic exploratory methods (printing out types, counts, percentages and sample graphing). I noticed several, what could be considered, outliers. Individuals such as Lay and Skilling could be considered outliers; however, because they are POIs and due to severe class imbalances they could not be removed. There were three outliers I did remove:
	
    del data_dict['TOTAL'] # this is not a person and should be removed.
	del data_dict['THE TRAVEL AGENCY IN THE PARK'] # not identified as a POI and had little financial data.
	del data_dict['LOCKHART EUGENE E'] # All data points were NaN and he was not a POI.
    
Other interesting information I found included:

- Number of People under Investigation: 143
- Number of Data Points: 3718 (this was after outliers were removed and I added new features, derived by: num_of_people * num_of_features)
- Number of Features: 26 (I added new features)
- Num of POIs:  18
- Percentage of data points as NaNs: 35%

I also created a new dict that provided a keys as features and value as % of NaNs in data set: 

    {'to_messages': '39.86%', 'deferral_payments': '73.43%', 'expenses': '34.27%', 'poi_email_reciept_interaction': '0.00%', 'poi': '0.00%', 'deferred_income': '66.43%', 'email_address': '22.38%', 'from_poi_to_this_person': '39.86%', 'restricted_stock_deferred': '88.11%', 'shared_receipt_with_poi': '39.86%', 'loan_advances': '97.90%', 'from_messages': '39.86%', 'other': '36.36%', 'from_this_person_to_poi_fraction': '0.00%', 'director_fees': '88.81%', 'salary': '34.27%', 'bonus': '43.36%', 'total_stock_value': '12.59%', 'poi_email_interaction': '0.00%', 'from_this_person_to_poi': '39.86%', 'restricted_stock': '23.78%', 'adj_compensation': '0.00%', 'total_payments': '13.99%', 'long_term_incentive': '54.55%', 'from_poi_to_this_person_fraction': '0.00%', 'exercised_stock_options': '29.37%'}

A lot of the email features have little to no NaNs, most of the NaNs are in the financial data, especially 'restricted_stock_deferred', 'loan_advances', and 'director_fees'.

### Optimize Feature Selection/Engineering

At first I thought about ways to imputate data for NaNs. For example using regression to predict values. However, it hit me as I looked over the financial PDF. NaNs should be Zeros, due to no financial data, if 'salary' was NaN that was because they did not receive a salary, and I should not imputate a salary for that person.

#### Create new features

I created 5 new features:
        
1. **'poi_email_interaction'** - People who received emails from POIs probably also responded to POIs and vice versa. I decided to combine emails (from POIs) and (to POIs) into one POI interaction.

2. **'poi_email_reciept_interaction'** - This feature takes poi_email_interaction and multiplies the values by the number of receipts shared with POIs. I thought that it was easier to share an email with a POI vs a receipt. The more receipts a person has, the more the personal the relationship. Therefore there is a potential greater chance that person in a POIS as well. 

3. **'adj_compensation'** - I created this feature by combining financial features that might make up an employee's total compensation from Enron. I added: 'salary', 'total_payments', 'exercised_stock_options', 'bonus', 'long_term_incentive', and 'total_stock_value'.

4. **'from_poi_to_this_person_fraction'** - This feature and the following one are the respective  fraction of poi emails from total emails. This feature will tease out individuals who send or receive less emails, however a majority of their emails are interactions with POIs.

5. **'from_this_person_to_poi_fraction'** - See the previous feature description above.

Example of code used to create the 'adj_compensation' feature:
        
        if (key == 'salary' or key == 'total_payments' or key == 'exercised_stock_options' \
				or key == 'bonus' or key == 'long_term_incentive' or key == 'total_stock_value') \
				and value != 'NaN':
				v['adj_compensation'] += value

#### Intelligently select features 

I tried manually selecting features, based on evidence supplied by a Correlation Matrix Heatmap (see figure 1). I looked at features that were highly correlated with many other features and choose not to use those. I wanted features that were mostly uncorrelated as they themselves explained a lot of the variance. Highly correlated features can create imbalances in the weight distributions in hyperplanes where algorithms look to classify data. Many features that overlap in the same space will create unneeded bias. On the other hand this imbalance can overshadow significant features and make them look insignificant. What a shame. So we want to limit collinearity in our model. 

<img src="figure_1.png" alt="inline" style="width: 800px;">

I tried adding SelectKBest to my pipeline to help computationally determine features to use. I attempted to use several parameters using GridSearch, for example (f_classif, and k=[2,4,6,8,10]), however, I could not get the precision and recall score both above 3.0 keeping everything else equal, compared to manually selecting features. The features I manually chose were:

    features_list = ["poi", #'poi' must be 1st feature in list
                    "salary",
                    "total_payments",
                    "exercised_stock_options",
                    "restricted_stock",
                    "expenses",
                    "director_fees",
                    "deferred_income",
                    "from_poi_to_this_person_fraction",
                    "from_this_person_to_poi_fraction",
                    "poi_email_reciept_interaction"]

This above list was selected after looking at the heatmap correlation matrix and many pairplots.

The below function and list shows what SelectKBest choose as the best features.

    def feature_select(clf, feature, label):
        '''
        Args:
            1. clf: select a classifier that selects best features.
            2. feature: features from data set.
            3. label: classification labels.

        Function that selects the best features to use and prints all the features 
        with scores and a boolean value if the feature should be used. 
        '''

        features_new = clf.fit(feature, label)
        SKB_scores =  features_new.scores_
        SKB_get_support =  features_new.get_support()


        best_features = []
        for foo in range(0, len(features_list)-1):
            temp_tuple = (features_list[foo + 1], SKB_scores[foo], SKB_get_support[foo])
            best_features.append(temp_tuple)

        sorted_best_features = sorted(best_features, key=lambda tup: tup[1], reverse=True)

        for tup in sorted_best_features:
            print tup

    ### Use SelectKBest to select the 10 best features.
    feature_select(SelectKBest(f_classif), features, labels)

    ('exercised_stock_options', 24.815079733218194, True)
    ('total_stock_value', 24.182898678566879, True)
    ('adj_compensation', 22.292299851749735, True)
    ('bonus', 20.792252047181535, True)
    ('salary', 18.289684043404513, True)
    ('from_this_person_to_poi_fraction', 16.409712548035792, True)
    ('deferred_income', 11.458476579280369, True)
    ('long_term_incentive', 9.9221860131898225, True)
    ('restricted_stock', 9.2128106219771002, True)
    ('total_payments', 8.7727777300916756, True) 
    ('shared_receipt_with_poi', 8.589420731682381, False)
    ('loan_advances', 7.1840556582887247, False)
    ('expenses', 6.0941733106389453, False)
    ('from_poi_to_this_person', 5.2434497133749582, False)
    ('poi_email_reciept_interaction', 4.8964971483426085, False)
    ('poi_email_interaction', 4.8636818394122443, False)
    ('from_poi_to_this_person_fraction', 3.1280917481567192, False)
    ('from_this_person_to_poi', 2.3826121082276739, False)
    ('director_fees', 2.1263278020077054, False)
    ('to_messages', 1.6463411294420076, False)
    ('deferral_payments', 0.22461127473600989, False)
    ('from_messages', 0.16970094762175533, False)
    ('restricted_stock_deferred', 0.065499652909942141, False)


#### Properly Scale Features

I was going to scale features at the beginning of my pipeline, using either: MinMaxScaler() or 
StandardScaler(). MinMaxScaler lets you choose a range to scale all features to, for example between [0,1]. Whereas StandardScaler normalizes the data between [-1, 1] with a mean 0. I believe StandardScaler is better for certain methods, e.g. PCA. On the flip side if a tree based method, such as Decision Tree is selected, then no feature scaling should be necessary.

One area of concern I had was the following. My pipeline:

    decision_tree_pipeline = Pipeline([
                ('pca', PCA()),
                ('dt', DecisionTreeClassifier(max_depth=10))
                ])

I achieved:

    Accuracy: 0.82807	 Precision: 0.35575	Recall: 0.35700	F1: 0.35638
    
However when I added:

     decision_tree_pipeline = Pipeline([
	    (‘scaler’, StandardScaler()),
        ('pca', PCA()),
        ('dt', DecisionTreeClassifier(max_depth=10))
        ])

I received a score of: (I got a similar score when I used MinMaxScaler() as well)

    Accuracy: 0.80707	 Precision: 0.27053	Recall: 0.26350	F1: 0.26697

I was not sure why the score went down so much. I assumed PCA would perform better with data that was standardized. Perhaps this is because I am using a decision tree, or maybe it has to do with the parameter tuning of ‘scaler’ or ‘pca’.

### Pick and Tune an Algorithm

#### Try Out Several Algorithms

I tried several 'out of the box' algorithms:

    # DT out of box
    test_pipeline_dt = Pipeline([	
                ('select', SelectKBest(k=3)),
                ('dt', DecisionTreeClassifier()),
                ])
    clf = test_pipeline_dt
    # 	Accuracy: 0.81260	Precision: 0.30663	Recall: 0.32150	F1: 0.31389	F2: 0.31841


    # GussianNB out of box
    test_pipeline_gnb = Pipeline([	
                ('select', SelectKBest(k=3)),
                ('gnb', GaussianNB())
                ])
    clf = test_pipeline_gnb
    # Accuracy: 0.84147	Precision: 0.35462	Recall: 0.23050	F1: 0.27939	F2: 0.24785

    # AdaBoost out of box
    test_pipeline_adaboost = Pipeline([	
                ('select', SelectKBest(k=3)),
                ('adaboost', AdaBoostClassifier())
                ])
    clf = test_pipeline_adaboost
    # Accuracy: 0.83047	Precision: 0.33535	Recall: 0.27650	F1: 0.30310	F2: 0.28656

    # Linear SVC out of box
    test_pipeline_Lsvc = Pipeline([	
                ('select', SelectKBest(k=3)),
                ('Lsvc', svm.LinearSVC())
                ])
    clf = test_pipeline_Lsvc
    # Accuracy: 0.67860	Precision: 0.10874	Recall: 0.19600	F1: 0.13988	F2: 0.16889

    # Random Forest out of box
    test_pipeline_rf = Pipeline([	
                ('select', SelectKBest(k=3)),
                ('rf', RandomForestClassifier(n_estimators=10))
                ])
    clf = test_pipeline_rf
    # Accuracy: 0.85147	Precision: 0.38000	Recall: 0.18050	F1: 0.24475	F2: 0.20168

#### Pick an algorithm 

Going off F1 scores, decision trees seemed to work best as a baseline model. Because of this, I decided to forge down the Decision Tree path and begin to tune the model using PCA, GridSearch and playing with the features a bit.

#### Tune an algorithm


I had an issue  trying to tune PCA. When using all default parameters I recieved precision and recall scores above 3.0. However, when I tried cycling through n_components in GridSearch(keeping everything else equal):

    param_grid={'pca__n_components': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}

I received scores of:

    Accuracy: 0.80527	Precision: 0.25750	Recall: 0.24450	F1: 0.25083

Using the same feature list as above. I was not sure why the scores would go down. At least they should stay the same I would think.

A parameter of Decision Tree that I wanted to explore was 'max_depth'. Allowing a tree to go as deep as it possibly can cause overfitting. Forcing it to stop earlier probably lowers accuracy; however, it is a reasonable tradeoff. In this case we prefer to gain some bias to lose a bit of variance. I attempted to use grid search with 

    {'dt__max_depth' : range(12)} 

GridSearchCV choose a ‘max_depth’ of None. Giving me a score of

    Accuracy: 0.83807	Precision: 0.32080	Recall: 0.19200	F1: 0.24023	
    
However, when I manually tried out numbers I got a score of

    Accuracy: 0.82807	Precision: 0.35575	Recall: 0.35700	F1: 0.35638	
    
Using ‘max_depth’ = 10. Again I am not sure why GridSearch did not work here. As I kept everything else constant.

This was the best decision tree model I was able to come up with. I winded up manually testing out the parameters, as I could not get GridSearch to behave. Perhaps I wrongly thought I could pass pipelines and GridSearchCV as a clf to the tester.py file.

    ['poi', 'salary', 'total_payments', 'exercised_stock_options', 'restricted_stock', 'expenses', 'director_fees', 'deferred_income', 'from_poi_to_this_person_fraction', 'from_this_person_to_poi_fraction', 'poi_email_reciept_interaction']
    
    GridSearchCV(cv=None, error_score='raise',
           estimator=Pipeline(steps=[('pca', PCA(copy=True, n_components=None, whiten=False)), ('dt', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
                max_features=None, max_leaf_nodes=None, min_samples_leaf=3,
                min_samples_split=10, min_weight_fraction_leaf=0.0,
                presort=False, random_state=None, splitter='best'))]),
           fit_params={}, iid=True, n_jobs=1, param_grid={},
           pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)
        
        Accuracy: 0.84427	Precision: 0.40289	Recall: 0.34850	F1: 0.37373	F2: 0.35817
        Total predictions: 15000	True positives:  697	False positives: 1033	False negatives: 1303	True negatives: 11967

Tuning a model is important because the parameters you feed into classifiers (SelectKBest, PCA, DecisionTree etc.) will all affect how your final classifier generalizes on unseen data. Inorder to tune the algorithm we have at our disposal methods such as Cross Validation and GidSearchCV. Usually a good rule of thumb is to split your data randomly into train, validate and test sets. The train set is to fit your model, the validate set helps you to test/tune your algorithm and the test set is unseen data that the algorithm has not yet seen. This test tells you how well your algorithm generaizes to unseen data. This process can be set up in many ways using SciKitLearn's Cross Validation Tools. 

GirdSearchCV allows us to try out many different versions of our algorithm. For example PCA has a parameter called n_components = (int). Where as int is some number (1, 2, 3 ... n). I would not want to change that number and re run my algorithm again and keep track of the score. I can use GridSearch to perform this tedious task for me. For example I would pass { pca__n_components : [ 2, 4, 6 ] }. Then GridSearchCV would create a grid and construct a PCA classifier with n_components for 2, 4, and 6. It would then run all three PCAs and report back to me which parameter performed best. Furthermore you can extend this by passing GridSearchCV a pipeline of many classifiers and steps. E.g. SelectKBest, PCA, DecisionTree and try out different parameters for all of them. Though it may take some time. 

For this project I decided to not do CrossValidation by itself, beacuse the data set was heavily class imbalanced and there was just not a lot of data. What I did do is use GridSearchCV and I changed a parameter in in grid search, 'cv': 

    cv = StratifiedShuffleSplit(labels_df, folds, random_state)
    clf = GridSearchCV(estimator, param_dict, cv=cv)

So that I am at least performing cross validation withn grid search while the parameters are being tuned. StratifiedShuffleSplit is one form cross validation. 


### Validate and Evaluate

#### Usage of Evaluation Metrics

While accuracy can be a good predictor, it is not the only evaluation criteria we can use. Precision, Recall and F1 scores are also available. Precision and Recall are especially well suited for this data set. In our data there is a sever class imbalance (most data can be classified as non-POI). If we relied only on accuracy then our model could simply select not-POI for every prediction and return a fairly good accuracy. We would never know that no POIs were being correctly classified. Precision and Recall tells us how many POIs were correctly and incorrectly classified. Furthermore, we can tune our model to favor one over the other or attempt to strike a balance between the two.

Precision can be thought of as **exactness**. Specifically a high score tells us that we have less false positives. Specifically precision in this context is the proportion of correctly identified POIs over the value of correctly identified POIs combined with Innocent people mislabeled as POIs. For example if you identified 5 POIs correctly and 5 non-POIs as POIs then you would have a precision of 50%. 

##### *Precision* = (*TP*) / (*TP* + *FP*)

Where:
- TP = True Positive
- FP = False Positive

Recall can be thought of as **completness**. Specifically a high Recall score shows us that we have less false negatives. Recall in this context tells us the percentage of POIs not identified as a POI. In other words our classifier missed them. E.g. if we correctly identified 3 POIs, but missed 1 POI, then our recall would be 75%. 

##### *Recall* = (*TP*) / (*TP* + *FN*)

Where:
- TP = True Positive
- FN = False Negative

***F1 Scores*** provide you with a value that determines how well recall and precision are balanced. E.g. if recall was very high and precision was very low then the F1 scre would be about in the middle.  


For this prediction algorithm I would argue that we care more about Recall. Because were are concerned with identifying POIs for further investigation. Therefore we rather get all POIs even if that means investigating some innocent people. 

#### Validation Strategy


Validation is very important in machine learning. For example validating a classifier can reduce over fitting on never before seen data, ensuring generalizability. This mindset goes back to the bias-variance tradeoff. Models may be more biased (think linear regression) or varianced (think SVM). Often an overly biased model  does not meet the complexities of the real world and we thus underfit. On the flip side a high variance model that overfits the data may produce great results, but in the real world will often be lackluster. Thus a balance is needed. Methods like cross validation and all of its implementations helps alleviate this issue, but employing various training, validating and testing sets to ensure models train on one set of data and are tested on a pure untouched set of data as well, usually smaller than the fit. 

This methodology can run into issues though when your data set is not very large, like the Enron set. Or classes are imbalanced, like in the Enron set (~18 POIs vs. oer a hundered non-POIs). To help remedy these issues we can employ strategies such as StratefiedShuffeSplit to partition our dataset and test, without having to permanently partition a training and testing set. 

#### Algorithm Performance

The score for my final algorithm was:

    Accuracy: 0.84427	Precision: 0.40289	Recall: 0.34850	F1: 0.37373
    True positives:  697	False positives: 1033	False negatives: 1303	True negatives: 11967
    
The algorithm was:

    GridSearchCV(cv=None, error_score='raise',
           estimator=Pipeline(steps=[('pca', PCA(copy=True, n_components=None, whiten=False)), ('dt', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
                max_features=None, max_leaf_nodes=None, min_samples_leaf=3,
                min_samples_split=10, min_weight_fraction_leaf=0.0,
                presort=False, random_state=None, splitter='best'))]),
           fit_params={}, iid=True, n_jobs=1, param_grid={},
           pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)
           
And feature list:

    ['poi', 'salary', 'total_payments', 'exercised_stock_options', 'restricted_stock', 'expenses', 'director_fees', 'deferred_income', 'from_poi_to_this_person_fraction', 'from_this_person_to_poi_fraction', 'poi_email_reciept_interaction']

I was able to get a decent precision score; however, I had trouble getting recall past the .40 mark. Overall I found it diffcult to impment feature selecting with SelectKBest and StandardScaler() with PCA. Attempting to reproduce results with a pipeline() and GridSearchCV() also gave me some trouble. However, I was able to produce a simple model eith precision and recall both above .30 using manual manipulation techniques. 

### References

1. https://www.oreilly.com/learning/handling-missing-data imputation
2. http://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues PCA
3. Python Machine Learning by Raschka
4. ISL by Hastie et al.
5. https://discussions.udacity.com/t/using-pipeline-precision-recall-have-been-decreased/45992/5
6. SkLearn documentation
7. PANDAS documentation
8. Seaborn documentation
9. MatPlotLib documentation
10. http://stats.stackexchange.com/questions/53/pca-on-correlation-or-covariance
11. Many more stack exachage, cross validated, and udacity forum posts.