# POI Summary

by Raul Maldonado

___

**Table of Contents**

1. **Summary**

2. **Concepts and Remarks**

    a. Concepts
    
    b. Remarks
    
3. **Resources**

4. **Data Dictionary**

___

## Summary 

### Initial Study
We imported the initial Eron dataset "data_dict." This dataset has:

1. 21 Features
    
    a. 14 features are in the "Financial" category
    b. 6 features are in the "Email Features" category
    c. 1 feature is in the "POI Label" category
    
2. 146 Entries

3. Of all data points, 18 are labeled POI

Thereafter,  we assumed that we could predict fraudulent persons of interest in the 2002 Enron crisis with two features, Salary and Bonus.
![Salary vs Bonus Scatterplot](../../Output-Files/Report_Images/SalaryvRaises_0.jpg)
> Described output for Salary vs Raises

>DescribeResult(nobs=95L, minmax=(array([ 477.,    0.]), array([ 26704229.,  97343619.])), 

>mean=array([  562194.29473684,  2049339.34736842]), variance=array([  7.37866138e+12,   9.95751069e+13]), skewness=array([ 9.53125149,  9.31531747]), kurtosis=array([ 89.22989094,  86.42271551]))

>Correlation Coefficient:  0.99321115087 
P-Value 1.06457686411e-88

We observe a large amount of correlation between two two features. However, we did not exclude the extreme outlier case from this brief analysis. 

As a forethought, we need to observe the occurance of this high value, and determine if we are required to remove it from our analysis.

### Upper High Value Investigation

After some investigation, we find:

1. Some of the abnormal data points occur from person's of interest (poi). I.e., We find some abnormal points occuring from corrupt enron individuals.

2. Thought these corrupt individuals affect our analysis, they are not the cause for the absurd outlying case. There is one point we must consider in our extreme case...

The data points "TOTAL" and "THE TRAVEL AGENCY IN THE PARK" served as our unidentified Enron employee entries, so we removed them.

Moreover, Observing some of the data through a table, we noticed an Enron Employee that was not recorded correctly. I.e., We see Eugene LockHart's information not available in the Enron Dataset. 

Because this employee's information was not properly tracked, we remove the entry.

![Salary vs Raises Scatterplot (Removed Undefined Values)](../../Output-Files/Report_Images/SalaryvRaises_1.jpg)

> Described output for Salary vs Raises
DescribeResult(nobs=65L, minmax=(array([ 76399.,  70000.]), array([  440698.,  1500000.])), mean=array([ 263358.16923077,  685671.07692308]),

>variance=array([  4.27305011e+09,   1.25704380e+11]), skewness=array([ 0.56259694,  0.31932616]), kurtosis=array([ 1.16526257, -0.58418445]))

>Correlation Coefficient:  0.99321115087 
P-Value 1.06457686411e-88

We observe a 99.3% correlation value with a very small chance of predicting Salary vs Raises by chance. I.e., we observe observe strong linearity between the two variables with a p value of $1.06x{10^{-88}}$.

However, these two variables to not answer our original equation. Yes, it's nice to go through this analysis. But, but, we do not answer the question "Is there some indication of fraud from Enron employees from the given information." 

No. We have only done two things, which are far away from our goal.

A bivariate case was too naive, and we needed to expand our considerations to predict fraud in the Enron crisis. However, this scenario led us to observe that we needed to remove the outlier case "Total" from our data. 

The "Total" entry introduced dependence amongst variables, and thus affected our findings. Moreover, the cumlimination of all entries in this one input lead to a misleading case of what were outliers. We considered several entries to be outliers that effected our initial Scatterplot visualization.

We took out the several non-significant outliers before factoring into our model. Our scatterplot visualization looked more clean. However, we still have a naive approach to not even create a model to determine persons of interest.

### Added Features

Recall the **Feature Selection: A Discussion** section. We observed the bivariate analysis of Salary versus Bonus was only not sufficient in predicting fraudulent, POI, Enron employees, we saw the features were not explicitily catored to answering our question.

We will implement four new features to potentially indentify Eron POI. 

The **first two** additions are the ratio between an employee and the CEO, _Jeffrey K. Skilling_.

The implementation for these four new features can lead to the following:

- If the ratio in bonus between an employee and the CEO is significantly closer to the value 1, then we can suspect some type of corruption occuring.

- If the ratio in salary between an employee and the CEO is significantly closer to the value 1, then we can suspect some type of corruption occuring.

These features are linear transformations of the salary and bonus information presented. We factor how close ones's financial information is with respect to the CEO to determine POI relationship.

The **remaining two** additional features are the inbound and outbound email ratios between a POI.


The implementation for these four new features can lead to the following:

- If the ratio $\dfrac{\text{Received emails from POI}}{\text{All received Emails}}$ is closer to the value 1, we can suspect an individual is a POI.

- If the ratio $\dfrac{\text{Sent emails from POI}}{\text{All Sent Emails}}$ is closer to the value 1, we can suspect an individual is a POI.




### Enron Information Evaluation


![http://blog.presentationload.com/optimize-profit-margin-using-value-chain-analysis/](../../Images/analysis.jpg)



We select several numerical features. Moreoever, we factor in the four engineered features discussed in the previous section.

We select several numerical features. Moreoever, we factor in the four engineered features discussed in the previous section.

We then implement a few algorithms to decide on what top three to four features we should implement for predicting people of interest in the Enron email scandal.

I.e. we zoom out to consider the following features, then zoom in to precise feature selections to predict Enron fraudsters.

1. POI

2. CEO to Employee Bonus Ratio

3. Total Payments

4. Exercised Stock Options

5. CEO to Employee Salary Ratio

6. Restricted Stock

7. Shared Receipt with POI

8. FROM POI to This Person

9. From Messages

10. From this Person to POI

11. Ratio of Sent Messages to POI

12. Ratio of Received Messages to POI

13. Deferral Payments

14. Loan Advances

15. Restricted Stock Deferred
                 
16. Deferred Income

17. Expenses

18. Other

19. Long Term Incentive

20. Director Fees

> <a style = "color:red">Note:</a> **We do not** implement Salary and Bonus into our feature selection process due to the multicollinearity association to CEO to Employee Salary Ratio and CEO to Employee Bonus Ratio, respectively. Moreover, "From Poi to this Person," "From Messages", and "From this Person to Poi" are collinear with ratio of interactions with POI

> **Cutoff Criteria:** We select the top features of each algorithm implementation with a cutoff score being a a 5 digit difference between two sequential features (e.g. Total expenses being .10 and the next being 0.05) [SelectK best case: 5.0+ digit difference.]. 

> Thereafter, we consider selecting the appropriate features for our final portion of our analysis from the score difference between the first and fifth ranked feature.



#### Import Clean Enron Data

We import the cleaned Enron dataset, and proceed with a full evaluation.

### Feature Selection

We implement two algorithms for our feature selection process. Thereafter, we implement one more algorithm. From the results of these implementations, we narrow down three to four features from a wide selection of features. 

We let partition the labels and features dataset by 30%, for testing purposes. 

We utilize the remaining 70% to train our model.

#### Importance of Top Features with SelectKBest

We implement the [SelectKBest](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html) method for selecting features according to the **k** highest scores. These k factors will be one of our few considerations for feature selection.

The following scenario is k='all' pre-selected cases. 

> Added: Per correction/comment, 'k' looks at the features after score computation and keeps the top features for next fitting process.
Our findings within the SelectKBest implementation had Exercised Stock Options as our top feature recommendation.

Moreover, we observe that CEO to Employee Bonus Ratio and CEO to Employee Salary Ratio ranked within top 5 features for our model selection, as seen below:

1. Exercised Stock Options

2. Total Stock Value

3. CEO To Employee Salary Ratio

4. CEO To Employee Bonus Ratio

5. Total Payments

At the core of our questions, these engineered variables from email interactions do indeed answer the question of finding peopel of interest from interaction of POI.

Though we had a performance of 0.214 seconds, the SKBest method is one of the few univariate feature selection proccesses within SKlearn. SelectKBest goes by selecting the best features based on univariate statistical tests, scores based off ANOVA. 

The following is an attempt at Extra Trees Implementation for attaining the top k recommended features.

#### Importance of Top Features with Extra Trees Classifier

Our findings within the Extra Trees implementation had Exercised Stock Options  as our top feature recommendation.

Moreover, we observe that CEO to Employee Bonus Ratio and CEO ranked within top 10 features for our model selection.

At the core of our questions, these engineered variables from email interactions do indeed answer the question of finding peopel of interest from interaction of POI.

The Extra Trees classifier is a variant of the popular Random Forest algorithm. However, each step of the Extra Trees implementation has random decision boundaries selected, rather than the best one. Moreover, Extra Trees classifier is great for our numerical features!


With the implementation of SelectKBest and Extra Trees Classifer, we carry on with two additional algorithms. These algorithms will not only provide a recommended list of features for determining POI, but we evaluate the models performance.

### Continued Feature Selection w/Model Performance Calibration

We continue our feature selection conversations. However, we verify model accuracy, precision, and recall scores for another algorithm--Decision Tree Algorithm

____

<a style ="color:red">Interlude</a> 

<u>Quick Definitions</u>

**Model Accuracy **

When TP < FP, then accuracy will always increase when we change a classification rule to always output “negative” category. Conversely, when TN < FN, the same will happen when we change our rule to always output “positive
> Note: This following section also provides the Precision, Recall, and F1-Score related to our implemented models. 

>In our case, 

>**Precision** (TP)/(TP+FP) cares about whether the positive examples predicted by our model were correct. In our case, what's the % Enron employees classified as POI correctly out of all classified Enron Employees classified as POI.

>**Recall** (TP)/(TP+FN) cares more on whether we have predicted all positive examples in the data. In our case, what is the percent of predictions were correctly identified POI, for all actual POI.


>where TP:=True Postive, FN:=False Negative, FP:= False Postives, TN:= True Negatives, as seen below



| True State/Diagnosis | NOT POI | POI |
|---------------------:|---------|-----|
|              NOT POI | TN      | FP  |
|                  POI | FN      | TP  |


<a style ="color:red">End of Interlude</a>
___


#### Decision Tree
So, we considered all features, and then zoomed in on preferred features for prediction. We could have manually made several attempts to create several models. Instead, we processed our features with advanced statistical method. Utilizing the Decision Trees, we determined a ranking of top features to consider for predicting people of interest were:

For ranking the importance of all recommended features, The top important features are:

1. Total Stock Value

2. CEO to Employee Bonus Ratio

3. Expenses

Our top features only have one commmon feature in common from the results of our other features. This feature is CEO to Employee Bonus Ratio.

Moreover, this scenario has a precision of 0.25 and recall of 0.40 for determining POI. This is partially good to see, however we should observe that this, and the past outcomes, is not optimal for model implementation. I.e.,Notice that we implemented the Decision Tree algorithm with default parameters, and previous algorithms as well. This consideration occurs because we do not know if our feature selection process was optimal in selection. 


We implement this procedure through **hypertuning**, alongside **cross validation**.

Therefore, we proceed with validating these features with another feature selection process.


**Comments**

In the "Model Metrics and Ranking" source, we observe the folowing:


>"Calibration dramatically improves the performance
of boosted trees, SVMs, boosted stumps, and
Naive Bayes, and provides a small, but noticeable improvement
for random forests. Neural nets, bagged
trees, memory based methods, and logistic regression
are not significantly improved by calibration.
With excellent performance on all eight metrics, calibrated
boosted trees were the best learning algorithm
overall. 

>Random forests are close second, followed by
uncalibrated bagged trees, calibrated SVMs, and uncalibrated
neural nets. The models that performed
poorest were naive bayes, logistic regression, decision
trees, and boosted stumps. Although some methods
clearly perform better or worse than other methods
on average, there is significant variability across the
problems and metrics. Even the best models sometimes
perform poorly, and models with poor average."


### Model Accuracies with Hypertuning

Hyperparameter optimizing is the problem of choosing a set of optimal hyperparameters for a learning algorithm. We ensure the model does not, for example, overfit its data by (over)tuning. Moreover, the opportunity of better model performance is more likely.

We implement hypertuning using the Decision Tree method. To split/sparse data, we implement Stratified Shuffle Split method to obtain train/test split data. We fine tune the sampling size of our training data, and find some outcome of accuracy from the said change.


#### Feature Selection Breakdown for New Feature List


#### Selected Features from Feature Selection Process and Data Partioning 

We selected "total_payments", "ceo_to_employee_bonus_ratio", "ratio_of_sent_messages_to_poi", "ceo_to_employee_salary_ratio", and "exercised_stock_options" as our features to implement in this hypertuning process. 

These features were selected on the basis of being top performers the three algorithms implemented above.


##### Naive Bayes Algorithm with Manual Tuning Cases

Naive Bayes is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. More information can be found in my [Machine Learning Basics Documenation](https://github.com/CloudChaoszero/General-Statistics-and-ML-Concepts/blob/master/ML/Machine_Learning_Notes.ipynb).

The following are several manual tuning cases for the Naive Bayes implementation. In each scenario, we modify the following classifier's parameter:

1. Prior
    
    > Note: Priors are the probabilities of the classes. If specified the priors are not adjusted according to the data.

| Metrics |  Gaussian Naive Bayes Algorithm Time before Tuning(seconds)  |Accuracy before Tuning|Precision Before Tuning|Recall before Tuning|F1-Score before Tuning|Gaussian Naive Bayes Algorithm Time after Tuning (seconds) |Accuracy after Tuning|Precision after Tuning|Recall after Tuning|F1-Score after Tuning|
|--------|--------|--------|--------|--------|--------|--------|--------|------|------|----|
|  Default Settings | 0.005| 0.846| 0.33| 0.20| 0.25| **0.005**| **0.846**| **0.33**| **0.20**| **0.25**|
|Non-POI 90% | 0.005| 0.846| 0.33| 0.20| 0.25| **0.002**| **0.128**|**0.13**| **1.00**| **0.23**|
|Non-POI 60%| 0.005| 0.846| 0.33| 0.20| 0.25| **0.001**| **0.795**| **0.20**| **0.20**| **0.20**|
|Non-POI 10% | 0.005| 0.846| 0.33| 0.20| 0.25| **0.001**| **0.846**|** .33**| **0.20 **|**0.25**|

We observe the Gaussian Naive Bayes implementation is best with default settings, or prior probability of Non-POI at/around 10%.


##### Decision Tree Algorithm with Manual tuning Cases

The following are several manual Hypertuning cases of the Decision Tree Classifier. In each scenario, we modify the following classifier's parameters:

1. Max Depth

2. Min Samples Split

3. Min Samples Leaf 

4. Min Weight Fraction Leaf

| Metrics |  Decision Tree Algorithm Time before Tuning(seconds)  |Accuracy before Tuning|Precision Before Tuning|Recall before Tuning|F1-Score before Tuning| Decision Tree Algorithm Time after Tuning (seconds) |Accuracy after Tuning|Precision after Tuning|Recall after Tuning|F1-Score after Tuning|
|--------|--------|--------|--------|--------|--------|--------|--------|------|------|----|
|  Default Settings | 0.003| 0.821| 0.33| 0.40| 0.36| **0.003**| **0.821**| **0.33**| **0.40**| **0.36**|
|min_weight_fraction_leaf=0.0001 | 0.003| 0.821| 0.33| 0.40| 0.36| **0.002**| **0.821**|**0.33**| **0.40**| **0.36**|
|min_samples_split=3| 0.003| 0.821| 0.33| 0.40| 0.36| **0.001**| **0.8210**| **0.33**| **0.40**| **0.36**|
|min_samples_leaf=5 | 0.003| 0.821| 0.33| 0.40| 0.36| **0.002**| **0.8462**|** .33**| **0.20 **|**0.25**|
|max_depth = 7 | 0.003| 0.821| 0.33| 0.40| 0.36| **0.001**| **0.821**| **0.33**| **0.40**| **0.36**|


Everything highlighted in bold, the latter five columns are different parameters being tuned. The firt five columns are the default Decision Tree Algorithm model output.

We observe our model being more optimal for an alteration in The minimum fraction leaf, maximum depth, or the minimum sample leafs.

Though, we could have done additional hypertuning, we best carry on with the process. 
>If the reader would like to learn more, they can download the project and data, and continue with this process themselves.

From the tester tuning these parameters, they can
1. Evaluated our data faster
2. Confirmed Optimal Accuracy

##### Automated HyperTuning

Going through this manual tuning is both tedious and largely time consuming. Luckily, there exist implemenations to avoid such hassles, and find some optimal model parameters for some given features. The following is a Pipleline and Grid Search combination to implement hypertuning.

Hyperparameter optimizing is the problem of choosing a set of optimal hyperparameters for a learning algorithm. We ensure the model does not, for example, overfit its data by (over)tuning. Moreover, the opportunity of better model performance is more likely.

We implement cross validation and testing from optimally identifying model parameters within the Gaussian Naive Baye's and Decision Tree implementation.

Gaussian Naive Baye HyperTuning

In [None]:
folds = 100
kbest = SelectKBest()

# A stratified shuffle split is used here to counter the effects of the class imbalance problem
sss = StratifiedShuffleSplit(labels, folds, random_state = 11)

# We could initially test a default decision tree classifier.  The tree could be fine-tuned as well.
gnb = GaussianNB()

# A pipeline is used to chain the SelectKBest and Decision Tree
pipeline = Pipeline([('kbest', kbest), ('gnb', gnb)])
param_grid = {'kbest__k':['all',2,3,5],'gnb__priors':[None,[.1,.9],[.9,.1],[.5,.5]]}
grid_search = GridSearchCV(estimator = pipeline, 
                           param_grid = param_grid,
                           scoring = 'f1',
                           cv = sss,
                           verbose = 1)
grid_search.fit(features, labels)
clf = grid_search.best_estimator_
print(clf)

> Fitting 100 folds for each of 16 candidates, totalling 1600 fits

> Pipeline(steps=[

> ('kbest', SelectKBest(k='all', score_func=<function f_classif at 0x00000000099B1438>)),

> ('gnb', GaussianNB(priors=None))])

Decision Tree HyperTuning> [Parallel(n_jobs=1)]: Done 1600 out of 1600 | elapsed:    8.8s finished

Our results show that the Naive Bayes Classifier parameters should be tuned to the following value

- Priors: None

Decision Tree HyperTuning

In [None]:
folds = 100
kbest = SelectKBest()

# A stratified shuffle split is used here to counter the effects of the class imbalance problem
sss = StratifiedShuffleSplit(labels, folds, random_state = 11)

# We could initially test a default decision tree classifier.  The tree could be fine-tuned as well.
dtree = DecisionTreeClassifier()

# A pipeline is used to chain the SelectKBest and Decision Tree
pipeline = Pipeline([('kbest', kbest), ('dtree', dtree)])
param_grid = {'kbest__k':['all',3],'dtree__min_samples_split':[6,7,8],'dtree__max_depth':[3,4,5],
             'dtree__min_samples_leaf':[3,5],'dtree__min_samples_leaf':[5,6,7],'dtree__min_weight_fraction_leaf':[0.001,0.02,0.04]}
grid_search = GridSearchCV(estimator = pipeline, 
                           param_grid = param_grid,
                           scoring = 'f1',
                           cv = sss,
                           verbose = 1)
grid_search.fit(features, labels)
clf = grid_search.best_estimator_
print(clf)

Fitting 100 folds for each of 162 candidates, totalling 16200 fits
Pipeline(steps=[('kbest', SelectKBest(k='all', score_func=<function f_classif at 0x00000000099B1438>)), ('dtree', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=7,
            min_samples_split=7, min_weight_fraction_leaf=0.04,
            presort=False, random_state=None, splitter='best'))])
            
[Parallel(n_jobs=1)]: Done 16200 out of 16200 | elapsed:  1.8min finished

Our results show that the Decision Tree Classifier parameters should be tuned to the following values:

- class_weight=None
- criterion='gini'
- max_depth=5,
- max_features=None
- max_leaf_nodes=None
- min_impurity_split=1e-07
- min_samples_leaf=7
- min_samples_split=8 
- min_weight_fraction_leaf=0.04
- splitter='best'

Comparing the two models, we observe that the Decision Tree Classifier brings in a .20 increase in precision for classiying POI, compared to the Gaussian Naive Bayes implementation. 
##### Validating the Decision Tree Algorithm with Hypertuning 

We validate how well our Decision Tree Algorithm performed with HyperTuning.  

In [None]:
t0 = time.time()

clf_DTC1 = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=5,
            min_samples_split=6, min_weight_fraction_leaf=0.02,
            presort=False, random_state=None, splitter='best')
clf_DTC1.fit(features_train,labels_train)
pred_decisionTree = clf_DTC1.predict(features_test)
score = clf_DTC1.score(features_test,labels_test)
print 'Accuracy before tuning: ', score
decTree_precision_1 = metrics.precision_score(pred_decisionTree,labels_test)
decTree_recall_1 = metrics.recall_score(pred_decisionTree,labels_test)
print 'Precision before tuning: ', decTree_precision_1
print 'Recall before tuning: ', decTree_recall_1
print "Decision tree algorithm time:", round(time.time()-t0, 3), "s"

print 
### use manual tuning parameter min_samples_split
t0 = time.time()
clf = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=6, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
clf = clf.fit(features_train,labels_train)
pred= clf.predict(features_test)
print("Decision tree algorithm, with tuning, time in %0.3fs" % (time.time() - t0))

acc= metrics.accuracy_score(labels_test, pred)
print "Accuracy after tuning: ", acc

# function for calculation ratio of true positives
# out of all positives (true + false)
print 'After tuning precision: ', metrics.precision_score(labels_test,pred)

# function for calculation ratio of true positives
# out of true positives and false negatives
print 'After tuning recall: ', metrics.recall_score(labels_test,pred)
print(classification_report(labels_test,pred))


##### Validating the Decision Tree Algorithm with Hypertuning 

We validate how well our Decision Tree Algorithm performed with HyperTuning.  


Utilizing the recommended parameters and feature_list features that we manually selected, we receive the following output from tester.py:

![Tester.py](../../Images/tester.jpg)

We have recall and precision scores of .30+. Moreover, our accuracy is ~0.86!

## Conclusion and Remarks

### Conclusion

$1)$ <span style="color:purple">_Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those? _ </style>
___


We are trying to identify indicators for Person of Interest(POI) from Enron employees in the 2002 Enron fraudlent scenario. 


We define POI  as someone who was indictedor settled in the case without admitting guilt. We utilized several techniques to come closer to answering the previously mentioned goal.

Using the inter-quartile range, we removed obscure data from our analysis and imputated some information. 

> We removed outlier entries such as "TOTAL" and two other entries due to "NaN" or absurd information recorded. This removal of infectious data enabled us to proceed with the analysis in better fashion than before.

In the case of removing outliers, I just set some condition for extracting/neglecting outlier information as we update the dataset.

In the case of imputing data, I created a function to replace existing "NaN" values with the integer "0." Thereafter, we verified the changes in our introductory data analysis of Salary versus Bonus.
___

$2)$ <span style="color:purple">_What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values._</span>

Initally, we used several features to proceed in the feature selection process. These features were the following:

>1. poi

>2. salary

>3. total_payments

>4. exercised_stock_options

>5. bonus

>6. restricted_stock

>7. shared_receipt_with_poi

>8. from_poi_to_this_person

>9. from_messages

>10. from_this_person_to_poi

>11. ceo_to_employee_bonus_difference

>12. ceo_to_employee_salary_difference


During this broad selection, we considered adding four features into our dataset.


The **first two** additions are the ratio between an employee and the CEO, _Jeffrey K. Skilling_.

The implementation for these four new features can lead to the following:

- If the ratio in bonus between an employee and the CEO is significantly closer to the value 1, then we can suspect some type of corruption occuring.

- If the ratio in salary between an employee and the CEO is significantly closer to the value 1, then we can suspect some type of corruption occuring.

These features are linear transformations of the salary and bonus information presented. We factor how close ones's financial information is with respect to the CEO to determine POI relationship.

The **remaining two** additional features are the inbound and outbound email ratios between a POI.


The implementation for these four new features can lead to the following:

- If the ratio $\dfrac{\text{Received emails from POI}}{\text{All received Emails}}$ is closer to the value 1, we can suspect an individual is a POI.

- If the ratio $\dfrac{\text{Sent emails from POI}}{\text{All Sent Emails}}$ is closer to the value 1, we can suspect an individual is a POI.


Thereafter, we proceeeded through three iterations of a _feature selection_ process, where each case utilized a different algorithm. 


**Feature Selection with SKBest Procedure**
Utilizing the SKBest approach, we determined the top features to be:

1. Exercised Stock Options

2. CEO To Employee Salary Ratio

3. CEO To Employee Bonus Ratio


**Feature Selection with Extra Trees Implemenatation**

For ranking the importance of all recommended features, The top important features are:

1. CEO To Employee Bonus Ratio

2. Total Payments

3. Exercised Stock Options

**Feature Selection with Decision Tree Algorithm**
The top important features from feature selection with the Decision Tree process are:

1. CEO to Employee Bonus Ratio

2. Total Payments

3. From POI to This Person

However, there is not sufficient evidence of several re-occuring parameters for optimal model consideration. In this latest case, we observe no re-assurance of choosing fit features for models.

So, with some brief testing, we settled on utilizing as our selected features for hypertuning testing using the Decision Tree method. 

To split/sparse data, we implement Kfold method is another method to obtain train/test split data. We fine tune the sampling size of our training data, and find some outcome of accuracy from the said change.

We hope to evaluate several scenarios, and select optimal conditions for algorithm processing.


**Feature Selection: Final Decision**


The top re-occuring features to select from where, in frequency:

1. CEO To Employee Bonus Ratio

2. CEO To Employee Salary Ratio

3. Exercised Stock Options

Additionally, we include two additionaly engineered features:

4. ratio_of_received_messages_to_poi

5. ratio_of_sent_messages_to_poi



___


___
$3)$ <span style="color:purple">_What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?_</span>

We utilized the Decision Tree algorithm. Other algorithms utilized were Extra Trees and Select K Best. The perfomance from each model were quick. However, the results were inconsistent and mixed up. Moreover, there were faults to each scenario. However, the Decision Tree worked more favorable with obtaining optimal features for our logistic problem.

The following is outputs for Decision Tree Implementation with manual tuning:

| Metrics |  Decision Tree Algorithm Time before Tuning(seconds)  |Accuracy before Tuning|Precision Before Tuning|Recall before Tuning|F1-Score before Tuning| Decision Tree Algorithm Time after Tuning (seconds) |Accuracy after Tuning|Precision after Tuning|Recall after Tuning|F1-Score after Tuning|
|--------|--------|--------|--------|--------|--------|--------|--------|------|------|----|
|  Default Settings | 0.004| 0.8975| 0.33| 0.40| 0.36| **0.004**| **0.8975**| **0.33**| **0.40**| **0.36**|
|min_weight_fraction_leaf=0.0001 | 0.004| 0.8975| 0.33| 0.40| 0.36| **0.005**| **0.8975**|**0.33**| **0.40**| **0.36**|
|min_samples_split=3| 0.004| 0.8975| 0.33| 0.40| 0.36| **0.004**| **0.8210**| **0.67**| **0.40**| **0.50**|
|min_samples_leaf=5 | 0.004| 0.8975| 0.33| 0.40| 0.36| **0.001**| **0.8205**|** .25**| **0.20 **|**0.22**|
|max_depth = 2 | 0.004| 0.8975| 0.33| 0.40| 0.36| **0.005**| **0.8975**| **0.67**| **0.40**| **0.50**|


Everything highlighted in bold, the latter five columns are different parameters being tuned. The firt five columns are the default Decision Tree Algorithm model output.

We observe our model being more optimal for an alteration in The minimum fraction leaf, maximum depth, or the minimum sample leafs.

> Note: If the reader would like to learn more about to manually tune some set of algorithms, they can download the project and data, and continue with this manual process themselves.

From the tester tuning these parameters, they can
1. Evaluated our data faster
2. Confirmed Optimal Accuracy

Utilizing the Pipeline and GridSearchCV libraries, the following code provided

fitting 1000 folds for each of 432 candidates, totalling 432000 fits, as seen below

> Pipeline(steps=[('kbest', SelectKBest(k='all', score_func=\<function f_classif at 0x00000000099DB518>)), 

> ('dtree', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,

> max_features=None, max_leaf_nodes=None,

> min_impurity_split=1e-07, min_samples_leaf=5,

> min_samples_split=7, min_weight_fraction_leaf=0.001,

> presort=False, random_state=None, splitter='best'))])

Again we see, utilizing the recommended parameters and feature_list features that we manually selected, we receive the following output from tester.py:

![Tester python file](../../Images/tester.jpg)

We have recall and precision scores of .30+. Moreover, our accuracy is ~0.86!
___

$4)$ <span style="color:purple">_What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?  How did you tune the parameters of your particular algorithm? What parameters did you tune?_</span>

Tuning parameters for an algorithm is a rigorous attempt at finding some optimal factors/identifiers to best create a statistical forecast model. If you we failed to reject or even disvalue the importance of tuning, we fail to find the a both precise and accurate model for future cases outside some past observations.

Utilizing the Decision Tree implementation we tuned the following parameters:

1. max_depth

2. max_leaf_nodes

3. min_samples_leaf

4. min_samples_split

5. min_weight_fraction_leaf


We selected these factors based on computation and processing speed. If we had a better computing system (#IWishIhadCloudComputing) our tuning could have been more broad and deep.

The following details is our implemenation of hypertuning with Pipeline and GridSearchCV implemenation.

From the tester tuning these parameters, they can
1. Evaluated our data faster
2. Confirmed Optimal Accuracy

Utilizing the Pipeline and GridSearchCV libraries, the following code provided

Fitting 100 folds for each of 162 candidates, totalling 16200 fits, as seen below

> Pipeline(steps=[('kbest', SelectKBest(k='all', score_func=\<function f_classif at 0x00000000099DB518>)), 

> ('dtree', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,

> max_features=None, max_leaf_nodes=None,

> min_impurity_split=1e-07, min_samples_leaf=5,

> min_samples_split=7, min_weight_fraction_leaf=0.001,

> presort=False, random_state=None, splitter='best'))])

Again we see, utilizing the recommended parameters and feature_list features that we manually selected, we receive the following output from tester.py:

![Tester python file](../../Images/tester.jpg)

We have recall and precision scores of .30+. Moreover, our accuracy is ~0.86!
___

$5)$ <span style="color:purple">_What is validation, and what’s a classic mistake you can make if you do it wrong?</span>

Validation is a testing procedure on algorithm performance. In our case, we test our trained model's precision, accuracy, and recall rates. 

One classic mistake if you can do it wrong is incorrectly guessing future cases in which affect production/company performance.


$6)$ <span style="color:purple">_Give at least 2 evaluation metrics and your average performance for each of them.  Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance. [relevant rubric item: “usage of evaluation metrics”]_</span>



Reitereating,


Fitting 100 folds for each of 162 candidates, totalling 16200 fits, as seen below

> Pipeline(steps=[('kbest', SelectKBest(k='all', score_func=\<function f_classif at 0x00000000099DB518>)), 

> ('dtree', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,

> max_features=None, max_leaf_nodes=None,

> min_impurity_split=1e-07, min_samples_leaf=5,

> min_samples_split=7, min_weight_fraction_leaf=0.001,

> presort=False, random_state=None, splitter='best'))])

Again we see, utilizing the recommended parameters and feature_list features that we manually selected, we receive the following output from tester.py:

![Tester python file](../../Images/tester.jpg)

We have recall and precision scores of .30+. Moreover, our accuracy is ~0.86!

> Note: This following section also provides the Precision, Recall, and F1-Score related to our implemented models. 

>In our case, 

>**Precision** (TP)/(TP+FP) cares about whether the positive examples predicted by our model were correct. In our case, what's the % Enron employees classified as POI correctly out of all classified Enron Employees classified as POI.

>**Recall** (TP)/(TP+FN) cares more on whether we have predicted all positive examples in the data. In our case, what is the percent of predictions were correctly identified POI, for all actual POI.


>where TP:=True Postive, FN:=False Negative, FP:= False Postives, TN:= True Negatives, as seen below
    
### Remarks

**Quick Recap**

- We dealt with with an imperfect, real-world dataset

- We validated a machine learning result using test data

- We evaluated a machine learning result using quantitative metrics

- We created, select and transform features

- We hypter tuned machine learning algorithms for maximum performance

**Interesting Fact**

- Testing data is a high variance outcome. In future cases, we should be concerned with overfitting or underfitting, for the sake of abiding to the variance contraint from testing data.


**Comparing Cross-Validation with Train/Test splits:**

- Cross Validation:

     + More accurate estimate for out-of-sample accuracy

    + more efficient use of data
    
- Train/Test Split
    + Runs K times faster than K-fold cross validation
    + Simplier to exampled detailed results

## Resources

1. [What is Pickling in Python?](https://pythontips.com/2013/08/02/what-is-pickle-in-python/)

    a. [Video on Pickling](https://www.youtube.com/watch?v=2Tw39kZIbhs)
    
2. [Recursive Feature Elminations](http://scikit-learn.org/stable/modules/feature_selection.html)

3. [General feature selection in cross-validation](https://www.youtube.com/watch?v=6dbrR-WymjI)

4. [Hypertuning](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning))

5. [SKBest and GridSearch Information](https://www.civisanalytics.com/blog/workflows-in-python-using-pipeline-and-gridsearchcv-for-more-compact-and-comprehensive-code/)

6. [SVM versus Decision Trees](https://stats.stackexchange.com/questions/57438/why-is-svm-not-so-good-as-decision-tree-on-the-same-data)

7. [Model Metrics and Ranking](http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf)

8. [Class Imbalance Problem](http://www.chioka.in/class-imbalance-problem/)

9. [F1-Score](https://datascience.stackexchange.com/questions/11014/why-are-precision-and-recall-used-in-the-f1-score-rather-than-precision-and-npv)

10. [Precision vs Recall](https://www.quora.com/What-is-the-best-way-to-understand-the-terms-precision-and-recall)
    
    a. [Precision vs Recall Blog](http://rushdishams.blogspot.co.id/2011/03/precision-and-recall.html)
11. [Accuracy versus Precision, Recall, and F1-Score](https://tryolabs.com/blog/2013/03/25/why-accuracy-alone-bad-measure-classification-tasks-and-what-we-can-do-about-it/)

12. [Multicollinearity](https://en.wikipedia.org/wiki/Multicollinearity)

13. [Stratification in Train and Test data](https://stackoverflow.com/questions/34842405/parameter-stratify-from-method-train-test-split-scikit-learn)

## Data Dictionary 

1. **Financial Features (units in US Dollars):**
    
    a. **Salary:** Reflects items such as base salary, executive cash allowances, and benefits payments.
    
    b. **Deferral Payments:** Reflects distributions from a deferred compensation arrangement due to termination of employment or due to in-service withdrawals as per plan provisions.
    
    c. **Total Payments:**
    
    d. **Loan Advances:** Reflects total amount of loan advances, excluding repayments, provided by the Debtor in return for a promise of repayment. In certain instances, the terms of the promissory notes allow for the option to repay with stock of the company.
    
    e. **Bonus:** Reflects annual cash incentives paid based upon company performance. Also may include other retention payments.
    
    f. **Restricted Stock Deferred:** Reflects value of restricted stock voluntarily deferred prior to release under a deferred compensation arranged
    
    g. **Deferred Income:** Reflects voluntary executive deferrals of salary, annual cash incentives, and long-term cash incentives as well as cash fees deferred by non-employee directors under a deferred compensation arrangement. May also reflect deferrals under a stock option or phantom stock unit in lieu of cash arrangement.
    
    h. **Total Stock Value:** In 1998, 1999 and 2000, Debtor and non-debtor affiliates were charged for options granted. The Black-Scholes method was used to determine the amount to be charged. Any amounts charged to Debtor and non-debtor affiliates associated with the options exercised related to these three years have not been subtracted from the share value amounts shown.
    
    i. **Expenses:** Reflects reimbursements of business expenses. May include fees paid for consulting services.
    
    j. **Exercised Stock Options:** Reflects amounts from exercised stock options which equal the market value in excess of the exercise price on the date the options were exercised either through cashless (same-day sale), stock swap or cash exercises. The reflected gain may differ from that realized by the insider due to fluctuations in the market price and the timing of any subsequent sale of the securities.
    
    k. **Other:** Reflects items such as payments for severance, consulting services, relocation costs, tax advances and allowances for employees on international assignment
    
    l. **Long Term Incentive:** Reflects long-term incentive cash payments from various long-term incentive programs designed to tie executive compensation to long-term success as measured against key performance drivers and business objectives over a multi-year period, generally 3 to 5 years.
    
    m. **Restricted Stock:** Reflects the gross fair market value of shares and accrued dividends (and/or phantom units and dividend equivalents) on the date of release due to lapse of vesting periods, regardless of whether deferred.
    
    n. **Director Fees:** Reflects cash payments and/or value of stock grants made in lieu of cash payments to non-employee directors.
2. **Email Features:**

    a. **To Messages**
    
    b. **Email Address**
    
    c. **From POI to this Person**
    
    d. **From Messages**
    
    e. **From This Person To POI**
    
    f. **Shared Receipt With POI**
3. **POI label:**

    a. **POI (person of interest):** Person suspected in fraudulent actions during 2002 Enron Crisis


## End