# Project: Identify Fraud from Enron Email


The ENRON case was one of the most important financial scandals in the history of EEUU. The ENRON corporation emerged in 1985 as a result of Houston Natural Gas and InterNorth fusion. The company started as regular energy enterprise that after the legislation of energy deregularization laws began to build a complex financial business model. This, together with unethical methods such false accounting records to hide failed invesments were the key to obtain revenues of nealy 101\$ billion during the year 2000. ENRON was the biggest company at that time employing more than 21000 people with invesments all over 40 countries. In summer of 2000 the price per action reached the peak value around 90\$, one year and half later the company declared bankrupcy.  

The goal of this project is to identify persons of interest (POI) among the employees of ENRON. We will consider POI as any individual who was responsible of any finantial irregularity that led to the corporation's bankrupcy. To do so, we are provided with a dataset which contains finantial and email-related data such as: salary, long term bonus, stock information and POI status. We will apply machine-learning techniques to train a predictive model able classify employees into POIs/non POIs.

This report is organized as follows: in Section I we provide a description of the dataset and the prodecure employed to identify and remove outliers. Feature selection and different machine learning algorithms are explained in Section II where we perform a comparison between different classifiers based on statistical metrics such as average score, recall or precission. Finally, parameter tuning is perfom over the best model found to improve its efficiency.

### Section I: ENRON dataset

#### Dataset description

We start the analysis by providing some insight into the ENRON dataset stored in ```./final_project_dataset.pkl```. Since this file contains human-entered data it is necessary to preprocess it to correct possible errors like incomplete or wrong information. After cross-checking the data with published records (```./enron61702insiderpay.pdf```) three errors were observed:

* The entries 'TOTAL' and 'THE TRAVEL AGENCY IN THE PARK' do not refer to any individual. In the case of 'TOTAL' we can assume that it is a spread-sheet error produced while exporting the data. As shown in Fig. 2, this point is a clear  outlier.

* The features of 'BHATNAGAR SANJAY' and 'BELFER ROBERT' were incorrectly filled. The data was corrected and updated with the method ```correct_data``` defined in ```./modules/data_cleaning.py```.

* The data of 'LOCKHART EUGENE E' is empty. This point will be automatically removed by the ```featureFormat``` method when transforming the raw data into a matrix.

The dataset contains 144 valid entries with 19 features related to finantial (such as salary, bonus, total stock value, etc...) and email information (messages to POIs, messages from POIs). POI or non-POI is the label that we are going to use to binary-classify the employees. Since only 19 of 144 employees are POIs the data-classes is strongly unbalanced. This aspect will be important during the validation of the machine learning model.

Completeness of the data is ussually a problem while working with real-word data. The bar plot shown in Fig. 1 depices the number of NaNs per feature. Fields like 'poi',  'restricted stock value' or 'total payments' are the ones with least amount of empty data in contrast with 'loan advances', 'director fees' or 'restricted stock deferred'. The precense of NaN values has a direct effect in the quality of a given feature and will influence which features to choose while training the algorithm.

![EDA_plot](./figures/features_nans.pdf)


#### Outliers

Outliers identification is esential to distinguish data points that are far away from the general trend an may affect negalively the outome of the study. A first approach can be done just by exploring the dispersion of different features. Fig. 2 shows the dependence between bonus, salary, exercised_stock_options and total_stock_value with and without outliers.

![EDA_plot](./figures/plot_EDA.pdf)

TOTAL!!!!!

The outiler removal can be push more upon consideration that the previous features are linearly dependent. If so, 




the outlier identification by considering that bonus (exercised_stock_options) and salary  (total_stock_value) are related by a linear expresion. If so, we can perform a linear regression and compare the predicted bonus (exercised_stock_options) of a given salary (total_stock_value). Those points with bigger deviation (10% of the data) could be considered outliers. As shown in Fig. 3, by removing these points we obtain a better R$^2$ coefficient.

![EDA_plot](./figures/outlier_removal_regression.pdf)

However, if we take a closer look to the outliers higlighted in blue in Fig. 3 we realize that some of them are POIs like Kenneth Lay. We do want to include this points in consideration when doing the machine learning analysis and moreover when we are going to perform a classification problem with a class-inbalanced dataset. The outliers and POI status removed in Fig. 3 are displayed in the following table:

| salary  vs.  bonus | POI status |   | total stock value  vs.  exercised stock options | POI status |
|:------------------:|:----------:|:-:|:-----------------------------------------------:|:----------:|
|   KITCHEN LOUISE   |    False   |   |                  LAY KENNETH L                  |    True    |
|   LAVORATO JOHN J  |    False   |   |                  RICE KENNETH D                 |    True    |
|  DELAINEY DAVID W  |    True    |   |                  KEAN STEVEN J                  |    False   |
|    LAY KENNETH L   |    True    |   |                WHITE JR THOMAS E                |    False   |
|  BELDEN TIMOTHY N  |    True    |   |               DIMICHELE RICHARD G               |    False   |
|  PICKERING MARK R  |    False   |   |                 BHATNAGAR SANJAY                |    False   |
|   ALLEN PHILLIP K  |    False   |   |                   HIRKO JOSEPH                  |    True    |
|   FREVERT MARK A   |    False   |   |               DERRICK JR. JAMES V               |    False   |
|                    |            |   |                     PAI LOU                     |    False   |
|                    |            |   |                 IZZO LAWRENCE L                 |    False   |

The outlier identification was performed combining the methods ```plot_EDA```, ```outliers_regression```, ```outliers_regression_plot``` and ```outliers_identification``` defined in ```./modules/ML_methods.py```.


### Section 2: Machine learning algorithm

During this section we are going to discuss the machine learning code and steps employed to build the classification model. The basic structure of the machine learning algorithm is the following: split the data into train and test sets, fit the selected classificator with the train data and evaluate the model with the test data. Also one must to take into account the number of features considered, type of model (Naive Bayes, SVM, Decission Tree...) and their parameters.   

#### Feature selection

Once the data is cleaned from NaNs and outliers, it is important to select the relevant number of features to be considered in the classification. This is one possible way to tune the bias-variance tradeoff. An extremely simple model with a small number of features (bias) will have a low performance in both train and test data and, on the other hand, a complex one (variance) will describe better the particularities of the trainig set but will not be accurate in the test part.

We can select the k-best number of features from the dataset using the tool ```sklearn.feature_selection.SelectKBest```. This was implemented inside of the ```get_kbest_features``` definition inside ```./modules/ML_methods.py```. The ```SelectKBest``` methods provides a coefficient related with the importance of the feature. In Table 3 we higlight the 10 most relevant features found.     

#### Feature engeneering

From the best features obtained in the prevous part I build new fields with the intention to provide new relevat information to classify the employees. To do so, I selected the four ones with biggest scores: 'total_stock_value', 'exercised_stock_options', 'bonus' and 'salary' to construct 'salary_bonus' and 'total_value'. 'salary_bonus' was calculated by adding 'salary' and 'bonus' fields. 'total_value' represents the total amount of 'salary', 'bonus' and 'total_stock_value' of an employee. I choose these three values assuming that a potential POI would have a large salary and bonus as well as a big amount of money from the stock market. In Tables 3-4 we show a comparison of the 10 best features with and without the 'salary_bonus' and 'total_value' fields.

<table>
<tr><th>Table 3 Default features</th><th></th><th>Table 3 Features 'salary_bonus' and 'total_value' included</th><th></th></tr>  
<tr><th>Features</th><th>Score</th><th>Features</th><th>Score</th></tr>
<tr><td>total_stock_value</td><td>22.511</td><td>total_stock_value</td><td>22.511</td></tr>
<tr><td>exercised_stock_options</td><td>22.349</td><td>salary_bonus</td><td>22.456</td></tr>
<tr><td>bonus</td><td>20.792</td><td>exercised_stock_options</td><td>22.349</td></tr>
<tr><td>salary</td><td>18.29</td><td>bonus</td><td>20.792</td></tr>
<tr><td>deferred_income</td><td>11.425</td><td>total_value</td><td>19.225</td></tr>
<tr><td>long_term_incentive</td><td>9.922</td><td>salary</td><td>18.29</td></tr>
<tr><td>total_payments</td><td>9.284</td><td>deferred_income</td><td>11.425</td></tr>
<tr><td>restricted_stock</td><td>8.825</td><td>long_term_incentive</td><td>9.922</td></tr>
<tr><td>shared_receipt_with_poi</td><td>8.589</td><td>total_payments</td><td>9.284</td></tr>
<tr><td>loan_advances</td><td>7.184</td><td>restricted_stock</td><td>8.825</td></tr>
</table> 

The new features were coded with the ```create_salarybonus_feature``` and ```create_value_feature``` methods defined in ```./modules/data_cleaning.py``` 



#### Classification algorithm

During this project we have evaluated the performace of different classifiers: Naive Bayes, Decission Tree, Random Forest and Ada Boost. Code 1 block shows as an example the classification perform using the Naive Bayes model. The steps taken during the classification are the same for all the classifiers tested. First the dataset to the 10 best beatures as found in the previous subsection (Table 3 right). After that we build and generalize the model using *k*-fold cross-validation (CV) implemented with ```perform_classification``` method defined in ```./modules/ML_methods.py```.

This method is specially usefull to treat class-inbalanced data. Instead performing a train-test split once, the *k*-fold CV divides the data into *k* sets called folds. The classification is built using *k-1* folds and evaluated with the remainig one. This process is repeated k times for all the folds. The effectiveness of the model is computed from the mean value of different statistical metrics obtained for all the folds. CV is broadly used to improve the model generalization. In the following table we display the accuracy score (ratio between the predicted and real test values) for different classificators.

|Algorithm|Accuracy score|Recall score|Precission score|F1 score|
|:------------------:|:----------:|:----------:|:----------:|
|Naive Bayes         |0.8362|0.3035|0.3218|0.2937|




|Decission Tree      |0.8053|0.2750|0.2347|0.2348|
|Random Forest       |0.8600|0.1650|0.2300|0.2437|
|Ada Boost           |0.8533|0.2750|0.3270|0.2959|

DT classifier
test_f1_score = 0.2573611111111111
test_recall = 0.289
test_accuracy = 0.8073333333333333
test_precision = 0.2631857142857143

RF classifier
test_f1_score = 0.17966666666666667
test_recall = 0.1605
test_accuracy = 0.8532666666666668
test_precision = 0.22566666666666665

AB classifier
test_f1_score = 0.23928095238095237
test_recall = 0.227
test_accuracy = 0.8463333333333335
test_precision = 0.2846166666666667

Accuracy, recall and precission precission are defined as follows:

* Accuracy $ \equiv $ Ratio of data classified correctly
* Recall $ \equiv TP / (TP + FN)$
* Precission $ \equiv TP / (TP + FP)$
* F1 $ \equiv 2 \cdot (P \cdot R) / (P + R)$

where TP, TN, FP and FN are true positive, true negative, false positive and false negative classificated values. 


In the case of binary-inbalance data the accuracy score is not a good metric to evaluate the goodness of the model. Since the majority of the ENRON employees are non-POIs, a model that ignores the POI data would correclty classify the ordinary workers and produce a good accuracy score. One would better consider other metrics such as recall, precission or F1. The recall reflects the amount of correctly classified POIs, or in other words, the model’s ability to find all the data points of interest in a dataset. The precission, on the other hand, shows how relevant is the POI classified data. By increasing the recall the precission decreases and viceversa. The tradeoff between these variables can be measured with the F1 score.

From the values shown in the table above, we see that the Naive Bayes classifier has the best recall, precission and F1 scores. Therefore, among the tested models, the Naive Bayes classifier is the one which presents the best performance of all.

**Code 1: Naive Bayes classification**
```python
def perform_classification(clf, steps, features, labels, folds = 100):
    '''
    This method performs the cross validation using the provided clf classifier.
    It builds a Pipeline with the selected steps. 
    It returns both the estimator and a dictionary containing 
    the mean value of the accuracy, precission, recall and f1 scores.
    '''
    
    # Cross-validation method
    cv_kfold = StratifiedShuffleSplit(labels, folds, random_state = 42)
    pipe  = Pipeline(steps)
    
    # Metric scores to be computed
    scoring = {'accuracy'  : make_scorer(accuracy_score),
               'precision' : make_scorer(precision_score),
               'recall'    : make_scorer(recall_score),
               'f1_score'  : make_scorer(f1_score)}

    results = cross_validate(estimator = pipe,\
                             X = features,\
                             y = labels,\
                             cv = cv_kfold,\
                             scoring = scoring)
    
    for key in results.keys():
        results[key] = np.mean(results[key])

    return pipe, results


print "Performing Naive Bayes classification\n"
# Classifier declaration
clf_NB   = GaussianNB()
# Algorithm steps: 1) Feature selection, 2) Model construction
steps_NB = [('feature_selection', feature_selection), ('Naive_Bayes', clf_NB)]
pipe_NB, results_NB = perform_classification(clf_NB, steps_NB, features, labels)
```


#### Parameter tuning

We have performed parameter tuning on the Naive Bayes algorithm to find the set of parameters that improve the POI identification. In this case we can only tune the number of features selected during the fit. As shown in the Code 2 block, this process is implemented with the ```GridSearchCV``` method. The best score parameters were found for the Naive Bayes classifier considering 12 features. The following table we show the score metrics before and after the parameter tuning.

|Algorithm|Accuracy score|Recall score|Precission score|F1 score|
|:------------------:|:----------:|:----------:|:----------:|
|Naive Bayes default |0.8467|0.3800|0.4052|0.3649|
|Naive Bayes tuned   |




	Accuracy: 0.84520	Precision: 0.40147	Recall: 0.32800	F1: 0.36103	F2: 0.34046
	Total predictions: 15000	True positives:  656	False positives:  978	False negatives: 1344	True negatives: 12022




**Code 2: Parameter tuning method**
```python
def grid_search(steps, clf_parameters, features, labels, folds = 100):
    '''
    This method performs a cross validation parameter tuning.
    To do so, it builds a Pipeline and employes GridSearchCV over the provided
    range of parameters.
    '''

    pipe    = Pipeline(steps)
    cv_kfold = StratifiedShuffleSplit(labels, folds, random_state = 42)
    cv_grid = GridSearchCV(pipe, param_grid = clf_parameters, cv = cv_kfold)
    cv_grid.fit(features, labels)

    return cv_grid


print "Performing Grid Search of Naive Bayes classification"
# Range of parameters of the classification tuning
param_dict_NB = {'feature_selection__k': range(5, len(features_list))}
gs = grid_search(steps_NB, param_dict_NB, features, labels)
gs_clf_NB = gs.best_estimator_
print '\n Score Metrics Decission Tree Classifier'
test_classifier(gs_clf_NB, data_dict, features_list, folds = 1000)
print
```

### Bibliography

* bla