# Project: Identify Fraud from ENRON Email


The ENRON case was one of the most important financial scandals in the history of EEUU. The corporation emerged in 1985 as a result of Houston Natural Gas and InterNorth fusion. After the legislation of energy deregulation laws, ENRON started to build a complex financial business model. This, together with unethical methods such as false accounting records to hide failed investments were the key to obtain revenues of nearly 101\$ billion during the year 2000. ENRON was the biggest company at that time employing more than 21000 people with investments all over 40 countries. In summer of 2000, the price per action reached the peak value around 90\$, one year and a half later the company declared bankruptcy.  

The goal of this project is to identify persons of interest (POI) among the employees of ENRON. We will consider POI as any individual who was responsible for any financial irregularity that led to the corporation's insolvency. To do so, we are provided with a dataset which contains financial and email-related data such as salary, long-term bonus, stock information and POI status. We will apply machine-learning techniques to train a predictive model able classify employees into POIs or non POIs.

This report is organized as follows: in Section I we provide a description of the dataset and the procedures employed to wrangle the data and to identify and remove outliers. Feature selection and different machine learning algorithms are explained in Section II where we perform a comparison between different classifiers based on statistical metrics such as average score, recall or precision. Parameter tuning is performed over the best model found to improve its efficiency. Finally, we conclude and provide a list of references used during this work.


### Section I: ENRON dataset

#### Dataset description

We start the analysis by providing some insight into the ENRON dataset stored in ```./final_project_dataset.pkl```. Since this file contains human-entered data it is necessary to preprocess it to correct possible incomplete or wrong information. After cross-checking the data with published records (```./enron61702insiderpay.pdf```) three errors were observed:

* The entries 'TOTAL' and 'THE TRAVEL AGENCY IN THE PARK' do not refer to any individual. In the case of 'TOTAL' we can assume that it is a spread-sheet error produced while exporting the data. As shown in Fig. 2, this point is a clear outlier.

* The features of 'BHATNAGAR SANJAY' and 'BELFER ROBERT' were incorrectly filled. The data was corrected and updated with the method ```correct_data``` defined in ```./modules/data_cleaning.py```.

* The data of 'LOCKHART EUGENE E' is empty. This point will be automatically removed by the ```featureFormat``` method when transforming the raw data into a matrix.

The dataset contains 144 valid entries with 19 features related to financial (such as salary, bonus, total stock value, etc...) and email information (messages to POIs, messages from POIs). POI is the field used to binary-classify the employees. This dataset is strongly unbalanced towards the non-POI class since only 18 out of 144 employees are POIs. The POI/non-POI ratio will be important during the data cleaning process and model validation since reflects the class imbalance of the data.

Completeness is usually a problem while working with real-world data. The bar plot shown in Fig. 1 depicts the number of NaNs per feature. Fields like 'poi',  'restricted stock value' or 'total payments' are the ones with least amount of empty data in contrast with 'loan advances', 'director fees' or 'restricted stock deferred'. The presence of NaN values has a direct effect on the quality of a given feature and will influence which features are relevant during the algorithm training.

**Figure 1** ![EDA_plot](./figures/features_nans.pdf)



#### Outliers

Outliers identification is essential to distinguish the data points that deviate from the general trend an may affect negatively the outcome of the study. A first approach can be done just by exploring the relationship between different features. Fig. 2 shows the dependence between bonus, salary, exercised_stock_options and total_stock_value with and without outliers. As we will see in Sec. II, these fields are going to be crucial while identifying POIS.

**Figure 2** ![EDA_plot](./figures/plot_EDA.pdf)

There is one clear outlier in the left panels which corresponds to the 'TOTAL' point. After the outlier is removed we see that the plot dimensions are reduced to ranges that agree with the employee's data. The outlier removal can be increased upon consideration of a linear dependence between the plotted features. In Fig. 3 we show that, after removing the biggest deviated data (10%), the $R^2$ coefficient increases as expected. However, this is not a good approach while working with strongly imbalanced data. Taking a closer look to the outliers (blue points in Fig. 3) one can detect that some of them are POIs like Kenneth Lay. Removing these points would not improve the quality of the model. The outliers and POI status detected in Fig. 3 are displayed in Table 1.

**Figure 3** ![EDA_plot](./figures/outlier_removal_regression.pdf)

**Table 1**

| salary  vs.  bonus | POI status |   | total stock value  vs.  exercised stock options | POI status |
|:------------------:|:----------:|:-:|:-----------------------------------------------:|:----------:|
|   KITCHEN LOUISE   |    False   |   |                  LAY KENNETH L                  |    True    |
|   LAVORATO JOHN J  |    False   |   |                  RICE KENNETH D                 |    True    |
|  DELAINEY DAVID W  |    True    |   |                  KEAN STEVEN J                  |    False   |
|    LAY KENNETH L   |    True    |   |                WHITE JR THOMAS E                |    False   |
|  BELDEN TIMOTHY N  |    True    |   |               DIMICHELE RICHARD G               |    False   |
|  PICKERING MARK R  |    False   |   |                 BHATNAGAR SANJAY                |    False   |
|   ALLEN PHILLIP K  |    False   |   |                   HIRKO JOSEPH                  |    True    |
|   FREVERT MARK A   |    False   |   |               DERRICK JR. JAMES V               |    False   |
|                    |            |   |                     PAI LOU                     |    False   |
|                    |            |   |                 IZZO LAWRENCE L                 |    False   |

The outlier identification was performed combining the methods ```plot_EDA```, ```outliers_regression```, ```outliers_regression_plot``` and ```outliers_identification``` defined in ```./modules/ML_methods.py```.



### Section 2: Machine learning algorithm

During this section, we are going to discuss the structure of the machine learning code and the steps employed to build the model. The general basic structure of a machine learning algorithm consists on three parts: first the data is split into train and test sets, the algorithm of our choice (Naive Bayes, SVM, Decision Tree...) is fitted using the training data and, finally, validated with the test set. To increase the efficiency of our model, we are going to add a few extra steps like feature selection and parameter tuning.

#### Feature selection

Once the data is cleaned of NaNs and outliers, it is important to select the relevant number of features to be considered in the classification. This is one way to tune the bias-variance tradeoff. A simple model with a small number of features (bias) will have a low performance in both train and test data and, on the other hand, a complex one (variance) will describe better the particularities of the training set but will not be accurate enough in the test part. The k-best number of features were selected employing the ```get_kbest_features``` definition coded in ```./modules/ML_methods.py```. As shown in Table 3, the ```SelectKBest``` methods provide a coefficient related to the importance of the feature. 'total_stock_value', 'exercised_stock_values', 'bonus' and 'salary' are the most relevant features in the dataset. From them, the new fields were built with the intention to provide new information to classify the employees. 'salary_bonus' ('total_value') was calculated by adding up 'salary', 'bonus' and 'total_stock_value' ('salary' and 'bonus'). In the following table, we show a comparison of the 10 best features with and without the 'salary_bonus' and 'total_value' fields.

**Table 3**
<table>
<tr><th>Default features</th><th></th><th>Features 'salary_bonus' and 'total_value' included</th><th></th></tr>  
<tr><th>Features</th><th>Score</th><th>Features</th><th>Score</th></tr>
<tr><td>total_stock_value</td><td>22.511</td><td>total_stock_value</td><td>22.511</td></tr>
<tr><td>exercised_stock_options</td><td>22.349</td><td>salary_bonus</td><td>22.456</td></tr>
<tr><td>bonus</td><td>20.792</td><td>exercised_stock_options</td><td>22.349</td></tr>
<tr><td>salary</td><td>18.29</td><td>bonus</td><td>20.792</td></tr>
<tr><td>deferred_income</td><td>11.425</td><td>total_value</td><td>19.225</td></tr>
<tr><td>long_term_incentive</td><td>9.922</td><td>salary</td><td>18.29</td></tr>
<tr><td>total_payments</td><td>9.284</td><td>deferred_income</td><td>11.425</td></tr>
<tr><td>restricted_stock</td><td>8.825</td><td>long_term_incentive</td><td>9.922</td></tr>
<tr><td>shared_receipt_with_poi</td><td>8.589</td><td>total_payments</td><td>9.284</td></tr>
<tr><td>loan_advances</td><td>7.184</td><td>restricted_stock</td><td>8.825</td></tr>
</table> 

The new features were coded with the ```create_salarybonus_feature``` and ```create_value_feature``` methods defined in ```./modules/data_cleaning.py``` 


#### Classification algorithm

During this project, we have evaluated the performance of different classifiers: Naive Bayes, Decision Tree, Random Forest and Ada Boost. Code 1 block shows an example of the classification's code using the Naive Bayes model. The steps taken during the classification are the same for all the classifiers tested.

Before discussing the details of the code we should consider the need (or not) to perform feature scaling. This technique is employed to make comparisons between features defined within different range intervals. Some algorithms like KNN make predictions measuring distances (usually cartesian or minkowsky distances) between points. This magnitude is highly sensitive to the data domain. Depending on the classifier feature scaling can be relevant and, therefore, may affect the outcome. To avoid this, we can renormalize the data to take values within the same range interval. The most important Feature Scaling procedures are Standardisation, Min-Max Scaling and Unit Vector. The Standardization replaces the data points according to the Z score and redistributes the data like a Gaussian with $\mu = 0$ and $\sigma = 1$. Min-Max Scaling redistributes the data as follows:

$$ \bar{x} = \frac{x - x_{min}}{x_{max} - x_{min}} $$

and reduces the range interval to $[0,1]$, where the minimum and maximum value take respectively 0 and 1. Unit Vector is the regular vector normalization and it also redefines the data domain from 0 to 1. Since we are going to perform the data analysis with classifiers like Naive Bayes or Decision Trees that are not sensitive to data distances or inter-feature's domain ranges, we do not need to include the scaling into the algorithm.

All the models were built by performing feature selection (as shown in the previous subsection Table 3 right) and using cross-validation (this was implemented with the ```perform_classification``` method defined in ```./modules/ML_methods.py```). Cross-validation (CV) is especially useful while working with class-disproportionated data since it typically validates systematically the model into subsets of the data and gives a better idea of how the algorithm generalizes. We have used StratifiedShuffleSplit to split and validate the classifier. This method ultimately splits the data into *k*-folds. Each time that a new fold is created the data is shuffled. Also, the fold is randomized again and split into train-test sets. The test sets are used to validate each *k*-level. This procedure is repeated *k*-1 times. The following picture illustrates the StratifiedShuffleSplit technique for k = 4 divisions.

**Figure 4** <img src="./figures/CV_SSS.png" alt="drawing" width="500"/>

The efficiency of the algorithm is tested on all the folds. As a result, the mean value of different metrics is provided as a global measure. CV is, in general, a more reliable approach to study the model's generalization than a simple train-test split. In the following table, we display the accuracy score (ratio between the predicted and real test values) for different classificators.

**Table 4**

|Algorithm|Accuracy score|Recall score|Precission score|F1 score|
|:------------------:|:----------:|:----------:|:----------:|
|Naive Bayes         |0.8362|0.3035|0.3218|0.2937|
|Decission Tree      |0.8073|0.2890|0.2631|0.2573|
|Random Forest       |0.8532|0.1605|0.2256|0.1796|
|Ada Boost           |0.8463|0.2270|0.2846|0.2392|

Accuracy, recall and precission precission are defined as follows:

* Accuracy $ \equiv $ Ratio of data classified correctly
* Recall $ \equiv TP / (TP + FN)$
* Precission $ \equiv TP / (TP + FP)$
* F1 $ \equiv 2 \cdot (P \cdot R) / (P + R)$

where TP, TN, FP and FN are true positive, true negative, false positive and false negative classified values. 

In the case of unbalanced data, the accuracy score is not a good metric to evaluate the goodness of the model. Since the majority of the ENRON employees are non-POIs, a model that ignores the POI data would correctly classify the ordinary workers and produce a good accuracy score. One would better consider other metrics such as recall, precision or F1. The recall reflects the amount of correctly classified POIs, or in other words, the model’s ability to find all the data points of interest in a dataset.

However, the recall alone is not a good metric to evaluate the goodness of the model. One can notice that by increasing the number of people predicted as POIs the recall increases but the precision is reduced. The precision is defined as the ratio between the true positives and the sum of true positives and false positives which, in this context, corresponds to the number of correctly predicted POIs between the number of correctly and incorrectly identified POIs. The precision is a metric which reflects the ability of the algorithm to identify reliable information within the predicted POIs. One can observe also that there is a relation between these two metrics: while the recall increases the precision decreases and vice-versa. The tradeoff between these variables can be measured with the F1 score.

From the values shown in the table above, we see that the Naive Bayes classifier has the best recall, precision and F1 scores. Therefore, among the tested models, the Naive Bayes classifier is the one which presents the best performance of all.

**Code 1: Naive Bayes classification**
```python
def perform_classification(clf, steps, features, labels, folds = 100):
    '''
    This method performs the cross validation using the provided clf classifier.
    It builds a Pipeline with the selected steps. 
    It returns both the estimator and a dictionary containing 
    the mean value of the accuracy, precission, recall and f1 scores.
    '''
    
    # Cross-validation method
    cv_kfold = StratifiedShuffleSplit(labels, folds, random_state = 42)
    pipe  = Pipeline(steps)
    
    # Metric scores to be computed
    scoring = {'accuracy'  : make_scorer(accuracy_score),
               'precision' : make_scorer(precision_score),
               'recall'    : make_scorer(recall_score),
               'f1_score'  : make_scorer(f1_score)}

    results = cross_validate(estimator = pipe,\
                             X = features,\
                             y = labels,\
                             cv = cv_kfold,\
                             scoring = scoring)
    
    for key in results.keys():
        results[key] = np.mean(results[key])

    return pipe, results


print "Performing Naive Bayes classification\n"
# Classifier declaration
clf_NB   = GaussianNB()
# Algorithm steps: 1) Feature selection, 2) Model construction
steps_NB = [('feature_selection', feature_selection), ('Naive_Bayes', clf_NB)]
pipe_NB, results_NB = perform_classification(clf_NB, steps_NB, features, labels)

# Similar code for the rest of the classifiers
```


#### Parameter tuning

In the previous subsection, we have found that the Naive Bayes algorithm produced the best quality classification. However, this result was obtained by comparing the results of different classifiers with the default parameter settings. Each machine-learning algorithm has a different set of variables that control several aspects of the model's complexity. In order to obtain the optimal configuration, we must to explore different configuration of parameters for different classifiers. 

Building the model with the correct complexity is the key to obtain good fitting both during training and testing steps. A too simple algorithm (bias) will ignore the data and produce inaccurate results both for predictions over the train and test sets. While a too detailed one (variance) will describe accurately the particularities of the training data but will not produce a good generalization. In Machine learning, we usually refer to these limit cases as underfitted or overfitted models. It is clear the importance of avoiding these scenarios and one way to do this is precisely to tune the parameters until found the ones which provides the optimal model complexity.

We have tested a wide range of parameters for the Naive Bayes, Decision Tree, Random Forest and Ada Boost classifiers.

The Code block 2 contains the steps followed to perform during the Decision Tree case. For this classifier, we explored two parameters: the 'criterion' and 'min_samples_split'. We tested two criterion methods 'gini' and 'entropy'. 'min_samples_split' stands for the minimum number of samples required to split an internal node and was tested from 2 to 10. The smaller this quantity is the more complex the classifier will be. The 'splitter' parameter was left as 'best' to ensure that the code chooses the best split on each node.

We have observed that the parameter tuning of the Naive Bayes algorithm still has the best generalization. In this case, we can only tune the number of features selected during the fit. As shown in the Code 3 block, this process is implemented with the ```GridSearchCV``` method. The best score parameters were found for the Naive Bayes classifier considering 12 features. The following table we show the score metrics before and after the parameter tuning.

**Table 5**

|Algorithm|Accuracy score|Recall score|Precission score|F1 score|
|:------------------:|:----------:|:----------:|:----------:|
|Naive Bayes default |0.8467|0.3800|0.4052|0.3649|
|Naive Bayes tuned   |0.8452|0.3280|0.4014|0.3610|





**Code 2: Parameter tuning method**
```python
def grid_search(steps, clf_parameters, features, labels, folds = 100):
    '''
    This method performs a cross validation parameter tuning.
    To do so, it builds a Pipeline and employes GridSearchCV over the provided
    range of parameters.
    '''

    # Pipeline following the steps of the classification
    pipe    = Pipeline(steps)
    # CV method
    cv_kfold = StratifiedShuffleSplit(labels, folds, random_state = 42)
    # Scanning of the parameters defimed in clf_parameters
    cv_grid = GridSearchCV(pipe, param_grid = clf_parameters, cv = cv_kfold)
    cv_grid.fit(features, labels)

    return cv_grid

clf_DT   = DecisionTreeClassifier()
feature_selection = SelectKBest()

# Steps: 1) Feature selection, 2) Clasification 
steps_DT = [('feature_selection', feature_selection), ('Decission_Tree', clf_DT)]
print "Performing Grid Search of Decission Tree classification"
param_dict_DT = {'feature_selection__k': range(5, len(features_list)),\
                 'Decission_Tree__criterion': ['gini', 'entropy'],\
                 'Decission_Tree__min_samples_split' : [2, 3, 4, 5, 6, 7, 8, 9, 10]}

gs = grid_search(steps_DT, param_dict_DT, features, labels)
# Model with the best parameter configuration
gs_clf_DT = gs.best_estimator_
print '\n Score Metrics Decission Tree Classifier'
test_classifier(gs_clf_DT, data_dict, features_list, folds = 1000)
print
```


**Code 3: Parameter tuning method for the Naive Bayes classifier**
```python
print "Performing Grid Search of Naive Bayes classification"
# Range of parameters of the classification tuning
param_dict_NB = {'feature_selection__k': range(5, len(features_list))}
gs = grid_search(steps_NB, param_dict_NB, features, labels)
gs_clf_NB = gs.best_estimator_
print '\n Score Metrics Decission Tree Classifier'
test_classifier(gs_clf_NB, data_dict, features_list, folds = 1000)
print
```

### Conclusions

During this project, we have employed machine learning techniques to binary-classify ENRON employees into POI or non-POI. To do so, we have clean and analyzed financial related data. We have found that the Naive Bayes classifier is the one that better generalizes the data and, after parameter tuning, we have obtained accuracy, recall and precision bigger than 0.8 and 0.3 respectively. These results could be strongly improved with the inclusion of more POI data. 

### Bibliography

* https://books.google.at/books/about/Introduction_to_Machine_Learning_with_Py.html?id=qjUVogEACAAJ&redir_esc=y
* http://scikit-learn.org/stable/modules/outlier_detection.html
* http://scikit-learn.org/stable/modules/preprocessing.html
* https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229
* https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6
* https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
* http://scikit-learn.org/stable/modules/cross_validation.html
* https://stackoverflow.com/questions/45969390/difference-between-stratifiedkfold-and-stratifiedshufflesplit-in-sklearn