# Identify Fraud from Enron Email
<hr>

## 1. Introduction

In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives. In this project, we will build a person of interest identifier based on financial and email data made public as a result of the Enron scandal.

## 2. Description of the Dataset 

In the project folder, we will find ***enron61702insiderpay.pdf***. That is the driving force of the project. The data presented in the ***pdf*** is stored in a ***pickle*** file - ***final_project_dataset.pkl*** as a dictionary and it is used heavily in ***poi_id.py*** for describing the dataset. Now let's look at the dataset.

- **It starts as the following**:

  ```python
  {'METTS MARK': {'salary': 365788, 'to_messages': 807, ...} ... }
  ```
    
  which clearly means that the names are the dictionary keys and the features are keys of the nested dictionaries.
  
- **Converting it to a Pandas DataFrame and the first 5 entries are as follows**:

|                    |   salary |   to_messages |   deferral_payments |   total_payments |   exercised_stock_options |          bonus |   restricted_stock |   shared_receipt_with_poi |   restricted_stock_deferred |   total_stock_value |   expenses |   loan_advances |   from_messages |           other |   from_this_person_to_poi | poi   |   director_fees |   deferred_income |   long_term_incentive | email_address              |   from_poi_to_this_person |
|:-------------------|---------:|--------------:|--------------------:|-----------------:|--------------------------:|---------------:|-------------------:|--------------------------:|----------------------------:|--------------------:|-----------:|----------------:|----------------:|----------------:|--------------------------:|:------|----------------:|------------------:|----------------------:|:---------------------------|--------------------------:|
| ALLEN PHILLIP K    |   201955 |          2902 |         2.86972e+06 |          4484442 |               1.72954e+06 |      4.175e+06 |   126027           |                      1407 |                     -126027 |             1729541 |      13868 |             nan |            2195 |    152          |                        65 | False |             nan |      -3.08106e+06 |      304805           | phillip.allen@enron.com    |                        47 |
| BADUM JAMES P      |      nan |           nan |    178980           |           182466 |          257817           |    nan         |      nan           |                       nan |                         nan |              257817 |       3486 |             nan |             nan |    nan          |                       nan | False |             nan |     nan           |         nan           | NaN                        |                       nan |
| BANNANTINE JAMES M |      477 |           566 |       nan           |           916197 |               4.04616e+06 |    nan         |        1.75755e+06 |                       465 |                     -560222 |             5243487 |      56301 |             nan |              29 | 864523          |                         0 | False |             nan |   -5104           |         nan           | james.bannantine@enron.com |                        39 |
| BAXTER JOHN C      |   267102 |           nan |         1.29574e+06 |          5634343 |               6.68054e+06 |      1.2e+06   |        3.94271e+06 |                       nan |                         nan |            10623258 |      11200 |             nan |             nan |      2.6603e+06 |                       nan | False |             nan |      -1.38606e+06 |           1.58606e+06 | NaN                        |                       nan |
| BAY FRANKLIN R     |   239671 |           nan |    260455           |           827696 |             nan           | 400000         |   145796           |                       nan |                      -82782 |               63014 |     129142 |             nan |             nan |     69          |                       nan | False |             nan | -201641           |         nan           | frank.bay@enron.com        |                       nan |

  
- **Lets see the number of the total employees in Enron**:
  
  ```python
  Total Employees: 146
  ```
  
  
- **Next we want to see the number of total features or attributes that defines each employee**:

  ```python
  Total Features: 21
  ```
  
  
- **Now let us see what the features are**:

 ```python
['salary', 'to_messages', 'deferral_payments', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'restricted_stock_deferred', 'total_stock_value', 'expenses', 'loan_advances', 'from_messages', 'other', 'from_this_person_to_poi', 'poi', 'director_fees', 'deferred_income', 'long_term_incentive', 'email_address', 'from_poi_to_this_person']
```


- **The features are mixed as Financial, Email and POI. Let's segregate the features as Financial and Email**:

    - ***Financial (14 features)***: 
    ```python
    ['salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees']
    ```
    - ***Email (6 features)***: 
    ```python
    ['to_messages', 'email_address', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi']
    ```
    - ***POI (1 feature)***: 
    ```python
    [poi]
    ```


- **A sneak peak into POI**:

  ```python
  Total POI: 18
  ```
  
  
- **Descriptions of the features**:
  - **Salary**: Reflects items such as base salary, executive cash allowances, and benefits payments.
  - **Bonus**: Reflects annual cash incentives paid based upon company performance. Also may include other retention payments.
  - **Long Term Incentives**: Reflects long-term incentive cash payments from various long-term incentive programs over a multi-year period, generally 3 to 5 years.
  - **Deferred Income**: Reflects voluntary executive deferrals of salary, annual cash incentives, and long-term cash incentives as well as cash fees deferred by non-employee directors under a deferred compensation arrangement. May also reflect deferrals under a stock option or phantom stock unit in lieu of cash arrangement.
  - **Deferral Payments**: Reflects distributions from a deferred compensation arrangement due to termination of employment or due to in-service withdrawals as per plan provisions.
  - **Loan Advances**: Reflects total amount of loan advances, excluding repayments, provided by the Debtor in return for a promise of repayment. In certain instances, the terms of the promissory notes allow for the option to repay with stock of the company.
  - **Other**: Reflects items such as payments for severence, consulting services, relocation costs, tax advances and allowances for employees on international assignment (i.e. housing allowances, cost of living allowances, payments under Enron’s Tax Equalization Program, etc.). May also include payments provided with respect to employment agreements, as well as imputed income amounts for such things as use of corporate aircraft.
  - **Expenses**: Reflects reimbursements of business expenses. May include fees paid for consulting services.
  - **Director Fees**: Reflects cash payments and/or value of stock grants made in lieu of cash payments to non-employee directors.
  - **Exercised Stock Options**: Reflects amounts from exercised stock options which equal the market value in excess of the exercise price on the date the options were exercised either through cashless (same-day sale), stock swap or cash exercises. The reflected gain may differ from that realized by the insider due to fluctuations in the market price and the timing of any subsequent sale of the securities.
  - **Restricted Stock**: Reflects the gross fair market value of shares and accrued dividends (and/or phantom units and dividend equivalents) on the date of release due to lapse of vesting periods, regardless of whether deferred.
  - **Restricted Stock Deferred**: Reflects value of restricted stock voluntarily deferred prior to release under a deferred compensation arrangement.
  - **Total Payments**:
  - **Total Stock Value**:
  
- **Percentage of NaN**:

|        Feature            | %              |
|:--------------------------|:---------------|
| salary                    | 34.9315068493% |
| to_messages               | 41.095890411%  |
| deferral_payments         | 73.2876712329% |
| total_payments            | 14.3835616438% |
| loan_advances             | 97.2602739726% |
| bonus                     | 43.8356164384% |
| email_address             | 23.9726027397% |
| restricted_stock_deferred | 87.6712328767% |
| total_stock_value         | 13.698630137%  |
| shared_receipt_with_poi   | 41.095890411%  |
| long_term_incentive       | 54.7945205479% |
| exercised_stock_options   | 30.1369863014% |
| from_messages             | 41.095890411%  |
| other                     | 36.301369863%  |
| from_poi_to_this_person   | 41.095890411%  |
| from_this_person_to_poi   | 41.095890411%  |
| poi                       | 0%             |
| deferred_income           | 66.4383561644% |
| expenses                  | 34.9315068493% |
| restricted_stock          | 24.6575342466% |
| director_fees             | 88.3561643836% |

## 3. Outliers

Before we dig into feature selections and classifiers, let us find some Outliers.

### 3.1 Salary vs Bonus

We are using ***data_wrangling.py*** to generate the plots and perform outlier removal and stuffs. The idea behind plotting salary vs bonus is we want to find those people to see if people with high salary also receives high bonuses. This outlier detection will also give us an insight into feature selections.

![salary_bonus.png](./outlier_plots/salary_bonus1.png)

What a surprise! We have an outlier for ***TOTAL***. That makes complete sense. We have a key in the dictionary called "TOTAL". That needs to be removed. So, re-plotting after removing the item from our dictionary.

![salary_bonus.png](./outlier_plots/salary_bonus2.png)

Now we can see the outliers are real people, so we keep it. However, there's one thing to notice. We have so many values where bonus is very low. Let us find what may be the case.

With the list produced by ***data_wrangling.py*** we can say there are two more outliers.
   - THE TRAVEL AGENCY IN THE PARK
  
Now a quick look at the features and their values of "THE TRAVEL AGENCY IN THE PARK",

```python
{'salary': 'NaN', 'to_messages': 'NaN', 'deferral_payments': 'NaN', 'total_payments': 362096, 'exercised_stock_options': 'NaN', 'bonus': 'NaN', 'restricted_stock': 'NaN', 'shared_receipt_with_poi': 'NaN', 'restricted_stock_deferred': 'NaN', 'total_stock_value': 'NaN', 'expenses': 'NaN', 'loan_advances': 'NaN', 'from_messages': 'NaN', 'other': 362096, 'from_this_person_to_poi': 'NaN', 'poi': False, 'director_fees': 'NaN', 'deferred_income': 'NaN', 'long_term_incentive': 'NaN', 'email_address': 'NaN', 'from_poi_to_this_person': 'NaN'}
```

We see there are so many NaN values. This gives us an idea to hunt for percentages of NaN for all the employees and see if we can get anything outlier of it. Running ***data_wrangling.py*** the top 10 NaN % in the employees along with their percentages is as follows. 

```python
 [['LOCKHART EUGENE E', 95.23809523809523],
 ['WHALEY DAVID A', 85.71428571428571],
 ['WROBEL BRUCE', 85.71428571428571],
 ['THE TRAVEL AGENCY IN THE PARK', 85.71428571428571],
 ['GRAMM WENDY L', 85.71428571428571],
 ['WODRASKA JOHN', 80.95238095238095],
 ['CLINE KENNETH W', 80.95238095238095],
 ['WAKEHAM JOHN', 80.95238095238095],
 ['SCRIMSHAW MATTHEW', 80.95238095238095],
 ['GILLIS JOHN', 80.95238095238095]]
 ```
 
 We can say that **LOCKHART EUGENE E** is also an outlier and can dampen the metrics of our classifier.
 
 > Conclusion: **TOTAL**, **THE TRAVEL AGENCY IN THE PARK** and **LOCKHART EUGENE E** are not needed and hence removed from out data dictionary.

## 4. Feature Engineering

As we know that all the given features in the dictionary may not be useful. We may need to find combinations, or transformations, of the original features that would make finding POIs easier.

### 4.1 Intuitive Features

Let's try to generate some.

- Exercised Stock Options is an award of company stock, with the right to sell or transfer the stock. So, chances are there that people with high exercised stock options are the key leaders and they have knowledge of what's going on. Now we can think of exercised stock as something which he can sell if he wants to as there's no restrictions attached to it. So, we can think of some new feature as **capital_in_hand** as the **summation** of **total_payments** and **exercised-stick-options**.

- The next feature we can think of is **poi_email_engagement** which is nothing but **ratio** of ** total email from/to poi** and **total to/from messages.

- We can also think of something like **from_poi_ratio** which is similar to the previous feature and is defined as the **ratio** of **total messages from poi** and **total from messages**.

- Next is **to_poi_ratio** which is similar to the previous feature and is defined as the **ratio** of **total messages to poi** and **total to messages**.

- Finally I though of **email_involvement** which is intuitively **summation** of **total from/to messages** and  **total shared receipt with poi**.

### 4.2 Selecting Top 5

Next I used `SelectKBest()` method of `sklearn` to select the top 5 features along with their scores. The table below shows my findings.

|     features            |   score ^ |
|:------------------------|----------:|
| exercised_stock_options |   24.8151 |
| total_stock_value       |   24.1829 |
| bonus                   |   20.7923 |
| salary                  |   18.2897 |
| capital_in_hand         |   16.5432 |

We can see that our feature **capital_in_hand** has made it up to top 5. I'm discarding the rest features for further analysis as their SelectKBest() scores are pretty low.

The algorithms I will be using in the following section does not require **feature scaling**, so skipping it.

## 5. Pick and Tune an Algorithm

Once we have have finished selecting the best features, we go with making classifiers to train on the our outlier-removed dataset and top 5 features. However we are always in search something better. This is where tuning of parameters come into picture. The provide the best parameters in which the algorithm produces Often tuning of the parameters are necessary for a better performance. I have tuned manually as well as used automatic optimization using `GridSearchCV`. I will discuss the classifiers I have used along-with the parameters I have tuned in the following sections

### 5.1 Tuning Gaussian Naive Bayes Classifier

Naive Bayes is a simple technique for constructing classifiers - models that assign class labels to problem instances, represented as vectors of feature values, where the lass labels are drawn from some finite set. It is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable.

The parameters of Naive Bayes that I have tuned:
- **priors**: Prior probabilities of the classes. If specified the priors are not adjusted according to the data.



### 5.2 Tuning Decision Trees Classifier

A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm.
Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal, but are also a popular tool in machine learning.

The parameters of Decision Trees that I have tuned are:
- **min_samples_leaf**:  The minimum number of samples required to be at a leaf node.
- **min_samples_split**: The minimum number of samples required to split an internal node.
- **criterion**: The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
- **max_leaf_nodes**: Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

### 5.3 Tuning Logistic Regression Classifier

In statistics, logistic regression is a regression model where the dependent variable (DV) is categorical. There can be binary dependent variable, that is, where it can take only two values, "0" and "1", which represent outcomes such as pass/fail, win/lose, alive/dead or healthy/sick. Cases where the dependent variable has more than two outcome categories may be analysed in multinomial logistic regression, or, if the multiple categories are ordered, in ordinal logistic regression. In the terminology of economics, logistic regression is an example of a qualitative response/discrete choice model.

The parameters I have tuned in this experiment are as follows:
- **C**: Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
- **tol**:  Tolerance for stopping criteria.
- **penalty**: Used to specify the norm used in the penalization. The ‘newton-cg’, ‘sag’ and ‘lbfgs’ solvers support only l2 penalties.
- **class-weight**: Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)). Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
- **solver**: Algorithm to use in the optimization problem.
- **random_state**: The seed of the pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator.

### 5.4 Tuning Random Forest

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of over-fitting to their training set.

The parameters I have tuned for this classifier are as follows:

- max_depth: The depth of the trees. More depth may result in over-fitting.
- max_leaf_nodes: Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity
- n_estimators: The number of trees in the forest taken into account.
- random_state: The seed used by the random number generator.

## 6. Validate and Evaluate

### 6.1 Importance of Validation

Validation is a way that evaluates the statistical analysis or the performance against an unseen data set. I have used the tester code supplied by Udacity to make perform the evaluation. Finally as a metric I have used Accuracy, Precision, Recall, F1 and F2 to judge the performance of the classifiers. But before them we need to define 4 terms - **True Positive**, **True Negative**, **False Positive** and **False Negative** which eventually defines the above metrics.

- **True Positive**: A true positive test result is one that detects the condition when the condition is present. 
- **True Negative**: A true negative test result is one that does not detect the condition when the condition is absent.

- **False Positive**: A false positive test result is one that detects the condition when the condition is absent. 
- **False Negative**:  A false negative test result is one that does not detect the condition when the condition is present.

Now with these terms, we define the following metrics as follows:

- **Accuracy**: It is defined as, $ \frac{True Positive + True Negative}{True Positive + True Negative + False Positive + False Negative} $


- **Precision**: It is defined as, $ \frac{True Positive}{True Positive + False Positive} $


- **Recall**: It is defined as, $ \frac{True Positive}{True Positive + False Negative} $


- **F score**: It is defined as, $ 2.\frac{Precision * Recall}{Precision + Recall} $


### 5.6 Scoring and Performance of the Classifiers

After trying out the classifiers, here's the result I got.

|     Classifiers            |   Accuracy | Precision | Recall | F1    | F2    |
|:---------------------------|:-----------|:----------|:-------|:------|:------|
|Naive Bayes                 |0.82493     |0.34412    |0.34550 |0.34481|0.34522|
|Decision Trees              |0.85260     |0.43259    |0.33850 |0.37980|0.35389|
|Logistic Regression         |0.76060     |0.29835    |0.58850 |0.39596|0.49267|
|Random Forest               |0.86127     |0.46018    |0.23400 |0.31024|0.25951|


### 5.5 Choosing the best

Now seeing the above statistics, we can say that Decision Trees gave the best results. So I'll choose it.

## 6. References

- https://en.wikipedia.org/wiki/Enron
- https://en.wikipedia.org
- https://www.coursera.org/
- https://www.datacamp.com/
- https://www.analyticsvidhya.com/
- http://scikit-learn.org/stable/