# Project: Identify Fraud from Enron Email
by: Murat Gürcü

Company: Airbus

Country: Germany (Munich)

## 1. Project Overview

In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives. In this project, I will play detective, and put my new skills to use by building a person of interest identifier based on financial and email data made public as a result of the Enron scandal. 


## 2. Questions

__1.) Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?  [relevant rubric items: “data exploration”, “outlier investigation”]__

The main goal of this project is to classify who is a Person of Interest (POI) and who not. Therefore I will create a prediction model, which easily and fast can classify POI´s and non POIS´s. But first I will start with a short summary of our Enron dataset:

Concerning our outcome at Lesson 6 our dataset has 146 entries (or employees) with 21 features each person. 18 of this person are POI´s and 128 are non-POI´s.

__Follwing an overview of our dataset:__

Index: 146 entries

Data columns (total 21 columns):

Name | Non-Null | Type |
--- | --- | --- |
salary | 95 | non-null float64 
to_messages | 86 | non-null float64 
deferral_payments | 39 | non-null float64 
total_payments | 125 | non-null float64 
exercised_stock_options | 102 | non-null float64 
bonus | 82 | non-null float64 
restricted_stock | 110 | non-null float64 
shared_receipt_with_poi | 86 | non-null float64 
restricted_stock_deferred | 18 | non-null float64 
total_stock_value | 126 | non-null float64 
expenses | 95 | non-null float64 
loan_advances | 4 | non-null float64 
from_messages | 86 | non-null float64 
other | 93 | non-null float64 
from_this_person_to_poi | 86 | non-null float64 
poi | 146 | non-null bool 
director_fees | 17 | non-null float64 
deferred_income | 49 | non-null float64 
long_term_incentive | 66 | non-null float64 
email_address | 111 | non-null object 
from_poi_to_this_person | 86 | non-null float64 
dtypes: bool(1), float64(19), object(1)
memory usage: 24.1+ KB
None

For the identificiation of the outliers we will analyze the total stock values.

Here is the graph with outliers:

<img src="total_stock_values-salary_w outliers.PNG">

##### TOP TOTAL STOCK VALUE #
Name|Value|
---|----|
TOTAL    |             434509511.0
LAY KENNETH L |         49110078.0
HIRKO JOSEPH   |        30766064.0
SKILLING JEFFREY K |    26093672.0
PAI LOU L          |    23817930.0

Already here we can see, that _Total_ is an outlier. To identify more outliers we will add a new column with missing percentages.

##### TOP MISSING VALUES RECORDS #
Name|Value|
---|----|
LOCKHART EUGENE E  |              95.238095
WROBEL BRUCE        |             85.714286
THE TRAVEL AGENCY IN THE PARK |   85.714286
GRAMM WENDY L                  |  85.714286
WHALEY DAVID A                 |  85.714286

Now we will also remove _The Travel Agency in the Park_ as it is not an individual and lot of values are missing. Additionally we will remove _Lockhart Eugene E_ as it has 95% of the coulmns empty.

Now our new index is 143 entries.

Allocation without outliers:

<img src="total_stock_values-salary_wo outliers.PNG">


__2.) What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.  [relevant rubric items: “create new features”, “intelligently select features”, “properly scale features”]__

Now we will update our feature list by creating two new features, __messages_to_poi__ and __messages_from_poi__ and as well deleting features with too many null entires. 

The new feature messages_to_poi shows the ratio a person send emails to POI and messages_from_poi emails sent from POI.

As we remember in Question 1 features like __loan_advances, director_fees, restricted_stock_deferred and deferral_payments__ have the highest missing entries and will be removed therefore.

Our new Dataframe after update looks now:

Index: 143 entries

Data columns (total 20 columns):

Name | Non-Null | Type |
--- | --- | --- |
salary | 94 |non-null float64
to_messages | 86 |non-null float64
total_payments | 123 |non-null float64
exercised_stock_options | 101 |non-null float64
bonus | 81 |non-null float64
restricted_stock | 109| non-null float64
shared_receipt_with_poi | 86| non-null float64
total_stock_value | 125| non-null float64
expenses | 94 | non-null float64
from_messages | 86 | non-null float64
other | 91 | non-null float64
from_this_person_to_poi | 86 | non-null float64
poi | 143 | non-null bool
deferred_income | 48 | non-null float64
long_term_incentive | 65 | non-null float64
email_address | 111 | non-null object
from_poi_to_this_person | 86 | non-null float64
percent_missing | 143 | non-null float64
messages_to_poi  | 86 | non-null float64
messages_from_poi | 86 | non-null float64
dtypes: bool(1), float64(18), object(1)
memory usage: 22.5+ KB

Next we will also check the raws for the missing values. We we will delete rows with more then 75% missing values.

These persons are:

- 'CHAN RONNIE',
- 'WHALEY DAVID A',
- 'CLINE KENNETH W', 
- 'WAKEHAM JOHN',
- 'WROBEL BRUCE', 
- 'SAVAGE FRANK', 
- 'GRAMM WENDY L'

Our Dataframe after cleaning looks now: 

Index: __136 entries__

Data columns (total 20 columns):

Name | Non-Null | Type |
--- | --- | --- |
salary             |        94 | non-null float64 
to_messages         |       86 | non-null float64 
total_payments       |      120 | non-null float64 
exercised_stock_options |    99 | non-null float64 
bonus                    |  81 | non-null float64 
restricted_stock        |   107 | non-null float64 
shared_receipt_with_poi  |  86 | non-null float64 
total_stock_value |         122 | non-null float64 
expenses           |        93 |non-null float64 
from_messages       |       86|  non-null float64 
other                |      91 | non-null float64 
from_this_person_to_poi |    86 | non-null float64 
poi           |             136 | non-null bool 
deferred_income |            46 |non-null float64 
long_term_incentive |       65 | non-null float64 
email_address        |      111 | non-null object 
from_poi_to_this_person |   86 | non-null float64 
percent_missing          |  136 | non-null float64
messages_to_poi        |    86 | non-null float64 
messages_from_poi       |   86 | non-null float64 
dtypes: bool(1), float64(18), object(1)

For the missing values we will take the __mean__ of it´s column. 



We have chosen the scikit-learn __SelectKBest__ to select the best influential features. SelectKBest is selecting features according to the highest k scores (see also scikit-learn.org - sklearn.feature_selection). Before running SelectKBest we have scaled our dataset with the sklearn.preprocessing.scale which standardize our dataset along any axis (see also scikit-learn.org - sklearn.preprocessing.scale).

#### BEST FEATURES 

Score | Feature Name | 
--- | --- | 
27.44 | exercised_stock_options 
19.97 | total_stock_value 
12.72 | messages_to_poi 
10.89 | bonus 
8.95 | salary 
7.19 | total_payments
6.37 | restricted_stock 
5.68 | long_term_incentive 
5.46 | shared_receipt_with_poi 
4.91 | deferred_income 
2.89 | from_poi_to_this_person 
1.82 | other 
1.30 | from_this_person_to_poi 
1.11 | messages_from_poi 
0.56 | from_messages 
0.55 | expenses 
0.35 | to_messages 

As we can assess here only 2 of first 10 features are from fraction of Email, rest is from finance fraction. Also our engineered features have different scores. "Messages to POI" is on third place and therefore also significant for further investigations, while "Messages from POI" is not so significant and on 14th place. We didn´t fix the K value yet, because we want to be flexibel and in case also use our engineered feature "Messages from POI" (with a K of 15). 

__3.) What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?  [relevant rubric item: “pick an algorithm”]__

A perfect system combined with high Precision and high Recall will provide many correctly labeled results. I also apply this approach for my selection. But we need to highlight, that the size of our dataset is too small. So with our current train/test split our results will be dramatically impacted (i.e. 1 POI in our test possible). So we need to run our model with different split combinations, as the case in test_classifier(). In the following table the results are shown for 1 run and for 1000 runs (as in test_classifier()).

I tried three different algorithms. Here is summary of my models and their accuracy, prediction and recall results:

Model | Accuracy | Precision (1.000 runs) | Recall (1.000 runs) | Precision (1 run) | Recall (1 run) |
--- | --- | --- | --- | --- | --- |
Logistic Regression | 0.84207 | 0.40294 | 0.21900 | 0.60000 | 0.60000 | chosen model
SVM | no result | no result | no result | 0.50000 | 0.20000 | due to the small sample size, the result by 1.000 runs are divided by zero and therefore not delivering any results
Desicion Tree | 0.80750 | 0.33237 | 0.34450 | 0.666667 | 0.80000 |


__Conclusion:__ depending on results of 1 run and 1.000 runs and on Accuracy rates, which is the weighted arithmetic mean of Precision and Recall, I go for Logistic Regression as it has less deviation between 1 run and 1.000 runs and the highest Accuracy. SVM is not showing any results on 1.000 runs, due to the small sample size. Decision tree has a good Precision and Recall at 1.000 runs, but a less Accuracy rate and more volatility between 1 run and 1.000 runs.

__4.) What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?  How did you tune the parameters of your particular algorithm? What parameters did you tune? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier).  [relevant rubric items: “discuss parameter tuning”, “tune the algorithm”]__

Tuning parameters of an algorithm is a final step to get the best results. To optimizatize the parameters, which are impacting the model to get the best results, tuning suggests the search-nature of the problem. With a poor quality of tuning we will also get a low level of Performance and our model will be overfitted and biased. In Scikit-learn we can find different search methods. Easy two methods are Grid-Search and Random Search.

I´ve applied the Grid-Search Method, which is systematically creating and evaluating a model for every combination of parameter specified in a grid.


__5.) What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?  [relevant rubric items: “discuss validation”, “validation strategy”]__

If we check for a definition of validation we can find following in Wikipedia:

.

<font color='blue'> Cross-validation, sometimes called rotation estimation, is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (testing dataset). The goal of cross validation is to define a dataset to "test" the model in the training phase (i.e., the validation dataset), in order to limit problems like overfitting, give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem), etc. </font>
.

<font color='blue'> One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds. </font>

.

<font color='blue'> One of the main reasons for using cross-validation instead of using the conventional validation (e.g. partitioning the data set into two sets of 70% for training and 30% for test) is that there is not enough data available to partition it into separate training and test sets without losing significant modelling or testing capability. In these cases, a fair way to properly estimate model prediction performance is to use cross-validation as a powerful general technique. </font>

.

<font color='blue'> In summary, cross-validation combines (averages) measures of fit (prediction error) to derive a more accurate estimate of model prediction performance. </font>

To avoid the mistake of overfitting, I tried to keep my approach simple by tuning just a few parameters, and built function called _tune_and_eval_clf()_ in which I applied cross validation technique _sklearn.model_selection.StratifiedShuffleSplit()_ to split the data into training data and test data 10 times, calculate the accuracy, precision, and recall of each iteration.



__6.) Give at least 2 evaluation metrics and your average performance for each of them.  Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance. [relevant rubric item: “usage of evaluation metrics”]__

For the evaluation of my model I´ve used 3 metrics: accuracy, precision, and recall. 

__Accuray__ is the weighted arithmetic mean of Precision and Recall. Official definition from Wikipedia:
Accuracy has two definitions:
1. More commonly, it is a description of systematic errors, a measure of statistical bias; as these cause a difference between a result and a "true" value, ISO calls this trueness.
2. Alternatively, ISO defines accuracy as describing a combination of both types of observational error above (random and systematic), so high accuracy requires both high precision and high trueness.

__Precision__: is the number of true positives over the number of true positives and additionally the number of false positives.

__Recall__: ist the number of true positives over the number of trure positives and additionally the number of false negatives.

Following the explanation of Wikipedia:
<img src="Precisionrecall_svg.PNG">

As already listed in 3.) I have decided to take __Logistic Regression__ as it has less deviation between 1 run and 1.000 runs and the highest Accuracy. An accuracy ratio of 0.84207 means that the proportion of true results (both true positives and true negatives) is 0.84207 among the total number of cases examined. A precision of 0.40294 means that among the total 100 persons classified as POIs, 40 persons are actually POIs. A recall of 0.21900 means that among 100 true POIs existing in the dataset, ~22 POIs are correctly classified as POIs.

### References:
- https://en.wikipedia.org/wiki/Precision_and_recall
- https://en.wikipedia.org/wiki/Accuracy_and_precision
- https://en.wikipedia.org/wiki/Cross-validation_(statistics)
- http://scikit-learn.org
- Udacity course Intro to Machine Learning

### Attachments:
- Main function poi_id2.py
- Tester file tester2.py
- Support functions enron2.py and feature_format.py
- Pickle files (final_project_dataset, my_classifier, my_dataset, my_feature_list)
- Dataset files (my_dataframe.xlsl)