# Project : Identify fraud from Enron Email

## Project Overview 

In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial datas of top enron executives. 

The Enron datasets comprising emails and financial datas of Enron were made available for public for research and analysis and can be downloaded from : https://www.cs.cmu.edu/~./enron/.

### Goal of the project
The goal of this project is to use the machine learning skills to build a POI(person of interest) identifier based on financial and email data made public as a result of the Enron scandal. The POI is an acronym for 'person of interest' i.e., a person who is charged by the law for commiting a crime, in this case the scandal at Enron.  

The overall work done for this project can be divided into four parts :
   1.  __Exploring the Enron Dataset__  
       This involves data cleaning, outlier removal and analyzing.

   2.  __Feature processing of the Enron dataset__  
       It includes feature creation, feature scaling, feature selection and feature transform.
      
   3.  __Choosing the algorithm__   
       As the dataset given has labelled data and output expected is also discrete so supervised learning techniques are used in this project. The supervised learning algorithms are tuned to achieve the best performance on the test dataset. 
       
   4. __Evaluation__  
      This step involves validation followed by overall performance check of the project done using evaluation metrics such as precision, recall and f1_score.


In [None]:
import sys
import pickle
sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data

import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.grid_search import GridSearchCV
from time import time

import pandas as pd
from matplotlib import pyplot as plt

### Task 1: Select what features you'll use.
### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi".

### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

### 1. Exploring the Enron Dataset

The pickled Enron data is loaded as a pandas dataframe for easy anlysis of the dataset. And the key i.e., the Enron employees name is used as the index of the pandas dataframe.

In [None]:
# Converting the given pickled Enron data to a pandas dataframe
enron_df = pd.DataFrame.from_records(list(data_dict.values()))

# set the index of df to be the employees series:
employees = pd.Series(list(data_dict.keys()))
enron_df.set_index(employees, inplace=True)
enron_df.head()

In [None]:
print "Size of the enron dataframe : ",enron_df.shape

In [None]:
print "Number of data points(people) in the dataset : ",len(enron_df)

In [None]:
print "To find the number of Features in the Enron Dataset : ",len(enron_df.columns)

In [None]:
# Counting the number of POIs and non-POIs in the given dataset
poi_count = enron_df.groupby('poi').size()
print "Total number of POI's in the given dataset : ",poi_count.iloc[1]
print "Total number of non-POI's in the given dataset : ",poi_t.iloc[0]

When you load the data into the dataframe, the data types are strings (or, in pandas, objects).

In [None]:
enron_df.dtypes

Converting the datatypes in the given pandas dataframe into floating points for analysis and replace NaN with zeros.

In [None]:
# Coerce numeric values into floats or ints; also change NaN to zero:
enron_df_new = enron_df.apply(lambda x : pd.to_numeric(x, errors = 'coerce')).copy().fillna(0)
enron_df_new.head()

Removing the column of email_address from the enron_df as it is not of much used in this project.

In [None]:
# Dropping column 'email_address' as not required in analysis
enron_df_new.drop('email_address', axis = 1, inplace = True)

# Checking the changed shape of df
enron_df_new.shape

### Outlier Investigation and Analyzing the features of Enron Dataset

#### Financial Features : Bonus and Salary
Drawing scatterplot of Bonus vs Salary of Enron employees

In [None]:
plt.scatter(enron_df_new['salary'][enron_df_new['poi'] == True],enron_df_new['bonus'][enron_df_new['poi'] == True], color = 'r',
           label = 'POI')
plt.scatter(enron_df_new['salary'][enron_df_new['poi'] == False],enron_df_new['bonus'][enron_df_new['poi'] == False],color = 'b',
           label = 'Not-POI')
    
plt.xlabel("Salary")
plt.ylabel("Bonus")
plt.title("Scatterplot of salary vs bonus w.r.t POI")
plt.legend(loc='upper left')
plt.show() 

From the above figure, one point has very high value of salary and bonus. Checking for the concerned point.

In [None]:
# Finding the non-POI employee having maximum salary
enron_df_new['salary'].argmax()

#### Removing Outlier 1 : 'TOTAL'
The 'TOTAL' row corresponding to the outlier is removed from the enron dataframe. And again the scatterplot is drawn for bonus vs salary.

In [None]:
# Deleting the row 'Total' from the 
enron_df_new.drop('TOTAL', axis = 0, inplace = True)

# Drawing scatterplot with the modified dataframe
plt.scatter(enron_df_new['salary'][enron_df_new['poi'] == True],enron_df_new['bonus'][enron_df_new['poi'] == True], color = 'r',
           label = 'POI')
plt.scatter(enron_df_new['salary'][enron_df_new['poi'] == False],enron_df_new['bonus'][enron_df_new['poi'] == False],color = 'b',
           label = 'Not-POI')
    
plt.xlabel("Salary")
plt.ylabel("Bonus")
plt.title("Scatterplot of salary vs bonus w.r.t POI")
plt.legend(loc='upper left')
plt.show() 

From the above figure, its observed that the data becomes more spread out and more comprehensable after the outlier removal. Its also observed that values of bonuses of POIs are higher than that of non-POIs.

As the POI's were taking larger amounts of money as bonus, in addition to their high salary so it can be stated that the ratio of bonus to salary of the POI's will be higher as compared to that of non-POI's. So i create a new feature called __ bonus-to-salary_ratio__ hoping that it may aid in the POI identification in the later parts of this project.

#### Feature created 1 : bonus-to-salary_ratio

In [None]:
# Created a new feature
enron_df_new['bonus-to-salary_ratio'] = enron_df_new['bonus']/enron_df_new['salary']

#### Removing Outlier 2 : 'THE TRAVEL AGENCY IN THE PARK'

From the _enron61702insiderpay.pdf_ provided by findlaw.com, a dataset was observed named 'THE TRAVEL AGENCY IN THE PARK'.
From the documentary [Enron The Smartest Guys In The Room](https://www.youtube.com/watch?v=H2f7FunDuTU) i had learnt that Enron had made up some transactions with bogus companies and  people so on observing the features of this dataset it can be considered as an outlier with very low values in all features except in _others_ and _total-payments_ and i am removing it.

In [None]:
# Features of the index 'THE TRAVEL AGENCY IN THE PARK'
enron_df_new.loc['THE TRAVEL AGENCY IN THE PARK']

In [None]:
# Deleting the row with index 'THE TRAVEL AGENCY IN THE PARK'
enron_df_new.drop('THE TRAVEL AGENCY IN THE PARK', axis = 0, inplace = True)

#### Financial Features : Deferred_income, deferred_payment and total_payment 

According to [BusinessDictionary.com](http://www.businessdictionary.com/definition/deferred-payment.html) : Deferred payment is "a loan arrangement in which the borrower is allowed to start making payments at some specified time in the future. Deferred payment arrangements are often used in retail settings where a person buys and receives an item with a commitment to begin making payments at a future date."

[Deferred income](https://en.wikipedia.org/wiki/Deferred_income) : (also known as deferred revenue, unearned revenue, or unearned income) is, in accrual accounting, money received for goods or services which have not yet been delivered. According to the revenue recognition principle, it is recorded as a liability until delivery is made, at which time it is converted into revenue.

As Enron scam involved a lot of undisclosed assets and cheating public by selling assets to shell companies at end of each month and buying them back at the start of next month to hide the acounting losses so there are chances that lot of deferred revenue by the company was used by the POI's.

In [None]:
enron_df_new['deferred_income'].describe()

The __deferred_income__ feature has mostly negative values as it is the money which has to be returned by the company.

In [None]:
# Finding out the integer index locations of POIs and non-POIs
poi_rs = []
non_poi_rs = []
for i in range(len(enron_df_new['poi'])):
    if enron_df_new['poi'][i] == True:
        poi_rs.append(i+1)
    else:
        non_poi_rs.append(i+1)

print "length poi list : ",len(poi_rs)
print "length non-poi list : ",len(non_poi_rs)

Drawing scatterplot of Employees with deferred income

In [None]:
plt.scatter(non_poi_rs,
            enron_df_new['deferred_income'][enron_df_new['poi'] == False],
            color = 'b', label = 'Not-POI')

plt.scatter(poi_rs,
            enron_df_new['deferred_income'][enron_df_new['poi'] == True],
            color = 'r', label = 'POI')

    
plt.xlabel('Employees')
plt.ylabel('deferred_income')
plt.title("Scatterplot of Employees with deferred income")
plt.legend(loc='upper right')
plt.show()

The above scatterplot is not much helpful in either detecting outliers or finding patterns as some POIs as well as non-POIs have high values of deferred income.   

Creating a scatterplot of total_payments vs deferral_payments w.r.t POI.

In [None]:
# scatterplot of total_payments vs deferral_payments w.r.t POI
plt.scatter(enron_df_new['total_payments'][enron_df_new['poi'] == False],
            enron_df_new['deferral_payments'][enron_df_new['poi'] == False],
            color = 'b', label = 'Not-POI')

plt.scatter(enron_df_new['total_payments'][enron_df_new['poi'] == True],
            enron_df_new['deferral_payments'][enron_df_new['poi'] == True],
            color = 'r', label = 'POI')

    
plt.xlabel('total_payments')
plt.ylabel('deferral_payments')
plt.title("Scatterplot of total_payments vs deferral_payments w.r.t POI")
plt.legend(loc='upper right')
plt.show() 

From the above scatterplot it can be observed that majority of POIs have very low value of deferral payments as compared to the deferral_payments of non-POIs.  
From the above we can observe there are two outliers. The one having high value of total_payments is a POI and the other outlier with high value of deferral payments is a non-POI. I am removing the non-POI outlier.

In [None]:
# Finding the non-POI employee having maximum 'deferral_payments'
enron_df_new['deferral_payments'].argmax()

#### Removing Outlier 3 : 'FREVERT MARK A'

In [None]:
# Removing the non-POI employee having maximum 'deferral_payments'
enron_df_new.drop('FREVERT MARK A', axis = 0, inplace = True)

#### Financial Features : 'long_term_incentive'

Making a scatterplot to check the long_term_incentive of different Enron employees.

In [None]:
# Finding out the integer index locations of POIs and non-POIs
poi_rs = []
non_poi_rs = []
for i in range(len(enron_df_new['poi'])):
    if enron_df_new['poi'][i] == True:
        poi_rs.append(i+1)
    else:
        non_poi_rs.append(i+1)

# Making a scatterplot
plt.scatter(non_poi_rs,
            enron_df_new['long_term_incentive'][enron_df_new['poi'] == False],
            color = 'b', label = 'Not-POI')

plt.scatter(poi_rs,
            enron_df_new['long_term_incentive'][enron_df_new['poi'] == True],
            color = 'r', label = 'POI')

    
plt.xlabel('Employees')
plt.ylabel('long_term_incentive')
plt.title("Scatterplot of Employee number with long_term_incentive")
plt.legend(loc='upper left')
plt.show()

From figure, one employee has a very high value of __long_term_incentive__. So considering this point as an outlier and removing it.  

In [None]:
enron_df_new['long_term_incentive'].argmax()

#### Removing Outlier 4 : 'MARTIN AMANDA K'

In [None]:
enron_df_new.drop('MARTIN AMANDA K', axis = 0, inplace = True)

#### Financial Features : restricted_stock and restricted_stock_deferred

In [None]:
# Scatterplot of restricted_stock vs 'restricted_stock_deferred' w.r.t POI

plt.scatter(enron_df_new['restricted_stock'][enron_df_new['poi'] == False],
            enron_df_new['restricted_stock_deferred'][enron_df_new['poi'] == False],
            color = 'b', label = 'Not-POI')

plt.scatter(enron_df_new['restricted_stock'][enron_df_new['poi'] == True],
            enron_df_new['restricted_stock_deferred'][enron_df_new['poi'] == True],
            color = 'r', label = 'POI')

    
plt.xlabel('restricted_stock')
plt.ylabel('restricted_stock_deferred')
plt.title("Scatterplot of restricted_stock vs 'restricted_stock_deferred' w.r.t POI")
plt.legend(loc='upper right')
plt.show() 

In [None]:
enron_df_new['restricted_stock_deferred'].argmax()

So obtained an outlier in the feature __restricted_stock_deferred__. Also taking a quick look at the values of __restricted_stock_deferred__ most of the values are zeros and the remaining few are negative values. The outlier found here is for the enron employee __BHATNAGAR SANJAY__ who is not a POI and in this analysis i am removing this datapoint hoping that it may aid in classification.  

And at the other axis of the graph, the other maximum values are of a POI and a non-POI so no need to remove them. 

#### Removing Outlier 5 : 'BHATNAGAR SANJAY'

In [None]:
enron_df_new.drop('BHATNAGAR SANJAY', axis = 0, inplace = True)

#### Email Features :  from_poi_to_this_person and from_this_person_to_poi

Also it can be thought that for doing such a big scam the POI's might have frequent contact between them via E-mails so by checking on the number of e-mails transferred between POIs and an Employee we can be able to guess for the involvement of that person in that scam. So finding the relationship using the mail from and to this person with respect to the POI.

In [None]:
plt.scatter(enron_df_new['from_poi_to_this_person'][enron_df_new['poi'] == False],
            enron_df_new['from_this_person_to_poi'][enron_df_new['poi'] == False],
            color = 'b', label = 'Not-POI')

plt.scatter(enron_df_new['from_poi_to_this_person'][enron_df_new['poi'] == True],
            enron_df_new['from_this_person_to_poi'][enron_df_new['poi'] == True],
            color = 'r', label = 'POI')

    
plt.xlabel('from_poi_to_this_person')
plt.ylabel('from_this_person_to_poi')
plt.title("Scatterplot of count of from and to mails between poi and this_person w.r.t POI")
plt.legend(loc='upper right')
plt.show() 

This scatterplot shows relationship between the count of mails send to and fro among different employees of Enron. I think a different feature showing the proportion of mail sent by employees to and fro to the POI will be more helpful in finding out the POI. As POIs will have more communications with POIs as compared to communication with other non-POIS.   
So creating two new features.  

#### Features created : fraction_mail_from_poi and fraction_mail_to_poi

In [None]:
enron_df_new['fraction_mail_from_poi'] = enron_df_new['from_poi_to_this_person']/enron_df_new['from_messages'] 
enron_df_new['fraction_mail_to_poi'] = enron_df_new['from_this_person_to_poi']/enron_df_new['to_messages']

In [None]:
# Scatterplot of fraction of mails from and to between poi and this_person w.r.t POI
plt.scatter(enron_df_new['fraction_mail_from_poi'][enron_df_new['poi'] == False],
            enron_df_new['fraction_mail_to_poi'][enron_df_new['poi'] == False],
            color = 'b', label = 'Not-POI')

plt.scatter(enron_df_new['fraction_mail_from_poi'][enron_df_new['poi'] == True],
            enron_df_new['fraction_mail_to_poi'][enron_df_new['poi'] == True],
            color = 'r', label = 'POI')

    
plt.xlabel('fraction_mail_from_poi')
plt.ylabel('fraction_mail_to_poi')
plt.title("Scatterplot of fraction of mails from and to between poi and this_person w.r.t POI")
plt.legend(loc='upper right')
plt.show() 

From the above figure, the difference between POIs and non-POIs points can be clearly classified.  As the red dots representing POIs are more distinct, have higher values and are more seperate from the non-POI blue points.

In [None]:
#clean all 'inf' values which we got if the person's from_messages = 0
enron_df_new = enron_df_new.replace('inf', 0)
enron_df_new = enron_df_new.fillna(0)
# Converting the above modified dataframe to a dictionary
enron_dict = enron_df_new.to_dict('index')
print "Features of modified data_dictionary :"
print "Total number of datapoints : ",len(enron_dict)
print "Total number of features : ",len(enron_dict['METTS MARK'])

In [None]:
# Store to my_dataset for easy export below.
my_dataset = enron_dict

#### Features choosen by me to be used in the POI identifier

Out of the 23 features available to be me, i will be using 19 of them for the POI identification and feature processing :
  - __12 financial features :__ 
  ['salary', 'bonus', 'long_term_incentive', 'bonus-to-salary_ratio', 'deferral_payments', 'expenses','restricted_stock_deferred', 'restricted_stock', 'deferred_income', 'total_payments','other']
  - __6 Email features :__ ['fraction_mail_from_poi', 'fraction_mail_to_poi', 'from_poi_to_this_person', 'from_this_person_to_poi', 'to_messages', 'from_messages']
  - __POI__
  
The features choosen for further processing were mainly dependant on the graphical analysis done above.

In [None]:
## Selecting features which i think might be important
features_list = ['poi', 'salary', 'bonus', 'long_term_incentive', 'bonus-to-salary_ratio', 'deferral_payments', 'expenses',
                 'restricted_stock_deferred', 'restricted_stock', 'deferred_income','fraction_mail_from_poi', 'total_payments',
                 'other', 'fraction_mail_to_poi', 'from_poi_to_this_person', 'from_this_person_to_poi', 'to_messages', 
                 'from_messages']

In [None]:
# Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

### 2. Feature processing of the Enron dataset

Steps involved :  
__1. Feature scaling :__ though standardization (or Z-score normalization) can be an important preprocessing step for many machine learning algorithms. Standardization involves rescaling the features such that they have the properties of a standard normal distribution with a mean of zero and a standard deviation of one. Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.  

In this project i will be using __MinMaxScaler__ which scales features to lie between zero and one. MinMaxScaler transforms features by scaling each feature to a given range. This estimator scales and translates each feature individually such that it is in the given range on the training set, i.e. between zero and one.


__2. Feature Selection :__ feature selection/dimensionality reduction on sample sets is essential to improve estimators’ accuracy scores or to boost their performance on very high-dimensional datasets. Univariate feature selection works by selecting the best features based on univariate statistical tests. It can be seen as a preprocessing step to an estimator.   

In this project i will be using __SelectKBest__ function to find the K best or high-scoring features. Objects of these function take as input a scoring function that returns univariate scores and p-values. Here __f_classif__ is used as scoring function. The f_classif function computes the ANOVA F-value between labels and features for classification tasks.  


__3. Dimensionality Reduction using PCA :__ PCA is used to decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum amount of the variance. In scikit-learn, PCA is implemented as a transformer object that learns n components in its fit method, and can be used on new data to project it on these components.


__Pipeline__ is used to sequentially apply feature processing steps such as scaling, selection and classifcation.

Machine learning aims to obtain the parameter values that gives the optimal performance. Sklearn's __GridSearchCV__ module automates this process by performing a grid search over a range of parameter values for an estimator.

__StratifiedShuffleSplit__ is used as the cross-validation method. 'GridSearchCV' will fit and validate over multiple test/train splits. 

In [None]:
### split data into training and testing datasets
from sklearn import cross_validation
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(features, labels, test_size=0.3, 
                                                                                             random_state=42)

# Stratified ShuffleSplit cross-validator
from sklearn.model_selection import StratifiedShuffleSplit
# Dont use 'labels' in the input of StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.3,random_state = 42)

# Importing modules for feature scaling and selection
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Defining features to be used via the pipeline
## 1. Feature scaling
scaler = MinMaxScaler()

## 2. Feature Selection
skb = SelectKBest(f_classif)

## 3. Dimensionality Reduction using PCA
pca = PCA()

### 3. Choosing Algorithm

For this project i used the following algorithms : 
1. Gaussian Naive Bayes (GaussianNB) classifier
2. Decision Tree classifier
3. SVM classifier
4. KNN classifier

#### Table showing the accuracy, precision, recall and f1-score of the algorithms used above

| Algorithm                | Accuracy | Precision | Recall | f1 score |
|--------------------------|:--------:|----------:|--------|----------|
| GaussianNB classifier    | 0.762    | 0.4       | 0.222  | 0.286    |
| Decision Tree classifier | 0.785    | 0.4       | 0.25   | 0.307    |
| SVM classifier           | 0.785    | 0.2       | 0.167  | 0.182    |
| KNN classifier           | 0.857    | 0.6       | 0.428  | 0.5      |


As KNN algorithm better precision and recall value as compared to other algorithms so i have choosen it for this project. The code below is for KNN classifier designed for this project. 


In [None]:
## KNN classifier code

from sklearn.neighbors import KNeighborsClassifier
clf_knn = KNeighborsClassifier()

# Create pipeline
pipeline = Pipeline(steps = [("scaling", scaler), ("SKB", skb), ("PCA",pca), ("knn",clf_knn)])
# Define the parameters to be passsed to GridSearchCV
param_grid = {"SKB__k":[7,8,9,10,11,12], 
              "PCA__n_components":[2,3,4,5,6,7],
              "PCA__whiten":[True],
              "knn__n_neighbors": [3,5,7,9,11]
              }

# Create GeidSearchCV object
clf = GridSearchCV(pipeline, param_grid, verbose = 0, cv = sss, scoring = 'f1')

t0 = time()
clf = clf.fit(features_train, labels_train)
print "training time: ", round(time()-t0, 3), "s"

t0 = time()
prediction = clf.predict(features_test)
print "testing time: ", round(time()-t0, 3), "s"

print "Accuracy of KNN classifer is  : ",accuracy_score(labels_test, prediction)
print "Precision of KNN classifer is : ",precision_score(prediction, labels_test)
print "Recall of KNN classifer is    : ",recall_score(prediction, labels_test)
print "f1-score of KNN classifer is  : ",f1_score(prediction, labels_test)

#### Finding out the features selected by SelectKBest

In [None]:
# Obtaining the boolean list showing selected features
features_selected_bool = clf.best_estimator_.named_steps['SKB'].get_support()
# Finding the features selected by SelectKBest
features_selected_list = [x for x,y in zip(features_list[1:], features_selected_bool) if y]

print "Total number of features selected by SelectKBest algorithm : ",len(features_selected_list)

# Finding the score of features 
feature_scores =  clf.best_estimator_.named_steps['SKB'].scores_
# Finding the score of features selected by selectKBest
feature_selected_scores = feature_scores[features_selected_bool]

# Creating a pandas dataframe and arranging the features based on their scores and rankimg them 
imp_features_df = pd.DataFrame({'Features_Selected':features_selected_list, 'Features_score':feature_selected_scores})
imp_features_df.sort_values('Features_score', ascending = False,inplace = True)
Rank = pd.Series(list(range(1,12)))
imp_features_df.set_index(Rank, inplace = True)
imp_features_df

#### Tuning the Algorithm 
Tuning the algorithm for machine learning means to choose the best parameter values for algorithm that provides an optimized performance. If not tuned correctly the decision boundary made using the algorithm and training set data won't be able to correctly give the best prediction for test set data as chances of having high bias and high variance occurs.

__Grid search__ is an approach to parameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid.  

After applying GridSearch along with other feature processing techniques i found that KNN algorithm gives the best prediction, recall and f1 scores.  

I also tried to find out Accuracy, precision, recall and f1 score for different values of parameter __n_neighbors__ manually:

| n_neighbors | Accuracy | Precision | Recall | f1 score |
|-------------|:--------:|----------:|--------|----------|
| 3           | 0.857    | 0.6       | 0.428  | 0.5      |
| 4           | 0.857    | 0.2       | 0.333  | 0.25     |
| 6           | 0.881    | 0.2       | 0.5    | 0.286    |
| 9           | 0.881    | 0.2       | 0.5    | 0.286    |
| 11          | 0.881    | 0.2       | 0.5    | 0.286    |

So for __n_neighbors = 3__, best performance of the algorithm is achieved. 

Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data or the features selected.

In [None]:
# Estimator chosen by the grid search
# best algorithm-tune combination selected for final analysis
clf.best_estimator_

In [None]:
# Parameter setting that gave the best results on the hold out data.
clf.best_params_

#### Validation

__Model validation__ is defined as the process where a trained model is evaluated with a testing data set. The testing data set is a separate portion of the same data set from which the training set is derived. The main purpose of using the testing data set is to test the generalization ability of a trained model. Model validation is carried out after model training.  

Training the prediction model and testing it on the same dataset used for testing is a classic mistake. A model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set.  

In this project for validation, [StratifiedShuffleSplit](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html) is used because the dataset used is very small. This method does a "Random permutation cross-validatoion".


#### Evaluation Metric

In this project, i am trying to optimize the precision and recall so i am using the f1-score as the key measure for algorithms's performance. F1 score considers both the precision and the recall to compute the score.

- The equation for precision is :  

$ precision = (\frac{TP}{TP + FP}) $

- The equation for recall is :

$ recall = (\frac{TP}{TP + FN}) $ 

- The equation for f1 is :  

$ f1 = 2(\frac{precision.recall}{precision+recall}) $  


| Algorithm used | Precision | Recall | f1 score |
|:--------------:|----------:|--------|----------|
| KNN            | 0.6       | 0.428  | 0.5      |

>  Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance.

- Precision can be interpreted as :  if a person is being classified as a POI by my classifier then there is a 60% chance that the person is actually a POI. (i.e., a 60% chance of obtaining a true positive condition.)  

- Recall  can be interpreted as : of all the POIs considered, 42.8% of all the POIs can be classified correctly as POI. 


## References :
1. [Documentation of scikit-learn 0.19.1](http://scikit-learn.org/stable/documentation.html)
2. Udacity Forum
3. [Notes on Machine Learning & Artificial Intelligence](https://chrisalbon.com/)
4. [Model Validation, Machine Learning](https://link.springer.com/referenceworkentry/10.1007%2F978-1-4419-9863-7_233)
5. [f1-score](https://en.wikipedia.org/wiki/F1_score)