## **Person of Interest identifier in Enron Email Dataset**
<p>In 2000, Enron was one of the largest companies in the United States. Two years later, it became bankrupt due to widespread corporate fraud. In the resulting Federal investigation, normally confidential information, including tens of thousands of emails and detailed financial data for top executives, entered the public record.

The objective of this project is to build an algorithm to identify Enron employees who may have committed fraud based on the public Enron financial and email dataset. Such employees are referred to as "person's of interest", or, POIs.</p>

### **Importing Libraries**
First we will start by importing libraries required.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import pickle
from sklearn import preprocessing
from sklearn import metrics
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split, validation_curve, cross_val_score, cross_val_predict
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
sys.path.append("/kaggle/input/enron-dataset/")
from feature_format import featureFormat
from feature_format import targetFeatureSplit


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### **Importing the dataset**

Next we will load the datset. The data here is given in the form of a dictionary containing combined Enron email and financial data. The dictionary key is the person's name, and the value is another dictionary, which contains the names of all the features and their values for that person. The features in the data fall into two major types; financial features and email features. Each person is also labelled as a poi (boolean).

We will load the pickle file and work it as a pandas dataframe.

In [None]:
with open("/kaggle/input/enron-dataset/final_project_dataset_unix.pkl", "rb") as file:
    dataset = pickle.load(file)

df = pd.DataFrame.from_dict(dataset,orient='index')
df = df.replace('NaN',np.nan)
df.head()

### **Outlier Detection**

For finding outliers, we will first plot employee salaries and bonuses to see if any employees are receiving payments that are magnitudes larger than others. Also to see if any employees are receiving bunuses much more than others.

In [None]:
sns.set(rc={"font.style":"normal",
            'axes.labelsize':16,
            'xtick.labelsize':10,
            'font.size':10,
            'ytick.labelsize':10,
            'figure.figsize': (10, 7)}
       )
df.plot(kind = 'scatter', x = 'salary', y = 'bonus')

We can see outliers right away. On analysing further, we see that the outlier is a result of a "TOTAL" field being present in the dictionary that sums each field for all employees in the dataset.

The "total" field is removed, and the same fields are re-plotted below.

In [None]:
dataset.pop('TOTAL')

df = pd.DataFrame.from_dict(dataset,orient='index')
df = df.replace('NaN',np.nan)
df.head()

In [None]:
sns.set(rc={"font.style":"normal",
            'axes.labelsize':16,
            'xtick.labelsize':10,
            'font.size':10,
            'ytick.labelsize':10,
            'figure.figsize': (10, 7)}
       )
df.plot(kind = 'scatter', x = 'salary', y = 'bonus')

Next, we try to plot total payments and we still find some outliers.

In [None]:
sns.set(rc={"font.style":"normal",
            'axes.labelsize':16,
            'xtick.labelsize':10,
            'font.size':10,
            'ytick.labelsize':10,
            'figure.figsize': (10, 7)}
       )
df.plot(kind = 'scatter', x = 'salary', y = 'total_payments')

To investigate further we check in the dataframe the employees and which category they belong to w.r.t. POI.

In [None]:
bonus = df[(df.salary > 1000000) | (df.bonus > 5000000)][['bonus','salary','poi','total_payments']]
bonus.sort_values(by = 'total_payments', ascending= False, inplace=True)
bonus

We find that most of these outliers are people who are in the POI category. And since we have only 18 POI so we will keep these data and not remove it. <br>
Also, Kenneth Lay and Jeffrey Skilling have come up in this list which already brings out their involvement in the fraud as they have been receiving way more bonus than their salary.

### **Exploratory Data Analysis**

In [None]:
df.shape, df.columns

The features in the data fall into three major types, namely financial features, email features and POI labels. THey are given below:

1. **Financial features: (all units are in US dollars)**
   * salary
   * deferral_payments
   * total_payments
   * loan_advances
   * bonus
   * restricted_stock_deferred
   * deferred_income
   * total_stock_value
   * expenses
   * exercised_stock_options
   * other
   * long_term_incentive
   * restricted_stock
   * director_fees
   
2. **Email features: (units are generally number of emails messages; exception is ‘email_address’, which is a string)**
   * to_messages
   * email_address
   * from_poi_to_this_person
   * from_messages
   * from_this_person_to_poi
   * shared_receipt_with_poi
   
3. **POI label: (boolean value representing wheter a person is POI)**
   * poi

In [None]:
df.dtypes

All the columns are having correct data type as it should have been. We changed the True/False labels to int in poi so its datatype is int64 and name and email having object dtye. All other columns are float.

In [None]:
 df['poi'] = df['poi'].apply(lambda x: 1 if x else 0)

poi = df.poi.value_counts()
print(poi)

sns.set(rc={"font.style":"normal",
            'axes.labelsize':16,
            'xtick.labelsize':10,
            'font.size':10,
            'ytick.labelsize':10,
            'figure.figsize': (6, 6)}
       )
poi.plot(kind = 'pie')
plt.legend()
plt.xlabel('Person of Interest')
plt.ylabel('Count')

plt.title ('The dataset contains a total of %s POIs.' % sum(df['poi']))
plt.show()

In [None]:
all = df.count()
pois = df[df['poi'] == 1].count()

result = pd.concat([all, pois], axis=1)
result.columns = ['All Records','POIs']

sns.set(rc={"font.style":"normal",
            'axes.labelsize':16,
            'xtick.labelsize':10,
            'font.size':10,
            'ytick.labelsize':10,
            'figure.figsize': (18, 12)}
       )

fig = result.plot(kind = 'bar',
       figsize = (20, 14),
       color = ['#5cb85c', '#5bc0de'])
fig.set_title('Distribution of Total Records present and number of POI present', fontsize=20)
fig.set_xlabel('Feature', fontsize = '16')
fig.set_ylabel('Count', fontsize = 16)
fig.legend(fontsize = 14)

for p in fig.patches:
    fig.annotate(np.round(p.get_height(),decimals=2), 
                (p.get_x()+p.get_width()/2., p.get_height()), 
                ha='center', 
                va='center', 
                xytext=(0, 10), 
                textcoords='offset points',
                fontsize = 14
               )

In this plot, we see number of total values present in each column represented by 'All Records' label and the number of 'POI' in each column involved with the 'POIs' label. We seem to have missing values for all columns except for name and poi column and thus no single feature has information for all the employees. This was expected as such confidential information is difficult to collect for all users.
<br>

### **Pairwise Plots**

They are an easy way to see the dependence of features on each other. We will first plot the pairwise plot for the financial features. We have not used all the features and dropped ones with too few entries present.

In [None]:
sns.set(rc={"font.style":"normal",
            'axes.labelsize':15,
            'xtick.labelsize':10,
            'font.size':10,
            'ytick.labelsize':10}
       )
financial = ['poi', 'salary', 'deferral_payments', 'total_payments', 'bonus', 'total_stock_value', 'expenses', 'exercised_stock_options','long_term_incentive', 'restricted_stock'] 
sns.pairplot(df[financial], hue = 'poi', palette = 'husl')
plt.show()

In [None]:
sns.set(rc={"font.style":"normal",
            'axes.labelsize':8,
            'xtick.labelsize':8,
            'font.size':8,
            'ytick.labelsize':8}
       )
email = ['to_messages','from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi', 'poi']
sns.pairplot(df[email], hue = 'poi', palette = 'husl')
plt.show()

### **Feature Selection and Engineering**

We drop the email_address feature as its a string and its not of much use. Then we use the functions in the feature_format module to split the dataset into target and feature variables. And then split it into trani and test set with test size of 15% of the total dataset. Also we will add features bonus_to_salary containing the ratio of bonus and salary and also bonus_to_total with ratio of bonus to total payment.

In [None]:
features_list = ["poi", "salary", "bonus", 'deferral_payments', 'total_payments', 'loan_advances', 
                 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses',
                 'exercised_stock_options', 'long_term_incentive', 'other', 'shared_receipt_with_poi', 
                 'restricted_stock', 'director_fees', 'to_messages','from_poi_to_this_person', 
                 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi']

data = featureFormat(dataset, features_list)
y, X = targetFeatureSplit(data)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
print(len(y_train), len(y_test))

### **Logistic Regression**

In [None]:
X_scaled = preprocessing.scale(X_train)
Xtest_scaled = preprocessing.scale(X_test)
lr = LogisticRegression(max_iter = 5000)
grid = {'C':[0.01, 0.03, 0.1, 0.3, 1, 3, 10]}
grid_lr = GridSearchCV(lr,param_grid=grid,scoring='accuracy',cv=5)
grid_lr.fit(X_scaled,y_train)

In [None]:
print(grid_lr.best_params_)
pred = grid_lr.predict(Xtest_scaled)
print('Test Accuracy = ',grid_lr.score(Xtest_scaled,y_test))
print(metrics.classification_report(y_test,pred, zero_division=0))

### **Random Forest Regression**

In [None]:
rf = RandomForestClassifier(n_estimators=200)
grid = {'n_estimators':[1, 10, 50],'max_depth':[25,30,35,40,45,50]}
grid_rf = GridSearchCV(rf,param_grid=grid,scoring='accuracy',cv=5)
grid_rf.fit(X_train,y_train)

In [None]:
print(grid_rf.best_params_)
pred = grid_rf.predict(X_test)
print('Accuracy = ',grid_rf.score(X_test,y_test))
print(metrics.classification_report(y_test,pred, zero_division = 0))


### **K Means Classifier**

In [None]:
km = KNeighborsClassifier()
grid = {'n_neighbors':[4,5,6,7,8,9,10,11]}
grid_km = GridSearchCV(km,param_grid=grid,scoring='accuracy',cv=5)
grid_km.fit(X_train,y_train)

In [None]:
print(grid_km.best_params_)
pred = grid_km.predict(X_test)
print('Accuracy = ',grid_km.score(X_test,y_test))
print(metrics.classification_report(y_test,pred, zero_division = 0))

### **AdaBoost Classifier**

In [None]:
from sklearn.tree import DecisionTreeClassifier
ada = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),n_estimators=200)
param_grid = {"base_estimator__criterion" : ["gini", "entropy"],"base_estimator__splitter":["best", "random"], "n_estimators": [1, 2]}
grid_ada = GridSearchCV(ada, param_grid=param_grid, scoring = 'accuracy', cv=5)
grid_ada.fit(X_train, y_train)

In [None]:
print(grid_ada.best_estimator_)
pred = grid_ada.predict(X_test)
print('Accuracy = ',grid_ada.score(X_test,y_test))
print(metrics.classification_report(y_test,pred, zero_division = 0))

### **Classification Report**

<table>
    <tr><b>
        <th>Classifier</th>
        <th>Precision(weighted_avg)</th>
        <th>Recall(weighted_avg)</th>
        <th>Test Accuracy</th>
    </b></tr>
    <tr>
        <th>Logistic Regression</th>
        <th>0.83</th>
        <th>0.91</th>
        <th>0.91</th>
    </tr>
        <tr>
        <th>Random Forest Classifier</th>
        <th>0.83</th>
        <th>0.91</th>
        <th>0.91</th>
    </tr>
        <tr>
        <th>Ada Boost Classifier</th>
        <th>0.96</th>
        <th>0.95</th>
        <th>0.95</th>
    </tr>
        <tr>
        <th>K Means Classifier</th>
        <th>0.96</th>
        <th>0.95</th>
        <th>0.95</th>
    </tr>
</table>    

In [None]:
pickle.dump(ada, open("my_classifier.pkl", "wb") )
pickle.dump(dataset, open("my_dataset.pkl", "wb") )
pickle.dump(features_list, open("my_feature_list.pkl", "wb") )