# Intro to Machine learning, Final Project

In this project, we are given the Enron dataset, and the goal is to identify the persons of interest in the Enron scandal. We are provided with financial features, like salary, stock options, etc... and some email features, detailing the exchanges between the collaborators. We first import the datasets. The cleaned datasets are saved as pickle files, and can be accessed directly from [here](#section3). 

1. [Importation of the data](#section1) 

2. [Data Cleaning and Outliers](#section2)

3. [Start Classification from cleaned data](#section3)

   3.1. [Feature engineering](#section3.1)

## Importation of the data:
<a id='section1'></a>

In [106]:
import sys
import pickle
import numpy as np
with open("../ud120-projects/final_project/final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)
import pandas as pd
data=pd.DataFrame.from_dict(data_dict, orient='index')
data.replace('NaN', np.nan, inplace=True) #replace the NaN values (strings) as np.nan values
print(data.head(2))
print('There are {} persons in this dataset, with {} features'.format(data.shape[0], data.shape[1]))
print('The features are: {}'.format(data.columns.values))

                   salary  to_messages  deferral_payments  total_payments  \
ALLEN PHILLIP K  201955.0       2902.0          2869717.0       4484442.0   
BADUM JAMES P         NaN          NaN           178980.0        182466.0   

                 exercised_stock_options      bonus  restricted_stock  \
ALLEN PHILLIP K                1729541.0  4175000.0          126027.0   
BADUM JAMES P                   257817.0        NaN               NaN   

                 shared_receipt_with_poi  restricted_stock_deferred  \
ALLEN PHILLIP K                   1407.0                  -126027.0   
BADUM JAMES P                        NaN                        NaN   

                 total_stock_value           ...            loan_advances  \
ALLEN PHILLIP K          1729541.0           ...                      NaN   
BADUM JAMES P             257817.0           ...                      NaN   

                 from_messages  other  from_this_person_to_poi    poi  \
ALLEN PHILLIP K         2195.

## Data cleaning and outliers
<a id='section2'></a>

We start with some data cleaning, identifying some outliers and absurdities. The first task is to count the number of 'NaN' values for each features.

In [107]:
nb_NaN=data.isnull().sum(axis=0)
print(nb_NaN)

salary                        51
to_messages                   60
deferral_payments            107
total_payments                21
exercised_stock_options       44
bonus                         64
restricted_stock              36
shared_receipt_with_poi       60
restricted_stock_deferred    128
total_stock_value             20
expenses                      51
loan_advances                142
from_messages                 60
other                         53
from_this_person_to_poi       60
poi                            0
director_fees                129
deferred_income               97
long_term_incentive           80
email_address                 35
from_poi_to_this_person       60
dtype: int64


We see that some of the features have a lot of missing values. Furthermore, some features are irrelevant, like the email address. We drop those features.

In [108]:
irrelevant_features=nb_NaN.index[nb_NaN>80].values
irrelevant_features=np.append(irrelevant_features,['email_address'])
data.drop(irrelevant_features,axis=1, inplace=True)
print(data.head(2))

                   salary  to_messages  total_payments  \
ALLEN PHILLIP K  201955.0       2902.0       4484442.0   
BADUM JAMES P         NaN          NaN        182466.0   

                 exercised_stock_options      bonus  restricted_stock  \
ALLEN PHILLIP K                1729541.0  4175000.0          126027.0   
BADUM JAMES P                   257817.0        NaN               NaN   

                 shared_receipt_with_poi  total_stock_value  expenses  \
ALLEN PHILLIP K                   1407.0          1729541.0   13868.0   
BADUM JAMES P                        NaN           257817.0    3486.0   

                 from_messages  other  from_this_person_to_poi    poi  \
ALLEN PHILLIP K         2195.0  152.0                     65.0  False   
BADUM JAMES P              NaN    NaN                      NaN  False   

                 long_term_incentive  from_poi_to_this_person  
ALLEN PHILLIP K             304805.0                     47.0  
BADUM JAMES P                    NaN 

The new feature list is given by:

In [109]:
print(data.columns.values)

['salary' 'to_messages' 'total_payments' 'exercised_stock_options' 'bonus'
 'restricted_stock' 'shared_receipt_with_poi' 'total_stock_value'
 'expenses' 'from_messages' 'other' 'from_this_person_to_poi' 'poi'
 'long_term_incentive' 'from_poi_to_this_person']


Next, we look for outliers in the data. To do this, we select the financial features and print the 5 largest values for each.

In [110]:
financial_features=['salary','total_payments' ,'exercised_stock_options', 'bonus',
 'restricted_stock' , 'total_stock_value',
 'expenses' , 'other' ,
 'long_term_incentive' ]
for feature in financial_features:
    print('Largest {}:'.format(feature))
    print(data.nlargest(5,feature)[feature])
    print('')

Largest salary:
TOTAL                 26704229.0
SKILLING JEFFREY K     1111258.0
LAY KENNETH L          1072321.0
FREVERT MARK A         1060932.0
PICKERING MARK R        655037.0
Name: salary, dtype: float64

Largest total_payments:
TOTAL               309886585.0
LAY KENNETH L       103559793.0
FREVERT MARK A       17252530.0
BHATNAGAR SANJAY     15456290.0
LAVORATO JOHN J      10425757.0
Name: total_payments, dtype: float64

Largest exercised_stock_options:
TOTAL                 311764000.0
LAY KENNETH L          34348384.0
HIRKO JOSEPH           30766064.0
RICE KENNETH D         19794175.0
SKILLING JEFFREY K     19250000.0
Name: exercised_stock_options, dtype: float64

Largest bonus:
TOTAL                 97343619.0
LAVORATO JOHN J        8000000.0
LAY KENNETH L          7000000.0
SKILLING JEFFREY K     5600000.0
BELDEN TIMOTHY N       5249999.0
Name: bonus, dtype: float64

Largest restricted_stock:
TOTAL                 130322299.0
LAY KENNETH L          14761694.0
WHITE JR THOMA

There is a 'TOTAL' index that is useless here, we drop it.

In [111]:
data.drop('TOTAL',axis=0,inplace=True)

We also check for indexes which have almost no data:

In [112]:
nb_NaN_index=data.isnull().sum(axis=1)
print(nb_NaN_index.nlargest(40))

LOCKHART EUGENE E                14
CHAN RONNIE                      13
GRAMM WENDY L                    13
SAVAGE FRANK                     13
BLAKE JR. NORMAN P               12
CLINE KENNETH W                  12
MENDELSOHN JOHN                  12
MEYER JEROME J                   12
PEREIRA PAULO V. FERRAZ          12
SCRIMSHAW MATTHEW                12
THE TRAVEL AGENCY IN THE PARK    12
URQUHART JOHN A                  12
WAKEHAM JOHN                     12
WHALEY DAVID A                   12
WINOKUR JR. HERBERT S            12
WODRASKA JOHN                    12
WROBEL BRUCE                     12
BELFER ROBERT                    11
CHRISTODOULOU DIOMEDES           11
DUNCAN JOHN H                    11
FUGH JOHN L                      11
GATHMANN WILLIAM D               11
GILLIS JOHN                      11
LEMAISTRE CHARLES                11
LOWRY CHARLES P                  11
NOLES JAMES L                    11
BADUM JAMES P                    10
GRAY RODNEY                 

We drop all the indices which have 12 or more features with no information (out of 15 features in total). In particular, looking at the list, we realize that there is a travel agency that does not have its place here.

In [113]:
irrelevant_indices=nb_NaN_index.index[nb_NaN_index>=12].values
data.drop(irrelevant_indices,axis=0, inplace=True)

We can now start working with a cleaner dataset, and we split the 'poi' target feature appart.

In [115]:
labels=data['poi']
features=data.drop('poi', axis=1)
print(labels.head())
print(features.head())
print(features.shape)

ALLEN PHILLIP K       False
BADUM JAMES P         False
BANNANTINE JAMES M    False
BAXTER JOHN C         False
BAY FRANKLIN R        False
Name: poi, dtype: bool
                      salary  to_messages  total_payments  \
ALLEN PHILLIP K     201955.0       2902.0       4484442.0   
BADUM JAMES P            NaN          NaN        182466.0   
BANNANTINE JAMES M     477.0        566.0        916197.0   
BAXTER JOHN C       267102.0          NaN       5634343.0   
BAY FRANKLIN R      239671.0          NaN        827696.0   

                    exercised_stock_options      bonus  restricted_stock  \
ALLEN PHILLIP K                   1729541.0  4175000.0          126027.0   
BADUM JAMES P                      257817.0        NaN               NaN   
BANNANTINE JAMES M                4046157.0        NaN         1757552.0   
BAXTER JOHN C                     6680544.0  1200000.0         3942714.0   
BAY FRANKLIN R                          NaN   400000.0          145796.0   

             

We save this clean dataset to pickle files.

In [116]:
labels.to_pickle('labels.pkl')
features.to_pickle('features.pkl')

## Start classification from cleaned data 
<a id='section3'></a>

We can start to train a classifier to the new data to try to identify persons of inerest.

In [3]:
import sys
import pickle
import numpy as np
import pandas as pd
labels=pd.read_pickle('labels.pkl')
features=pd.read_pickle('features.pkl')

### Feature engineering
<a id='section3.1'></a>

We recall the features that we have:

In [4]:
print(features.columns.values)

['salary' 'to_messages' 'total_payments' 'exercised_stock_options' 'bonus'
 'restricted_stock' 'shared_receipt_with_poi' 'total_stock_value'
 'expenses' 'from_messages' 'other' 'from_this_person_to_poi'
 'long_term_incentive' 'from_poi_to_this_person']


Some of those features do not seem to be particularly relevant for the problem at stake. For instance, the features 'to messages' and 'from message' are the total number of messages received and sent by a person. However, we can use those features to create a feature that stores the proportion of messages in relation with a POI. 