# Determining the best classifier for the POI in the Enron case using FeatureUnion (PCA and SelectKBest)

## Data Cleaning

** Loading the data **

In [1]:
import pickle
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

#### Features and number of features

In [2]:
all_features = []
c = 0
for key in data_dict:
    if c < 1:
        for feature in data_dict[key]:
            all_features.append(feature)
        c += 1
print "Features: \n{}".format(all_features)
print "Number of features: {}".format(len(all_features))

Features: 
['salary', 'to_messages', 'deferral_payments', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'restricted_stock_deferred', 'total_stock_value', 'expenses', 'loan_advances', 'from_messages', 'other', 'from_this_person_to_poi', 'poi', 'director_fees', 'deferred_income', 'long_term_incentive', 'email_address', 'from_poi_to_this_person']
Number of features: 21


**Number of poi == 1:**

In [3]:
poi = 0
for key in data_dict:
    if data_dict[key]['poi'] == 1:
        poi += 1
print poi

18


**Number of non-poi (poi == 0)**

In [4]:
len(data_dict) - poi

128

** Removal of "TOTAL" and "THE TRAVEL AGENCY IN THE PARK"**

"TOTAL"'s removal was discussed in class.

In [5]:
data_dict.pop("TOTAL")

{'bonus': 97343619,
 'deferral_payments': 32083396,
 'deferred_income': -27992891,
 'director_fees': 1398517,
 'email_address': 'NaN',
 'exercised_stock_options': 311764000,
 'expenses': 5235198,
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 83925000,
 'long_term_incentive': 48521928,
 'other': 42667589,
 'poi': False,
 'restricted_stock': 130322299,
 'restricted_stock_deferred': -7576788,
 'salary': 26704229,
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 309886585,
 'total_stock_value': 434509511}

I noticed from looking at the dictionary that "THE TRAVEL AGENCY IN THE PARK" isn't a person so I remove it: 

In [6]:
data_dict.pop('THE TRAVEL AGENCY IN THE PARK')

{'bonus': 'NaN',
 'deferral_payments': 'NaN',
 'deferred_income': 'NaN',
 'director_fees': 'NaN',
 'email_address': 'NaN',
 'exercised_stock_options': 'NaN',
 'expenses': 'NaN',
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 'NaN',
 'long_term_incentive': 'NaN',
 'other': 362096,
 'poi': False,
 'restricted_stock': 'NaN',
 'restricted_stock_deferred': 'NaN',
 'salary': 'NaN',
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 362096,
 'total_stock_value': 'NaN'}

After seeing what other people has done to treat data on the Enron case, I found out that LOCKHART EUGENE E does not have data in it, so it is removed here, too.

In [7]:
data_dict.pop('LOCKHART EUGENE E') # no data

{'bonus': 'NaN',
 'deferral_payments': 'NaN',
 'deferred_income': 'NaN',
 'director_fees': 'NaN',
 'email_address': 'NaN',
 'exercised_stock_options': 'NaN',
 'expenses': 'NaN',
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 'NaN',
 'long_term_incentive': 'NaN',
 'other': 'NaN',
 'poi': False,
 'restricted_stock': 'NaN',
 'restricted_stock_deferred': 'NaN',
 'salary': 'NaN',
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 'NaN',
 'total_stock_value': 'NaN'}

**Fixing some data points**

Also from other people's work, I found out that two datapoints have errors. Here, I manually fix the data for two people whose total_payments and total_stock_value didn't add up. These values were obtained from the pdf file provided (enron61702insiderpay.pdf).

In [8]:
data_dict['BELFER ROBERT']['director_fees'] = 102500
data_dict['BELFER ROBERT']['exercised_stock_options'] = 0
data_dict['BELFER ROBERT']['expenses'] = 3285
data_dict['BELFER ROBERT']['restricted_stock_deferred'] = -44093
data_dict['BELFER ROBERT']['total_payments'] = 3285
data_dict['BELFER ROBERT']['deferred_income'] = -102500
data_dict['BELFER ROBERT']['deferral_payments'] = 0
data_dict['BELFER ROBERT']['restricted_stock'] = 44093
data_dict['BELFER ROBERT']['total_stock_value'] = 0

data_dict['BHATNAGAR SANJAY']['director_fees'] = 0
data_dict['BHATNAGAR SANJAY']['exercised_stock_options'] = 15456290
data_dict['BHATNAGAR SANJAY']['expenses'] = 137864
data_dict['BHATNAGAR SANJAY']['other'] = 0
data_dict['BHATNAGAR SANJAY']['restricted_stock'] = 2604490
data_dict['BHATNAGAR SANJAY']['restricted_stock_deferred'] = -2604490
data_dict['BHATNAGAR SANJAY']['total_payments'] = 137864
data_dict['BHATNAGAR SANJAY']['total_stock_value'] = 15456290

**Derivation of new features, "fraction_to_poi", "fraction_from_poi"**

These were discussed in the class. 

In [9]:
def computeFraction(poi_messages, all_messages):
    """ given a number messages to/from POI (numerator) 
        and number of all messages to/from a person (denominator),
        return the fraction of messages to/from that person
        that are from/to a POI
    """
    if poi_messages != 'NaN' or all_messages != 'NaN':
        fraction = float(poi_messages) / float(all_messages)
    else:
        fraction = 0
    return fraction

In [10]:
for name in data_dict:
    data_point = data_dict[name]
    
    from_poi_to_this_person = data_point["from_poi_to_this_person"]
    to_messages = data_point["to_messages"]
    fraction_from_poi = computeFraction(from_poi_to_this_person, to_messages)
    
    data_point["fraction_from_poi"] = fraction_from_poi
    
    from_this_person_to_poi = data_point["from_this_person_to_poi"]
    from_messages = data_point["from_messages"]
    fraction_to_poi = computeFraction( from_this_person_to_poi, from_messages )
    
    data_point["fraction_to_poi"] = fraction_to_poi

Checking whether the new features were added:

In [11]:
all_features = []
c = 0
for key in data_dict:
    if c < 1:
        for feature in data_dict[key]:
            all_features.append(feature)
        c += 1

In [12]:
all_features

['to_messages',
 'deferral_payments',
 'expenses',
 'poi',
 'deferred_income',
 'email_address',
 'long_term_incentive',
 'fraction_from_poi',
 'restricted_stock_deferred',
 'shared_receipt_with_poi',
 'loan_advances',
 'from_messages',
 'other',
 'director_fees',
 'bonus',
 'total_stock_value',
 'from_poi_to_this_person',
 'from_this_person_to_poi',
 'restricted_stock',
 'salary',
 'total_payments',
 'fraction_to_poi',
 'exercised_stock_options']

In [13]:
len(all_features)

23

**Removal of Some Features**

Because of the new feautures above, it only makes sense to remove the features used to come up with the new features. "email_address" is removed because it will not give a numeric data type that can be used to create the numpy arrays for the analysis. "poi" is removed for the moment because it needs to be the first element in the list as indicated in the function used to create the numpy arrays (featureFormat).

In [14]:
features_remove = ["poi", "email_address", "from_poi_to_this_person", "from_this_person_to_poi", "from_messages", "to_messages"]

In [15]:
features_list = ["poi"]
for feature in all_features:
    if feature not in features_remove:
        features_list.append(feature)
print "features_list = {}".format(features_list)

features_list = ['poi', 'deferral_payments', 'expenses', 'deferred_income', 'long_term_incentive', 'fraction_from_poi', 'restricted_stock_deferred', 'shared_receipt_with_poi', 'loan_advances', 'other', 'director_fees', 'bonus', 'total_stock_value', 'restricted_stock', 'salary', 'total_payments', 'fraction_to_poi', 'exercised_stock_options']


In [16]:
len(features_list)

18

**Extracting features and labels from dataset for local testing**

In [17]:
from feature_format import featureFormat, targetFeatureSplit

In [18]:
my_dataset = data_dict
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

## Exploring a variety of classifiers

Here, I explore various classifiers by looping over them and collecting scores for each classfier in a dictionary.

But before this can be done, the data is split into training and testing sets. As hinted in the provided tester.py, splitting the data is done using StratifiedShuffleSplit to account for the fact that the number of one class (i.e. POI) is a lot lower than the other (non-POI). 

**Splitting the features to test and train, converting to numpy arrays**

In [19]:
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
features = np.array(features)
labels = np.array(labels)
cv = StratifiedShuffleSplit(n_splits=1000, random_state=42)
for train_idx, test_idx in cv.split(features, labels):
    features_train, features_test = features[train_idx], features[test_idx]
    labels_train, labels_test = labels[train_idx], labels[test_idx]

In [20]:
len(features_train)

128

In [21]:
len(labels_train)

128

In [22]:
len(features_test)

15

In [23]:
len(labels_test)

15

** Importing modules **

In [24]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, recall_score, precision_score

In [25]:
from sklearn.pipeline import FeatureUnion, Pipeline

**Classifiers to be explored**

http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html


In [27]:
names = ["K Nearest Neighbors", "RBF SVM", "Linear SVM", "Decision Tree", "Naive Bayes", 
         "AdaBoost", "Random Forest", "Logistic Regression"]

In [28]:
classifiers = [KNeighborsClassifier(), SVC(kernel='rbf', random_state=42), \
                SVC(kernel="linear", random_state=42), DecisionTreeClassifier(random_state=42), \
                GaussianNB(), AdaBoostClassifier(random_state=42), RandomForestClassifier(random_state=42), \
                LogisticRegression(random_state=42)]

**Using FeatureUnion**

http://scikit-learn.org/stable/auto_examples/feature_stacker.html

In [40]:
pca = PCA(n_components=5)

In [41]:
selection = SelectKBest(k=3)

In [42]:
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])

Use combined features to transform dataset:

In [43]:
features_train_transformed = combined_features.fit(features_train, labels_train).transform(features_train)

In [44]:
features_test_transformed = combined_features.transform(features_test)

**Iterating over numerous classifiers**

In [45]:
results = {}
for name, clf in zip(names, classifiers):
    pipe = Pipeline([("scaler", MinMaxScaler()), ("features", combined_features), ("clf", clf)])
    pipe.fit(features_train, labels_train)
    
    scores = {}

    score =  round(pipe.score(features_test, labels_test), 3)
    scores["Accuracy score"] = score

    pred = pipe.predict(features_test)
    
    conf_mat = confusion_matrix(labels_test, pred)
    scores["Confusion matrix"] = conf_mat
    
    precisionscore = round(precision_score(labels_test, pred), 3)
    scores["Precision score"] = precisionscore
    
    recallscore = round(recall_score(labels_test, pred), 3)
    scores["Recall score"] = recallscore
    
    results[name] = scores

In [46]:
import pprint
pprint.pprint(results)

{'AdaBoost': {'Accuracy score': 0.667,
              'Confusion matrix': array([[10,  3],
       [ 2,  0]]),
              'Precision score': 0.0,
              'Recall score': 0.0},
 'Decision Tree': {'Accuracy score': 0.8,
                   'Confusion matrix': array([[12,  1],
       [ 2,  0]]),
                   'Precision score': 0.0,
                   'Recall score': 0.0},
 'K Nearest Neighbors': {'Accuracy score': 0.933,
                         'Confusion matrix': array([[13,  0],
       [ 1,  1]]),
                         'Precision score': 1.0,
                         'Recall score': 0.5},
 'Linear SVM': {'Accuracy score': 0.867,
                'Confusion matrix': array([[13,  0],
       [ 2,  0]]),
                'Precision score': 0.0,
                'Recall score': 0.0},
 'Logistic Regression': {'Accuracy score': 0.933,
                         'Confusion matrix': array([[13,  0],
       [ 1,  1]]),
                         'Precision score': 1.0,
                  

**Converting the dictionaries to pandas dataframe to easily see the scores**

In [47]:
import pandas as pd

In [48]:
pd.DataFrame.from_dict(results)

Unnamed: 0,AdaBoost,Decision Tree,K Nearest Neighbors,Linear SVM,Logistic Regression,Naive Bayes,RBF SVM,Random Forest
Accuracy score,0.667,0.8,0.933,0.867,0.933,0.8,0.867,0.733
Confusion matrix,"[[10, 3], [2, 0]]","[[12, 1], [2, 0]]","[[13, 0], [1, 1]]","[[13, 0], [2, 0]]","[[13, 0], [1, 1]]","[[11, 2], [1, 1]]","[[13, 0], [2, 0]]","[[10, 3], [1, 1]]"
Precision score,0,0,1,0,1,0.333,0,0.25
Recall score,0,0,0.5,0,0.5,0.5,0,0.5


Best ones:

- KNN
- Logistic Regressino
- Naive Bayes
- Random Forest


See poi_id_skb_pca_featureunion_knn.py for optimizing parameters.