# Udacity Course - Intro to Machine Learning

## Final Project Overview:

In this project, you will play detective, and put your machine learning skills to use by building an algorithm to identify Enron Employees who may have committed fraud based on the public Enron financial and email dataset.

#### Project Introduction
In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, there was a significant amount of typically confidential information entered into public record, including tens of thousands of emails and detailed financial data for top executives. In this project, you will play detective, and put your new skills to use by building a person of interest identifier based on financial and email data made public as a result of the Enron scandal. To assist you in your detective work, we've combined this data with a hand-generated list of persons of interest in the fraud case, which means individuals who were indicted, reached a settlement, or plea deal with the government, or testified in exchange for prosecution immunity.

In [45]:

import sys
import pickle
import numpy as np
import pandas as pd
import sklearn
#from ggplot import *
import matplotlib as plt
%matplotlib inline

sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
from tester import test_classifier, dump_classifier_and_data


### Task 1: Features Selection

The features in the data fall into three major types, namely financial features, email features and POI labels.

financial features: ['salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees'] (all units are in US dollars)

email features: ['to_messages', 'email_address', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi'] (units are generally number of emails messages; notable exception is ‘email_address’, which is a text string)

POI label: [‘poi’] (boolean, represented as integer)

I'm going to try to use all features, filter them and choose the best.

In [10]:

### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi".
features_list = ['poi', 'salary', 'to_messages', 'deferral_payments', 'total_payments', 
                 'loan_advances', 'bonus', 'restricted_stock_deferred', 
                 'deferred_income', 'total_stock_value', 'expenses', 'from_poi_to_this_person', 
                 'exercised_stock_options', 'from_messages', 'other', 'from_this_person_to_poi', 
                 'long_term_incentive', 'shared_receipt_with_poi', 'restricted_stock', 'director_fees'] 

#financial_features_list = ['salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees']
#email_features_list = ['to_messages', 'email_address', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi']


### Load the dictionary containing the dataset
with open("final_project_dataset_modified.pkl", "r") as data_file:
    enron_data = pickle.load(data_file)
    
enron_data['METTS MARK'].keys()

f = open('poi_names.txt', 'r')



In [13]:
df = pd.DataFrame.from_records(list(enron_data.values()))
persons = pd.Series(list(enron_data.keys()))

print persons.head()
print df.head()

0          METTS MARK
1       BAXTER JOHN C
2      ELLIOTT STEVEN
3    CORDES WILLIAM R
4      HANNON KEVIN P
dtype: object
     bonus deferral_payments deferred_income director_fees  \
0   600000               NaN             NaN           NaN   
1  1200000           1295738        -1386055           NaN   
2   350000               NaN         -400729           NaN   
3      NaN               NaN             NaN           NaN   
4  1500000               NaN        -3117011           NaN   

              email_address exercised_stock_options expenses from_messages  \
0      mark.metts@enron.com                     NaN    94299            29   
1                       NaN                 6680544    11200           NaN   
2  steven.elliott@enron.com                 4890344    78552           NaN   
3     bill.cordes@enron.com                  651850      NaN            12   
4    kevin.hannon@enron.com                 5538001    34039            32   

  from_poi_to_this_person from_thi

### Task 2: Remove outliers

In [15]:
# convert to numpy.nan
df.replace(to_replace='NaN', value=np.nan, inplace=True)

# count number of nan for columns
print df.isnull().sum()

# dataframe dimension
df.shape

bonus                         64
deferral_payments            106
deferred_income               96
director_fees                127
email_address                 34
exercised_stock_options       44
expenses                      51
from_messages                 59
from_poi_to_this_person       59
from_this_person_to_poi       59
loan_advances                141
long_term_incentive           80
other                         53
poi                            0
restricted_stock              36
restricted_stock_deferred    126
salary                        51
shared_receipt_with_poi       59
to_messages                   59
total_payments                21
total_stock_value             20
dtype: int64


(143, 21)

In [24]:
# remove column from df if null counter > 65
for column, series in df.iteritems():
    if series.isnull().sum() > 65:
        df.drop(column, axis=1, inplace=True)

# remove email address column
if 'email_address' in list(df.columns.values):
    df.drop('email_address', axis=1, inplace=True)



In [None]:
df.info()

In [27]:
# Impute the missing values
from sklearn.preprocessing import Imputer
#imp = Imputer(missing_values='NaN', strategy='median', axis=0)
#imp.fit(df)
#df_imp = pd.DataFrame(imp.transform(df.copy(deep=True)))

df_imp = df.replace(to_replace=np.nan, value=0)
df_imp = df.fillna(0).copy(deep=True)
df_imp.columns = list(df.columns.values)

print df_imp.isnull().sum()
print df_imp.head()

df_imp.describe()


bonus                      0
exercised_stock_options    0
expenses                   0
from_messages              0
from_poi_to_this_person    0
from_this_person_to_poi    0
other                      0
poi                        0
restricted_stock           0
salary                     0
shared_receipt_with_poi    0
to_messages                0
total_payments             0
total_stock_value          0
dtype: int64
       bonus  exercised_stock_options  expenses  from_messages  \
0   600000.0                      0.0   94299.0           29.0   
1  1200000.0                6680544.0   11200.0            0.0   
2   350000.0                4890344.0   78552.0            0.0   
3        0.0                 651850.0       0.0           12.0   
4  1500000.0                5538001.0   34039.0           32.0   

   from_poi_to_this_person  from_this_person_to_poi      other    poi  \
0                     38.0                      1.0     1740.0  False   
1                      0.0            

Unnamed: 0,bonus,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,other,restricted_stock,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
count,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0
mean,592612.7,1715504.0,34719.447552,365.118881,37.552448,24.475524,226738.5,723523.5,171473.1,676.384615,1191.972028,1489942.0,2404233.0
std,1036924.0,3694149.0,45235.547286,1455.675655,74.148184,80.080666,755217.8,1571184.0,166040.4,1066.923179,2223.8603,2386745.0,4422592.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-2604490.0,0.0,0.0,0.0,0.0,-44093.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,89292.5,214825.5
50%,250000.0,601438.0,17355.0,16.0,1.0,0.0,891.0,346663.0,206121.0,91.0,266.0,911453.0,954354.0
75%,800000.0,1636136.0,52688.5,50.5,39.5,12.5,149204.0,680164.0,267097.5,869.0,1504.0,1858492.0,2217787.0
max,8000000.0,30766060.0,228763.0,14368.0,528.0,609.0,7427621.0,13847070.0,1060932.0,5521.0,15149.0,17252530.0,30766060.0


In [28]:
print enron_data.keys()

['METTS MARK', 'BAXTER JOHN C', 'ELLIOTT STEVEN', 'CORDES WILLIAM R', 'HANNON KEVIN P', 'MORDAUNT KRISTINA M', 'MEYER ROCKFORD G', 'MCMAHON JEFFREY', 'HORTON STANLEY C', 'PIPER GREGORY F', 'HUMPHREY GENE E', 'UMANOFF ADAM S', 'BLACHMAN JEREMY M', 'SUNDE MARTIN', 'GIBBS DANA R', 'LOWRY CHARLES P', 'COLWELL WESLEY', 'MULLER MARK S', 'JACKSON CHARLENE R', 'WESTFAHL RICHARD K', 'WALTERS GARETH W', 'WALLS JR ROBERT H', 'KITCHEN LOUISE', 'CHAN RONNIE', 'BELFER ROBERT', 'SHANKMAN JEFFREY A', 'WODRASKA JOHN', 'BERGSIEKER RICHARD P', 'URQUHART JOHN A', 'BIBI PHILIPPE A', 'RIEKER PAULA H', 'WHALEY DAVID A', 'BECK SALLY W', 'HAUG DAVID L', 'ECHOLS JOHN B', 'MENDELSOHN JOHN', 'HICKERSON GARY J', 'CLINE KENNETH W', 'LEWIS RICHARD', 'HAYES ROBERT E', 'MCCARTY DANNY J', 'KOPPER MICHAEL J', 'LEFF DANIEL P', 'LAVORATO JOHN J', 'BERBERIAN DAVID', 'DETMERING TIMOTHY J', 'WAKEHAM JOHN', 'POWERS WILLIAM', 'GOLD JOSEPH', 'BANNANTINE JAMES M', 'DUNCAN JOHN H', 'SHAPIRO RICHARD S', 'SHERRIFF JOHN R', 'SHELBY 

In [31]:
# drop row for 'THE TRAVEL AGENCY IN THE PARK'

park_index = enron_data.keys().index('THE TRAVEL AGENCY IN THE PARK')
print park_index
df_imp_sub = df_imp.drop(df_imp.index[[park_index]])
df_imp_sub.describe()

99


Unnamed: 0,bonus,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,other,restricted_stock,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
count,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0
mean,596786.0,1727585.0,34963.950704,367.690141,37.816901,24.647887,225785.3,728618.7,172680.6,681.147887,1200.366197,1497885.0,2421165.0
std,1039388.0,3704389.0,45300.747867,1460.502581,74.342949,80.337515,757804.8,1575560.0,165996.8,1069.172948,2229.45777,2393296.0,4433593.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-2604490.0,0.0,0.0,0.0,0.0,-44093.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8115.0,0.0,0.0,0.0,88392.25,228869.5
50%,275000.0,604637.5,18094.5,16.5,2.5,0.0,882.5,353595.5,208310.5,102.5,289.0,913825.0,955113.5
75%,800000.0,1636585.0,52905.25,51.25,39.75,12.75,145428.5,689203.0,267099.8,871.5,1513.0,1863625.0,2218031.0
max,8000000.0,30766060.0,228763.0,14368.0,528.0,609.0,7427621.0,13847070.0,1060932.0,5521.0,15149.0,17252530.0,30766060.0


In [34]:
print 'Columns:', list(df_imp_sub.columns.values)
print 'Shape:', df_imp_sub.shape
print 'Number of POI in the sub dataset:', (df_imp_sub['poi'] == 1).sum()
print 'Number of non-POI in the sub dataset:', (df_imp_sub['poi'] == 0).sum()

 Columns: ['bonus', 'exercised_stock_options', 'expenses', 'from_messages', 'from_poi_to_this_person', 'from_this_person_to_poi', 'other', 'poi', 'restricted_stock', 'salary', 'shared_receipt_with_poi', 'to_messages', 'total_payments', 'total_stock_value']
Shape: (142, 14)
Number of POI in the sub dataset: 16
Number of non-POI in the sub dataset: 126


### Task 3: Create new feature(s)

Create column 'poi_ratio' to store ratio in percentages. 

Also will be added next features:
the fraction of all emails to a person that were sent from a person of interest
the fraction of all emails that a person sent that were addressed to persons of interest

The hypothesis behind these features was that there might be stronger email connections between POIs than between POIs and non-POIs, and a scatterplot of these two features suggests that there might be some truth to that hypothesis.

Additionally i want to scale 'salary' to range [0,100].

In [43]:
def computeFraction( poi_messages, all_messages ):
    """ given a number messages to/from POI (numerator)
        and number of all messages to/from a person (denominator),
        return the fraction of messages to/from that person
        that are from/to a POI
   """
    fraction = 0.
    if poi_messages != 0 and all_messages != 0:
        fraction = float(poi_messages) / float(all_messages)

    return fraction * 100

In [44]:
for index, row in df_imp_sub.iterrows():
    row['poi_ratio'] = computeFraction(row['from_poi_to_this_person'] + row['from_this_person_to_poi'], row['from_messages'] + row['to_messages'])
    row['fraction_to_poi'] = computeFraction(row['from_this_person_to_poi'], row['from_messages'])
    row['fraction_from_poi'] = computeFraction(row['from_poi_to_this_person'], row['to_messages'])

df_imp_sub.describe()

Unnamed: 0,bonus,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,other,restricted_stock,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value,poi_ratio,fraction_to_poi,fraction_from_poi
count,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,84.0,84.0,84.0
mean,596786.0,1727585.0,34963.950704,367.690141,37.816901,24.647887,225785.3,728618.7,172680.6,681.147887,1200.366197,1497885.0,2421165.0,4.808472,17.983987,3.823534
std,1039388.0,3704389.0,45300.747867,1460.502581,74.342949,80.337515,757804.8,1575560.0,165996.8,1069.172948,2229.45777,2393296.0,4433593.0,4.694606,21.09143,4.098911
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-2604490.0,0.0,0.0,0.0,0.0,-44093.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8115.0,0.0,0.0,0.0,88392.25,228869.5,,,
50%,275000.0,604637.5,18094.5,16.5,2.5,0.0,882.5,353595.5,208310.5,102.5,289.0,913825.0,955113.5,,,
75%,800000.0,1636585.0,52905.25,51.25,39.75,12.75,145428.5,689203.0,267099.8,871.5,1513.0,1863625.0,2218031.0,,,
max,8000000.0,30766060.0,228763.0,14368.0,528.0,609.0,7427621.0,13847070.0,1060932.0,5521.0,15149.0,17252530.0,30766060.0,22.435175,100.0,21.734104


In [42]:
df_imp_sub.head()

Unnamed: 0,bonus,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,other,poi,restricted_stock,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value,poi_ratio,fraction_to_poi,fraction_from_poi
0,600000.0,0.0,94299.0,29.0,38.0,1.0,1740.0,False,585062.0,365788.0,702.0,807.0,1061827.0,585062.0,4.665072,3.448276,4.708798
1,1200000.0,6680544.0,11200.0,0.0,0.0,0.0,2660303.0,False,3942714.0,267102.0,0.0,0.0,5634343.0,10623258.0,,,
2,350000.0,4890344.0,78552.0,0.0,0.0,0.0,12961.0,False,1788391.0,170941.0,0.0,0.0,211725.0,6678735.0,,,
3,0.0,651850.0,0.0,12.0,10.0,0.0,0.0,False,386335.0,0.0,58.0,764.0,0.0,1038185.0,1.28866,0.0,1.308901
4,1500000.0,5538001.0,34039.0,32.0,32.0,21.0,11350.0,True,853064.0,243293.0,1035.0,1045.0,288682.0,6391065.0,4.921077,65.625,3.062201


In [37]:
df['price'] = df.apply(lambda row: valuation_formula(row['x'], row['y']), axis=1)

poi_ratio = (df_imp_sub['from_poi_to_this_person'] + df_imp_sub['from_this_person_to_poi']) / (df_imp_sub['from_messages'] + df_imp_sub['to_messages'])
fraction_to_poi = (df_imp_sub['from_this_person_to_poi']) / (df_imp_sub['from_messages'])
fraction_from_poi = (df_imp_sub['from_poi_to_this_person']) / (df_imp_sub['to_messages'])

df_imp_sub['poi_ratio'] = pd.Series(poi_ratio) * 100
df_imp_sub['fraction_to_poi'] = pd.Series(fraction_to_poi) * 100
df_imp_sub['fraction_from_poi'] = pd.Series(fraction_from_poi) * 100

scale = sklearn.preprocessing.MinMaxScaler(feature_range=(0, 100), copy=True)
salary_scaled = scale.fit_transform(df_imp_sub['salary'])

df_imp_sub.describe()



Unnamed: 0,bonus,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,other,restricted_stock,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value,poi_ratio,fraction_to_poi,fraction_from_poi
count,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,84.0,84.0,84.0
mean,596786.0,1727585.0,34963.950704,367.690141,37.816901,24.647887,225785.3,728618.7,172680.6,681.147887,1200.366197,1497885.0,2421165.0,4.808472,17.983987,3.823534
std,1039388.0,3704389.0,45300.747867,1460.502581,74.342949,80.337515,757804.8,1575560.0,165996.8,1069.172948,2229.45777,2393296.0,4433593.0,4.694606,21.09143,4.098911
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-2604490.0,0.0,0.0,0.0,0.0,-44093.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8115.0,0.0,0.0,0.0,88392.25,228869.5,,,
50%,275000.0,604637.5,18094.5,16.5,2.5,0.0,882.5,353595.5,208310.5,102.5,289.0,913825.0,955113.5,,,
75%,800000.0,1636585.0,52905.25,51.25,39.75,12.75,145428.5,689203.0,267099.8,871.5,1513.0,1863625.0,2218031.0,,,
max,8000000.0,30766060.0,228763.0,14368.0,528.0,609.0,7427621.0,13847070.0,1060932.0,5521.0,15149.0,17252530.0,30766060.0,22.435175,100.0,21.734104


In [None]:
# 
plt.scatter('salary', 'fraction_to_poi', color='blue')
plt.scatter('salary', 'fraction_from_poi', color='red')
plt.xlabel('Salary')
plt.ylabel('Fraction')
plt.show()

### Task 4: Try a varity of classifiers

In [None]:
### Task 4: Try a varity of classifiers
### Please name your classifier clf for easy export below.
### Note that if you want to do PCA or other multi-stage operations,
### you'll need to use Pipelines. For more info:
### http://scikit-learn.org/stable/modules/pipeline.html



In [None]:
# Split the data at first on training and test data. We will use Stratified Shuffle Split due to small dataset
labels = df_im_sub['poi'].copy(deep=True).astype(int).as_matrix()
features = (df_im_sub.drop('poi', axis=1)).fillna(0).copy(deep=True).as_matrix()
shuffle = sklearn.cross_validation.StratifiedShuffleSplit(labels, 4, test_size=0.3, random_state=0)

print labels
print features

In [None]:
# Data transformation with held out data
from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(features_train)
features_train_transformed = scaler.transform(features_train)
clf = svm.SVC(C=1).fit(features_train_transformed, labels_train)
features_test_transformed = scaler.transform(features_test)
clf.score(features_test_transformed, labels_test)  

In [None]:
# Try GaussianNB
from sklearn.naive_bayes import GaussianNB

gnb_clf = GaussianNB()
scores = sklearn.cross_validation.cross_val_score(gnb_clf, features, labels)
print scores

In [None]:
# Try ExtraTreesClassifier
from sklearn.ensemble import ExtraTreesClassifier

erf_clf = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=1, random_state=0)
scores = sklearn.cross_validation.cross_val_score(erf_clf, features, labels)
print scores

In [None]:
# Try RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=10)
scores = sklearn.cross_validation.cross_val_score(rf_clf, features, labels)

print scores

In [None]:
# Try AdaBoostClassifier
from sklearn.ensemble import AdaBoostClassifier

ab_clf = AdaBoostClassifier(n_estimators=100)
scores = sklearn.cross_validation.cross_val_score(ab_clf, features, labels)
print scores 


Choosen Algorithms:
Random Forest
AdaBoost
Why not SVM? It's too slow for tuning, i want more computational time efficient algorithm.
SVM and NB seems not to be good choice from initial scores.


### Task 5: Tune your classifier to achieve better than .3 precision and recall 

In [None]:
### Task 5: Tune your classifier to achieve better than .3 precision and recall 
### using our testing script. Check the tester.py script in the final project
### folder for details on the evaluation method, especially the test_classifier
### function. Because of the small size of the dataset, the script uses
### stratified shuffle split cross validation. For more info: 
### http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedShuffleSplit.html

# Example starting point. Try investigating other evaluation techniques!
from sklearn.cross_validation import train_test_split
features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)



#### Get the best fitting parameters and Testing for RandomForest

In [None]:
# get the best parameters
from sklearn.model_selection import KFold

cv = KFold(n_splits=5)

parameters = {'n_estimators': [10,20,30,40,50],
               'min_samples_split' :[2,3,4,5],
               'min_samples_leaf' : [1,2,3]
             }

rfclf = RandomForestClassifier()
grid_obj = GridSearchCV(rfclf, parameters, cv=cv)
grid_fit = grid_obj.fit(features_train, labels_train)
best_rfclf = grid_fit.best_estimator_ 

best_rfclf.fit(features_train,labels_train)

In [None]:
# RandomForestClassifier use the best parameter
from sklearn.cross_validation import KFold

rf = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=2,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=40, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

rf.fit(features_train, labels_train)

kf = KFold(titanic.shape[0], n_folds=3, random_state=1)
scores = cross_validation.cross_val_score(rf, features, labels_train, cv=kf)
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

output = best_rfclf.predict(features_test)
print len(output)

In [None]:
from sklearn import grid_search

cv = sklearn.cross_validation.StratifiedShuffleSplit(labels, n_iter=10)

def scoring(estimator, features_test, labels_test):
     labels_pred = estimator.predict(features_test)
     p = sklearn.metrics.precision_score(labels_test, labels_pred, average='micro')
     r = sklearn.metrics.recall_score(labels_test, labels_pred, average='micro')
     if p > 0.3 and r > 0.3:
            return sklearn.metrics.f1_score(labels_test, labels_pred, average='macro')
     return 0

'''
parameters = {'max_depth': [2,3,4,5,6],'min_samples_split':[2,3,4,5], 'n_estimators':[10,20,50], 'min_samples_leaf':[1,2,3,4], 'criterion':('gini', 'entropy')}
rf_clf = RandomForestClassifier()
rfclf = grid_search.GridSearchCV(rf_clf, parameters, scoring = scoring, cv = cv)
rfclf.fit(features, labels)

print rfclf.best_estimator_
print rfclf.best_score_
'''

#### Get the best fitting parameters and Testing for Adaboost

In [None]:

from sklearn import grid_search
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
parameters = {'n_estimators' : [5, 10, 30, 40, 50, 100,150], 'learning_rate' : [0.1, 0.5, 1, 1.5, 2, 2.5], 'algorithm' : ('SAMME', 'SAMME.R')}
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=8))
adaclf = grid_search.GridSearchCV(ada_clf, parameters, scoring = scoring, cv = cv)
adaclf.fit(features, labels)

print adaclf.best_estimator_
print adaclf.best_score_

In [None]:
#prepare parameter for dump classifier
clf_dump = adaclf

features_list = list(df_imp_sub.columns.values)

list_cols = list(df_subset.columns.values)
features_list.remove('poi')
features_list.insert(0, 'poi')
data = df_subset[features_list].fillna(0).to_dict(orient='records')
enron_data_sub = {}
counter = 0
for item in data:
    enron_data_sub[counter] = item
    counter += 1
    
my_dataset = enron_data_sub    

### Task 6: Dump your classifier, dataset, and features_list for checking

In [None]:
### Task 6: Dump your classifier, dataset, and features_list so anyone can
### check your results. You do not need to change anything below, but make sure
### that the version of poi_id.py that you submit can be run on its own and
### generates the necessary .pkl files for validating your results.

dump_classifier_and_data(clf_dump, my_dataset, features_list)

### List of Resources


sklearn documentation: http://scikit-learn.org/stable/index.html
pandas documentation: http://pandas.pydata.org
Jason Brownlee on Machine Learning Process:  How to Identify Outliers in your Data http://machinelearningmastery.com/how-to-identify-outliers-in-your-data/
