# 机器学习项目作业
本文尝试对数据集进行研究，并使用机器学习的方法，挖掘可能存在的嫌疑人。

## 数据清洗
首先，我引入本次研究所使用的数据集。

In [201]:
import sys
import pickle
import pandas as pd
import numpy as np
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data
from tester import test_classifier

from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectPercentile, f_classif

### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)
    my_dataset = data_dict
    features_list = ['poi',
                 'salary',
                 'bonus',
                 'to_messages',
                 'deferral_payments',
                 'total_payments',
                 'exercised_stock_options',
                 'restricted_stock',
                 'shared_receipt_with_poi',
                 'restricted_stock_deferred',
                 'total_stock_value',
                 'expenses',
                 'loan_advances',
                 'director_fees', 
                 'deferred_income',
                 'long_term_incentive',
                 'from_poi_to_this_person',
                 'from_this_person_to_poi']
    data = featureFormat(my_dataset, features_list, sort_keys = True)
    labels, features = targetFeatureSplit(data)

In [190]:
print len(my_dataset)

146


### 去除非人数据点
首先，打印数据所有人的名称，去除明显非人的数据点

In [202]:
for element in my_dataset:
    print element

METTS MARK
BAXTER JOHN C
ELLIOTT STEVEN
CORDES WILLIAM R
HANNON KEVIN P
MORDAUNT KRISTINA M
MEYER ROCKFORD G
MCMAHON JEFFREY
HORTON STANLEY C
PIPER GREGORY F
HUMPHREY GENE E
UMANOFF ADAM S
BLACHMAN JEREMY M
SUNDE MARTIN
GIBBS DANA R
LOWRY CHARLES P
COLWELL WESLEY
MULLER MARK S
JACKSON CHARLENE R
WESTFAHL RICHARD K
WALTERS GARETH W
WALLS JR ROBERT H
KITCHEN LOUISE
CHAN RONNIE
BELFER ROBERT
SHANKMAN JEFFREY A
WODRASKA JOHN
BERGSIEKER RICHARD P
URQUHART JOHN A
BIBI PHILIPPE A
RIEKER PAULA H
WHALEY DAVID A
BECK SALLY W
HAUG DAVID L
ECHOLS JOHN B
MENDELSOHN JOHN
HICKERSON GARY J
CLINE KENNETH W
LEWIS RICHARD
HAYES ROBERT E
MCCARTY DANNY J
KOPPER MICHAEL J
LEFF DANIEL P
LAVORATO JOHN J
BERBERIAN DAVID
DETMERING TIMOTHY J
WAKEHAM JOHN
POWERS WILLIAM
GOLD JOSEPH
BANNANTINE JAMES M
DUNCAN JOHN H
SHAPIRO RICHARD S
SHERRIFF JOHN R
SHELBY REX
LEMAISTRE CHARLES
DEFFNER JOSEPH M
KISHKILL JOSEPH G
WHALLEY LAWRENCE G
MCCONNELL MICHAEL S
PIRO JIM
DELAINEY DAVID W
SULLIVAN-SHAKLOVITZ COLLEEN
WROBEL BRUC

In [203]:
del my_dataset['TOTAL']
del my_dataset['THE TRAVEL AGENCY IN THE PARK']

### 去除全部为NaN的数据点
其次，对数据全部为NaN的点进行清理。

In [204]:
for element in my_dataset:
    for feature in my_dataset[element]:
        item = my_dataset[element][feature]
        if item == 'NaN' or item == 0:
            NaN = True
        else:
            NaN = False
            break
    if NaN:
        print element

LOCKHART EUGENE E


In [205]:
del my_dataset['LOCKHART EUGENE E']
print len(my_dataset)

143


### 数据初步探查
接下来我尝试使用pandas对数据集进行初步的研究。

In [206]:
data = featureFormat(my_dataset, features_list, sort_keys = True)
## 引入pandas模块，并创建DataFrame开始分析数据
import pandas as pd
df = pd.DataFrame(data)
len(df) # 数据集长度

143

In [196]:
(df == 0 ).sum()

0     125
1      49
2      62
3      57
4     105
5      20
6      42
7      34
8      57
9     126
10     18
11     49
12    140
13    127
14     95
15     78
16     69
17     77
dtype: int64

可以发现143条记录，其中有近5个字段有超过三分之二的数据是零。在选择特征数据的时候应该加以考量。

In [207]:
df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
count,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0
mean,0.125874,186742.9,680724.6,1247.216783,223642.6,2272323.0,2090318.0,874610.0,707.524476,73931.31,2930134.0,35622.72028,586888.1,10050.111888,-195037.7,339314.2,39.027972,24.797203
std,0.332873,197117.1,1236180.0,2243.006069,756520.8,8876252.0,4809193.0,2022338.0,1079.457016,1306545.0,6205937.0,45370.869604,6818177.0,31399.349067,607922.5,689013.9,74.466359,80.031821
min,0.0,0.0,0.0,0.0,-102500.0,0.0,0.0,-2604490.0,0.0,-1787380.0,-44093.0,0.0,0.0,0.0,-3504386.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,96796.5,0.0,38276.5,0.0,0.0,254936.0,0.0,0.0,0.0,-37506.0,0.0,0.0,0.0
50%,0.0,210692.0,300000.0,383.0,0.0,966522.0,608750.0,360528.0,114.0,0.0,976037.0,21530.0,0.0,0.0,0.0,0.0,4.0,0.0
75%,0.0,270259.0,800000.0,1639.0,9110.0,1956978.0,1698900.0,775992.0,967.5,0.0,2307584.0,53534.5,0.0,0.0,0.0,374825.5,41.5,14.0
max,1.0,1111258.0,8000000.0,15149.0,6426990.0,103559800.0,34348380.0,14761690.0,5521.0,15456290.0,49110080.0,228763.0,81525000.0,137864.0,0.0,5145434.0,528.0,609.0


In [208]:
df.groupby([0])[0].count()

0
0.0    125
1.0     18
Name: 0, dtype: int64

嫌疑人人数18人，非嫌疑人人数125人。分布不平衡，后期评估算法性能的时候应采用precision和recall更加合适。
光从这些数据的分布，无法研究出数据与嫌疑人身份的关联性，所以我将开始使用机器学习的方法，来对这些数据进行分析。
## 特征工程准备
### 新特征
首先我根据to_messages,from_messages,from_this_person_to_poi,from_poi_to_this_person四个值，计算与poi相关的邮件数与总体邮件数之比，构造一个名为poi_email_rate新的特征。我认为如果一个人的邮件中有大量的邮件与嫌疑人有关，那么这个人很有可能也是嫌疑人。

In [308]:
for element in my_dataset:
    record = my_dataset[element]
    to_messages = record['to_messages']
    from_messages = record['from_messages']
    from_this_person_to_poi = record['from_this_person_to_poi']
    from_poi_to_this_person = record['from_poi_to_this_person']
    if  to_messages == 0 or \
    from_messages == 'NaN' or \
    from_this_person_to_poi == 'NaN' or \
    from_poi_to_this_person == 'NaN':
        record['poi_email_rate'] = 'NaN'
    else:
        record['poi_email_rate'] = float(from_this_person_to_poi+from_poi_to_this_person)/(to_messages+from_messages)

其次我构造了可以根据特征评分的函数，选取排名靠前的特征进行排序。可以通过百分之多少的比例来控制特征的数量，方便后期调整参数。

In [309]:
def feature_selection(percent,print_score = False):
    features_list = ['poi',
                 'to_messages',
                 'from_messages',
                 'from_this_person_to_poi',
                 'from_poi_to_this_person',
                 'salary',
                 'bonus',
                 'deferral_payments',
                 'total_payments',
                 'exercised_stock_options',
                 'restricted_stock',
                 'shared_receipt_with_poi',
                 'restricted_stock_deferred',
                 'total_stock_value',
                 'expenses',
                 'loan_advances',
                 'director_fees',
                 'deferred_income',
                 'long_term_incentive',
                 'other',
                 'poi_email_rate']
    data = featureFormat(my_dataset, features_list, sort_keys = True)
    labels, features = targetFeatureSplit(data)

    cv = StratifiedShuffleSplit(labels, 1000, random_state = 42)
    
    score_list = []
    
    for train_idx, test_idx in cv: 
        features_train = []
        features_test  = []
        labels_train   = []
        labels_test    = []
        for ii in train_idx:
            features_train.append( features[ii] )
            labels_train.append( labels[ii] )
        for jj in test_idx:
            features_test.append( features[jj] )
            labels_test.append( labels[jj] )
        selector = SelectPercentile()
        selector = selector.fit(features_train,labels_train)
        score_list.append(selector.scores_)
    
    score_average = sum(score_list)/len(score_list)
    feature_score = zip(features_list[1:],score_average)
    feature_score.sort(key = lambda tup:tup[1],reverse=True)
    
    new_feature_list = ['poi']
    
    feature_score = feature_score[:int(len(feature_score)*percent/100)]
    
    for ele,_ in feature_score:
        new_feature_list.append(ele)
    
    if print_score == True:
        print "Features Score:{}".format(feature_score)
        
    return new_feature_list

features_list = feature_selection(100, True)

Features Score:[('exercised_stock_options', 22.484239699561655), ('total_stock_value', 21.924096297344661), ('bonus', 19.158702272676425), ('salary', 16.583003746563765), ('deferred_income', 10.474111674381019), ('long_term_incentive', 9.5538080839163086), ('restricted_stock', 8.6021755972911045), ('total_payments', 8.011715381936348), ('shared_receipt_with_poi', 7.8953356715814165), ('loan_advances', 6.5661652281435723), ('expenses', 5.5798694818050976), ('poi_email_rate', 4.9252304622627996), ('from_poi_to_this_person', 4.9182163726212007), ('other', 4.0899120466075605), ('from_this_person_to_poi', 2.4186632091763709), ('director_fees', 1.8884368442227371), ('to_messages', 1.6505535296368732), ('deferral_payments', 0.2649254972389542), ('from_messages', 0.18041998942131626), ('restricted_stock_deferred', 0.15046529211472459)]


从得分可以看出poi_email_rate得分较低，所以可能无法作为最终的特征。具体情况需要视算法和特征情况而定。见后文。
### 特征缩放
使用MinMaxScaler对特征进行缩放。

In [154]:
# 特征缩放函数
def feature_scale(features):
    scaler = MinMaxScaler()
    features = scaler.fit_transform(features)
    return features

### 主成因分析
选用了PCA函数，进行主成因分析

In [156]:
# 主成因分析
def feature_PCA(features,labels,components_parameter):
    pca = PCA(n_components=components_parameter)
    features = pca.fit_transform(features,labels)
    return features

### 数据转化
最后，我构造了一个统一的数据转化函数，可以根据参数对数据调用上述函数对数据特征进行转化：

In [219]:
# 数据转化函数
def features_transform(features,labels):
    features = feature_scale(features) # 特征缩放
    features = feature_PCA(features,labels,components_parameter) # 主成因分析
    return features

上述是特征工程所需的功能。
## 机器学习算法准备
本文尝试使用Naive Bayes，Decision Tree，SVM以及Random Forest四种算法，对数据进行分析，首先，我搭建了一个统一的算法函数，可以通过设置参数选择对应的算法。

In [158]:
# GridSearch_test用以控制是否通过GridSearch寻找合适的参数，函数中的各个算法参数都是经过试验获得的，后文会详细展示试验过程。
# 通过输入不同的函数名选用不同的参数 "NB","Decision Tree","Random Forest","SVM"

from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from  sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

def classifier(algorithm = "Decision Tree", GridSearch_test = False):
    if algorithm == 'NB':
        ## GaussianNB
        clf = GaussianNB()
    elif algorithm == 'Decision Tree':
        ## Decision Tree
        clf = DecisionTreeClassifier(criterion = "entropy",max_depth = 2,min_samples_leaf = 9)  
        if GridSearch_test:
            parameters = {'criterion':["entropy","gini"],'max_depth':(1,10,1),'min_samples_leaf':(1,200,10)}
            clf = GridSearchCV(clf,parameters)
    elif algorithm == 'Random Forest':
        ## Random Forest
        clf = RandomForestClassifier(n_estimators = 3)
        if GridSearch_test:
            parameters = {'n_estimators':[1,10]}
            clf = GridSearchCV(clf,parameters)
    elif algorithm == 'SVM':
        clf = SVC(C=1,gamma = 1)
        if GridSearch_test:
            parameters = {'C':[0.001, 0.01, 0.1, 1, 10],
            "gamma":[0.001, 0.01, 0.1, 1]}
            clf = GridSearchCV(clf, parameters)
    return clf

clf = classifier()

## 算法评估准备
首先我定义了评估的函数。

In [159]:
# 当print_score 为 Ture时，打印算法的accuracy，precision和recall
def calculate_score(clf,features,labels,print_score = False):
    ## import precision score evaluation
    from sklearn.metrics import accuracy_score
    from sklearn.metrics import precision_score
    from sklearn.metrics import recall_score
    
    accuracy = accuracy_score(clf.predict(features),labels)
    precision = precision_score(clf.predict(features),labels)
    recall = recall_score(clf.predict(features),labels)
    
    if print_score:
        print clf
        print "accuracy score: {}".format(accuracy)
        print "precision score: {}".format(precision)
        print "recall score: {}".format(recall)
    
    scores = {"accuracy":accuracy,
           "precision":precision,
           "recall":recall
            }
    return scores

最初我选用了简单的train_test_split，进行算法性能评估。

In [160]:
from sklearn.cross_validation import train_test_split
features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.4, random_state=42)
clf = classifier()
clf = clf.fit(features_train,labels_train)

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

calculate_score(clf,features_test,labels_test,True)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=9,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
accuracy score: 0.844827586207
precision score: 0.428571428571
recall score: 0.375


{'accuracy': 0.84482758620689657,
 'precision': 0.42857142857142855,
 'recall': 0.375}

因为样本的不平衡性比较强，所以单一一次的随机状态很容易导致poi分布不均衡，导致训练的模型不够精确。最终的结果与tester中的结果也相差很多。所以我尝试改用kfold，通过多次分组，记录不同组的score，取score平均值进行评估。

In [161]:
# 搭建kfold测试数据组
from sklearn.model_selection import KFold
kf = KFold(10)
cv = kf.split(features)
accuracy_list = []
precision_list = []
recall_list = []
for train_index,test_index in cv:
    features_train = [features[ii] for ii in train_index]
    features_test = [features[ii] for ii in test_index]
    labels_train = [labels[ii] for ii in train_index]
    labels_test = [labels[ii] for ii in test_index]
    clf.fit(features_train,labels_train)
    score = calculate_score(clf,features_test,labels_test) # 计算每个数据组的数据情况，并且添加到对应list中
    accuracy_list.append(score['accuracy'])
    precision_list.append(score['precision'])
    recall_list.append(score['recall'])
# 计算各个list的均值
print clf
print "average accuracy:{}".format(sum(accuracy_list)/len(accuracy_list))
print "precision accuracy:{}".format(sum(precision_list)/len(precision_list))
print "recall accuracy:{}".format(sum(recall_list)/len(recall_list))

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=9,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
average accuracy:0.874761904762
precision accuracy:0.133333333333
recall accuracy:0.25


这样的评估效果是会好一些。后来我了解到，对样本容量比较小的数据集，可以使用StratifiedShuffleSplit进行交叉验证。所以我最后复用了tester中的语句进行评估。

In [260]:
accuracy = []
precision = []
recall = []

true_negatives = 0
false_negatives = 0
true_positives = 0
false_positives = 0


PERF_FORMAT_STRING = "\
\tAccuracy: {:>0.{display_precision}f}\tPrecision: {:>0.{display_precision}f}\t\
Recall: {:>0.{display_precision}f}\tF1: {:>0.{display_precision}f}\tF2: {:>0.{display_precision}f}"
RESULTS_FORMAT_STRING = "\tTotal predictions: {:4d}\tTrue positives: {:4d}\tFalse positives: {:4d}\
\tFalse negatives: {:4d}\tTrue negatives: {:4d}"

folds = 1000

# 
from sklearn.cross_validation import StratifiedShuffleSplit

def test_classifier(clf, dataset, feature_list,transform = True):
    score_list = []
    data = featureFormat(dataset, feature_list, sort_keys = True)
    labels, features = targetFeatureSplit(data)
    cv = StratifiedShuffleSplit(labels, folds, random_state = 42)
    true_negatives = 0
    false_negatives = 0
    true_positives = 0
    false_positives = 0
    for train_idx, test_idx in cv: 
        features_train = []
        features_test  = []
        labels_train   = []
        labels_test    = []
        for ii in train_idx:
            features_train.append( features[ii] )
            labels_train.append( labels[ii] )
        for jj in test_idx:
            features_test.append( features[jj] )
            labels_test.append( labels[jj] )
        ### feature engineer the features
        if transform:
            features_train = features_transform(features_train,labels_train)
        clf.fit(features_train, labels_train)
        ### transform test features
        if transform:
            features_test = features_transform(features_test,labels_test)
        predictions = clf.predict(features_test)
        for prediction, truth in zip(predictions, labels_test):
            if prediction == 0 and truth == 0:
                true_negatives += 1
            elif prediction == 0 and truth == 1:
                false_negatives += 1
            elif prediction == 1 and truth == 0:
                false_positives += 1
            elif prediction == 1 and truth == 1:
                true_positives += 1
            else:
                print "Warning: Found a predicted label not == 0 or 1."
                print "All predictions should take value 0 or 1."
                print "Evaluating performance for processed predictions:"
                break

    try:
        
        total_predictions = true_negatives + false_negatives + false_positives + true_positives
        accuracy = 1.0*(true_positives + true_negatives)/total_predictions
        precision = 1.0*true_positives/(true_positives+false_positives)
        recall = 1.0*true_positives/(true_positives+false_negatives)
        f1 = 2.0 * true_positives/(2*true_positives + false_positives+false_negatives)
        f2 = (1+2.0*2.0) * precision*recall/(4*precision + recall)
        print clf
        print PERF_FORMAT_STRING.format(accuracy, precision, recall, f1, f2, display_precision = 5)
        print RESULTS_FORMAT_STRING.format(total_predictions, true_positives, false_positives, false_negatives, true_negatives)
        print ""
    except:
        print "Got a divide by zero when trying out:", clf
        print "Precision or recall may be undefined due to a lack of true positive predicitons."

test_classifier(clf, my_dataset, features_list,transform = False)

Got a divide by zero when trying out: SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
Precision or recall may be undefined due to a lack of true positive predicitons.


StratifiedShuffleSplit会对数据集进行打乱并分层，充分的随机性可以满足对不平衡数据集的验证，耗费的计算会相对比较多，但是因为样本容量比较小，所以也可以接受。这样可以确保训练出来的分类器可以符合数据集实际的状态，减少在不平衡的数据情况下训练，导致无法准确分类的情况。
## 参数调整
接下来，我开始对整个模型进行参数调整。首先，我使用GridSearchCV，对各种算法的参数进行了调整：
### GridSearchCV

In [221]:
from sklearn.model_selection import GridSearchCV
clf = classifier("Decision Tree",GridSearch_test = True)
clf.fit(features_train,labels_train)
print clf.best_estimator_

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=1,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')


In [222]:
clf = DecisionTreeClassifier(criterion = "entropy",max_depth = 1,min_samples_leaf = 2)  
test_classifier(clf, my_dataset, features_list,transform = False,add = False)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=1,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=2,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
	Accuracy: 0.85980	Precision: 0.01869	Recall: 0.00100	F1: 0.00190	F2: 0.00123
	Total predictions: 15000	True positives:    2	False positives:  105	False negatives: 1998	True negatives: 12895



In [223]:
clf = classifier("Random Forest",GridSearch_test = True)
clf.fit(features_train,labels_train)
print clf.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)


In [224]:
clf = RandomForestClassifier(n_estimators = 2)  
test_classifier(clf, my_dataset, features_list,transform = False,add = False)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=2, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)
	Accuracy: 0.85120	Precision: 0.31703	Recall: 0.10050	F1: 0.15262	F2: 0.11640
	Total predictions: 15000	True positives:  201	False positives:  433	False negatives: 1799	True negatives: 12567



In [225]:
clf = classifier("SVM",GridSearch_test = True)
clf.fit(features_train,labels_train)
print clf.best_estimator_

SVC(C=0.001, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)


In [226]:
clf = SVC(C=0.001,gamma=0.001)  
test_classifier(clf, my_dataset, features_list,transform = False,add = False)

Got a divide by zero when trying out: SVC(C=0.001, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
Precision or recall may be undefined due to a lack of true positive predicitons.


从上述的几次测试来看GridSearch都没有达到比较好的效果。所以我最后采用了手动调整的方法，手动调整各个算法中的参数，以及选用特征的比例（feature_selection：selector_percentile），和构造新特征的个数（PCA：components）。
### 手动调整
手动调整算法参数以及特征参数的结果如下：
#### Decision Tree
criterion = "entropy",max_depth = 20,min_samples_leaf = 3

In [303]:
components_parameter = 2
percent = 25
features_list = feature_selection(percent,print_score = True)
clf = DecisionTreeClassifier(criterion = "entropy",max_depth = 20,min_samples_leaf = 3)  
test_classifier(clf, my_dataset, features_list,transform = True)

[('exercised_stock_options', 22.484239699561655), ('total_stock_value', 21.924096297344661), ('bonus', 19.158702272676425), ('salary', 16.583003746563765), ('deferred_income', 10.474111674381019)]
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=20,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=3,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
	Accuracy: 0.79621	Precision: 0.32281	Recall: 0.38850	F1: 0.35262	F2: 0.37331
	Total predictions: 14000	True positives:  777	False positives: 1630	False negatives: 1223	True negatives: 10370



#### Naive Bayes
最终得出参数components_parameter:1, percent:35

In [282]:
components_parameter = 1
percent = 35
clf = clf = GaussianNB()
features_list = feature_selection(percent)
test_classifier(clf, my_dataset, features_list,transform = True)

GaussianNB(priors=None)
	Accuracy: 0.78329	Precision: 0.31885	Recall: 0.45500	F1: 0.37495	F2: 0.41920
	Total predictions: 14000	True positives:  910	False positives: 1944	False negatives: 1090	True negatives: 10056



#### Random Forest
最终得出参数components_parameter:1, percent:20 n_estimators = 3

In [281]:
components_parameter = 1
percent = 20
clf = RandomForestClassifier(n_estimators = 3)
features_list = feature_selection(percent,print_score = True)
test_classifier(clf, my_dataset, features_list,transform = True)

[('exercised_stock_options', 22.484239699561655), ('total_stock_value', 21.924096297344661), ('bonus', 19.158702272676425), ('salary', 16.583003746563765)]
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=3, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)
	Accuracy: 0.79415	Precision: 0.32192	Recall: 0.30550	F1: 0.31349	F2: 0.30865
	Total predictions: 13000	True positives:  611	False positives: 1287	False negatives: 1389	True negatives: 9713



#### SVM

In [271]:
components_parameter = 1
percent = 5
clf = SVC(kernel = 'rbf',C=1,gamma = 1)
features_list = feature_selection(percent,print_score = True)
test_classifier(clf, my_dataset, features_list,transform = True)

[('exercised_stock_options', 22.484239699561655)]
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
	Accuracy: 0.88473	Precision: 0.37796	Recall: 0.41500	F1: 0.39561	F2: 0.40702
	Total predictions: 11000	True positives:  415	False positives:  683	False negatives:  585	True negatives: 9317



最后整体调整参数可得SVM这个算法是最优的算法。
其参数为：  
- components_parameter = 1  
- percent = 5  
- C=1  
- gamma = 1  

选用了特征'exercised_stock_options'，其评分为22.484239699561655。新增的特征没有纳入特征选择范围。

## 问题总结

- 你在获得数据时它们是否包含任何异常值，你是如何进行处理的？【相关标准项：“数据探索”，“异常值调查”】

包含了部分异常值，一部分异常值是数据统计时产生的异常值，如'TOTAL'的值，直接删除。  
还有非人名的值，也进行了删除，如'THE TRAVEL AGENCY IN THE PARK'  
最后，对全部都是NaN的值，也进行了删除。如'LOCKHART EUGENE E'    

- 你最终在你的 POI 标识符中使用了什么特征，你使用了什么筛选过程来挑选它们？你是否需要进行任何缩放？为什么？作为任务的一部分，你应该尝试设计自己的特征，而非使用数据集中现成的——解释你尝试创建的特征及其基本原理。（你不一定要在最后的分析中使用它，而只设计并测试它）。在你的特征选择步骤，如果你使用了算法（如决策树），请也给出所使用特征的特征重要性；如果你使用了自动特征选择函数（如 SelectBest），请报告特征得分及你所选的参数值的原因。【相关标准项：“创建新特征”、“适当缩放特征”、“智能选择功能”】

首先我添加了尽可能多的参数，可以保证在算法训练中使用到尽可能多的信息。接下来我通过SelectPercentile来选择参数，这样可以方便调节选用哪些参数，提升或减少用以训练的信息量。  
我根据邮件数总量(发送的邮件和收到的邮件之和)，以及与poi有关的邮件量（发自poi的邮件和发送给poi的邮件之和），创建了嫌疑人邮件率的特征。我认为如果一个人他的嫌疑人邮件率比较高，那么说明他很有可能是一个嫌疑人。  
我使用了PCA进行了主成因分析，并且用MinMax进行了特征缩放，根据调参的结果，这样做可以提升算法性能。 

- 你最终使用了什么算法？你还尝试了其他什么算法？不同算法之间的模型性能有何差异？【相关标准项：“选择算法”】  

我最终使用了SVM，我还尝试了Naive Bayes、Random Forest以及Decision Tree。Naive Bayes召回率最高，但是精准度难以提升。Random Forest通过调整可以有很高的正确率和召回率，但是相对耗时偏长。Decision Tree在使用Entrophy时，无需调整特征就可以达到很高的精准度和召回率。但是总和评判下来，SVM拥有最高的分类准确度、正确率、召回率，而且预测时间也可以接受。  

- 调整算法的参数是什么意思，如果你不这样做会发生什么？你是如何调整特定算法的参数的？（一些算法没有需要调整的参数 – 如果你选择的算法是这种情况，指明并简要解释对于你最终未选择的模型或需要参数调整的不同模型，例如决策树分类器，你会怎么做）。【相关标准项：“调整算法”】 
  
我使用了GridSearcCV进行调参，但是我发现会出现过度拟合的情况。可能的问题是样本量设置过大，导致过度拟合。最后我是通过手动调参，并且记录的方式进行调整，这样做可以结合precision和recall的情况对参数做目标性的调整，效率可以更加高。  

- 什么是验证，未正确执行情况下的典型错误是什么？你是如何验证你的分析的？【相关标准项：“验证策略”】  

典型错误就是过度拟合，在样本中表现非常好，但是在测试过程中表现就不太好。遇到这样的问题，我会适当减少训练算法的深度，减少拟合的水平。  

- 给出至少 2 个评估度量并说明每个的平均性能。解释对用简单的语言表明算法性能的度量的解读。【相关标准项：“评估度量的使用”】

precision rate：预测为嫌疑人中真正是嫌疑人的占比。即测的对不对。  
recall rate：所有嫌疑人被识别出的人的占比。即测的全不全。  

以上是对数据集的分析。其实还有很多问题没有解决：  

- 如何通过机器的方法自动调整包括特征工程在内的参数  
- 是否有其他算法，比文中所选的算法更加有效  
- 是否在数据点足够多的情况下，是否要将时间也考虑在内  

## 关于代码
本文中覆盖了所有与结果相关的代码。随文附上调参时使用的代码，参数均已设置为最优算法与参数。为了能够体现特征工程的效果，我在tester文件中添加了特征工程的相关功能，供测试使用。