In [1]:
"""
    Starter code for the evaluation mini-project.
    Start by copying your trained/tested POI identifier from
    that which you built in the validation mini-project.

    This is the second step toward building your POI identifier!

    Start by loading/formatting the data...
"""

import pickle
import sys
from sklearn import cross_validation
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit
#import numpy as np
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

data_dict = pickle.load(open("../final_project/final_project_dataset.pkl", "r") )

### add more features to features_list!
features_list = ["poi", "salary"]

data = featureFormat(data_dict, features_list)
labels, features = targetFeatureSplit(data)
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(features, labels, test_size = 0.3, 
                                                                                           random_state=42)




In [2]:
clf = DecisionTreeClassifier()
clf = clf.fit(features_train, labels_train)
prediction = clf.predict(features_test)

In [3]:
print "Accuracy : ", accuracy_score(prediction, labels_test)

Accuracy :  0.724137931034


## Number of POIs in Test Set
How many POIs are predicted for the test set for your POI identifier?

In [4]:
sum(labels_test)

4.0

## Number of People in Test Set

In [5]:
len(labels_test)

29

## Accuracy of a Biased Identifier
If your identifier predicted 0. (not POI) for everyone in the test set, what would its accuracy be?

In [6]:
acc = ((29.0-4.0)/29.0)
print "Accuracy of biased identifier : ",acc

Accuracy of biased identifier :  0.862068965517


## Number of True Positives
Look at the predictions of your model and compare them to the true test labels. Do you get any true positives? (In this case, we define a true positive as a case where both the actual label and the predicted label are 1)

In [7]:
TP_count = 0
for i in range(len(labels_test)):
    if labels_test[i] == 1.0 and prediction[i] == 1.0:
        TP_count += 1

print TP_count

0


No, True positives

## Unpacking Into Precision and Recall

As you may now see, having imbalanced classes like we have in the Enron dataset (many more non-POIs than POIs) introduces some special challenges, namely that you can just guess the more common class label for every point, not a very insightful strategy, and still get pretty good accuracy!

Precision and recall can help illuminate your performance better. Use the ```precision_score``` and ```recall_score``` available in sklearn.metrics to compute those quantities.

In [8]:
print "Precision : ",precision_score(prediction, labels_test)

Precision :  0.0


In [9]:
print "recall : ",recall_score(prediction, labels_test)

recall :  0.0


(Note: you may see a message like UserWarning: The precision and recall are equal to zero for some labels. Just like the message says, there can be problems in computing other metrics (like the F1 score) when precision and/or recall are zero, and it wants to warn you when that happens.) 

Obviously this isn’t a very optimized machine learning strategy (we haven’t tried any algorithms besides the decision tree, or tuned any parameters, or done any feature selection), and now seeing the precision and recall should make that much more apparent than the accuracy did.

## How Many True Positives, True Negatives, False Positives and False Negatives?
In the final project you’ll work on optimizing your POI identifier, using many of the tools learned in this course. Hopefully one result will be that your precision and/or recall will go up, but then you’ll have to be able to interpret them. 

Here are some made-up predictions and true labels for a hypothetical test set; fill in the following boxes to practice identifying true positives, false positives, true negatives, and false negatives. Let’s use the convention that “1” signifies a positive result, and “0” a negative. 

predictions = [0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1] 
true labels = [0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0]

__Ans :__ True positives : 6  (1 in true labels has corresponding 1 in predictions)

__Ans :__ True negatives : 2  (0 in true labels has corresponding 0 in predictions)

__Ans :__ False Positives : 3  (0 in true labels has corresponding 1 in predictions)

__Ans :__ False Negatives : 2  (1 in true labels has corresponding 0 in predictions)

In [10]:
predictions_2 = [0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1] 
true_labels_2 = [0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0]

In [11]:
TP = 6.0   # Imp to use floating point
TN = 2.0
FP = 3.0
FN = 2.0

In [12]:
print "precison in this case : ",(TP/(TP + FP))

precison in this case :  0.666666666667


In [13]:
print "Recall in this case : ",(TP/(TP + FN))

Recall in this case :  0.75


## Quiz
1. My true positive rate is high, which means that when a __POI__ is present in the test data, I am good at flagging him or her.  
2. My identifier doesn’t have great __precision__, but it does have good __recall__. That means that, nearly every time a POI shows up in my test set, I am able to identify him or her. The cost of this is that I sometimes get some false positives (false alarm), where non-POIs get flagged.

3. My identifier doesn’t have great __recall__, but it does have good __precision__. That means that whenever a POI gets flagged in my test set, I know with a lot of confidence that it’s very likely to be a real POI and not a false alarm. On the other hand, the price I pay for this is that I sometimes miss real POIs, since I’m effectively reluctant to pull the trigger on edge cases.

4. My identifier has a really great __f1-score__. This is the best of both worlds. Both my false positive and false negative rates are __low__, which means that I can identify POI’s reliably and accurately. If my identifier finds a POI then the person is almost certainly a POI, and if the identifier does not flag someone, then they are almost certainly not a POI.

In statistical analysis of binary classification, the __F1 score (also F-score or F-measure)__ is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results, and r is the number of correct positive results divided by the number of positive results that should have been returned. The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.


There’s usually a tradeoff between precision and recall--which one do you think is more important in your POI identifier? There’s no right or wrong answer, there are good arguments either way, but you should be able to interpret both metrics and articulate which one you find most important and why.