# Evaluation Metrics

Go back to your code from the last lesson, where you built a simple first iteration of a POI identifier using a decision tree and one feature. Copy the POI identifier that you built into the skeleton code in evaluation/evaluate_poi_identifier.py. Recall that at the end of that project, your identifier had an accuracy (on the test set) of 0.724. Not too bad, right? Let’s dig into your predictions a little more carefully.

In [1]:
"""
    Starter code for the evaluation mini-project.
    Start by copying your trained/tested POI identifier from
    that which you built in the validation mini-project.
    This is the second step toward building your POI identifier!
    Start by loading/formatting the data...
"""

import pickle
from feature_format import featureFormat, targetFeatureSplit

data_dict = pickle.load(open("final_project_dataset.pkl", "rb") )

### add more features to features_list!
features_list = ["poi", "salary"]

data = featureFormat(data_dict, features_list, sort_keys = 'python2_lesson14_keys.pkl')
labels, features = targetFeatureSplit(data)

### your code goes here

from sklearn import tree
from sklearn.model_selection import train_test_split

features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.3, random_state=42)

# Create classifier
clf = tree.DecisionTreeClassifier()

# Fit the classifier on the training features and labels
clf.fit(features_train, labels_train)

# Make prediction - Store predictions in a list named pred
pred = clf.predict(features_test)

# Calculate the accuracy on the test data
print(clf.score(features_test, labels_test))

0.724137931034


### Number of POIs in Test Set
How many POIs are predicted for the test set for your POI identifier?

(Note that we said test set! We are not looking for the number of POIs in the whole dataset.)

The line:

data = featureFormat(data_dict, features_list)

converts 'True' to 1 and 'False' to 0 for the 'poi' variable, so that should help you to determine the number of pois in 'labels_test' (as it will now contain zeros and ones).

In [2]:
count = 0
# poi is the placeholder for each item in 'labels_test'
# 'labels_test' is the array
for poi in labels_test:
    if poi == 1:
        count += 1
print(count)

# Alternative solution
print(sum(labels_test))

4
4.0


### Number of People in Test Set
How many people total are in your test set?

In [3]:
print(len(labels_test))

29


### Accuracy of a Biased Identifier
If your identifier predicted 0 (not POI) for everyone in the test set, what would its accuracy be?
*  The right answer is (29-4)/29 which is equal to 0.8621.

In [4]:
import sklearn.metrics

pred = [0.] * len(labels_test)

print(labels_test)
print(pred)
print(sklearn.metrics.accuracy_score(labels_test, pred))

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
0.862068965517


* This question was more to illustrate that if you train using a skewed training set you can still end up with a high accuracy and in more complex examples you might take high accuracy as good outcome.

### Number of True Positives
Look at the predictions of your model and compare them to the true test labels. Do you get any true positives? (In this case, we define a true positive as a case where both the actual label and the predicted label are 1)
* Nope

In [5]:
import numpy as np

# A positive prediction differs from a 'correct positive prediction', i.e. a 'true positive'.
# So, to find out how many positive predictions the code has made,
# all that you need to do is to sum the pred data.

# If you do want to find the number of 'correct positive predictions',
# or 'true positives, you can zip labels and pred and use list comprehension:

cpp = [1 for j in zip(labels, pred) if j[0] == j[1] and j[1] == 1]

# then you can sum them, to find out how many correct predictions your code has made:

no_cpp = np.sum(cpp)

print("Number of Correct Positive Predictions")
print(no_cpp)
print()
print(sklearn.metrics.confusion_matrix(labels_test, pred))

Number of Correct Positive Predictions
0.0

[[25  0]
 [ 4  0]]


### Unpacking Into Precision and Recall
As you may now see, having imbalanced classes like we have in the Enron dataset (many more non-POIs than POIs) introduces some special challenges, namely that you can just guess the more common class label for every point, not a very insightful strategy, and still get pretty good accuracy!

Precision and recall can help illuminate your performance better. Use the precision_score and recall_score available in sklearn.metrics to compute those quantities.

What’s the precision?

In [6]:
x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size = 0.3, random_state=42)

clf = tree.DecisionTreeClassifier()
clf = clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

print("Accuracy:", clf.score(x_test, y_test))

print("Precision:", sklearn.metrics.precision_score(y_test, y_pred))

Accuracy: 0.724137931034
Precision: 0.0


### Recall of Your POI Identifier
What’s the recall? 

(Note: you may see a message like UserWarning: The precision and recall are equal to zero for some labels. Just like the message says, there can be problems in computing other metrics (like the F1 score) when precision and/or recall are zero, and it wants to warn you when that happens.) 

Obviously this isn’t a very optimized machine learning strategy (we haven’t tried any algorithms besides the decision tree, or tuned any parameters, or done any feature selection), and now seeing the precision and recall should make that much more apparent than the accuracy did.

In [7]:
print("Recall:", sklearn.metrics.recall_score(y_test, y_pred))

Recall: 0.0


### How Many True Positives?
In the final project you’ll work on optimizing your POI identifier, using many of the tools learned in this course. Hopefully one result will be that your precision and/or recall will go up, but then you’ll have to be able to interpret them. 

Here are some made-up predictions and true labels for a hypothetical test set; fill in the following boxes to practice identifying true positives, false positives, true negatives, and false negatives. Let’s use the convention that “1” signifies a positive result, and “0” a negative. 

predictions = [0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1] 

true labels = [0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0]

How many true positives are there?

* 6

In [8]:
predictions = [0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1]
true_labels = [0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0]

cpp = [1 for j in zip(true_labels, predictions) if j[0] == j[1] and j[1] == 1]
no_cpp = np.sum(cpp)

print("Number of True Positives")
print(no_cpp)

# Alternative solution
print(sklearn.metrics.confusion_matrix(true_labels, predictions))

Number of True Positives
6
[[9 3]
 [2 6]]


### How Many True Negatives?
How many true negatives are there in this example?

In [9]:
cpp = [1 for j in zip(true_labels, predictions) if j[0] == j[1] and j[1] == 0]
no_cpp = np.sum(cpp)

print("Number of True Negatives")
print(no_cpp)

# Alternative solution
print(sklearn.metrics.confusion_matrix(true_labels, predictions))

Number of True Negatives
9
[[9 3]
 [2 6]]


### False Positives?
How many false positives are there?
* 3

### False Negatives?
How many false negatives are there?
* 2

### Precision
What's the precision of this classifier?

In [10]:
print("Precision:", sklearn.metrics.precision_score(true_labels, predictions))

Precision: 0.666666666667


### Recall
What's the recall of this classifier?

In [11]:
print("Recall:", sklearn.metrics.recall_score(true_labels, predictions))

Recall: 0.75


### Making Sense of Metrics
* “My true positive rate is high, which means that when a POI is present in the test data, I am good at flagging him or her.”
* “My identifier doesn’t have great PRECISION, but it does have good RECALL. That means that, nearly every time a POI shows up in my test set, I am able to identify him or her. The cost of this is that I sometimes get some false positives, where non-POIs get flagged.”
* “My identifier doesn’t have great RECALL, but it does have good PRECISION. That means that whenever a POI gets flagged in my test set, I know with a lot of confidence that it’s very likely to be a real POI and not a false alarm. On the other hand, the price I pay for this is that I sometimes miss real POIs, since I’m effectively reluctant to pull the trigger on edge cases.”
* “My identifier has a really great F1 SCORE. This is the best of both worlds. Both my false positive and false negative rates are LOW, which means that I can identify POI’s reliably and accurately. If my identifier finds a POI then the person is almost certainly a POI, and if the identifier does not flag someone, then they are almost certainly not a POI.”

#### There’s usually a tradeoff between precision and recall--which one do you think is more important in your POI identifier? There’s no right or wrong answer, there are good arguments either way, but you should be able to interpret both metrics and articulate which one you find most important and why.