##Evaluation Metrics

###Workflow

- Evaluate how "well" the model predicts the target
- Usually the model is trained on a **train set**
- Hyperparameters are tuned on a **dev set**
- Model is evaluated on a **test set**
- The ratio should be around 60:20:20
- Evaluation done on the **test set** should be cross validated
- After evaluation, the model is trained on the whole data set
- When data is scarce, one can dispense with the **dev set** and evaluate directly on the **test set**

<br>

###Common Metrics
- Accuracy
- Recall
- Precision
- Receiever Operating Characteristics (ROC)
- F1 Score
- Entropy based

- Mutinomial data and predictions

In [1]:
import numpy as np
from __future__ import division
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, f1_score, recall_score, precision_score

# Multinomial
actual    = np.array([0, 1, 2, 3, 0, 0, 2])
predicted = np.array([0, 1, 1, 3, 2, 1, 2])

###Accuracy

$$Accuracy = \frac{\text{# of correctly predicted samples}}{\text{# of samples}}$$


In [2]:
def accuracy(actual, predicted):
    return sum(actual == predicted) / (len(actual) * 1.)

eval_acc = accuracy(actual, predicted)
eval_acc_sk = accuracy_score(actual, predicted)
assert eval_acc == eval_acc_sk

print 'Accuracy:', eval_acc

Accuracy: 0.571428571429


**Disadvantages**
- **Not informative**
  - One number that does not indicate which labels have been misclassified
- **Does not take into account unequal misclassification cost for each label**
  - Misclassifying some labels are more consequential than others
- **Misleading in the case of class imbalance**
  - Predict all non-fraud in fraud detection will achieve high accuracy

###Recall

- aka **sensitivity**
- Fraction of positives recalled
- Predicted positives divided by all the actual positives

$$Recall = \frac{\text{True positives}}{\text{True positives} + \text{False negatives}}$$

In [3]:
def recall_per_label(actual, predicted, n=1): # n is the positive label
    tp = ((predicted == n) & (actual == predicted)).sum()
    fn = ((actual == n) & (actual != predicted)).sum()
    return tp / (tp + fn)

def recall(actual, predicted, method='weighted'):
    if method == 'micro':
        tp = (actual == predicted).sum()
        fn = (actual != predicted).sum()
        return tp / (tp + fn)
    
    if method in ['macro', 'weighted']:
        uniq_nums = np.unique(actual)
        recall_arr = np.array([recall_per_label(actual, predicted, n=n) for n in uniq_nums])
        if method == 'macro':
            return np.mean(recall_arr)
        if method == 'weighted':
            weights = np.bincount(actual)[uniq_nums] / len(actual)
            return np.sum(recall_arr * weights)
    
    raise NotImplementedError()

eval_recall_overall = recall(actual, predicted, method='micro')
eval_recall_avg = recall(actual, predicted, method='macro')
eval_recall_weighted = recall(actual, predicted, method='weighted')

eval_recall_overall_sk = recall_score(actual, predicted, average='micro')
eval_recall_avg_sk = recall_score(actual, predicted, average='macro')
eval_recall_weighted_sk = recall_score(actual, predicted, average='weighted')

assert eval_recall_overall == eval_recall_overall_sk
assert eval_recall_avg == eval_recall_avg_sk
assert eval_recall_weighted == eval_recall_weighted_sk

print 'Recall (overall):', eval_recall_overall 
print 'Recall (average):', eval_recall_avg
print 'Recall (weighted):', eval_recall_weighted

Recall (overall): 0.571428571429
Recall (average): 0.708333333333
Recall (weighted): 0.571428571429


**Disadvantages**
- Only account for positive results
  - Does not penalize false positives
- Recall alone is a poor metric

**Advantages**
- Given multiple labels, can weight the recall by the true positives in each label
  - Account for class imbalance, unlike vanilla version of accuracy

###Precision

###ROC

###Part 5: F1 Score