# Evaluate System

A script for evaluating the performance of a system (e.g. head detection, cut detection) against ground truth face and cut labels from `txt` files with hand coding data in [the Gaze Data for the Analysis of Attention in Feature Films dataset](http://graphics.stanford.edu/~kbreeden/gazedata.html).

Takes in pickled dictionaries of ground truth labels and labels created by the system being evaluated. Each dictionary has keys that are frame numbers and values that indicate whether or not the frame contains the feature (0 if the frame does not contain the feature; 1 if the frame does). Compares the two dictionaries to compute accuracy, precision, recall, and F1 scores.

In [1]:
import os
import pickle
import collections

In [2]:
def load_pickled_labels_into_dict(label_dicts_dir_path, filename_part_to_drop):
    """
    Loads pickled frame-to-feature dictionaries for multiple clips into a single nested dictionary for all clips.
    
    Parameters:
    - label_dicts_dir_path: the path to the folder containing the pickled frame-to-feature dictionaries (of all the clips) for this feature.
    - filename_part_to_drop: the part of the pickled frame-to-feature dictionary filenames we should drop (for determining clip name keys for the single nested dictionary).
    
    Returns a nested dictionary where keys are clip names, and values are dictionaries mapping frame number keys to feature label values (0 if feature is not in the frame; 1 if it is).
    """
    # Get list of all files and directories in the directory with all the pickled frame-to-feature dictionaries for all the clips
    file_list = os.listdir(label_dicts_dir_path)

    # Create dictionary where keys are clip names, values are dictionaries of the frame-to-feature labels
    label_dicts = {}
    # For each pickled dict...
    for filename in file_list:
        # We only want to process our pickled dictionaries, which are pkl files
        if filename.endswith(".pkl"):
            clip = filename.replace(filename_part_to_drop, "")

            # Load (deserialize) pickled data
            with open(label_dicts_dir_path + "/" + filename, "rb") as f:
                label_dicts[clip] = pickle.load(f)
                
    return label_dicts

In [3]:
def evaluate_system(system_labels, ground_truth_labels):
    """
    Evaluates the performance of a system by comparing labels as assigned by the system, versus ground-truth labels.
    
    Parameters:
    - system_labels: feature labels assigned by our system.
    - ground_truth_labels: ground-truth feature labels from the hand coding in the Gaze dataset.
    Each parameter is a dictionary from clip name keys to values of dictionaries of frame number keys to feature binary label values (0 if feature is not in the frame; 1 if it is).
    
    Returns dictionary of num of true positives, num of true negatives, num of false positives, num of false negatives, as well as system's accuracy, precision, recall, and F1 scores.
    """
    # Create a Counter of (system prediction, ground truth label) tuples
    c = collections.Counter()
    
    # For each clip...
    for clip in system_labels:
        # For each frame...
        for frame_num in system_labels[clip]:
            system = system_labels[clip][frame_num]
            ground_truth = ground_truth_labels[clip][frame_num]
            c[(system, ground_truth)] += 1

    # Count the number of true positives (tn), true negatives (tn), false positives (fp), and false negatives (fn)
    # 1 (frame contains the feature) is the positive class, and 0 (frame does not contain the feature) is the negative class
    tp = c[(1, 1)] # system predicted 1, and ground truth was actually 1 
    tn = c[(0, 0)] # system predicted 0, and ground truth was actually 0
    fp = c[(1, 0)] # system predicted 1, but ground truth was actually 0
    fn = c[(0, 1)] # system predicted 0, but ground truth was actually 1
    
    # Compute evaluation metrics: accuracy, precision, recall, and F1
    accuracy  = (tp + tn) / sum(c.values())
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0 # given a prediction, what is the likelihood that prediction is accurate?
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0    # out of all the actual positives out there, how many did we find?
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0

    results = {"truePositives": tp, "trueNegatives": tn, "falsePositives": fp, "falseNegatives": fn,
               "accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}
    return results

In [4]:
# Load all head labels predicted by a Haar Cascade head detection system into a single nested dictionary
head_haar_dicts = load_pickled_labels_into_dict("head_label_dicts_haar", "_frame_to_head_dict.pkl")

# Load all ground-truth face labels from the hand coding in the Gaze dataset into a single nested dictionary
ground_truth_dicts = load_pickled_labels_into_dict("ground_truth_face_label_dicts", "_hcode_frame_to_face_dict.pkl")

In [5]:
# Evaluate performance of head detection using Haar Cascades
evaluate_system(head_haar_dicts, ground_truth_dicts)

{'truePositives': 12896,
 'trueNegatives': 12684,
 'falsePositives': 220,
 'falseNegatives': 28326,
 'accuracy': 0.47260096811144364,
 'precision': 0.98322659347362,
 'recall': 0.3128426568337296,
 'f1': 0.4746586182781846}