## Scoring   
The goal of this notebook is to take a list of image shas and confidence scores and generate cluster level scores for maximum and average ad-level aggregation.

### Inputs
1. cp1_results.csv   
This is a csv mapping image shas to confidence scores, typically the output of `compute_classifications`.
2. CP1_data.csv   
This is a csv with columns for cluster_id, ad_id, image_sha, class (class isn't required for this notebook). See the 'Data Preparation' notebook for obtaining this file.

### Outputs
1. scores.avg.jl   
This is a csv file containing 4 columns: cluster_id, ad_id, image_sha, class
2. scores.max.jl   
Expects a DATA_FILE to be a csv in the format of: image_sha1, score    
Expects a TOTAL_DATA_FILE to be a csv in the format of: cluster_id, ad_id, image_sha1, class

In [None]:
__depends__ = ['cp1_results.csv', 'CP1_data.csv']
__dest__ = ['scores.avg.jl', 'scores.max.jl']

DATA_FILE = 'cp1_results.csv'
TOTAL_DATA_FILE = 'CP1_data.csv'

In [None]:
import csv
import json
import numpy as np
from collections import defaultdict

In [None]:
cluster_id_to_ad_ids = defaultdict(set)
ad_id_to_image_shas = defaultdict(set)
image_sha_scores = {}

In [None]:
with open(DATA_FILE) as infile:
    for (sha1, score) in csv.reader(infile):
        image_sha_scores[sha1] = score

In [None]:
with open(TOTAL_DATA_FILE) as infile:
    for (cluster_id, ad_id, image_sha1, _) in csv.reader(infile):
        cluster_id_to_ad_ids[cluster_id].add(ad_id)
        ad_id_to_image_shas[ad_id].add(image_sha1)

In [None]:
def ad_score(ad_id, func=np.average):
    image_scores = [float(image_sha_scores[sha]) for sha in ad_id_to_image_shas[ad_id]]
    return func(image_scores)
                  
def cluster_score(cluster_id, func=np.average):
    return func([ad_score(ad_id) for ad_id in cluster_id_to_ad_ids[cluster_id]])

In [None]:
with open('scores.avg.jl', 'w') as outfile:
    for cluster_id in cluster_id_to_ad_ids:
        outfile.write(json.dumps({'cluster_id': cluster_id,
                                  'score': cluster_score(cluster_id)}) + '\n')       

In [None]:
with open('scores.max.jl', 'w') as outfile:
    for cluster_id in cluster_id_to_ad_ids:
        outfile.write(json.dumps({'cluster_id': cluster_id,
                                  'score': cluster_score(cluster_id, func=np.max)}) + '\n')       