The score below are fully detailed in the following publication:

Shikhar Sharma, Layla El Asri, Hannes Schulz, and Jeremie Zumer. "Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation" arXiv preprint arXiv:1706.09799 (2017)

In [1]:
import os, sys

import matplotlib.pyplot as plt
import numpy as np
import joblib

In [2]:
sys.path.append('../nlg-eval')
from nlgeval import compute_individual_metrics

In [3]:
predicted_reports = joblib.load('../data/predicted_reports_val.jbl')

In [15]:
def avg_scores(metrics):
    avg = list(metrics.values())[0]
    for key, val in list(metrics.items())[1:]:
        for idx, item in enumerate(val):
            for key2 in item.keys():
                avg[idx][key2] += item[key2]
    for idx2, score_set in enumerate(avg):
        for key3 in score_set.keys():
            avg[idx2][key3] = score_set[key3]/len(metrics.keys())
    return avg

In [8]:
# Calculating the w2v embedding of the DeepMiner reports 
# takes ages to run, this is precomputed cache.
# This cache contains the similarity metrics for 
# each mammogram's ground truth report and predicted report pair.
# These metrics include the EmbeddingAverageCosineSimilairty,
# GreedyMatchingScore, and VectorExtremaCosineSimilarity.
# Skip the next two cells unless you want to get these values again
metrics = joblib.load('../data/predicted_reports_w2v_scores_val.jbl' )

In [None]:
# Exhaustive Val Set baseline
# Note: check out the requirements in the submodule containing 
# the Maluuba/nlg-eval repo before running this cell or
# the randomized baseline

metrics = {}
for ind, image_name in enumerate(predicted_reports.keys()):
    metrics[image_name] = []
    for gt, pred in predicted_reports[image_name]:
        try:
            metrics_dict = compute_individual_metrics(' '.join(gt), ' '+' '.join(pred), no_overlap=True,
                                                      no_skipthoughts=True, no_glove=False)
        except:
            print('Error! {}, {}'.format(gt, pred))
        metrics[image_name].append(metrics_dict)
        
    if ind%10 == 0:
        print('Scored Reports {} / {}'.format(ind, len(predicted_reports.keys())))
        print(avg_scores(metrics))
        print('')

In [None]:
#joblib.dump(metrics, 'predicted_reports_w2v_scores_val.jbl' )

In [None]:
# Randomized baseline
from random import shuffle
metrics = {}
rand_inds = list(range(len(predicted_reports.keys())))
shuffle(rand_inds)
reportkeys = list(predicted_reports.keys())
for ind, image_name in enumerate(reportkeys):
    metrics[image_name] = []
    ind2 = 0
    for gt, pred_orig in predicted_reports[image_name]:
        rind = rand_inds[ind]       
        _, pred = predicted_reports[reportkeys[rind]][ind2]
        try:
            metrics_dict = compute_individual_metrics(' '.join(gt), ' '+' '.join(pred), no_overlap=True,
                                                      no_skipthoughts=True, no_glove=False)
        except:
            print('Error! {}, {}'.format(gt, pred))
        metrics[image_name].append(metrics_dict)
        ind2 +=1 
        
    if ind%10 == 0:
        print('Scored Reports {} / {}'.format(ind, len(predicted_reports.keys())))
        print(avg_scores(metrics))
        print('')

In [31]:
# NLP similarity scores between GT mammogram diagnoses and Predicted DeepMinder reports
num_units = [1, 4, 8, 20]
for ind, score in enumerate(avg_scores(metrics)):
    print('NLP scores for predicted reports using top {} units: {}\n'.format(num_units[ind], score))

NLP scores for predicted reports using top 1 units: {'EmbeddingAverageCosineSimilairty': -0.05958589730961718, 'GreedyMatchingScore': 0.626612518126469, 'VectorExtremaCosineSimilarity': -0.36652118315803106}

NLP scores for predicted reports using top 4 units: {'EmbeddingAverageCosineSimilairty': -0.3608542462486459, 'GreedyMatchingScore': 0.5415953116633389, 'VectorExtremaCosineSimilarity': -0.36305756516138443}

NLP scores for predicted reports using top 8 units: {'EmbeddingAverageCosineSimilairty': -0.45745307127549334, 'GreedyMatchingScore': 0.5333609578178252, 'VectorExtremaCosineSimilarity': -0.3457480962420635}

NLP scores for predicted reports using top 20 units: {'EmbeddingAverageCosineSimilairty': -0.449112173004515, 'GreedyMatchingScore': 0.5604797010213896, 'VectorExtremaCosineSimilarity': -0.3141268791857631}

