None-consistent results between corpus and single #10

AmitMY · 2018-02-28T19:57:01Z

The REAMDE, and manual running of the binary file results in:

Bleu_1: 0.550000
Bleu_2: 0.428174
Bleu_3: 0.284043
Bleu_4: 0.201143
METEOR: 0.295797
ROUGE_L: 0.522104
CIDEr: 1.242192
SkipThoughtsCosineSimilairty: 0.626149
EmbeddingAverageCosineSimilairty: 0.884690
VectorExtremaCosineSimilarity: 0.568696
GreedyMatchingScore: 0.784205

However, running each sentence using the new model preload method gives (for each of the 2 sentences):

{'CIDEr': 0.0, 'GreedyMatchingScore': 0.697944, 'Bleu_4': 4.939382736523921e-09, 'Bleu_3': 1.609148974162434e-06, 'Bleu_2': 0.18257418581578377, 'Bleu_1': 0.2999999999700001, 'ROUGE_L': 0.36454183266932266, 'METEOR': 0.1556568826170604, 'EmbeddingAverageCosineSimilairty': 0.836663, 'VectorExtremaCosineSimilarity': 0.427065, 'SkipThoughtCS': 0.3743917}

{'CIDEr': 0.0, 'GreedyMatchingScore': 0.870466, 'Bleu_4': 0.35494810555850326, 'Bleu_3': 0.4807498567152745, 'Bleu_2': 0.6666666665962965, 'Bleu_1': 0.7999999999200001, 'ROUGE_L': 0.67966573816155995, 'METEOR': 0.39012536521249613, 'EmbeddingAverageCosineSimilairty': 0.932718, 'VectorExtremaCosineSimilarity': 0.710326, 'SkipThoughtCS': 0.87790722}

Or on average:

For CIDEr, I understand it must be 0 because of the note in the README.
For the rest, I don't understand why there are such fluctuating deviations

I looked at your scorers' code, and I saw that for most checked scorers above, you are using an average like np.mean, but for METEOR and BLEU I couldn't find any behavior like that.

Here is code for reproduction, I think it can be good to use that as test code: (obviously not to use these assertions values, but values from compute_metrics on the entire corpus)

from nlgeval import NLGEval
from datetime import datetime
from collections import Counter

startTime = datetime.now()


def passedTime():
    return str(datetime.now() - startTime)


data = [
    ("this is the model generated sentence1 which seems good enough", [
        "this is one reference sentence for sentence1",
        "this is one more reference sentence for sentence1"
    ]),
    ("this is sentence2 which has been generated by your model", [
        "this is a reference sentence for sentence2 which was generated by your model",
        "this is the second reference sentence for sentence2"
    ])
]

if __name__ == "__main__":
    print "Start loading NLG-Eval Model", passedTime()
    nlgeval = NLGEval()  # loads the models
    print "End loading NLG-Eval Model", passedTime()

    metrics = []
    for hyp, refs in data:
        print "Start evaluating a single sentence", passedTime()
        metrics_dict = nlgeval.evaluate(refs, hyp)
        print "End evaluating a single sentence", passedTime()
        print metrics_dict
        metrics.append(metrics_dict)

    total = sum(map(Counter, metrics), Counter())
    N = float(len(metrics))
    final_metrics = {k: round(v / len(metrics), 6) for k, v in total.items()}
    print final_metrics

    assert final_metrics["Bleu_1"] == 0.550000
    assert final_metrics["Bleu_2"] == 0.428174
    assert final_metrics["Bleu_3"] == 0.284043
    assert final_metrics["Bleu_4"] == 0.201143
    assert final_metrics["METEOR"] == 0.295797
    assert final_metrics["ROUGE_L"] == 0.522104
    assert final_metrics["SkipThoughtsCosineSimilairty"] == 0.626149
    assert final_metrics["EmbeddingAverageCosineSimilairty"] == 0.884690
    assert final_metrics["VectorExtremaCosineSimilarity"] == 0.568696
    assert final_metrics["GreedyMatchingScore"] == 0.784205

The text was updated successfully, but these errors were encountered:

kracwarlock · 2018-03-01T14:52:31Z

This is expected behavior. BLEU and METEOR are calculated across the entire corpus and the corpus score is not supposed to be the average of sentence scores. You can search online for corpus BLEU and you will find several explanations available.

kracwarlock · 2018-03-01T14:55:19Z

This was the reason behind why this repository originally assumed that you would have all the hypothesis and references in advance and then you would run compute_metrics over them to report the corpus level scores as most papers just report that.

AmitMY · 2018-03-02T10:48:00Z

Thanks! I wasn't aware.

For future reference:
Instead of averaging the sentence level BLEU scores (i.e. marco-average precision), the original BLEU metric (Papineni et al. 2002) accounts for the micro-average precision (i.e. summing the numerators and denominators for each hypothesis-reference(s) pairs before the division).

kracwarlock closed this as completed Mar 1, 2018

kracwarlock mentioned this issue Nov 20, 2019

got different result when using standalone and python API #84

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

None-consistent results between corpus and single #10

None-consistent results between corpus and single #10

AmitMY commented Feb 28, 2018 •

edited

kracwarlock commented Mar 1, 2018

kracwarlock commented Mar 1, 2018

AmitMY commented Mar 2, 2018

None-consistent results between corpus and single #10

None-consistent results between corpus and single #10

Comments

AmitMY commented Feb 28, 2018 • edited

kracwarlock commented Mar 1, 2018

kracwarlock commented Mar 1, 2018

AmitMY commented Mar 2, 2018

AmitMY commented Feb 28, 2018 •

edited