Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

None-consistent results between corpus and single #10

Closed
6 of 11 tasks
AmitMY opened this issue Feb 28, 2018 · 3 comments
Closed
6 of 11 tasks

None-consistent results between corpus and single #10

AmitMY opened this issue Feb 28, 2018 · 3 comments

Comments

@AmitMY
Copy link

AmitMY commented Feb 28, 2018

The REAMDE, and manual running of the binary file results in:

Bleu_1: 0.550000
Bleu_2: 0.428174
Bleu_3: 0.284043
Bleu_4: 0.201143
METEOR: 0.295797
ROUGE_L: 0.522104
CIDEr: 1.242192
SkipThoughtsCosineSimilairty: 0.626149
EmbeddingAverageCosineSimilairty: 0.884690
VectorExtremaCosineSimilarity: 0.568696
GreedyMatchingScore: 0.784205

However, running each sentence using the new model preload method gives (for each of the 2 sentences):

{'CIDEr': 0.0, 'GreedyMatchingScore': 0.697944, 'Bleu_4': 4.939382736523921e-09, 'Bleu_3': 1.609148974162434e-06, 'Bleu_2': 0.18257418581578377, 'Bleu_1': 0.2999999999700001, 'ROUGE_L': 0.36454183266932266, 'METEOR': 0.1556568826170604, 'EmbeddingAverageCosineSimilairty': 0.836663, 'VectorExtremaCosineSimilarity': 0.427065, 'SkipThoughtCS': 0.3743917}

{'CIDEr': 0.0, 'GreedyMatchingScore': 0.870466, 'Bleu_4': 0.35494810555850326, 'Bleu_3': 0.4807498567152745, 'Bleu_2': 0.6666666665962965, 'Bleu_1': 0.7999999999200001, 'ROUGE_L': 0.67966573816155995, 'METEOR': 0.39012536521249613, 'EmbeddingAverageCosineSimilairty': 0.932718, 'VectorExtremaCosineSimilarity': 0.710326, 'SkipThoughtCS': 0.87790722}

Or on average:

  • 'Bleu_1': 0.5499999999450002,
  • 'Bleu_2': 0.4246204262060401,
  • 'Bleu_3': 0.24037573293212433,
  • 'Bleu_4': 0.177474055248943,
  • 'METEOR': 0.27289112391477827,
  • 'ROUGE_L': 0.52210378541544133,
  • 'CIDEr': 0.0
  • 'SkipThoughtCS': 0.62614947557449341
  • 'EmbeddingAverageCosineSimilairty': 0.8846905,
  • 'VectorExtremaCosineSimilarity': 0.5686955,
  • 'GreedyMatchingScore': 0.784205,

For CIDEr, I understand it must be 0 because of the note in the README.
For the rest, I don't understand why there are such fluctuating deviations

I looked at your scorers' code, and I saw that for most checked scorers above, you are using an average like np.mean, but for METEOR and BLEU I couldn't find any behavior like that.

Here is code for reproduction, I think it can be good to use that as test code: (obviously not to use these assertions values, but values from compute_metrics on the entire corpus)

from nlgeval import NLGEval
from datetime import datetime
from collections import Counter

startTime = datetime.now()


def passedTime():
    return str(datetime.now() - startTime)


data = [
    ("this is the model generated sentence1 which seems good enough", [
        "this is one reference sentence for sentence1",
        "this is one more reference sentence for sentence1"
    ]),
    ("this is sentence2 which has been generated by your model", [
        "this is a reference sentence for sentence2 which was generated by your model",
        "this is the second reference sentence for sentence2"
    ])
]

if __name__ == "__main__":
    print "Start loading NLG-Eval Model", passedTime()
    nlgeval = NLGEval()  # loads the models
    print "End loading NLG-Eval Model", passedTime()

    metrics = []
    for hyp, refs in data:
        print "Start evaluating a single sentence", passedTime()
        metrics_dict = nlgeval.evaluate(refs, hyp)
        print "End evaluating a single sentence", passedTime()
        print metrics_dict
        metrics.append(metrics_dict)

    total = sum(map(Counter, metrics), Counter())
    N = float(len(metrics))
    final_metrics = {k: round(v / len(metrics), 6) for k, v in total.items()}
    print final_metrics

    assert final_metrics["Bleu_1"] == 0.550000
    assert final_metrics["Bleu_2"] == 0.428174
    assert final_metrics["Bleu_3"] == 0.284043
    assert final_metrics["Bleu_4"] == 0.201143
    assert final_metrics["METEOR"] == 0.295797
    assert final_metrics["ROUGE_L"] == 0.522104
    assert final_metrics["SkipThoughtsCosineSimilairty"] == 0.626149
    assert final_metrics["EmbeddingAverageCosineSimilairty"] == 0.884690
    assert final_metrics["VectorExtremaCosineSimilarity"] == 0.568696
    assert final_metrics["GreedyMatchingScore"] == 0.784205
@kracwarlock
Copy link
Member

This is expected behavior. BLEU and METEOR are calculated across the entire corpus and the corpus score is not supposed to be the average of sentence scores. You can search online for corpus BLEU and you will find several explanations available.

@kracwarlock
Copy link
Member

This was the reason behind why this repository originally assumed that you would have all the hypothesis and references in advance and then you would run compute_metrics over them to report the corpus level scores as most papers just report that.

@AmitMY
Copy link
Author

AmitMY commented Mar 2, 2018

Thanks! I wasn't aware.

For future reference:
Instead of averaging the sentence level BLEU scores (i.e. marco-average precision), the original BLEU metric (Papineni et al. 2002) accounts for the micro-average precision (i.e. summing the numerators and denominators for each hypothesis-reference(s) pairs before the division).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants