You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For CIDEr, I understand it must be 0 because of the note in the README.
For the rest, I don't understand why there are such fluctuating deviations
I looked at your scorers' code, and I saw that for most checked scorers above, you are using an average like np.mean, but for METEOR and BLEU I couldn't find any behavior like that.
Here is code for reproduction, I think it can be good to use that as test code: (obviously not to use these assertions values, but values from compute_metrics on the entire corpus)
fromnlgevalimportNLGEvalfromdatetimeimportdatetimefromcollectionsimportCounterstartTime=datetime.now()
defpassedTime():
returnstr(datetime.now() -startTime)
data= [
("this is the model generated sentence1 which seems good enough", [
"this is one reference sentence for sentence1",
"this is one more reference sentence for sentence1"
]),
("this is sentence2 which has been generated by your model", [
"this is a reference sentence for sentence2 which was generated by your model",
"this is the second reference sentence for sentence2"
])
]
if__name__=="__main__":
print"Start loading NLG-Eval Model", passedTime()
nlgeval=NLGEval() # loads the modelsprint"End loading NLG-Eval Model", passedTime()
metrics= []
forhyp, refsindata:
print"Start evaluating a single sentence", passedTime()
metrics_dict=nlgeval.evaluate(refs, hyp)
print"End evaluating a single sentence", passedTime()
printmetrics_dictmetrics.append(metrics_dict)
total=sum(map(Counter, metrics), Counter())
N=float(len(metrics))
final_metrics= {k: round(v/len(metrics), 6) fork, vintotal.items()}
printfinal_metricsassertfinal_metrics["Bleu_1"] ==0.550000assertfinal_metrics["Bleu_2"] ==0.428174assertfinal_metrics["Bleu_3"] ==0.284043assertfinal_metrics["Bleu_4"] ==0.201143assertfinal_metrics["METEOR"] ==0.295797assertfinal_metrics["ROUGE_L"] ==0.522104assertfinal_metrics["SkipThoughtsCosineSimilairty"] ==0.626149assertfinal_metrics["EmbeddingAverageCosineSimilairty"] ==0.884690assertfinal_metrics["VectorExtremaCosineSimilarity"] ==0.568696assertfinal_metrics["GreedyMatchingScore"] ==0.784205
The text was updated successfully, but these errors were encountered:
This is expected behavior. BLEU and METEOR are calculated across the entire corpus and the corpus score is not supposed to be the average of sentence scores. You can search online for corpus BLEU and you will find several explanations available.
This was the reason behind why this repository originally assumed that you would have all the hypothesis and references in advance and then you would run compute_metrics over them to report the corpus level scores as most papers just report that.
For future reference:
Instead of averaging the sentence level BLEU scores (i.e. marco-average precision), the original BLEU metric (Papineni et al. 2002) accounts for the micro-average precision (i.e. summing the numerators and denominators for each hypothesis-reference(s) pairs before the division).
The REAMDE, and manual running of the binary file results in:
However, running each sentence using the new model preload method gives (for each of the 2 sentences):
Or on average:
For CIDEr, I understand it must be 0 because of the note in the README.
For the rest, I don't understand why there are such fluctuating deviations
I looked at your scorers' code, and I saw that for most checked scorers above, you are using an average like
np.mean
, but for METEOR and BLEU I couldn't find any behavior like that.Here is code for reproduction, I think it can be good to use that as test code: (obviously not to use these assertions values, but values from
compute_metrics
on the entire corpus)The text was updated successfully, but these errors were encountered: