You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello. I reran the GEC-PD experiment with the provided data and code in the repo. However, the results I got were lower then what are reported in the repo.
Here are several possible reasons I guess that led to the performance gap:
Choice of the best model for generating predictions with the test sets and for evaluation (calculating precision / recall / $F_{0.5}$). I used the best checkpoint during training (checkpoint_best.pt generated by fairseq). In the sample code of the repo it is checkpoint3.pt but why?
ERRANT version. I used errant==2.3.0.
Random seeds. I used [10, 20, 30] and took the average.
Since the evaluation script was not released by the repo, I am not sure how the trained models in the paper were evaluated. Could you kindly provide more details, such as releasing the evaluation script?
Thank you very much.
The text was updated successfully, but these errors were encountered:
Chunngai
changed the title
Reran results lower than what's reported
Rerun results lower than what's reported
Aug 3, 2022
Apologies for the late reply. I think the main reason is about choosing the checkpoint. We choose the best checkpoint based on validation set performance in terms of F0.5 score not the loss produced during training. So, in our experiment checkpoint3.pt gives the best validation F0.5 score. Maybe you could try to use checkpoint3.pt. The rest of reasons should not cause a problem.
Hello. I reran the GEC-PD experiment with the provided data and code in the repo. However, the results I got were lower then what are reported in the repo.
Results of the repo:
Rerun results:
Environment:
Here are several possible reasons I guess that led to the performance gap:
Choice of the best model for generating predictions with the test sets and for evaluation (calculating precision / recall /$F_{0.5}$ ). I used the best checkpoint during training (
checkpoint_best.pt
generated byfairseq
). In the sample code of the repo it ischeckpoint3.pt
but why?ERRANT
version. I usederrant==2.3.0
.Random seeds. I used
[10, 20, 30]
and took the average.Since the evaluation script was not released by the repo, I am not sure how the trained models in the paper were evaluated. Could you kindly provide more details, such as releasing the evaluation script?
Thank you very much.
The text was updated successfully, but these errors were encountered: