Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rerun results lower than what's reported #4

Closed
Chunngai opened this issue Aug 3, 2022 · 2 comments
Closed

Rerun results lower than what's reported #4

Chunngai opened this issue Aug 3, 2022 · 2 comments

Comments

@Chunngai
Copy link

Chunngai commented Aug 3, 2022

Hello. I reran the GEC-PD experiment with the provided data and code in the repo. However, the results I got were lower then what are reported in the repo.

Results of the repo:

S0: 41.48 | 21.44 | 34.94
S1: 31.11 | 19.37 | 27.74
G0: 42.41 | 23.01 | 36.29
G1: 32.00 | 23.28 | 29.77

S avg: 36.30 | 20.40 | 31.34
G avg: 37.21 | 23.15 | 33.03

Rerun results:

S0: 38.54 | 19.10 | 31.99
S1: 30.33 | 18.09 | 26.69
G0: 42.38 | 21.19 | 35.30
G1: 32.06 | 21.50 | 29.17

S avg: 34.43 | 18.60 | 29.34
G avg: 37.22 | 21.35 | 32.24

Environment:

  • OS: Ubuntu 18.04.1 64 bits
  • Python version 3.7.11
  • Pytorch version 1.7.1
  • CUDA Version 11.2

Here are several possible reasons I guess that led to the performance gap:

  1. Choice of the best model for generating predictions with the test sets and for evaluation (calculating precision / recall / $F_{0.5}$). I used the best checkpoint during training (checkpoint_best.pt generated by fairseq). In the sample code of the repo it is checkpoint3.pt but why?

  2. ERRANT version. I used errant==2.3.0.

  3. Random seeds. I used [10, 20, 30] and took the average.

Since the evaluation script was not released by the repo, I am not sure how the trained models in the paper were evaluated. Could you kindly provide more details, such as releasing the evaluation script?

Thank you very much.

@Chunngai Chunngai changed the title Reran results lower than what's reported Rerun results lower than what's reported Aug 3, 2022
@MichaelCaohn
Copy link
Collaborator

Hi,

Apologies for the late reply. I think the main reason is about choosing the checkpoint. We choose the best checkpoint based on validation set performance in terms of F0.5 score not the loss produced during training. So, in our experiment checkpoint3.pt gives the best validation F0.5 score. Maybe you could try to use checkpoint3.pt. The rest of reasons should not cause a problem.

@Chunngai
Copy link
Author

OK, I'll try it. Thank you ^w^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants