Rerun results lower than what's reported #4

Chunngai · 2022-08-03T10:39:34Z

Hello. I reran the GEC-PD experiment with the provided data and code in the repo. However, the results I got were lower then what are reported in the repo.

Results of the repo:

S0: 41.48 | 21.44 | 34.94
S1: 31.11 | 19.37 | 27.74
G0: 42.41 | 23.01 | 36.29
G1: 32.00 | 23.28 | 29.77

S avg: 36.30 | 20.40 | 31.34
G avg: 37.21 | 23.15 | 33.03

Rerun results:

S0: 38.54 | 19.10 | 31.99
S1: 30.33 | 18.09 | 26.69
G0: 42.38 | 21.19 | 35.30
G1: 32.06 | 21.50 | 29.17

S avg: 34.43 | 18.60 | 29.34
G avg: 37.22 | 21.35 | 32.24

Environment:

OS: Ubuntu 18.04.1 64 bits
Python version 3.7.11
Pytorch version 1.7.1
CUDA Version 11.2

Here are several possible reasons I guess that led to the performance gap:

Choice of the best model for generating predictions with the test sets and for evaluation (calculating precision / recall / $F_{0.5}$). I used the best checkpoint during training (checkpoint_best.pt generated by fairseq). In the sample code of the repo it is checkpoint3.pt but why?
ERRANT version. I used errant==2.3.0.
Random seeds. I used [10, 20, 30] and took the average.

Since the evaluation script was not released by the repo, I am not sure how the trained models in the paper were evaluated. Could you kindly provide more details, such as releasing the evaluation script?

Thank you very much.

The text was updated successfully, but these errors were encountered:

MichaelCaohn · 2022-10-14T08:37:18Z

Hi,

Apologies for the late reply. I think the main reason is about choosing the checkpoint. We choose the best checkpoint based on validation set performance in terms of F0.5 score not the loss produced during training. So, in our experiment checkpoint3.pt gives the best validation F0.5 score. Maybe you could try to use checkpoint3.pt. The rest of reasons should not cause a problem.

Chunngai · 2022-10-17T16:12:04Z

OK, I'll try it. Thank you ^w^

Chunngai changed the title ~~Reran results lower than what's reported~~ Rerun results lower than what's reported Aug 3, 2022

Chunngai closed this as completed Oct 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rerun results lower than what's reported #4

Rerun results lower than what's reported #4

Chunngai commented Aug 3, 2022 •

edited

Loading

MichaelCaohn commented Oct 14, 2022

Chunngai commented Oct 17, 2022

Rerun results lower than what's reported #4

Rerun results lower than what's reported #4

Comments

Chunngai commented Aug 3, 2022 • edited Loading

MichaelCaohn commented Oct 14, 2022

Chunngai commented Oct 17, 2022

Chunngai commented Aug 3, 2022 •

edited

Loading