Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduction of experimental results #2

Closed
SunnyMarkLiu opened this issue Nov 29, 2019 · 7 comments
Closed

Reproduction of experimental results #2

SunnyMarkLiu opened this issue Nov 29, 2019 · 7 comments

Comments

@SunnyMarkLiu
Copy link

SunnyMarkLiu commented Nov 29, 2019

First of all, thanks for sharing this cleaned and object-oriented code! I have learned a lot from this repo. I even want to say Wow, you can really code! ^_^

I have training the model on CoNLL04 dataset with the default configuration, according to the README, and the test results as follows:

--- Entities (NER) ---

                type    precision       recall     f1-score      support
                 Org        79.43        83.84        81.57          198
                 Loc        91.51        90.87        91.19          427
               Other        76.61        71.43        73.93          133
                Peop        92.17        95.33        93.72          321

               micro        87.70        88.51        88.10         1079
               macro        84.93        85.37        85.10         1079

--- Relations ---

Without NER
                type    precision       recall     f1-score      support
                Kill        84.78        82.98        83.87           47
               OrgBI        73.86        61.90        67.36          105
                Work        61.54        63.16        62.34           76
               LocIn        74.36        61.70        67.44           94
                Live        74.04        77.00        75.49          100

               micro        72.84        68.01        70.34          422
               macro        73.72        69.35        71.30          422

With NER
                type    precision       recall     f1-score      support
                Kill        84.78        82.98        83.87           47
               OrgBI        73.86        61.90        67.36          105
                Work        61.54        63.16        62.34           76
               LocIn        73.08        60.64        66.28           94
                Live        74.04        77.00        75.49          100

               micro        72.59        67.77        70.10          422
               macro        73.46        69.14        71.07          422

The test result is worse than the original paper, especially for macro-average metrics.

Is it possible that the random seed is different? I just set seed=42 in example_train.conf

Thanks!

@markus-eberts
Copy link
Member

markus-eberts commented Nov 29, 2019

Hi,
thanks for your compliments :).

The F1 scores vary by about 2% (we could probably do better with some more hyperparameter tuning) because of several factors such as random initialization and negative sampling. That's why we report the average of 5 runs with random seeds in the paper. Also, we retrained the model on the combined train and dev set after hyperparameter tuning ('datasets/conll04/conll04_train_dev.json').

Could you please report the average of 5 runs with random seeds and trained on train+dev?

@SunnyMarkLiu
Copy link
Author

SunnyMarkLiu commented Nov 29, 2019

I have retrained on train+dev with the same random seed I used before, and the test result is much better now, and close to or even better than the paper reported.

--- Entities (NER) ---

                type    precision       recall     f1-score      support
                Peop        92.79        96.26        94.50          321
                 Loc        91.36        91.57        91.46          427
               Other        80.00        72.18        75.89          133
                 Org        80.09        85.35        82.64          198

               micro        88.37        89.43        88.90         1079
               macro        86.06        86.34        86.12         1079

--- Relations ---

Without NER
                type    precision       recall     f1-score      support
               LocIn        78.38        61.70        69.05           94
                Work        66.67        63.16        64.86           76
               OrgBI        65.45        68.57        66.98          105
                Live        71.93        82.00        76.64          100
                Kill        87.23        87.23        87.23           47

               micro        72.18        71.33        71.75          422
               macro        73.93        72.53        72.95          422

With NER
                type    precision       recall     f1-score      support
               LocIn        78.38        61.70        69.05           94
                Work        66.67        63.16        64.86           76
               OrgBI        65.45        68.57        66.98          105
                Live        71.93        82.00        76.64          100
                Kill        87.23        87.23        87.23           47

               micro        72.18        71.33        71.75          422
               macro        73.93        72.53        72.95          422

And I think, the average of 5 runs with random seeds and trained on train+dev will meet the paper's result. Thanks again!

@JackySnake
Copy link

I am tring to reproduce this work. I have some doubts about it.
What is the role of seeds?
The performance on CoNLL04 in the paper is trained on train+dev dataset, is it?

@markus-eberts
Copy link
Member

I'm not sure what you mean with "role of seeds". By using a random seed, we ensure that weights are initialized differently in each run (also things like random sampling depend on the seed). Yes, we train the final model on the train+dev dataset. This is a common thing to do after hyperparameter tuning.

@JackySnake
Copy link

I'm not sure what you mean with "role of seeds". By using a random seed, we ensure that weights are initialized differently in each run (also things like random sampling depend on the seed). Yes, we train the final model on the train+dev dataset. This is a common thing to do after hyperparameter tuning.

Thanks for your reply.
I just don't understand the effect of the seeds for the method performance. According your reply, I think it is no effect but just initialize weight.
I was always train the model only on the train dataset. And if train the model on the train+dev dataset, I think that maybe leak the data in the evaluation because the performance is on the dev in the code. I have try to evaluate the provided model on the test set. The performance is better than the paper.

PS: It is a very excellent work and the code is very very good. I have study that for weeks!

@markus-eberts
Copy link
Member

Yes, you should evaluate the provided model on the test set. However, the provided model is the best out of 5 runs, whereas we report the average of 5 runs in our paper (...and due to random weight initialization and sampling the performance varies between runs). That's why you get a better performance compared to the results we reported in our paper.

Thanks :)!

@JackySnake
Copy link

Yes, you should evaluate the provided model on the test set. However, the provided model is the best out of 5 runs, whereas we report the average of 5 runs in our paper (...and due to random weight initialization and sampling the performance varies between runs). That's why you get a better performance compared to the results we reported in our paper.

Thanks :)!

I understand. Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants