Reproduction of experimental results #2

SunnyMarkLiu · 2019-11-29T10:55:38Z

First of all, thanks for sharing this cleaned and object-oriented code! I have learned a lot from this repo. I even want to say Wow, you can really code! ^_^

I have training the model on CoNLL04 dataset with the default configuration, according to the README, and the test results as follows:

--- Entities (NER) ---

                type    precision       recall     f1-score      support
                 Org        79.43        83.84        81.57          198
                 Loc        91.51        90.87        91.19          427
               Other        76.61        71.43        73.93          133
                Peop        92.17        95.33        93.72          321

               micro        87.70        88.51        88.10         1079
               macro        84.93        85.37        85.10         1079

--- Relations ---

Without NER
                type    precision       recall     f1-score      support
                Kill        84.78        82.98        83.87           47
               OrgBI        73.86        61.90        67.36          105
                Work        61.54        63.16        62.34           76
               LocIn        74.36        61.70        67.44           94
                Live        74.04        77.00        75.49          100

               micro        72.84        68.01        70.34          422
               macro        73.72        69.35        71.30          422

With NER
                type    precision       recall     f1-score      support
                Kill        84.78        82.98        83.87           47
               OrgBI        73.86        61.90        67.36          105
                Work        61.54        63.16        62.34           76
               LocIn        73.08        60.64        66.28           94
                Live        74.04        77.00        75.49          100

               micro        72.59        67.77        70.10          422
               macro        73.46        69.14        71.07          422

The test result is worse than the original paper, especially for macro-average metrics.

Is it possible that the random seed is different? I just set seed=42 in example_train.conf

Thanks!

The text was updated successfully, but these errors were encountered:

markus-eberts · 2019-11-29T12:30:18Z

Hi,
thanks for your compliments :).

The F1 scores vary by about 2% (we could probably do better with some more hyperparameter tuning) because of several factors such as random initialization and negative sampling. That's why we report the average of 5 runs with random seeds in the paper. Also, we retrained the model on the combined train and dev set after hyperparameter tuning ('datasets/conll04/conll04_train_dev.json').

Could you please report the average of 5 runs with random seeds and trained on train+dev?

SunnyMarkLiu · 2019-11-29T13:58:43Z

I have retrained on train+dev with the same random seed I used before, and the test result is much better now, and close to or even better than the paper reported.

--- Entities (NER) ---

                type    precision       recall     f1-score      support
                Peop        92.79        96.26        94.50          321
                 Loc        91.36        91.57        91.46          427
               Other        80.00        72.18        75.89          133
                 Org        80.09        85.35        82.64          198

               micro        88.37        89.43        88.90         1079
               macro        86.06        86.34        86.12         1079

--- Relations ---

Without NER
                type    precision       recall     f1-score      support
               LocIn        78.38        61.70        69.05           94
                Work        66.67        63.16        64.86           76
               OrgBI        65.45        68.57        66.98          105
                Live        71.93        82.00        76.64          100
                Kill        87.23        87.23        87.23           47

               micro        72.18        71.33        71.75          422
               macro        73.93        72.53        72.95          422

With NER
                type    precision       recall     f1-score      support
               LocIn        78.38        61.70        69.05           94
                Work        66.67        63.16        64.86           76
               OrgBI        65.45        68.57        66.98          105
                Live        71.93        82.00        76.64          100
                Kill        87.23        87.23        87.23           47

               micro        72.18        71.33        71.75          422
               macro        73.93        72.53        72.95          422

And I think, the average of 5 runs with random seeds and trained on train+dev will meet the paper's result. Thanks again!

JackySnake · 2020-07-15T13:41:38Z

I am tring to reproduce this work. I have some doubts about it.
What is the role of seeds?
The performance on CoNLL04 in the paper is trained on train+dev dataset, is it?

markus-eberts · 2020-07-15T13:59:01Z

I'm not sure what you mean with "role of seeds". By using a random seed, we ensure that weights are initialized differently in each run (also things like random sampling depend on the seed). Yes, we train the final model on the train+dev dataset. This is a common thing to do after hyperparameter tuning.

JackySnake · 2020-07-15T14:16:59Z

I'm not sure what you mean with "role of seeds". By using a random seed, we ensure that weights are initialized differently in each run (also things like random sampling depend on the seed). Yes, we train the final model on the train+dev dataset. This is a common thing to do after hyperparameter tuning.

Thanks for your reply.
I just don't understand the effect of the seeds for the method performance. According your reply, I think it is no effect but just initialize weight.
I was always train the model only on the train dataset. And if train the model on the train+dev dataset, I think that maybe leak the data in the evaluation because the performance is on the dev in the code. I have try to evaluate the provided model on the test set. The performance is better than the paper.

PS: It is a very excellent work and the code is very very good. I have study that for weeks!

markus-eberts · 2020-07-15T14:29:49Z

Yes, you should evaluate the provided model on the test set. However, the provided model is the best out of 5 runs, whereas we report the average of 5 runs in our paper (...and due to random weight initialization and sampling the performance varies between runs). That's why you get a better performance compared to the results we reported in our paper.

Thanks :)!

JackySnake · 2020-07-15T15:00:08Z

Yes, you should evaluate the provided model on the test set. However, the provided model is the best out of 5 runs, whereas we report the average of 5 runs in our paper (...and due to random weight initialization and sampling the performance varies between runs). That's why you get a better performance compared to the results we reported in our paper.

Thanks :)!

I understand. Thanks a lot.

markus-eberts closed this as completed Dec 3, 2019

HuizhaoWang mentioned this issue Jun 9, 2021

something about the model training and testing process #50

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduction of experimental results #2

Reproduction of experimental results #2

SunnyMarkLiu commented Nov 29, 2019 •

edited

Loading

markus-eberts commented Nov 29, 2019 •

edited

Loading

SunnyMarkLiu commented Nov 29, 2019 •

edited

Loading

JackySnake commented Jul 15, 2020

markus-eberts commented Jul 15, 2020

JackySnake commented Jul 15, 2020

markus-eberts commented Jul 15, 2020

JackySnake commented Jul 15, 2020

Reproduction of experimental results #2

Reproduction of experimental results #2

Comments

SunnyMarkLiu commented Nov 29, 2019 • edited Loading

markus-eberts commented Nov 29, 2019 • edited Loading

SunnyMarkLiu commented Nov 29, 2019 • edited Loading

JackySnake commented Jul 15, 2020

markus-eberts commented Jul 15, 2020

JackySnake commented Jul 15, 2020

markus-eberts commented Jul 15, 2020

JackySnake commented Jul 15, 2020

SunnyMarkLiu commented Nov 29, 2019 •

edited

Loading

markus-eberts commented Nov 29, 2019 •

edited

Loading

SunnyMarkLiu commented Nov 29, 2019 •

edited

Loading