Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is the fine-tuning performance much lower than benchmark in paper? #8

Closed
lokeaichirou opened this issue May 6, 2021 · 8 comments

Comments

@lokeaichirou
Copy link

Hi, @ArrowLuo , I am fine tuning the model on captioning downstream task, however, I find its evaluation performance is much poor than benchmark in paper. Actually I have set epoch to be 10 and batch size =16 (same for validation), and my best validation score is: BLEU_1: 0.3759, BLEU_2: 0.2398, BLEU_3: 0.1576, BLEU_4: 0.1069, METEOR: 0.1682, ROUGE_L: 0.3916, CIDEr: 1.2186.
Is it kind due to the the operation that I threw away the distributed training in coding? because I always faced the distributed computation issue in colab, and batch size = 16 is because larger batch size would cause memory issue.

@ArrowLuo
Copy link
Contributor

ArrowLuo commented May 6, 2021

Hi, I guess your results are obtained from YoucookII. If you run on the youcookii_data.no_transcript.pickle directly, the scores you got are right under the condition of your hyperparameters. Our best scores are generated with the transcript. The youcookii_data.no_transcript.pickle is a version without the transcript. Read the readme file to get the below information:

If using video only as input (youcookii_data.no_transcript.pickle),
The results are close to
BLEU_1: 0.3921, BLEU_2: 0.2522, BLEU_3: 0.1655, BLEU_4: 0.1117
METEOR: 0.1769, ROUGE_L: 0.4049, CIDEr: 1.2725

You should compare your scores with the third line from the bottom in Table 3 in our paper.

@lokeaichirou
Copy link
Author

Hi, I guess your results are obtained from YoucookII. If you run on the youcookii_data.no_transcript.pickle directly, the scores you got are right under the condition of your hyperparameters. Our best scores are generated with the transcript. The youcookii_data.no_transcript.pickle is a version without the transcript. Read the readme file to get the below information:

If using video only as input (youcookii_data.no_transcript.pickle),
The results are close to
BLEU_1: 0.3921, BLEU_2: 0.2522, BLEU_3: 0.1655, BLEU_4: 0.1117
METEOR: 0.1769, ROUGE_L: 0.4049, CIDEr: 1.2725

You should compare your scores with the third line from the bottom in Table 3 in our paper.

Hi, @ArrowLuo , thanks for your information. In fact, in Table 3 of the paper, the scores for single V as input are: B-3: 16.46, B_4: 11.17, M: 17.57, R-L: 40.09, CIDEr: 1.27. It is much larger than these: BLEU_1: 0.3921, BLEU_2: 0.2522, BLEU_3: 0.1655, BLEU_4: 0.1117, METEOR: 0.1769, ROUGE_L: 0.4049, CIDEr: 1.2725. So may I know is the above scores obtained by better hyper-parameter setting or something else?

@ArrowLuo
Copy link
Contributor

ArrowLuo commented May 7, 2021

Sorry for the confusion. These metrics are printed with real value and reported with percentages in the paper (except CIDEr). So your scores are right.

@lokeaichirou
Copy link
Author

Sorry for the confusion. These metrics are printed with real value and reported with percentages in the paper (except CIDEr). So your scores are right.

Hi, @ArrowLuo , so the metrics printed by program is correct, so may I know is all the metrics in Table 3 in paper are normalized? meaning metrics for all the models involved in table. And how do you normalize it? Because I am lack of sense on the performance by these pre-normalizaed metrics: BLEU_1: 0.3921, BLEU_2: 0.2522, BLEU_3: 0.1655, BLEU_4: 0.1117, METEOR: 0.1769, ROUGE_L: 0.4049 , sicne I desire to make a comparison with those in paper. Many thanks~

@ArrowLuo
Copy link
Contributor

ArrowLuo commented May 7, 2021

Not normalization operation. Just need to multiply 100 on these metrics (except for CIDEr).
For example, the print is:

BLEU_1: 0.3921, BLEU_2: 0.2522, BLEU_3: 0.1655, BLEU_4: 0.1117, METEOR: 0.1769, ROUGE_L: 0.4049

the scores after multiplying 100:

BLEU_1: 39.21, BLEU_2: 25.22, BLEU_3: 16.55, BLEU_4: 11.17, METEOR: 17.69, ROUGE_L: 40.49

@lokeaichirou
Copy link
Author

Not normalization operation. Just need to multiply 100 on these metrics (except for CIDEr).
For example, the print is:

BLEU_1: 0.3921, BLEU_2: 0.2522, BLEU_3: 0.1655, BLEU_4: 0.1117, METEOR: 0.1769, ROUGE_L: 0.4049

the scores after multiplying 100:

BLEU_1: 39.21, BLEU_2: 25.22, BLEU_3: 16.55, BLEU_4: 11.17, METEOR: 17.69, ROUGE_L: 40.49

Hi, @ArrowLuo , many thanks! got it. However, I found there is a mismatch of metrics scores in Univl paper with E2E masked transformer for dense video captioning original paper. In their paper, their scores are:
Method GT Proposals Learned Proposals
B4 M B4 M
Bi-LSTM +TempoAttn 0.87 8.15 0.08 4.62
Our Method 1.42 11.20 0.30 6.58

While in Univl paper, Table 3, B4 : 4.38, M:11.55 for E2E masked transformer. Is this result obtained by your experiments based on their released model, and utilizing ground-truth proposals during inference?

@ArrowLuo
Copy link
Contributor

ArrowLuo commented May 8, 2021

These part of baseline results are copied from Table 4 in https://arxiv.org/pdf/1906.05743.pdf. I notice that it is indeed different from the original paper.

@lokeaichirou
Copy link
Author

These part of baseline results are copied from Table 4 in https://arxiv.org/pdf/1906.05743.pdf. I notice that it is indeed different from the original paper.

Ok, I see, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants