Why is the fine-tuning performance much lower than benchmark in paper? #8

lokeaichirou · 2021-05-06T14:58:53Z

Hi, @ArrowLuo , I am fine tuning the model on captioning downstream task, however, I find its evaluation performance is much poor than benchmark in paper. Actually I have set epoch to be 10 and batch size =16 (same for validation), and my best validation score is: BLEU_1: 0.3759, BLEU_2: 0.2398, BLEU_3: 0.1576, BLEU_4: 0.1069, METEOR: 0.1682, ROUGE_L: 0.3916, CIDEr: 1.2186.
Is it kind due to the the operation that I threw away the distributed training in coding? because I always faced the distributed computation issue in colab, and batch size = 16 is because larger batch size would cause memory issue.

ArrowLuo · 2021-05-06T15:35:29Z

Hi, I guess your results are obtained from YoucookII. If you run on the youcookii_data.no_transcript.pickle directly, the scores you got are right under the condition of your hyperparameters. Our best scores are generated with the transcript. The youcookii_data.no_transcript.pickle is a version without the transcript. Read the readme file to get the below information:

If using video only as input (youcookii_data.no_transcript.pickle),
The results are close to
BLEU_1: 0.3921, BLEU_2: 0.2522, BLEU_3: 0.1655, BLEU_4: 0.1117
METEOR: 0.1769, ROUGE_L: 0.4049, CIDEr: 1.2725

You should compare your scores with the third line from the bottom in Table 3 in our paper.

lokeaichirou · 2021-05-06T15:48:09Z

Hi, I guess your results are obtained from YoucookII. If you run on the youcookii_data.no_transcript.pickle directly, the scores you got are right under the condition of your hyperparameters. Our best scores are generated with the transcript. The youcookii_data.no_transcript.pickle is a version without the transcript. Read the readme file to get the below information:

If using video only as input (youcookii_data.no_transcript.pickle),
The results are close to
BLEU_1: 0.3921, BLEU_2: 0.2522, BLEU_3: 0.1655, BLEU_4: 0.1117
METEOR: 0.1769, ROUGE_L: 0.4049, CIDEr: 1.2725

You should compare your scores with the third line from the bottom in Table 3 in our paper.

Hi, @ArrowLuo , thanks for your information. In fact, in Table 3 of the paper, the scores for single V as input are: B-3: 16.46, B_4: 11.17, M: 17.57, R-L: 40.09, CIDEr: 1.27. It is much larger than these: BLEU_1: 0.3921, BLEU_2: 0.2522, BLEU_3: 0.1655, BLEU_4: 0.1117, METEOR: 0.1769, ROUGE_L: 0.4049, CIDEr: 1.2725. So may I know is the above scores obtained by better hyper-parameter setting or something else?

ArrowLuo · 2021-05-07T05:18:44Z

Sorry for the confusion. These metrics are printed with real value and reported with percentages in the paper (except CIDEr). So your scores are right.

lokeaichirou · 2021-05-07T11:11:25Z

Sorry for the confusion. These metrics are printed with real value and reported with percentages in the paper (except CIDEr). So your scores are right.

Hi, @ArrowLuo , so the metrics printed by program is correct, so may I know is all the metrics in Table 3 in paper are normalized? meaning metrics for all the models involved in table. And how do you normalize it? Because I am lack of sense on the performance by these pre-normalizaed metrics: BLEU_1: 0.3921, BLEU_2: 0.2522, BLEU_3: 0.1655, BLEU_4: 0.1117, METEOR: 0.1769, ROUGE_L: 0.4049 , sicne I desire to make a comparison with those in paper. Many thanks~

ArrowLuo · 2021-05-07T12:27:46Z

Not normalization operation. Just need to multiply 100 on these metrics (except for CIDEr).
For example, the print is:

BLEU_1: 0.3921, BLEU_2: 0.2522, BLEU_3: 0.1655, BLEU_4: 0.1117, METEOR: 0.1769, ROUGE_L: 0.4049

the scores after multiplying 100:

BLEU_1: 39.21, BLEU_2: 25.22, BLEU_3: 16.55, BLEU_4: 11.17, METEOR: 17.69, ROUGE_L: 40.49

lokeaichirou · 2021-05-07T13:13:48Z

Not normalization operation. Just need to multiply 100 on these metrics (except for CIDEr).
For example, the print is:

BLEU_1: 0.3921, BLEU_2: 0.2522, BLEU_3: 0.1655, BLEU_4: 0.1117, METEOR: 0.1769, ROUGE_L: 0.4049

the scores after multiplying 100:

BLEU_1: 39.21, BLEU_2: 25.22, BLEU_3: 16.55, BLEU_4: 11.17, METEOR: 17.69, ROUGE_L: 40.49

Hi, @ArrowLuo , many thanks! got it. However, I found there is a mismatch of metrics scores in Univl paper with E2E masked transformer for dense video captioning original paper. In their paper, their scores are:
Method GT Proposals Learned Proposals
B4 M B4 M
Bi-LSTM +TempoAttn 0.87 8.15 0.08 4.62
Our Method 1.42 11.20 0.30 6.58

While in Univl paper, Table 3, B4 : 4.38, M:11.55 for E2E masked transformer. Is this result obtained by your experiments based on their released model, and utilizing ground-truth proposals during inference?

ArrowLuo · 2021-05-08T03:38:37Z

These part of baseline results are copied from Table 4 in https://arxiv.org/pdf/1906.05743.pdf. I notice that it is indeed different from the original paper.

lokeaichirou · 2021-05-08T11:30:01Z

These part of baseline results are copied from Table 4 in https://arxiv.org/pdf/1906.05743.pdf. I notice that it is indeed different from the original paper.

Ok, I see, thanks!

lokeaichirou closed this as completed May 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is the fine-tuning performance much lower than benchmark in paper? #8

Why is the fine-tuning performance much lower than benchmark in paper? #8

lokeaichirou commented May 6, 2021

ArrowLuo commented May 6, 2021

lokeaichirou commented May 6, 2021

ArrowLuo commented May 7, 2021

lokeaichirou commented May 7, 2021

ArrowLuo commented May 7, 2021

lokeaichirou commented May 7, 2021

ArrowLuo commented May 8, 2021

lokeaichirou commented May 8, 2021

Why is the fine-tuning performance much lower than benchmark in paper? #8

Why is the fine-tuning performance much lower than benchmark in paper? #8

Comments

lokeaichirou commented May 6, 2021

ArrowLuo commented May 6, 2021

lokeaichirou commented May 6, 2021

ArrowLuo commented May 7, 2021

lokeaichirou commented May 7, 2021

ArrowLuo commented May 7, 2021

lokeaichirou commented May 7, 2021

ArrowLuo commented May 8, 2021

lokeaichirou commented May 8, 2021