About msrvtt retrieval results #17

zhangliang-04 · 2021-09-01T14:53:10Z

I found that the MSRVTT text-to-video retrieval performance under FT-Joint setting released in the readme is R@1: 0.2720 - R@5: 0.5570 - R@10: 0.6870 - Median R: 4.0, but the result in the paper is R@1: 0.206 - R@5: 0.491 - R@10: 0.629 - Median R: 6.0. What is the difference between them?
Addtionally, what is the performance of the FT-Align setting should be? It seems to be forgotten in the readme. Actually I tried to finetune use the scripts released by the repo but got worse score than FT-Joint on MSRVTT.

The text was updated successfully, but these errors were encountered:

ArrowLuo · 2021-09-01T16:43:16Z

Hi @zhangliang-04,

Our paper reports results on ‘Training-7K’ follows the data splits from (Miech et al., 2019). However, the readme reports the results of ‘Training-9K’ which follows the data splits from (Gabeur et al., 2020). You can find two files, MSRVTT_train.7k.csv and MSRVTT_train.9k.csv in our released msrvtt.zip.
Our running on FT-Align (‘Training-9K’ ) has a smaller batch size due to our GPUs limited. Thus, the results on ‘Training-9K’ are also not an obvious advantage over FT-Joint. Our experience is that the finetune hyper-parameters are important, and the FT-Align may not be the same as the FT-Joint. You can test on ‘Training-7K’ as our paper reported.

zhangliang-04 closed this as completed Sep 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About msrvtt retrieval results #17

About msrvtt retrieval results #17

zhangliang-04 commented Sep 1, 2021

ArrowLuo commented Sep 1, 2021

About msrvtt retrieval results #17

About msrvtt retrieval results #17

Comments

zhangliang-04 commented Sep 1, 2021

ArrowLuo commented Sep 1, 2021