Ad-hoc Video Search
We provide frame-level CNN features for the following datasets that have been used by our winning entry for the TRECVID 2018 Ad-hoc Video Search (AVS) task.
- The IACC.3 dataset, which has been the test set for the TRECVID Ad-hoc Video Search (AVS) task since 2016. The dataset contains 4,593 Internet Archive videos (144 GB, 600 hours) with Creative Commons licenses in MPEG-4/H.264 format with duration ranging from 6.5 min to 9.5 min and a mean duration of almost 7.8 min. Automated shot boundary detection has been performed, resulting in 335,944 shots in total. From each shot we sampled frames uniformaly, obtaining 3,845,221 frames in total.
- The MSR-VTT dataset, providing 10K web video clips and 200k natural sentences describing the visual content of the clips. The average number of sentences per clip is 20. From each clip we sampled frames uniformly, obtaining 305,462 frames in total.
- The TGIF dataset, containing 100K animated GIFs and 120K sentences describing visual content of the animated GIFs. From each gif we sampled frames uniformly, obtaining 1,045,268 frames in total.
- The TRECVID 2016 VTT training set, containing 200 videos and 400 sentences.
Besides, we provide frame-level CNN features for the following datasets that have been used by our winning entry for the TRECVID 2018 Video-to-Text (VTT) Matching and Ranking task.
|ResNext-101||2,048||IACC.3 (27GB), MSR-VTT (2GB), TGIF (7GB), MSVD (288M), TV2016VTT-train (42M)|
|ResNet-152||2,048||IACC.3 (26GB), MSR-VTT (2GB), TGIF (7GB), MSVD (283M), TV2016VTT-train (42M)|
If you find the feature data useful, please consider citing
- Xirong Li, Jianfeng Dong, Chaoxi Xu, Jing Cao, Xun Wang, Gang Yang, Renmin University of China and Zhejiang Gongshang University at TRECVID 2018: Deep Cross-Modal Embeddings for Video-Text Retrieval, TRECVID Workshop, 2018 [slides]