Skip to content
Go to file


Failed to load latest commit information.
Latest commit message
Commit time
Aug 22, 2020
Jan 27, 2020
Jan 27, 2020
Jan 27, 2020
Jan 27, 2020
Jan 27, 2020
Jan 27, 2020


PyTorch implementation of MultiModal Transformer (MMT), a method for multimodal (video + subtitle) captioning.

TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval

Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal

TVC Dataset and Task

We extended TVR by collecting extra captions for each annotated moment. This dataset, named TV show Captions (TVC), is a large-scale multimodal video captioning dataset, contains 262K captions paired with 108K moments. We show our annotated captions and model generated captions below. Similar to TVR, the TVC task requires systems to gather information from both video and subtitle to generate relevant descriptions. tvc example

Method: MultiModal Transformer (MMT)

we designed a MultiModal Transformer (MMT) captioning model which follows the classical encoder-decoder transformer architecture. It takes both video and subtitle as encoder inputs to generate the captions from the decoder.


Getting started


  1. Clone this repository
git clone --recursive
cd TVCaption
  1. Prepare feature files Download tvc_feature_release.tar.gz (23GB). After downloading the file, extract it to the data directory.
tar -xf path/to/tvc_feature_release.tar.gz -C data

You should be able to see video_feature under data/tvc_feature_release directory. It contains video features (ResNet, I3D, ResNet+I3D), these features are the same as the video features we used for TVR/XML. Read the code to learn details on how the features are extracted: video feature extraction.

  1. Install dependencies:
  • Python 2.7
  • PyTorch 1.1.0
  • nltk
  • easydict
  • tqdm
  • h5py
  • tensorboardX
  1. Add project root to PYTHONPATH

Note that you need to do this each time you start a new session.

Training and Inference

  1. Build Vocabulary
bash baselines/transformer_captioning/scripts/

Running this command will build vocabulary cache/tvc_word2idx.json from TVC train set.

  1. MMT training
bash baselines/multimodal_transformer/scripts/ CTX_MODE VID_FEAT_TYPE

CTX_MODE refers to the context (video, sub, video_sub) we use. VID_FEAT_TYPE video feature type (resnet, i3d, resnet_i3d).

Below is an example of training MMT with both video and subtitle, where we use the concatenation of ResNet and I3D features for video.

bash baselines/multimodal_transformer/scripts/ video_sub resnet_i3d

This code will load all the data (~30GB) into RAM to speed up training, use --no_core_driver to disable this behavior.

Training using the above config will stop at around epoch 22, around 3 hours with a single 2080Ti GPU. You should get ~45.0 CIDEr-D and ~10.5 BLEU@4 scores on val split. The resulting model and config will be saved at a dir: baselines/multimodal_transformer/results/video_sub-res-*

  1. MMT inference After training, you can inference using the saved model on val or test_public split:
bash baselines/multimodal_transformer/scripts/ MODEL_DIR_NAME SPLIT_NAME

MODEL_DIR_NAME is the name of the dir containing the saved model, e.g., video_sub-res-*. SPLIT_NAME could be val or test_public.

Evaluation and Submission

We only release ground-truth for train and val splits, to get results on test-public split, please submit your results follow the instructions here: standalone_eval/


If you find this code useful for your research, please cite our paper:

  title={TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval},
  author={Lei, Jie and Yu, Licheng and Berg, Tamara L and Bansal, Mohit},


This research is supported by grants and awards from NSF, DARPA, ARO and Google.

This code borrowed components from the following projects: recurrent-transformer, OpenNMT-py, transformers, coco-caption, we thank the authors for open-sourcing these great projects!


jielei [at]


[ECCV 2020] PyTorch code of MMT (a multimodal transformer captioning model) on TVCaption dataset





No releases published


No packages published
You can’t perform that action at this time.