Skip to content

nasib-ullah/video-captioning-models-in-Pytorch

Repository files navigation

Video captioning models in Pytorch (Work in progress)

DOI

This repository contains Pytorch implementation of video captioning SOTA models from 2015-2020 on MSVD and MSRVTT datasets. Details are given in below table

Model Datasets Paper name Year Status Remarks
Mean Pooling MSVD, MSRVTT Translating videos to natural language using deep recurrent neural networks[1] 2015 Implemented No temporal modeling
S2VT MSVD, MSRVTT Sequence to Sequence - Video to Text[2] 2015 Implemented Single LSTM as encoder decoder
SA-LSTM MSVD, MSRVTT Describing videos by exploiting temporal structure[3] 2015 Implemented Good Baseline with attention
Recnet MSVD, MSRVTT Reconstruction Network for Video Captioning[4] 2018 Implemented Results did not improve over SA-LSTM with both global and local reconstruction loss
MARN MSVD, MSRVTT Memory-Attended Recurrent Network for Video Captioning[5] 2019 Implemented Memory requirement linearly increases with vocabulary size
ORG-TRL MSVD, MSRVTT Object Relational Graph with Teacher-Recommended Learning for Video Captioning[6] 2020 In progress leavarage GCN for object relational features

*More recent models will be added in future

Environment

  • Ubuntu 18.04
  • CUDA 11.0
  • Nvidia GeForce RTX 2080Ti

Requirements

  • Java 8
  • Python 3.8.5
    • Pytorch 1.7.0
    • Other Python libraries specified in requirements.txt

How to Use

Step 1. Setup python virtual environment

$ virtualenv .env
$ source .env/bin/activate
(.env) $ pip install --upgrade pip
(.env) $ pip install -r requirements.txt

Step 2. Prepare data, path and hyperparameter settings

  1. Extract features from network you want to use, and locate them at <PROJECT ROOT>/<DATASET>/features/<DATASET>_APPEARANCE_<NETWORK>_<FRAME_LENGTH>.hdf5. To extracted features follow the repository here. Or simply download the already extracted features from given table and locate them in <PROJECT ROOT>/<DATASET>/features/

    Dataset Feature Type Inception-v4 InceptionResNetV2 ResNet-101 REsNext-101
    MSVD Appearance link link link -
    MSR-VTT Appearance link link link -
    MSVD Motion - - - link
    MSR-VTT Motion - - - link
    MSVD Object - - link -
    MSRVTT Object - - link -

You can change hyperparameters by modifying config.py.

Step 3. Prepare Evaluation Codes

Clone evaluation codes from the official coco-evaluation repo.

(.env) $ git clone https://github.com/tylin/coco-caption.git
(.env) $ mv coco-caption/pycocoevalcap .
(.env) $ rm -rf coco-caption

Or simply copy the pycocoevalcap folder and its contents in the project root.

Step 4. Training

Follow the demo given in video_captioning.ipynb.

Step 5. Inference

Follow the demo given in video_captioning.ipynb.

Quantitative Results

*MSVD

Model Pretrained model BLEU4 METEOR ROUGE_L CIDER Pretrained
Mean Pooling Inceptionv4 42.4 31.6 68.3 71.8 link
SA-LSTM InceptionvResNetV2 45.5 32.5 69.0 78.0 link
S2VT Inceptionv4 - - - - -
RecNet (global ) Inceptionv4 - - - - -
RecNet (local) Inceptionv4 - - - - -
MARN Inceptionv4, REsNext-101 48.5 34.4 71.4 86.4 link
ORG-TRL InceptionResNetV2, REsNext-101 - - - -

*MSRVTT

Model Pretrained model BLEU4 METEOR ROUGE_L CIDER Pretrained
Mean Pooling Inceptionv4 34.9 25.5 58.12 35.76 link
SA-LSTM Inceptionv4 - - - - -
S2VT Inceptionv4 - - - - -
RecNet (global ) Inceptionv4 - - - - -
RecNet (local) Inceptionv4 - - - - -
MARN Inceptionv4 - - - - -
ORG-TRL InceptionResNetV2, REsNext-101 - - - -

References

[1] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. Translating videos to natural language using deep recurrent neural networks. In NAACLHLT, 2015.

[2] Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond J. Mooney, Trevor Darrell and Kate Saenko. Sequence to Sequence - Video to Text. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015

[3] Yao, Li, et al. "Describing videos by exploiting temporal structure." Proceedings of the IEEE international conference on computer vision. 2015.

[4] Wang, Bairui, et al. "Reconstruction Network for Video Captioning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

[5] Wenjie Pei, Jiyuan Zhang, Xiangrong Wang, Lei Ke, Xiaoyong Shen, and Yu-Wing Tai. Memory-attended recurrent network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8347–8356, 2019

[6] Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, Zhengjun Zha. Object Relational Graph with Teacher-Recommended Learning for Video Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.

Acknowlegement

I got some of the coding ideas from hobincar/pytorch-video-feature-extractor. For pre-trained appearance feature extraction I have followed this repo and this repo for 3D motion feature extraction. Many thanks!