Video captioning models in Pytorch (Work in progress)

This repository contains Pytorch implementation of video captioning SOTA models from 2015-2020 on MSVD and MSRVTT datasets. Details are given in below table

Model	Datasets	Paper name	Year	Status	Remarks
Mean Pooling	MSVD, MSRVTT	Translating videos to natural language using deep recurrent neural networks[1]	2015	Implemented	No temporal modeling
S2VT	MSVD, MSRVTT	Sequence to Sequence - Video to Text[2]	2015	Implemented	Single LSTM as encoder decoder
SA-LSTM	MSVD, MSRVTT	Describing videos by exploiting temporal structure[3]	2015	Implemented	Good Baseline with attention
Recnet	MSVD, MSRVTT	Reconstruction Network for Video Captioning[4]	2018	Implemented	Results did not improve over SA-LSTM with both global and local reconstruction loss
MARN	MSVD, MSRVTT	Memory-Attended Recurrent Network for Video Captioning[5]	2019	Implemented	Memory requirement linearly increases with vocabulary size
ORG-TRL	MSVD, MSRVTT	Object Relational Graph with Teacher-Recommended Learning for Video Captioning[6]	2020	In progress	leavarage GCN for object relational features

*More recent models will be added in future

Environment

Ubuntu 18.04
CUDA 11.0
Nvidia GeForce RTX 2080Ti

Requirements

Java 8
Python 3.8.5
- Pytorch 1.7.0
- Other Python libraries specified in requirements.txt

How to Use

Step 1. Setup python virtual environment

$ virtualenv .env
$ source .env/bin/activate
(.env) $ pip install --upgrade pip
(.env) $ pip install -r requirements.txt

Step 2. Prepare data, path and hyperparameter settings

Extract features from network you want to use, and locate them at <PROJECT ROOT>/<DATASET>/features/<DATASET>_APPEARANCE_<NETWORK>_<FRAME_LENGTH>.hdf5. To extracted features follow the repository here. Or simply download the already extracted features from given table and locate them in <PROJECT ROOT>/<DATASET>/features/

Dataset	Feature Type	Inception-v4	InceptionResNetV2	ResNet-101	REsNext-101
MSVD	Appearance	link	link	link	-
MSR-VTT	Appearance	link	link	link	-
MSVD	Motion	-	-	-	link
MSR-VTT	Motion	-	-	-	link
MSVD	Object	-	-	link	-
MSRVTT	Object	-	-	link	-

You can change hyperparameters by modifying config.py.

Step 3. Prepare Evaluation Codes

Clone evaluation codes from the official coco-evaluation repo.

(.env) $ git clone https://github.com/tylin/coco-caption.git
(.env) $ mv coco-caption/pycocoevalcap .
(.env) $ rm -rf coco-caption

Or simply copy the pycocoevalcap folder and its contents in the project root.

Step 4. Training

Follow the demo given in video_captioning.ipynb.

Step 5. Inference

Follow the demo given in video_captioning.ipynb.

Quantitative Results

*MSVD

Model	Pretrained model	BLEU4	METEOR	ROUGE_L	CIDER	Pretrained
Mean Pooling	Inceptionv4	42.4	31.6	68.3	71.8	link
SA-LSTM	InceptionvResNetV2	45.5	32.5	69.0	78.0	link
S2VT	Inceptionv4	-	-	-	-	-
RecNet (global )	Inceptionv4	-	-	-	-	-
RecNet (local)	Inceptionv4	-	-	-	-	-
MARN	Inceptionv4, REsNext-101	48.5	34.4	71.4	86.4	link
ORG-TRL	InceptionResNetV2, REsNext-101	-	-	-	-

*MSRVTT

Model	Pretrained model	BLEU4	METEOR	ROUGE_L	CIDER	Pretrained
Mean Pooling	Inceptionv4	34.9	25.5	58.12	35.76	link
SA-LSTM	Inceptionv4	-	-	-	-	-
S2VT	Inceptionv4	-	-	-	-	-
RecNet (global )	Inceptionv4	-	-	-	-	-
RecNet (local)	Inceptionv4	-	-	-	-	-
MARN	Inceptionv4	-	-	-	-	-
ORG-TRL	InceptionResNetV2, REsNext-101	-	-	-	-

References

[1] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. Translating videos to natural language using deep recurrent neural networks. In NAACLHLT, 2015.

[2] Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond J. Mooney, Trevor Darrell and Kate Saenko. Sequence to Sequence - Video to Text. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015

[3] Yao, Li, et al. "Describing videos by exploiting temporal structure." Proceedings of the IEEE international conference on computer vision. 2015.

[4] Wang, Bairui, et al. "Reconstruction Network for Video Captioning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

[5] Wenjie Pei, Jiyuan Zhang, Xiangrong Wang, Lei Ke, Xiaoyong Shen, and Yu-Wing Tai. Memory-attended recurrent network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8347–8356, 2019

[6] Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, Zhengjun Zha. Object Relational Graph with Teacher-Recommended Learning for Video Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.

Acknowlegement

I got some of the coding ideas from hobincar/pytorch-video-feature-extractor. For pre-trained appearance feature extraction I have followed this repo and this repo for 3D motion feature extraction. Many thanks!

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
MSRVTT		MSRVTT
MSVD		MSVD
Saved		Saved
models		models
pycocoevalcap		pycocoevalcap
README.md		README.md
config.py		config.py
data.py		data.py
dictionary.py		dictionary.py
evaluate.py		evaluate.py
requirements.txt		requirements.txt
utils.py		utils.py
video_captioning.ipynb		video_captioning.ipynb

nasib-ullah/video-captioning-models-in-Pytorch

Folders and files

Latest commit

History

Repository files navigation

Video captioning models in Pytorch (Work in progress)

Environment

Requirements

How to Use

Step 1. Setup python virtual environment

Step 2. Prepare data, path and hyperparameter settings

Step 3. Prepare Evaluation Codes

Step 4. Training

Step 5. Inference

Quantitative Results

References

Acknowlegement

About

Topics

Resources

Stars

Watchers

Forks

Languages