VT-TWINS

This repositoriy is the implementation of "Video-Text Representation Learning via Differentiable Weak Temporal Alignment (CVPR 2022)".

Preparation

Requirements

Python 3
PyTorch (>= 1.0)
python-ffmpeg with ffmpeg
pandas
numpy
tqdm
scikit-learn
numba 0.53.1

Dataset

The annotation files (.csv) of all datasets are in './data'. If you download the downstream datasets, place the files as follows:

data
 |─ downstream
 │   |─ ucf
 │   │   └─ ucf101
 |   │       |─ label1
 |   │           |─ video1.mp4
 |   │           :
 |   │       :
 |   |─ hmdb
 |   │   |─ label1
 |   │   │   |─ video1.avi
 |   │   │   :
 |   │   :
 |   |─ youcook
 |   │   |─ task1
 |   │   │   |─ video1.mp4
 |   │   │   :
 |   │   :
 |   |─ msrvtt
 |   │   └─ TestVideo
 |   │       |─ video1.mp4
 |   │       :
 |   └─ crosstask
 |       └─ videos
 |           |─ 105222
 |           │   |─ 4K4PnQ66LQ8.mp4
 |           │   :
 |           :

Pretrained Weight

The pretrained weight of our model, word2vec, and the tokenizer can be found in here. Place the pretrained weight of our model in the './checkpoint', and word2vec and the tokenizer in the './data'.

Evaluation

Action Recognition on UCF101

python src/eval_ucf.py --pretrain_cnn_path ./checkpoint/pretrained.pth.tar

Action Recognition on HMDB

python src/eval_hmdb.py --pretrain_cnn_path ./checkpoint/pretrained.pth.tar

Text-to-Video Retrieval on YouCook2

python src/eval_youcook.py --pretrain_cnn_path ./checkpoint/pretrained.pth.tar

Text-to-Video Retrieval on MSRVTT

python src/eval_msrvtt.py --pretrain_cnn_path ./checkpoint/pretrained.pth.tar

Action Step Localization on CrossTask

python src/eval_crosstask.py --pretrain_cnn_path ./checkpoint/pretrained.pth.tar

Citation

@inproceedings{ko2022video,
  title={Video-Text Representation Learning via Differentiable Weak Temporal Alignment},
  author={Ko, Dohwan and Choi, Joonmyung and Ko, Juyeon and Noh, Shinyeong and On, Kyoung-Woon and Kim, Eun-Sol and Kim, Hyunwoo J},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
asset		asset
data		data
loader		loader
src		src
README.md		README.md
args.py		args.py
loss.py		loss.py
metrics.py		metrics.py
s3dg.py		s3dg.py
soft_dtw.py		soft_dtw.py
utils.py		utils.py

mlvlab/VT-TWINS

Folders and files

Latest commit

History

Repository files navigation

VT-TWINS

Preparation

Requirements

Dataset

Pretrained Weight

Evaluation

Action Recognition on UCF101

Action Recognition on HMDB

Text-to-Video Retrieval on YouCook2

Text-to-Video Retrieval on MSRVTT

Action Step Localization on CrossTask

Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages