Skip to content

Latest commit

 

History

History
128 lines (87 loc) · 8.38 KB

README.md

File metadata and controls

128 lines (87 loc) · 8.38 KB

LFB

Long-term feature banks for detailed video understanding

Abstract

To understand the world, we humans constantly need to relate the present to the past, and put events in context. In this paper, we enable existing video models to do the same. We propose a long-term feature bank---supportive information extracted over the entire span of a video---to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds. Our experiments demonstrate that augmenting 3D convolutional networks with a long-term feature bank yields state-of-the-art results on three challenging video datasets: AVA, EPIC-Kitchens, and Charades.

Results and Models

AVA2.1

frame sampling strategy resolution gpus backbone pretrain mAP gpu_mem(M) config ckpt log
4x16x1 raw 8 SlowOnly ResNet50 (with Nonlocal LFB) Kinetics-400 24.11 8620 config ckpt log
4x16x1 raw 8 SlowOnly ResNet50 (with Max LFB) Kinetics-400 22.15 8425 config ckpt log

Note:

  1. The gpus indicates the number of gpu we used to get the checkpoint. According to the Linear Scaling Rule, you may set the learning rate proportional to the batch size if you use different GPUs or videos per GPU, e.g., lr=0.01 for 4 GPUs x 2 video/gpu and lr=0.08 for 16 GPUs x 4 video/gpu.
  2. We use slowonly_r50_4x16x1 instead of I3D-R50-NL in the original paper as the backbone of LFB, but we have achieved the similar improvement: (ours: 20.1 -> 24.05 vs. author: 22.1 -> 25.8).
  3. Because the long-term features are randomly sampled in testing, the test accuracy may have some differences.
  4. Before train or test lfb, you need to infer feature bank with the slowonly-lfb_ava-pretrained-r50_infer-4x16x1_ava21-rgb.py. For more details on infer feature bank, you can refer to Train part.
  5. The ROIHead now supports single-label classification (i.e. the network outputs at most one-label per actor). This can be done by (a) setting multilabel=False during training and the test_cfg.rcnn.action_thr for testing.

Train

a. Infer long-term feature bank for training

Before train or test lfb, you need to infer long-term feature bank first. You can also dowonload long-term feature bank from AVA_train_val_float32_lfb or AVA_train_val_float16_lfb, and then put them on lfb_prefix_path. In this case, you can skip this step.

Specifically, run the test on the training, validation, testing dataset with the config file slowonly-lfb_ava-pretrained-r50_infer-4x16x1_ava21-rgb.py (The config file will only infer the feature bank of training dataset and you need set dataset_mode = 'val' to infer the feature bank of validation dataset in the config file.), and the shared head LFBInferHead will generate the feature bank.

A long-term feature bank file of AVA training and validation datasets with float32 precision occupies 3.3 GB. If store the features with float16 precision, the feature bank occupies 1.65 GB.

You can use the following command to infer feature bank of AVA training and validation dataset and the feature bank will be stored in lfb_prefix_path/lfb_train.pkl and lfb_prefix_path/lfb_val.pkl.

# set `dataset_mode = 'train'` in lfb_slowonly_r50_ava_infer.py
python tools/test.py configs/detection/lfb/slowonly-lfb-infer_r50_ava21-rgb.py \
    checkpoints/YOUR_BASELINE_CHECKPOINT.pth

# set `dataset_mode = 'val'` in lfb_slowonly_r50_ava_infer.py
python tools/test.py configs/detection/lfb/slowonly-lfb-infer_r50_ava21-rgb.py \
    checkpoints/YOUR_BASELINE_CHECKPOINT.pth

We use slowonly_r50_4x16x1 checkpoint from slowonly_kinetics400-pretrained-r50_8xb16-4x16x1-20e_ava21-rgb to infer feature bank.

b. Train LFB

You can use the following command to train a model.

python tools/train.py ${CONFIG_FILE} [optional arguments]

Example: train LFB model on AVA with half-precision long-term feature bank.

python tools/train.py configs/detection/lfb/slowonly-lfb-nl_kinetics400-pretrained-r50_8xb12-4x16x1-20e_ava21-rgb.py \
  --seed 0 --deterministic

For more details and optional arguments infos, you can refer to the Training part in the Training and Test Tutorial.

Test

a. Infer long-term feature bank for testing

Before train or test lfb, you also need to infer long-term feature bank first. If you have generated the feature bank file, you can skip it.

The step is the same with Infer long-term feature bank for training part in Train.

b. Test LFB

You can use the following command to test a model.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test LFB model on AVA with half-precision long-term feature bank and dump the result to a pkl file.

python tools/test.py configs/detection/lfb/slowonly-lfb-nl_kinetics400-pretrained-r50_8xb12-4x16x1-20e_ava21-rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl

For more details, you can refer to the Test part in the Training and Test Tutorial.

Citation

@inproceedings{gu2018ava,
  title={Ava: A video dataset of spatio-temporally localized atomic visual actions},
  author={Gu, Chunhui and Sun, Chen and Ross, David A and Vondrick, Carl and Pantofaru, Caroline and Li, Yeqing and Vijayanarasimhan, Sudheendra and Toderici, George and Ricco, Susanna and Sukthankar, Rahul and others},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={6047--6056},
  year={2018}
}
@inproceedings{wu2019long,
  title={Long-term feature banks for detailed video understanding},
  author={Wu, Chao-Yuan and Feichtenhofer, Christoph and Fan, Haoqi and He, Kaiming and Krahenbuhl, Philipp and Girshick, Ross},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={284--293},
  year={2019}
}