Skip to content

Latest commit

 

History

History
80 lines (47 loc) · 4.18 KB

xclip.md

File metadata and controls

80 lines (47 loc) · 4.18 KB

X-CLIP

Overview

The X-CLIP model was proposed in Expanding Language-Image Pretrained Models for General Video Recognition by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling. X-CLIP is a minimal extension of CLIP for video. The model consists of a text encoder, a cross-frame vision encoder, a multi-frame integration Transformer, and a video-specific prompt generator.

The abstract from the paper is the following:

Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable "zero-shot" generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12 times fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited.

Tips:

  • Usage of X-CLIP is identical to CLIP.

drawing

X-CLIP architecture. Taken from the original paper.

This model was contributed by nielsr. The original code can be found here.

Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with X-CLIP.

  • Demo notebooks for X-CLIP can be found here.

If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

XCLIPProcessor

[[autodoc]] XCLIPProcessor

XCLIPConfig

[[autodoc]] XCLIPConfig - from_text_vision_configs

XCLIPTextConfig

[[autodoc]] XCLIPTextConfig

XCLIPVisionConfig

[[autodoc]] XCLIPVisionConfig

XCLIPModel

[[autodoc]] XCLIPModel - forward - get_text_features - get_video_features

XCLIPTextModel

[[autodoc]] XCLIPTextModel - forward

XCLIPVisionModel

[[autodoc]] XCLIPVisionModel - forward