[Enhance] Support the Training of ActionClip #2620
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
ActionCLIP Project
ActionCLIP: A New Paradigm for Video Action Recognition
Abstract
The canonical approach to video action recognition dictates a neural model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferable ability on new datasets with unseen concepts. In this paper, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, prompt and fine-tune". This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then it makes the action recognition task to act more like pre-training problems via prompt engineering. Finally, it end-to-end fine-tunes on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone.
Usage
Setup Environment
Please refer to Installation to install MMAction2. Run the following command to install
clip
.Assume that you are located at
$MMACTION2/projects/actionclip
.Add the current folder to
PYTHONPATH
, so that Python can find your code. Run the following command in the current directory to add it.Data Preparation
Prepare the Kinetics400 dataset according to the instruction.
Create a symbolic link from
$MMACTION2/data
to./data
in the current directory, so that Python can locate your data. Run the following command in the current directory to create the symbolic link.Training commands
To train with single GPU:
To train with multiple GPUs:
To train with multiple GPUs by slurm:
mim train mmaction configs/actionclip_vit-base-p32-res224-clip-pre_g8xb16_1x1x8_k400-rgb.py --launcher slurm \ --gpus 8 --gpus-per-node 8 --partition $PARTITION
Testing commands
To test with single GPU:
To test with multiple GPUs:
To test with multiple GPUs by slurm:
Results
Kinetics400
[1] The models are ported from the repo ActionCLIP and tested on our data. Currently, we only support the testing of ActionCLIP models. Due to the variation in testing data, our reported test accuracy differs from that of the original repository (on average, it is lower by one point). Please refer to this issue for more details.
Kinetics400 (Trained on Our K400 dataset)
Zero-Shot Prediction
We offer two methods for zero-shot prediction as follows. The
test.mp4
can be downloaded from here.Using Naive Pytorch
Using MMAction2 APIs
Citation