This is our anonymous repository for PRCV2024.
pip install git+https://github.com/openai/CLIP.git
To train all of our models, we extract videos into frames for fast reading. Please refer to MVFNet repo for the detailed guide of data processing.
Coming Soon
Input = #frame x #temporal clip x # spatial crop x size
Architecture | #Input | Top-1 Acc.(%) | config |
---|---|---|---|
ViT-B/16 | 8x1x1x224x224 | 82.0 | - |
ViT-B/16 | 8x3x1x224x224 | 82.6 | - |
ViT-L/14 | 8x1x1x224x224 | 85.8 | - |
ViT-L/14 | 8x3x1x224x224 | 86.6 | - |
Architecture | Task | #Input | Top-1 Acc.(%) | config |
---|---|---|---|---|
ViT-B/16 | All | 8x1x1x224x224 | 74.6 | - |
ViT-B/16 | 2-shot | 8x1x1x224x224 | 62.7 | - |
ViT-B/16 | zero-shot | 8x1x1x224x224 | 45.8 | - |
# For Kinetics-400, use 8 frames and ViT-B/16.
bash scripts/run_train.sh configs/k400/k400_train_rgb_vitb-16-f8.yaml
# For HMDB-51, use 8 frames and ViT-B/16.
bash scripts/run_train.sh configs/k400/k400_train_rgb_vitb-16-f8.yaml
bash script/run_test.sh <PATH_TO_CONFIG> <PATH_TO_MODEL>
This repository is built based on Text4Vis and CLIP(https://github.com/openai/CLIP). Sincere thanks to their wonderful works.