VideoMambaPro

Official Implementation of VideoMambaPro: A Leap Forward for Mamba in Video Understanding

we investigate similarities and differences of self-attention and Mamba from the perspective of the latter, and reveal the limitations of Mamba on video understanding task. We propose VideoMambaPro that uses VideoMamba as a backbone, but significantly enhancing performance in the video understanding task, narrowing the gap with transformers.

Installation

The required packages are in the file requirements.txt, and you can run the following command to install the environment

conda create --name videomae python=3.8 -y
conda activate videomambapro

conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 -c pytorch

pip install -r requirements.txt

Note:

The above commands are for reference only, please configure your own environment according to your needs.
We recommend installing PyTorch >= 1.12.0, which may greatly reduce the GPU memory usage.
It is recommended to install timm == 0.4.12, because some of the APIs we use are deprecated in the latest version of timm.
We have supported pre-training with PyTorch 2.0, but it has not been fully tested.

Data Preparation

We read and process the same way as VideoMAE, but with a different convention for the format of the data list file.

Pre-train Dataset

We pretrain the model on ImageNet-1K dataset, where the model loads a data list file with the following format:

frame_folder_path total_frames label

Fine-tune Dataset

There are two implementations of our finetune dataset VideoClsDataset and RawFrameClsDataset, supporting video data and rawframes data, respectively. Where SSV2 uses RawFrameClsDataset by default and the rest of the datasets use VideoClsDataset.

VideoClsDataset loads a data list file with the following format:

video_path label

while RawFrameClsDataset loads a data list file with the following format:

frame_folder_path total_frames label

For example, video data list and rawframes data list are shown below:

# The path prefix 'your_path' can be specified by `--data_root ${PATH_PREFIX}` in scripts when training or inferencing.

# k400 video data validation list
your_path/k400/jf7RDuUTrsQ.mp4 325
your_path/k400/JTlatknwOrY.mp4 233
your_path/k400/NUG7kwJ-614.mp4 103
your_path/k400/y9r115bgfNk.mp4 320
your_path/k400/ZnIDviwA8CE.mp4 244
...

# ssv2 rawframes data validation list
your_path/SomethingV2/frames/74225 62 140
your_path/SomethingV2/frames/116154 51 127
your_path/SomethingV2/frames/198186 47 173
your_path/SomethingV2/frames/137878 29 99
your_path/SomethingV2/frames/151151 31 166
...

Codes details

Our project is based on VideoMamba for fair comparison. To solve limitation 1&2 in our paper, we mainly change the pipeline of Mamba by applying the diagonal mask during the backward SSM and applying residual connection on the bidirection SSM. The residual connection of Ab is realized through assign new matrix A in mamba/mamba_ssm/ops/selective_scan_interface.py

A = deltaA[:, :, i] + deltaA[:, :, x.index]

The mask assignment is realized through setting elements of A_b in mamba/mamba_ssm/modules/mamba_simple.py

self.A_b_log = mask_diagnomal (A_b_log)

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
causal-conv1d		causal-conv1d
datasets		datasets
exp		exp
fig		fig
mamba		mamba
videomambapro		videomambapro
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VideoMambaPro

Installation

Note:

Data Preparation

Pre-train Dataset

Fine-tune Dataset

Codes details

About

Releases

Packages

Languages

License

hotfinda/VideoMambaPro

Folders and files

Latest commit

History

Repository files navigation

VideoMambaPro

Installation

Note:

Data Preparation

Pre-train Dataset

Fine-tune Dataset

Codes details

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages