LocalMamba

LocalMamba: Visual State Space Model with Windowed Selective Scan

Tao Huang, Xiaohuan Pei, Shan You, Fei Wang, Chen Qian, Chang Xu

ArXiv Preprint (arXiv 2403.09338)

Updates
Overview
Main results
Getting started
Image Classification

Updates

06 May: We improve our LocalVim with middle class token proposed by Vim. The log and checkpoint of local_vim_tiny_middle_cls_token are uploaded. Thanks to @FanqingM's issue.
19 Apr: We released the segmentation and detection code for all the models. We released the ckpt and log of LocalVim-S.
01 Apr: We released the ckpt and log of LocalVMamba-S.
01 Apr: We spent some time debugging an undesired performance collapse bug in triton code of local scan, but we still couldn't find the bug. So we switched the local scan and local reverse to the original pytorch versions. The speeds are similar.
21 Mar: We released the detection code of LocalVim.
20 Mar: We released the classification code of LocalVMamba (py). Since we rewrite the code related to Mamba operations, we need to retrain the models, and the checkpoints and logs of rest models will be uploaded later. We are preparing the detection and segmentation code now.
19 Mar: We released the classification code of LocalVim (py). The checkpoint and training log of LocalVim-T are uploaded.
15 Mar: We are working hard in releasing the code, it will be public in several days.

Overview

Abstract

Recent advancements in state space models, notably Mamba, have demonstrated significant progress in modeling long sequences for tasks like language understanding. Yet, their application in vision tasks has not markedly surpassed the performance of traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). This paper posits that the key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling. Traditional ViM approaches, which flatten spatial tokens, overlook the preservation of local 2D dependencies, thereby elongating the distance between adjacent tokens. We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective. Additionally, acknowledging the varying preferences for scan patterns across different network layers, we propose a dynamic method to independently search for the optimal scan choices for each layer, substantially improving performance. Extensive experiments across both plain and hierarchical models underscore our approach's superiority in effectively capturing image representations. For example, our model significantly outperforms Vim-Ti by 3.1% on ImageNet with the same 1.5G FLOPs.

Local Scan

Architecture of LocalVim

Main Results

ImageNet classification

Model	Dataset	Resolution	ACC@1	#Params	FLOPs	ckpts/logs
Vim-Ti (mid_cls_token)	ImageNet-1K	224x224	76.1	7M	1.5G	-
LocalVim-T (mid_cls_token)	ImageNet-1K	224x224	77.8	8M	1.5G	ckpt/log

Vim-Ti	ImageNet-1K	224x224	73.1	7M	1.5G	-
Vim-S	ImageNet-1K	224x224	80.3	26M	5.1G	-
LocalVim-T	ImageNet-1K	224x224	76.2	8M	1.5G	ckpt/log
LocalVim-S	ImageNet-1K	224x224	81.1	28M	4.8G	ckpt/log

VMamba-T	ImageNet-1K	224x224	82.2	22M	5.6G	-
VMamba-S	ImageNet-1K	224x224	83.5	44M	11.2G	-
LocalVMamba-T	ImageNet-1K	224x224	82.7	26M	5.7G	retraining...
LocalVMamba-S	ImageNet-1K	224x224	83.7	50M	11.4G	ckpt/log

Object Detection & Instance Segmentation

See detection folder.

Getting Started

Installation

1. Clone the LocalMamba repository:

git clone https://github.com/hunto/LocalMamba.git

2. Environment setup:

We tested our code on torch==1.13.1 and torch==2.0.2.

Install Mamba kernels:

cd causual-conv1d && pip install .
cd ..
cd mamba-1p1p1 && pip install .

Other dependencies:

timm==0.9.12
fvcore==0.1.5.post20221221

Image Classification

Dataset

We use ImageNet-1K dataset for training and validation. It is recommended to put the dataset files into ./data folder, then the directory structures should be like:

classification
├── lib
├── tools
├── configs
├── data
│   ├── imagenet
│   │   ├── meta
│   │   ├── train
│   │   ├── val
│   ├── cifar
│   │   ├── cifar-10-batches-py
│   │   ├── cifar-100-python

Evaluation

sh tools/dist_run.sh tools/test.py ${NUM_GPUS} configs/strategies/local_vmamba/config.yaml timm_local_vim_tiny --drop-path-rate 0.1 --experiment lightvit_tiny_test --resume ${ckpt_file_path}

Train models with 8 GPUs

LocalVim-T

sh tools/dist_train.sh 8 configs/strategies/local_mamba/config.yaml timm_local_vim_tiny -b 128 --drop-path-rate 0.1 --experiment local_vim_tiny

Other training options:

--amp: enable torch Automatic Mixed Precision (AMP) training. It can speedup the training on large models. We open it on LocalVMamba models.
--clip-grad-norm: enable gradient clipping.
--clip-grad-max-norm 1: gradient clipping value.
--model-ema: enable model exponential moving average. It can improve the accuracy on large model.
--model-ema-decay 0.9999: decay rate of model EMA.

Search scan directions

1. Train the search space (supernet) `local_vim_tiny_search`:

sh tools/dist_train.sh 8 configs/strategies/local_mamba/config.yaml timm_local_vim_tiny_search -b 128 --drop-path-rate 0.1 --experiment local_vim_tiny --epochs 100

After training, run tools/vis_search_prob.py to get the searched directions.

License

This project is released under the Apache 2.0 license.

Acknowledements

This project is based on Mamba (paper, code), Vim (paper, code), VMamba (paper, code), thanks for the excellent works.

If our paper helps your research, please consider citing us:

@article{huang2024localmamba,
  title={LocalMamba: Visual State Space Model with Windowed Selective Scan},
  author={Huang, Tao and Pei, Xiaohuan and You, Shan and Wang, Fei and Qian, Chen and Xu, Chang},
  journal={arXiv preprint arXiv:2403.09338},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
assests		assests
causal-conv1d		causal-conv1d
classification		classification
detection		detection
mamba-1p1p1		mamba-1p1p1
segmentation		segmentation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LocalMamba

LocalMamba: Visual State Space Model with Windowed Selective Scan

Updates

Overview

Abstract

Local Scan

Architecture of LocalVim

Main Results

ImageNet classification

Object Detection & Instance Segmentation

Getting Started

Installation

1. Clone the LocalMamba repository:

2. Environment setup:

Image Classification

Dataset

Evaluation

Train models with 8 GPUs

Search scan directions

1. Train the search space (supernet) `local_vim_tiny_search`:

License

Acknowledements

About

Releases 1

Packages

Languages

License

hunto/LocalMamba

Folders and files

Latest commit

History

Repository files navigation

LocalMamba

LocalMamba: Visual State Space Model with Windowed Selective Scan

Updates

Overview

Abstract

Local Scan

Architecture of LocalVim

Main Results

ImageNet classification

Object Detection & Instance Segmentation

Getting Started

Installation

1. Clone the LocalMamba repository:

2. Environment setup:

Image Classification

Dataset

Evaluation

Train models with 8 GPUs

Search scan directions

1. Train the search space (supernet) local_vim_tiny_search:

License

Acknowledements

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

1. Train the search space (supernet) `local_vim_tiny_search`:

Packages