ParFormer: Vision Transformer Baseline with Parallel Local Global Token Mixer and Convolution Attention Patch Embedding
The Implementation of ParFormer: Vision Transformer Baseline with Parallel Local Global Token Mixer and Convolution Attention Patch Embedding
Model Name | Resolution | Params | GFLOPs | @Top-1 | Download |
---|---|---|---|---|---|
ParFormer-B1 | 224X224 | 11M | 1.5 | 80.5 | model |
ParFormer-B2 | 224X224 | 23M | 3.4 | 82.1 | model |
ParFormer-B3 | 224X224 | 34M | 6.5 | 83.1 | model |
conda
virtual environment is recommended.
conda install pytorch torchvision cudatoolkit=11.8 -c pytorch
pip install timm==0.6.13
pip install wandb
pip install fvcore
Download and extract ImageNet train and val images from http://image-net.org/. The training and validation data are expected to be in the train
folder and val
folder respectively:
|-- /path/to/imagenet/
|-- train
|-- val
We provide an example training script train_imnet.sh
using PyTorch distributed data parallel (DDP).
To train ParFormer-B1 on an 2-GPU machine:
sh train_imnet.sh parformer_b1 2
Tips: specify your data path and experiment name in the script!