This repo is the official implementation of "Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios". This algorithm is proposed by ByteDance, Intelligent Creation, AutoML Team (字节跳动-智能创作 AutoML团队).
08/16/2022
- Pretrained models on large scale dataset follow [SSLD] are provided.
- Segmentation results with large scale dataset pretrained model are also presented.
Due to the complex attention mechanisms and model design, most existing vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios, e.g. TensorRT and CoreML. This poses a distinct challenge: Can a visual neural network be designed to infer as fast as CNNs and perform as powerful as ViTs? Recent works have tried to design CNN-Transformer hybrid architectures to address this issue, yet the overall performance of these works is far away from satisfactory. To end these, we propose a next generation vision Transformer for efficient deployment in realistic industrial scenarios, namely Next-ViT, which dominates both CNNs and ViTs from the perspective of latency/accuracy trade-off. In this work, the Next Convolution Block (NCB) and Next Transformer Block (NTB) are respectively developed to capture local and global information with deployment-friendly mechanisms. Then, Next Hybrid Strategy (NHS) is designed to stack NCB and NTB in an efficient hybrid paradigm, which boosts performance in various downstream tasks. Extensive experiments show that Next-ViT significantly outperforms existing CNNs, ViTs and CNN-Transformer hybrid architectures with respect to the latency/accuracy trade-off across various vision tasks. On TensorRT, Next-ViT surpasses ResNet by 5.5 mAP (from 40.4 to 45.9) on COCO detection and 7.7% mIoU (from 38.8% to 46.5%) on ADE20K segmentation under similar latency. Meanwhile, it achieves comparable performance with CSWin, while the inference speed is accelerated by 3.6×. On CoreML, Next-ViT surpasses EfficientFormer by 4.6 mAP (from 42.6 to 47.2) on COCO detection and 3.5% mIoU (from 45.1% to 48.6%) on ADE20K segmentation under similar latency.
Figure 2. Comparison among Next-ViT and efficient Networks, in terms of accuracy-latency trade-off.First, clone the repository locally:
git clone https://github.com/bytedance/Next-ViT.git
Then, install torch=1.10.0
, mmcv-full==1.5.0
, timm==0.4.9
and etc.
pip3 install -r requirements.txt
Download and extract ImageNet train and val images from http://image-net.org/.
The directory structure is the standard layout for the torchvision datasets.ImageFolder
, and the training and validation data is expected to be in the train/
folder and val/
folder respectively:
/path/to/imagenet/
train/
class1/
img1.jpeg
class2/
img2.jpeg
val/
class1/
img3.jpeg
class/2
img4.jpeg
We provide a series of Next-ViT models pretrained on ILSVRC2012 ImageNet-1K dataset. More details can be seen in [paper].
Model | Dataset | Resolution | FLOPs(G) | Params (M) | TensorRT Latency(ms) |
CoreML Latency(ms) |
Acc@1 | ckpt | log |
---|---|---|---|---|---|---|---|---|---|
Next-ViT-S | ImageNet-1K | 224 | 5.8 | 31.7 | 7.7 | 3.5 | 82.5 | ckpt | log |
Next-ViT-B | ImageNet-1K | 224 | 8.3 | 44.8 | 10.5 | 4.5 | 83.2 | ckpt | log |
Next-ViT-L | ImageNet-1K | 224 | 10.8 | 57.8 | 13.0 | 5.5 | 83.6 | ckpt | log |
Next-ViT-S | ImageNet-1K | 384 | 17.3 | 31.7 | 21.6 | 8.9 | 83.6 | ckpt | log |
Next-ViT-B | ImageNet-1K | 384 | 24.6 | 44.8 | 29.6 | 12.4 | 84.3 | ckpt | log |
Next-ViT-L | ImageNet-1K | 384 | 32.0 | 57.8 | 36.0 | 15.2 | 84.7 | ckpt | log |
We also provide a series of Next-ViT models pretrained on large scale dataset follow [SSLD]. More details can be seen in [paper].
Model | Dataset | Resolution | FLOPs(G) | Params (M) | TensorRT Latency(ms) |
CoreML Latency(ms) |
Acc@1 | ckpt |
---|---|---|---|---|---|---|---|---|
Next-ViT-S | ImageNet-1K-6M | 224 | 5.8 | 31.7 | 7.7 | 3.5 | 84.8 | ckpt |
Next-ViT-B | ImageNet-1K-6M | 224 | 8.3 | 44.8 | 10.5 | 4.5 | 85.1 | ckpt |
Next-ViT-L | ImageNet-1K-6M | 224 | 10.8 | 57.8 | 13.0 | 5.5 | 85.4 | ckpt |
Next-ViT-S | ImageNet-1K-6M | 384 | 17.3 | 31.7 | 21.6 | 8.9 | 85.8 | ckpt |
Next-ViT-B | ImageNet-1K-6M | 384 | 24.6 | 44.8 | 29.6 | 12.4 | 86.1 | ckpt |
Next-ViT-L | ImageNet-1K-6M | 384 | 32.0 | 57.8 | 36.0 | 15.2 | 86.4 | ckpt |
To train Next-ViT-S on ImageNet using 8 gpus for 300 epochs, run:
cd classification/
bash train.sh 8 --model nextvit_small --batch-size 256 --lr 5e-4 --warmup-epochs 20 --weight-decay 0.1 --data-path your_imagenet_path
Finetune Next-ViT-S with 384x384 input size for 30 epochs, run:
cd classification/
bash train.sh 8 --model nextvit_small --batch-size 128 --lr 5e-6 --warmup-epochs 0 --weight-decay 1e-8 --epochs 30 --sched step --decay-epochs 60 --input-size 384 --resume ../checkpoints/nextvit_small_in1k_224.pth --finetune --data-path your_imagenet_path
To evaluate the performance of Next-ViT-S on ImageNet using 8 gpus, run:
cd classification/
bash train.sh 8 --model nextvit_small --batch-size 256 --lr 5e-4 --warmup-epochs 20 --weight-decay 0.1 --data-path your_imagenet_path --resume ../checkpoints/nextvit_small_in1k_224.pth --eval
Our code is based on mmdetection, please install mmdetection==2.23.0
. Next-ViT serve as the strong backbones for
Mask R-CNN. It's easy to apply Next-ViT in other detectors provided by mmdetection based on our examples. More details can be seen in [paper].
Backbone | Pretrained | Lr Schd | Param.(M) | FLOPs(G) | TensorRT Latency(ms) |
CoreML Latency(ms) |
bbox mAP | mask mAP | ckpt | log |
---|---|---|---|---|---|---|---|---|---|---|
Next-ViT-S | ImageNet-1K | 1x | 51.8 | 290 | 38.2 | 18.1 | 45.9 | 41.8 | ckpt | log |
Next-ViT-S | ImageNet-1K | 3x | 51.8 | 290 | 38.2 | 18.1 | 48.0 | 43.2 | ckpt | log |
Next-ViT-B | ImageNet-1K | 1x | 64.9 | 340 | 51.6 | 24.4 | 47.2 | 42.8 | ckpt | log |
Next-ViT-B | ImageNet-1K | 3x | 64.9 | 340 | 51.6 | 24.4 | 49.5 | 44.4 | ckpt | log |
Next-ViT-L | ImageNet-1K | 1x | 77.9 | 391 | 65.3 | 30.1 | 48.0 | 43.2 | ckpt | log |
Next-ViT-L | ImageNet-1K | 3x | 77.9 | 391 | 65.3 | 30.1 | 50.2 | 44.8 | ckpt | log |
To train Mask R-CNN with Next-ViT-S backbone using 8 gpus, run:
cd detection/
PORT=29501 bash dist_train.sh configs/mask_rcnn_nextvit_small_1x.py 8
To evaluate Mask R-CNN with Next-ViT-S backbone using 8 gpus, run:
cd detection/
PORT=29501 bash dist_test.sh configs/mask_rcnn_nextvit_small_1x.py ../checkpoints/mask_rcnn_1x_nextvit_small.pth 8 --eval bbox
Our code is based on mmsegmentation, please install mmsegmentation==0.23.0
. Next-ViT serve as the strong backbones for segmentation tasks on ADE20K dataset. It's easy to extend it to other datasets and segmentation methods. More details can be seen in [paper].
Backbone | Pretrained | FLOPs(G) | Params (M) | TensorRT Latency(ms) |
CoreML Latency(ms) |
mIoU | ckpt | log |
---|---|---|---|---|---|---|---|---|
Next-ViT-S | ImageNet-1K | 208 | 36.3 | 38.2 | 18.1 | 46.5 | ckpt | log |
Next-ViT-B | ImageNet-1K | 260 | 49.3 | 51.6 | 24.4 | 48.6 | ckpt | log |
Next-ViT-L | ImageNet-1K | 331 | 62.4 | 65.3 | 30.1 | 49.1 | ckpt | log |
Next-ViT-S | ImageNet-1K-6M | 208 | 36.3 | 38.2 | 18.1 | 48.8 | ckpt | log |
Next-ViT-B | ImageNet-1K-6M | 260 | 49.3 | 51.6 | 24.4 | 50.2 | ckpt | log |
Next-ViT-L | ImageNet-1K-6M | 331 | 62.4 | 65.3 | 30.1 | 50.5 | ckpt | log |
Backbone | Pretrained | FLOPs(G) | Params (M) | TensorRT Latency(ms) |
CoreML Latency(ms) |
mIoU(ss/ms) | ckpt | log |
---|---|---|---|---|---|---|---|---|
Next-ViT-S | ImageNet-1K | 968 | 66.3 | 38.2 | 18.1 | 48.1/49.0 | ckpt | log |
Next-ViT-B | ImageNet-1K | 1020 | 79.3 | 51.6 | 24.4 | 50.4/51.1 | ckpt | log |
Next-ViT-L | ImageNet-1K | 1072 | 92.4 | 65.3 | 30.1 | 50.1/50.8 | ckpt | log |
Next-ViT-S | ImageNet-1K-6M | 968 | 66.3 | 38.2 | 18.1 | 49.8/50.8 | ckpt | log |
Next-ViT-B | ImageNet-1K-6M | 1020 | 79.3 | 51.6 | 24.4 | 51.8/52.8 | ckpt | log |
Next-ViT-L | ImageNet-1K-6M | 1072 | 92.4 | 65.3 | 30.1 | 51.5/52.0 | ckpt | log |
To train Semantic FPN 80k with Next-ViT-S backbone using 8 gpus, run:
cd segmentation/
PORT=29501 bash dist_train.sh configs/fpn_512_nextvit_small_80k.py 8
To evaluate Semantic FPN 80k(single scale) with Next-ViT-S backbone using 8 gpus, run:
cd segmentation/
PORT=29501 bash dist_test.sh configs/fpn_512_nextvit_small_80k.py ../checkpoints/fpn_80k_nextvit_small.pth 8 --eval mIoU
we provide scripts to convert Next-ViT from pytorch model to CoreML model and TensorRT engine.
Convert Next-ViT-S to CoreML model with coremltools==5.2.0
, run:
cd deployment/
python3 export_coreml_model.py --model nextvit_small --batch-size 1 --image-size 224
Backbone | Resolution | FLOPs (G) | CoreML Latency(ms) |
CoreML Model |
---|---|---|---|---|
Next-ViT-S | 224 | 5.8 | 3.5 | mlmodel |
Next-ViT-B | 224 | 8.3 | 4.5 | mlmodel |
Next-ViT-L | 224 | 10.8 | 5.5 | mlmodel |
We uniformly benchmark CoreML Latency on an iPhone12 Pro Max(iOS 16.0) with Xcode 14.0. The performance report of CoreML model can be generated with Xcode 14.0 directly(new feature of Xcode 14.0).
Convert Next-ViT-S to TensorRT engine with tensorrt==8.0.3.4
, run:
cd deployment/
python3 export_tensorrt_engine.py --model nextvit_small --batch-size 8 --image-size 224 --datatype fp16 --profile True --trtexec-path /usr/bin/trtexec
If you find this project useful in your research, please consider cite:
@article{li2022next,
title={Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios},
author={Li, Jiashi and Xia, Xin and Li, Wei and Li, Huixia and Wang, Xing and Xiao, Xuefeng and Wang, Rui and Zheng, Min and Pan, Xin},
journal={arXiv preprint arXiv:2207.05501},
year={2022}
}
We heavily borrow the code from Twins.
This repository is released under the Apache 2.0 license as found in the LICENSE file.