Skip to content


Repository files navigation

VarifocalNet: An IoU-aware Dense Object Detector

This repo hosts the code for implementing the VarifocalNet, as presented in our CVPR 2021 oral paper, which is available at:

  title={VarifocalNet: An IoU-aware Dense Object Detector},
  author={Zhang, Haoyang and Wang, Ying and Dayoub, Feras and S{\"u}nderhauf, Niko},


Accurately ranking the vast number of candidate detections is crucial for dense object detectors to achieve high performance. In this work, we propose to learn IoU-aware classification scores (IACS) that simultaneously represent the object presence confidence and localization accuracy, to produce a more accurate ranking of detections in dense object detectors. In particular, we design a new loss function, named Varifocal Loss (VFL), for training a dense object detector to predict the IACS, and a new efficient star-shaped bounding box feature representation (the features at nine yellow sampling points) for estimating the IACS and refining coarse bounding boxes. Combining these two new components and a bounding box refinement branch, we build a new IoU-aware dense object detector based on the FCOS+ATSS architecture, what we call VarifocalNet or VFNet for short. Extensive experiments on MS COCO benchmark show that our VFNet consistently surpasses the strong baseline by ~2.0 AP with different backbones. Our best model VFNet-X-1200 with Res2Net-101-DCN reaches a single-model single-scale AP of 55.1 on COCO test-dev, achieving the state-of-the-art performance among various object detectors.

Learning to Predict the IoU-aware Classification Score.


  • 2021.03.05 Our VarifocalNet is accepted to CVPR 2021 as an oral presentation. Thanks the reviewers and ACs.
  • 2021.03.04 Update to MMDetection v2.10.0, add more results and training scripts, and update the arXiv paper.
  • 2021.01.09 Add SWA training.
  • 2021.01.07 Update to MMDetection v2.8.0.
  • 2020.12.24 We release a new VFNet-X model that can achieve a single-model single-scale 55.1 AP on COCO test-dev at 4.2 FPS.
  • 2020.12.02 Update to MMDetection v2.7.0.
  • 2020.10.29 VarifocalNet has been merged into the official MMDetection repo. Many thanks to @yhcao6, @RyanXLi and @hellock!
  • 2020.10.29 This repo has been refactored so that users can pull the latest updates from the upstream official MMDetection repo. The previous one can be found in the old branch.


  • This VarifocalNet implementation is based on MMDetection. Therefore the installation is the same as original MMDetection.

  • Please check for installation. Note that you should change the version of PyTorch and CUDA to yours when installing mmcv in step 3 and clone this repo instead of MMdetection in step 4.

  • If you run into problems with pycocotools, please install it by:

    pip install "git+"

A Quick Demo

Once the installation is done, you can follow the steps below to run a quick demo.

  • Download the model and put it into one folder under the root directory of this project, say, checkpoints/.

  • Go to the root directory of this project in terminal and activate the corresponding virtual environment.

  • Run

    python demo/ demo/demo.jpg configs/vfnet/ checkpoints/vfnet_r50_1x_41.6.pth

    and you should see an image with detections.

Usage of MMDetection

Please see for the basic usage of MMDetection. They also provide colab tutorial for beginners.

For troubleshooting, please refer to

Results and Models

For your convenience, we provide the following trained models. These models are trained with a mini-batch size of 16 images on 8 Nvidia V100 GPUs (2 images per GPU).

Backbone Style DCN MS
Inf time
box AP
box AP
R-50 pytorch N N 1x 19.4 41.6 41.6 model | log
R-50 pytorch N Y 2x 19.3 44.5 44.8 model | log
R-50 pytorch Y Y 2x 16.3 47.8 48.0 model | log
R-101 pytorch N N 1x 15.5 43.0 43.6 model | log
R-101 pytorch N N 2x 15.6 43.5 43.9 model | log
R-101 pytorch N Y 2x 15.6 46.2 46.7 model | log
R-101 pytorch Y Y 2x 12.6 49.0 49.2 model | log
X-101-32x4d pytorch N Y 2x 13.1 47.4 47.6 model | log
X-101-32x4d pytorch Y Y 2x 10.1 49.7 50.0 model | log
X-101-64x4d pytorch N Y 2x 9.2 48.2 48.5 model | log
X-101-64x4d pytorch Y Y 2x 6.7 50.4 50.8 model | log
R2-101 pytorch N Y 2x 13.0 49.2 49.3 model | log
R2-101 pytorch Y Y 2x 10.3 51.1 51.3 model | log


  • The MS-train maximum scale range is 1333x[480:960] (range mode) and the inference scale keeps 1333x800.
  • The R2-101 backbone is Res2Net-101.
  • DCN means using DCNv2 in both backbone and head.
  • The inference speed is tested with an Nvidia V100 GPU on HPC (log file).

We also provide the models of RetinaNet, FoveaBox, RepPoints and ATSS trained with the Focal Loss (FL) and our Varifocal Loss (VFL).

Method Backbone MS train Lr schd box AP (val) Download
RetinaNet + FL R-50 N 1x 36.5 model | log
RetinaNet + VFL R-50 N 1x 37.4 model | log
FoveaBox + FL R-50 N 1x 36.3 model | log
FoveaBox + VFL R-50 N 1x 37.2 model | log
RepPoints + FL R-50 N 1x 38.3 model | log
RepPoints + VFL R-50 N 1x 39.7 model | log
ATSS + FL R-50 N 1x 39.3 model | log
ATSS + VFL R-50 N 1x 40.2 model | log


  • We use 4 P100 GPUs for the training of these models (except ATSS, 8x2) with a mini-batch size of 16 images (4 images per GPU), as we found 4x4 training yielded slightly better results compared to 8x2 training.
  • You can find corresponding config files in configs/vfnet.
  • use_vfl flag in those config files controls whether to use the Varifocal Loss in training or not.


Backbone DCN MS
Training Inf
Inf time
box AP
box AP
R2-101 Y Y 41e + SWA 18e 1333x800 8.0 53.4 53.7 model | config
R2-101 Y Y 41e + SWA 18e 1800x1200 4.2 54.5 55.1


We implement some improvements to the original VFNet. This version of VFNet is called VFNet-X and these improvements include:

  • PAFPN. We replace the FPN with the PAFPNX (minor modifications are made to the original PAFPN), and apply the DCN and group normalization (GN) in it.

  • More and Wider Conv Layers. We stack 4 convolution layers in the detection head, instead of 3 layers in the original VFNet, and increase the original 256 feature channels to 384 channels.

  • RandomCrop and Cutout. We employ the random crop and cutout as additional data augmentation methods.

  • Wider MSTrain Scale Range and Longer Training. We adopt a wider MSTrain scale range, from 750x500 to 2100x1400, and initially train the VFNet-X for 41 epochs.

  • SWA. We apply the technique of Stochastic Weight Averaging (SWA) in training the VFNet-X (for another 18 epochs), which brings 1.2 AP gain. Please see our work of SWA Object Detection for more details.

  • Soft-NMS. We apply soft-NMS in inference.

For more detailed information, please see the VFNet-X config file.


Assuming you have put the COCO dataset into data/coco/ and have downloaded the models into the checkpoints/, you can now evaluate the models on the COCO val2017 split:

./tools/ configs/vfnet/ checkpoints/vfnet_r50_1x_41.6.pth 8 --eval bbox


  • If you have less than 8 gpus available on your machine, please change 8 into the number of your gpus.
  • If you want to evaluate a different model, please change the config file (in configs/vfnet) and corresponding model weights file.
  • Test time augmentation is supported for the VarifocalNet, including multi-scale testing and flip testing. If you are interested, please refer to an example config file More information about test time augmentation can be found in the official script


The following command line will train vfnet_r50_fpn_1x_coco on 8 GPUs:

./tools/ configs/vfnet/ 8


  • The models will be saved into work_dirs/vfnet_r50_fpn_1x_coco.
  • To use fewer GPUs, please change 8 to the number of your GPUs. If you want to keep the mini-batch size to 16, you need to change the samples_per_gpu and workers_per_gpu accordingly, so that samplers_per_gpu x number_of_gpus = 16. In general, workers_per_gpu = samples_per_gpu.
  • If you use a different mini-batch size, please change the learning rate according to the Linear Scaling Rule, e.g., lr=0.01 for 8 GPUs x 2 img/gpu and lr=0.005 for 4 GPUs x 2 img/gpu.
  • To train the VarifocalNet with other backbones, please change the config file accordingly.
  • To train the VarifocalNet on your own dataset, please follow this instruction.


Any pull requests or issues are welcome.


Please consider citing our paper in your publications if the project helps your research. BibTeX reference is as follows:

  title={VarifocalNet: An IoU-aware Dense Object Detector},
  author={Zhang, Haoyang and Wang, Ying and Dayoub, Feras and S{\"u}nderhauf, Niko},


We would like to thank MMDetection team for producing this great object detection toolbox!


This project is released under the Apache 2.0 license.