# **YOLOv4: Optimal Speed and Accuracy of Object Detection**

**Authors: Alexey Bochkovskiy {alexeyab84@gmail.com}, Chien-Yao Wang {kinyiu@iis.sinica.edu.tw}, Hong-Yuan Mark Liao {liao@iis.sinica.edu.tw}**

**Official Github**: https://github.com/AlexeyAB/darknet

---

**Edited By Su Hyung Choi - [Computer Vision Paper Reviews]**

**[Github: @JonyChoi]** https://github.com/jonychoi/Computer-Vision-Paper-Reviews

Edited Jan 7 2022

---

### Abstract

There are a huge number of features which are said to
improve Convolutional Neural Network (CNN) accuracy.
Practical testing of combinations of such features on large
datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively
and for certain problems exclusively, or only for small-scale
datasets; while some features, such as batch-normalization
and residual-connections, are applicable to the majority of
models, tasks, and datasets. We assume that such universal
features include Weighted-Residual-Connections (WRC),
Cross-Stage-Partial-connections (CSP), Cross mini-Batch
Normalization (CmBN), Self-adversarial-training (SAT)
and Mish-activation. We use new features: WRC, CSP,
CmBN, SAT, Mish activation, Mosaic data augmentation,
CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5%
AP (65.7% AP50) for the MS COCO dataset at a realtime speed of ∼65 FPS on Tesla V100. Source code is at
https://github.com/AlexeyAB/darknet.


### 1. Introduction

The majority of CNN-based object detectors are largely
applicable only for recommendation systems. For example,
searching for free parking spaces via urban video cameras
is executed by slow accurate models, whereas car collision
warning is related to fast inaccurate models. Improving
the real-time object detector accuracy enables using them
not only for hint generating recommendation systems, but
also for stand-alone process management and human input
reduction. Real-time object detector operation on conventional Graphics Processing Units (GPU) allows their mass
usage at an affordable price. The most accurate modern
neural networks do not operate in real time and require large
number of GPUs for training with a large mini-batch-size.
We address such problems through creating a CNN that operates in real-time on a conventional GPU, and for which
training requires only one conventional GPU.

<table>
  <tr>
      <td>
        <img src="./figure1.png" style="width: 350px">
      </td>
      <td valign="bottom">
        Figure 1: Comparison of the proposed YOLOv4 and other
        state-of-the-art object detectors. YOLOv4 runs twice faster
        than EfficientDet with comparable performance. Improves
        YOLOv3’s AP and FPS by 10% and 12%, respectively.
    </td>
  </tr>
</table>

The main goal of this work is designing a fast operating speed of an object detector in production systems and optimization for parallel computations, rather than the low computation volume theoretical indicator (BFLOP). We hope that the designed object can be easily trained and used. For example, anyone who uses a conventional GPU to train and test can achieve real-time, high quality, and convincing object detection results, as the YOLOv4 results shown in Figure 1. Our contributions are summarized as follows:

1. We develope an efficient and powerful object detection model. It makes everyone can use a 1080 Ti or 2080 Ti GPU to train a super fast and accurate object detector.

2. We verify the influence of state-of-the-art Bag-ofFreebies and Bag-of-Specials methods of object detection during the detector training.

3. We modify state-of-the-art methods and make them more effecient and suitable for single GPU training, 
including CBN [89], PAN [49], SAM [85], etc.


<img src="./figure2.png" width="90%">

### 2. Related work
#### 2.1. Object detection models

A modern detector is usually composed of two parts,
a backbone which is pre-trained on ImageNet and a head
which is used to predict classes and bounding boxes of objects. For those detectors running on GPU platform, their
backbone could be VGG [68], ResNet [26], ResNeXt [86],
or DenseNet [30]. For those detectors running on CPU platform, their backbone could be SqueezeNet [31], MobileNet
[28, 66, 27, 74], or ShuffleNet [97, 53]. As to the head part,
it is usually categorized into two kinds, i.e., one-stage object
detector and two-stage object detector. The most representative two-stage object detector is the R-CNN [19] series,
including fast R-CNN [18], faster R-CNN [64], R-FCN [9],
and Libra R-CNN [58]. It is also possible to make a twostage object detector an anchor-free object detector, such as
RepPoints [87]. As for one-stage object detector, the most
representative models are YOLO [61, 62, 63], SSD [50],
and RetinaNet [45]. In recent years, anchor-free one-stage
object detectors are developed. The detectors of this sort are
CenterNet [13], CornerNet [37, 38], FCOS [78], etc. Object
detectors developed in recent years often insert some layers between backbone and head, and these layers are usually used to collect feature maps from different stages. We
can call it the neck of an object detector. Usually, a neck
is composed of several bottom-up paths and several topdown paths. Networks equipped with this mechanism include Feature Pyramid Network (FPN) [44], Path Aggregation Network (PAN) [49], BiFPN [77], and NAS-FPN [17].
In addition to the above models, some researchers put their
emphasis on directly building a new backbone (DetNet [43],
DetNAS [7]) or a new whole model (SpineNet [12], HitDetector [20]) for object detection.
To sum up, an ordinary object detector is composed of
several parts:
• Input: Image, Patches, Image Pyramid
• Backbones: VGG16 [68], ResNet-50 [26], SpineNet
[12], EfficientNet-B0/B7 [75], CSPResNeXt50 [81],
CSPDarknet53 [81]
• Neck:
• Additional blocks: SPP [25], ASPP [5], RFB
[47], SAM [85]
• Path-aggregation blocks: FPN [44], PAN [49],
NAS-FPN [17], Fully-connected FPN, BiFPN
[77], ASFF [48], SFAM [98]
• Heads::
• Dense Prediction (one-stage):
◦ RPN [64], SSD [50], YOLO [61], RetinaNet
[45] (anchor based)
◦ CornerNet [37], CenterNet [13], MatrixNet
[60], FCOS [78] (anchor free)
• Sparse Prediction (two-stage):
◦ Faster R-CNN [64], R-FCN [9], Mask RCNN [23] (anchor based)
◦ RepPoints [87] (anchor free)
