# **YOLOv4: Optimal Speed and Accuracy of Object Detection**

**Authors: Alexey Bochkovskiy {alexeyab84@gmail.com}, Chien-Yao Wang {kinyiu@iis.sinica.edu.tw}, Hong-Yuan Mark Liao {liao@iis.sinica.edu.tw}**

**Official Github**: https://github.com/AlexeyAB/darknet

---

**Edited By Su Hyung Choi (Key Summary & Code Practice)**

If you have any issues on this scripts, please PR to the repository below.

**[Github: @JonyChoi - Computer Vision Paper Reviews]** https://github.com/jonychoi/Computer-Vision-Paper-Reviews

Edited Jan 7 2022

---


### **::Abstract::**

So far, there are a huge number of **features** to improve the **CNN Accuracy**.

- Some features are applicable on certain **exclusive environment(e.g models, problems etc.)**

- And some features are applicable to the majority of models, tasks, and datasets. **(e.g small-scale datsets, batch-normalization, residual-connections etc.)**.

The authors assumed that there are **universal features**, and created the new features belows

- **WRC**: Weighted-Residual-Connections
- **CSP**: Cross-Stage-Partial-connections
- **CmBN**: Cross mini-Batch Normalization
- **SAT**: Self-adversarial-training
- **Mish-activation**

+ Also they used other new feautures below (not assumed the universal features in context of abstract)

- Mosaic data augmentation
- DropBlock regularization
- CIOU loss

With using new **features**, and combining some of them, they achieved

- **state-of-the-art results: 43.5% AP (65.7% AP50)** 
- for the MS COCO dataset
- at a **realtime speed of ∼65 FPS**
- on Tesla V100

---

### **::Introduction::**

#### **1. Trade offs between Accuracy & Speed**
The majority of CNN-based object detectors are largely applicable **only** for recommendation systems.
- **Example of Slow Accurate Models**: Searching Free parking place

- **Example of Fast Inaccurate Models**: Car collision warning

#### **2. Problems Authors Object**

- The most accurate modern neural networks **do not operate in real time**.
- The most accurate modern neural networks **require large number of GPUs** for training with a large mini-batch-size.

#### **3. Solutions Authors Object**
- Creating a CNN that operates **in real-time on a conventional GPU**.
- Creating a CNN that **training requires only one conventional GPU** like 1080Ti for everyone can train.

#### **4. Main Goal**

- **Designing a fast operating speed of an object detector in production systems and optimization for parallel computations that can be easily trained and used.**

#### **5. Contributions**

1. **Develop an efficient and powerful object detection model that everyone can use a 1080 Ti or 2080 Ti GPU to train.**

2. Verify the influence of state-of-the-art Bag-ofFreebies and Bag-of-Specials methods of object detection during the detector training.

3. Modify state-of-the-art methods and make them more effecient and suitable for single GPU training, including CBN [89], PAN [49], SAM [85], etc.

---
### **::Related Work::**

#### **1. Modern Detector**
Usually composed of two parts.
- ***Backbone*** pre-trained on ImageNet.
- ***Head*** used to **predict classes** and **bounding boxes** of objects.

#### **2. Backbones**

Backbones can be classified according to platform(GPU, CPU)

**Backbones on GPU Platform**

- VGG [68]
- ResNet [26]
- ResNeXt [86]
- DenseNet [30]

**Backbones on CPU Platform**

- SqueezeNet [31]
- MobileNet [28, 66, 27, 74]
- ShuffleNet [97, 53]

#### **3. Head**

As to the head part, it is usually categorized into two kinds(**one-stage** object detector, **two-stage** object detector)

The representative object detectors are below.


**Two-Stage-Detector**

- R-CNN [19] series: fast R-CNN [18], faster R-CNN [64], R-FCN [9]
- Libra R-CNN [58]

**Two-Stage-Detector with anchor-free**
- RepPoints [87]


**One-Stage-Detector**

- YOLO
- SSD
- RetinaNet

**One-Stage-Detector with anchor-free**
- CenterNet [13]
- CornerNet [37, 38]
- FCOS [78]

#### **4. Neck**

Recently Object Detectors inserting some layers between backbone and head are developed. 

These layers are usually used to collect feature maps from different stages

Usually, a neck is composed of several **bottom-up paths** and several **topdown paths**. 

Networks equipped with this mechanism are below.

- Feature Pyramid Network (FPN) [44]
- Path Aggregation Network (PAN) [49]
- BiFPN [77]
- NAS-FPN [17]

#### **5. New Backbone**

In addition to the above models, some researchers put their emphasis on directly building a new backbone
- DetNet [43]
- DetNAS [7] 

#### **6. Whole New Model**
- SpineNet [12]
- HitDetector [20]
---