# **YOLOv4: Optimal Speed and Accuracy of Object Detection**

**Authors: Alexey Bochkovskiy {alexeyab84@gmail.com}, Chien-Yao Wang {kinyiu@iis.sinica.edu.tw}, Hong-Yuan Mark Liao {liao@iis.sinica.edu.tw}**

**Official Github**: https://github.com/AlexeyAB/darknet

---

**Edited By Su Hyung Choi (Key Summary & Code Practice)**

If you have any issues on this scripts, please PR to the repository below.

**[Github: @JonyChoi - Computer Vision Paper Reviews]** https://github.com/jonychoi/Computer-Vision-Paper-Reviews

Edited Jan 7 2022

---


### **Abstract**

So far, there are a huge number of **features** to improve the **CNN Accuracy**.

- Some features are applicable on certain **exclusive environment(e.g models, problems etc.)**

- And some features are applicable to the majority of models, tasks, and datasets. **(e.g small-scale datsets, batch-normalization, residual-connections etc.)**.

The authors assumed that there are **universal features**, and created the new features belows

- **WRC**: Weighted-Residual-Connections
- **CSP**: Cross-Stage-Partial-connections
- **CmBN**: Cross mini-Batch Normalization
- **SAT**: Self-adversarial-training
- **Mish-activation**

+ Also they used other new feautures below (not assumed the universal features in context of abstract)

- Mosaic data augmentation
- DropBlock regularization
- CIOU loss

With using new **features**, and combining some of them, they achieved

- **state-of-the-art results: 43.5% AP (65.7% AP50)** 
- for the MS COCO dataset
- at a **realtime speed of ∼65 FPS**
- on Tesla V100

---

### **Introduction**

#### **1. Trade offs between Accuracy & Speed**
The majority of CNN-based object detectors are largely applicable **only** for recommendation systems.
- **Example of Slow Accurate Models**: Searching Free parking place

- **Example of Fast Inaccurate Models**: Car collision warning

#### **2. Problems Authors Object**

- The most accurate modern neural networks **do not operate in real time**.
- The most accurate modern neural networks **require large number of GPUs** for training with a large mini-batch-size.

#### **3. Solutions Authors Object**
- Creating a CNN that operates **in real-time on a conventional GPU**.
- Creating a CNN that **training requires only one conventional GPU** like 1080Ti for everyone can train.

#### **4. Main Goal**

- **Designing a fast operating speed of an object detector in production systems and optimization for parallel computations that can be easily trained and used.**

#### **5. Contributions**

1. **Develop an efficient and powerful object detection model that everyone can use a 1080 Ti or 2080 Ti GPU to train.**

2. Verify the influence of state-of-the-art Bag-ofFreebies and Bag-of-Specials methods of object detection during the detector training.

3. Modify state-of-the-art methods and make them more effecient and suitable for single GPU training, including CBN [89], PAN [49], SAM [85], etc.

---
### **Related Work**

#### **1. Modern Detector**
Usually composed of two parts.
- ***Backbone*** pre-trained on ImageNet.
- ***Head*** used to **predict classes** and **bounding boxes** of objects.

#### **2. Backbones**

Backbones can be classified according to platform(GPU, CPU)

**Backbones on GPU Platform**

- VGG [68]
- ResNet [26]
- ResNeXt [86]
- DenseNet [30]

**Backbones on CPU Platform**

- SqueezeNet [31]
- MobileNet [28, 66, 27, 74]
- ShuffleNet [97, 53]

#### **3. Head**

As to the head part, it is usually categorized into two kinds(**one-stage** object detector, **two-stage** object detector)

The representative object detectors are below.


**Two-Stage-Detector**

- R-CNN [19] series: fast R-CNN [18], faster R-CNN [64], R-FCN [9]
- Libra R-CNN [58]

**Two-Stage-Detector with anchor-free**
- RepPoints [87]


**One-Stage-Detector**

- YOLO
- SSD
- RetinaNet

**One-Stage-Detector with anchor-free**
- CenterNet [13]
- CornerNet [37, 38]
- FCOS [78]

#### **4. Neck**

Recently Object Detectors inserting some layers between backbone and head are developed. 

These layers are usually used to collect feature maps from different stages

Usually, a neck is composed of several **bottom-up paths** and several **topdown paths**. 

Networks equipped with this mechanism are below.

- Feature Pyramid Network (FPN) [44]
- Path Aggregation Network (PAN) [49]
- BiFPN [77]
- NAS-FPN [17]

#### **5. New Backbone**

In addition to the above models, some researchers put their emphasis on directly building a new backbone
- DetNet [43]
- DetNAS [7] 

#### **6. Whole New Model**
- SpineNet [12]
- HitDetector [20]
---
### **Terminologies**

**1. Dense Prediction**

***== "Semantic image segmentation"***

Classify all pixels in the picture into corresponding class(a predetermined number). It is also called dense prediction because it predicts all pixels in the image.

<img src="https://images.deepai.org/converted-papers/1809.04184/x1.png" width="500">

**2. Sparse Prediction**

Sparse Prediction seems to be dubbed since its network propose the ***"sparse"*** proposal region from the ***a huge number of predicted boxes*** (Fast R-CNN etc.) or ***predict only sparse boxes at the inital*** (Sparse R-CNN).

<img src="https://miro.medium.com/max/2000/0*vhzrGqMMGQIyxg66.jpeg" height="300">

[Figure: The road to Sparse R-CNN — key ideas and intuition](https://medium.com/responsibleml/the-road-to-sparse-r-cnn-key-ideas-and-intuition-feb184d0d4f3)

**3. Anchor Free**

<---> ***"Anchor based detector"***

The object detection problem was mainly studied on anchor-based detectors. The anchor-based detector is a method of predicting categories and adjusting coordinates in numerous preset anchors. There are two-stage methods and one-stage methods.

However, due to the recent emergence of FPN and Local Loss, research on the anchor-free detector method is being conducted. The anchor-free detector can immediately find the object without an anchor. There are two methods: a keypoint-based method of predicting the location of an object using keypoints and a center-based method of predicting the distance of the object boundary if positive after predicting the center of the object. This anchor-free detector is considered to have more potential in the field of object detection because it obtains similar performance to anchor-based detector without using a hyperparameter related to anchor.

The center-based detector is similar to the anchor-based detector in that it uses points instead of the anchor box.

https://byeongjokim.github.io/posts/Bridging-the-Gap-Between-Anchor-based-and-Anchor-free-Detection-via-Adaptive-Training-Sample-Selection/

**4. Neck**

As the name suggests, it is a connection between the backbone and head. Neck extracts different feature maps from different stages of the backbone, such as FPN, PANet, Bi-FPN, etc.

For example here, the below is ***FPN***, the Feature Pyramid Network. [About FPN(Feature Pyramid Network)](https://github.com/jonychoi/Computer-Vision-Paper-Reviews/tree/main/Feature%20Pyramid%20Networks%20for%20Object%20Detection)

<img src="https://media.vlpt.us/images/hewas1230/post/94c179ec-f5cb-4729-bbf0-89fcc2d9722a/image.png" />

The FPN **enhance the existing convolutional network** through the top-down method and side connection. This allows the network to configure rich feature pyramids and diverse scales from single resolution input images. 

<img src="https://media.vlpt.us/images/hewas1230/post/b6a6c808-9e73-4c08-b51a-3d08439df63f/image.png" />

Each side connection creates a different pyramid level by merging feature maps from bottom-up pathway to top-down pathway. Before merging Feature maps, the previous pyramid level is up-sampled by the 2x element of the FPN to have the same spatial size.

Thanks to these various layers, classification or regression network (Head) can be applied to each layer to detect objects of different sizes.

<img src="https://media.vlpt.us/images/hewas1230/post/9bf68005-46d5-4ccd-be37-91df562852f8/image.png" />

(a) shows how features are extracted from the backbone of a single shot detector architecture (SSD). 

(b) is the FPN method (FPN uses ResNet)

(c) the STDN method, and (d) the YOLOv4 method.


In conclusion, the idea of "giving various scale changes to detect objects through pyramid forms" is all the same.

[Reference: Object-Detection Architecture](https://velog.io/@hewas1230/ObjectDetection-Architecture)

**4.1 Neck - Additional blocks**

**4.2 Neck - Path-aggregation blocks**


**5. Heads**

It is a practical part of 'detection' such as tasking classification and regression of the bounding boxes. Output is in the form of four values (x, y, h, w) and the probability of k classes + 1 (+1 is for background).

Since the Object Detector vary to two models depends on End to End feature extraction and classification or feature extraction and classification with restricted number of boxes(sparse), upon, we can classify the object detectors as **Dense Prediction (one-stage)** and **Sparse Prediction (two-stage)**.

---