#  YOLO (You Only Look Once)

The main idea behind YOLO is to perform object detection in a single pass through the neural network, which involves predicting bounding boxes and class probabilities directly from the image pixels. This approach is in contrast to methods that first propose candidate regions and then classify each region separately.


### YOLO the main component

**Backbone:** YOLOv4 uses CSPDarknet53 as its feature extractor or backbone network. This backbone is responsible for ***extracting features from the input image.*** CSP stands for Cross Stage Partial networks, which help in ***reducing*** the computation ***cost*** while maintaining ***accuracy.***

**Neck:** The neck is composed of the Path Aggregation Network (PANet) and Spatial Pyramid Pooling (SPP) block. The PANet aids in the aggregation of different levels of features which are beneficial for detecting objects at various scales. The SPP block increases the receptive field and helps to separate out the most significant context features.

**Head:** The head of YOLOv4 performs the ***detection tasks.*** It consists of YOLO layers that ***predict bounding boxes***, objectness scores (the ***likelihood*** that a box contains an object), and class predictions. The head processes the feature maps at three different scales, which allows it to detect small, medium, and large objects.

**Anchor Boxes:** YOLOv4 uses anchor boxes, which are **predefined bounding boxes** that the model uses as a starting point for predicting actual bounding boxes around objects.


<img src="./img/comp_vision/YOLOV_architecture.png" alt="nearby_objects" width="900"/>

### YOLO Architecture

- **Residual Blocks:** These are likely components that help in training deep networks by allowing gradients to flow through the network **without vanishing or exploding.** They usually consist of convolutional layers with **skip connections.**

- **Detection Layers:** These are the layers where the **actual prediction of bounding boxes** and class probabilities occurs.

- **Upsampling Layers:** These layers increase the **spatial resolution of the feature maps**, which allows the network to detect smaller objects.

- **Concatenation and Addition:** These operations are used to **combine feature maps**. Concatenation merges feature maps along the channel dimension, preserving all features, while addition is a way to **merge feature maps by element-wise addition**, which can be part of a residual learning strategy.

- **Further Layers:** The dots leading to further layers imply that the network architecture extends beyond what is shown, likely for additional processing or to handle different scales of detection.

- **Scale 1, Scale 2, Scale 3:** These are likely referring to the multi-scale predictions that YOLO networks perform. Different scales allow the detection of objects of various sizes. The stride indicates the factor by which the image is downsampled; a larger stride results in a smaller feature map.

#### TODO Deploy on embedding devices. 