# YOLOv3 Object Detector

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#YOLOv3-Object-Detector" data-toc-modified-id="YOLOv3-Object-Detector-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>YOLOv3 Object Detector</a></span><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Introduction</a></span><ul class="toc-item"><li><span><a href="#Model-Architecture" data-toc-modified-id="Model-Architecture-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Model Architecture</a></span></li><li><span><a href="#Implementation-in-arcgis.learn" data-toc-modified-id="Implementation-in-arcgis.learn-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Implementation in <code>arcgis.learn</code></a></span></li><li><span><a href="#Using-COCO-pretrained-weights" data-toc-modified-id="Using-COCO-pretrained-weights-1.1.3"><span class="toc-item-num">1.1.3&nbsp;&nbsp;</span>Using COCO pretrained weights</a></span></li><li><span><a href="#References" data-toc-modified-id="References-1.1.4"><span class="toc-item-num">1.1.4&nbsp;&nbsp;</span>References</a></span></li></ul></li></ul></li></ul></div>

## Introduction

**YOLO (You Only Look Once)** is one of the most popular series of object detection models. Its advantage has been in providing real-time detections while approaching the accuracy of state-of-the-art object detection models.

In the earlier works for object detection, models used to either use a sliding window technique or region proposal network. Sliding window, as the name suggests choses a Region of Interest (RoI) by sliding a window across the image and then performs classification in the chosen RoI to detect an object. Region proposal networks work in two steps - first, they extract region proposals and then using CNN features, classify the proposed regions. Sliding window method is not very precise and accurate, and though some of the region-based networks can be highly accurate they tend to be slower.

Then came along the one-shot object detectors such as [SSD](https://arxiv.org/abs/1512.02325), [YOLO](https://arxiv.org/pdf/1506.02640.pdf) and [RetinaNet](https://arxiv.org/abs/1708.02002). These models detect objects in a single pass of the image and, thus, are considerably faster, and can match up the accuracy of region-based detectors. The [SSD guide](https://developers.arcgis.com/python/guide/how-ssd-works/) explains the essential components of a one-shot object detection model. You can also read up the RetinaNet guide [here](https://developers.arcgis.com/python/guide/how-retinanet-works/). These models are already a part of ArcGIS API for Python and the addition of [**YOLOv3**](https://arxiv.org/abs/1804.02767) provides another tool in our deep learning toolbox.

The biggest advantage of YOLOv3 in `arcgis.learn` is that it comes preloaded with weights pretrained on the [COCO dataset](https://cocodataset.org/). This makes it ready-to-use for the 80 common objects (car, truck, person, etc.) that are part of the COCO dataset.

<center><img src="../../static/img/yolov3_demo.gif"/></center>
<center>Figure 1. Real-time Object detection using YOLOv3 [1]</center>

### Model Architecture

YOLOv3 uses **Darknet-53** as its backbone. This contrasts with the use of popular ResNet family of backbones by other models such as SSD and RetinaNet. Darknet-53 is a deeper version of Darknet-19 which was used in [YOLOv2](https://arxiv.org/pdf/1612.08242.pdf), a prior version. As the name suggests, this backbone architecture has 53 convolutional layers. Adapting the ResNet style residual layers has improved its accuracy while maintaining the speed advantage. This feature extractor performs better than ResNet101 and similar to ResNet152 while being about 1.5x and 2x faster, respectively [2].

YOLOv3 has incremental improvements over its prior versions [2]. It uses upsampling and concatenation of feature layers with earlier feature layers which preserve fine-grained features. Another improvement is using three scales for detection. This has made the model good at detecting objects of varying scales in an image. There are other improvements in anchor box selections, loss function, etc. For a detailed analysis of the YOLOv3 architecture, please refer to this [blog](https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b).

<center><img src="../../static/img/yolov3.jpg"/></center>
<center>Figure 2. YOLOv3 architecture [3]</center>

### Implementation in `arcgis.learn`

You can create a YOLOv3 model in arcgis.learn using a single line of code.
```
model = YOLOv3(data)
```
where ``data`` is the databunch prepared for training using the `prepare_data` method in the earlier steps.

For more information about the API, please go to the [API reference](https://developers.arcgis.com/python/api-reference/arcgis.learn.toc/#yolov3).

### Using COCO pretrained weights

To use the model out-of-the-box with COCO pretrained weights, initialize the model as following:

```
model = YOLOv3()
```

Note, the model must be initialized without providing any `data`. Because we are not training the model and instead using the pre-trained weights, we do not require a databunch. Any oriented image or video can be used for inferencing using the following commands, respectively:

```
model.predict(image_path)

model.predict_video(input_video_path, metadata_file)
```

The following 80 classes are available for object detection in the COCO dataset:
```
'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck',
'boat', 'traffic light', 'fire hydrant', 'stop sign',
'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella',
'handbag', 'tie', 'suitcase', 'frisbee', 'skis',
'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove',
'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass',
'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich',
'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair',
'couch', 'potted plant', 'bed', 'dining table',
'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book',
'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
```

### References

* [1] Sik-Ho Tsang, "Review: YOLOv3 — You Only Look Once (Object Detection)", https://towardsdatascience.com/review-yolov3-you-only-look-once-object-detection-eab75d7a1ba6.
* [2] Joseph Redmon, Ali Farhadi: "YOLOv3: An Incremental Improvement", 2018; [https://arxiv.org/abs/1804.02767 arXiv:1804.02767].
* [3] Ayoosh Katuria, "What’s new in YOLO v3?", https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b.