Paper

Summary

Question/Goal:

Development of an accurate and fast algorithms that can detect of objects in real time.
They sort to improve upon the current state of the art object detection algorithms at the time, for example deformable part models (DPM) and R-CNN. These methods are not only computationally expensive, but also can be slow and hard to optimize.

Methods:

DPM leverage on looking at images using a sliding window approach, observing the defined grid cells as it moves across images; while R-CNN proposes a box region around an object, classifies it using CNN, followed by post-processing steps. In You Only Look Once (YOLO), the author transformed the object detection problem into a regression problem where spatially separated bounding boxes are generated with their associated class probabilities.

An image is inputted into a S X S grid cell. The goal of each grid cell is to predict B, whether the grid cell contains a bounding box and a confidence score that indicate if the grid cell contains an object. In addition, it’s predicts C, a conditional probability for class membership contingent on the presence of an object. Only one C is predicted per grid cell independent of the number of bounding boxes. This is given by the following equation:

Pictorially, the model is explained as such:

Predictions are encoded as a S * S * (B * 5 +C) tensor.

Convolutional neural network models are used, and models were tested on the PASCAL VOC dataset. The architecture of the model was inspired by GoogLeNet with 24 convolutional layers followed by 2 fully connected layers; and alternating 1 X 1 convolutional layers were used to reduce the features space from previous layers. In addition, they trained a FAST version of YOLO with a shallower neural network for fast object detection.

The final output of the network is a 7 X 7 X 30 tensor of predictions.
A linear activation function in the final layer was used to predict the class probabilities and bounding box coordinates. ReLu was used in our layers.
A sum-squared error of the model’s output was used as the loss function.
The network was trained on 135 epochs using a batch size of 64, momentum of 0.9, and a decay of 0.0005. For the learning rate scheduling, they used a gradation of 10e-3 to 10e-2 for the first epochs, 10e-2 for 75 epochs, 10e-3 for 30 epochs and finally 10e-4 for 30 epochs. Dropout and data augmentation was carried out to prevent overfitting.

Experiments/Results:

YOLO was compared to real time detection systems using the PASCAL VOC 2007 dataset.

To observe the breakdown of the results an error analysis on the VOC 2007 experiment was carried out.
YOLO had problems localizing objects, Fast R-CNN makes lots of background error.

Because of the limitation of academic datasets, to test generalizability, person detection in artwork was carried out using the Picasso datasets and the people-art datasets via a comparative analysis. YOLO is comparatively better.

Finally they discussed real time detection in the wild, and providing a demo link: http://pjreddie.com/yolo/

Conclusions: