# YOLO Object Detection using OpenCV

[Prashant Brahmbhatt](https://www.github.com/hashbanger)

____

## Why CNNs aren't good enough!

First of all, why do we need other image detection algorithms if we already had **Convolutional Neural Networks**?  
As you can guess, to overcome the disadvantages of the traditional CNNs, some of them are:  
- High computational cost.
- If you don't have a good GPU they are quite slow to train (for complex tasks).
- They use to need a lot of training data.  
- CNNs depend on the initial parameter tuning (for a good point) to avoid local optima.

But we do have R-CNNs, Faster R-Cnns as well don't we?  
Although, they are much better implemented than vanilla CNNs by using Region Proposal Algorithm which could do localization and convolution classification, they are still quite slow sadly!

The CNN are good at image classification that requires a single class associated with an image however in real life scenarios that's not good enough! We require detection of multiple objects in an image and also where are they located, termed as **Object Detection** and **Object Localization**.

If you're confused about image classification, object detection, segmentation have a look at this given image.  
![img1](img1.png)


## The YOLO Approach (You Look Only Once)

As the original papers cites, the object detection problem is reframed as a regression problem. YOLO trains on full images and directly optimizes detection performance. It doesn't requires a complex pipeline.  
Unlike the sliding window technique it looks at the image only once hence the name. It implcitily encodes textual information about the classes and their appearance. 
The YOLO sees the entire image at once and gets the entire context of the image and makes rare background errors.   
The YOLO is a highly generalizable approach it is less prone to bad performance for unexpected inputs or unknown domains.  

### Working
- The YOLO divides the image in $S x S$ grid, the if the center of an object lies in a grid then that grid becomes responsible for predicting the class of that object. **(Image 1)**
- Each of the grid is responsible of predicting some $B$ bounding boxes and confidence score for those boxes to show how sure the model is about any particular object. The score doesn't indicate what kind of object it is rather if it contains some object. If there is no object then the confidence should be zero (duh!).
- Each bounding box is consists of 5 predictions $x,y,w,h$ where the (x,y) are the coordinates of the center of the box relative to the bounds of the cell. The w, h are the width and the height which are predicted relative to the whole image.  
- When we visualise all of the predictions we get a bunch of bounding boxes around each object and the thickness of the box depends on the confidence score for that object. **(Image 2)**
-  Each grid cell predicts the class probabilities. Given that it's an object, the conditional probabilities for each class of the object.
- It predicts only one set of class probabilities per grid cell regardless of $B$. So if the grid predicts a *Dog* that doesn't mean that it contains a Dog but rather if that grid contains an object then most probably it is a dog. **(Image 3)** Then at test time it multiplies multiple conditional class probabilities and the individual box confidence predictions.

![img3](img3.png)
Where $IOU$ is the ***"Intersection of Union"***

The output scores not only encodes the probability of the class fitting the box but also how well the box fits the object.

- We then have a lot of predictions which can include multiple predictions for the same object by different grids with different threshold values so we use ***Non Max Suppresion***. NMS in a nutshell suppress or discards bounding boxes with confidence score less than a selected threshold and then further discards the ones that are left which do not have maximum values, hence the name. **(Image 4)**


![img2](img2.png)

__________

____

## Object Detection in Images ( Not in real time )


Note: The below code will be better implemented using a single py script as the segment execution is not possible due to inclusion of command line arguments

In [1]:
import numpy as np
import argparse
import time 
import cv2
import os

constructing the argument parse and parse the arguments

In [2]:
ap = argparse.ArgumentParser()

In [None]:
ap.add_argument('-i','--image', required = True, help = 'the path to input image')
ap.add_argument('-y','--yolo', required = True, help = 'the path to input image')
ap.add_argument('-c','--confidence', type = float , default = 0.5, help = 'min probability to filter the weak detections')
ap.add_argument('-t','--threshold', type = float , default = 0.3, help = 'threshold to apply in NMS')
args = vars(ap.parse_args())

The above command line arguments will be processed at runtime and they provide the flexibility of changing the inputs to our script from the terminal.   


**--image :** the path to the input image  
**--yolo :** base path to th yolo directory  
**--confidence :** the minimum probability that will filter out the weaker detections  
**--threshold :** the value of the threshold that will be used during teh Non Max Suppression. 


Once we parse the arguments, the args variable becomes a dictionary with the key, value pairs for the command line arguments.

Now in the next step we will assign random colors to the different classes.

In [None]:
labels_path = os.path.sep.join([args['yolo'], "coco.names"])
LABELS = open(labels_path).read().strip().split('\n')

initializing a list of colors to represent each unique class

In [9]:
np.random.seed(45)
COLORS = np.random.randint(0, 255, size = (len(LABELS), 3), dtype= 'uint8') # The second 3 argument is because of RGB values

Now we derive the paths to the yolo trained weights and the configuration files from our disk.

In [None]:
weightsPath =os.path.sep.join([args['yolo'], "yolov3.weights"])
configPath = os.path.sep.join([args['yolo'], "yolov3.cfg"])

loading our yolo detector trained on the coco dataset

For loading YOLO from the disk, we’ll take advantage of OpenCV’s DNN function called **cv2.dnn.readNetFromDarknet**.  
This function requires both the configPath and weightsPath that we already have as command line arguments.

In [None]:
print('loading the model from disk...')
net = cv2.dnn.readNetFromDarknet(configPath, weightsPath)

Now we load the image and send it into the network.

getting the image dimensions

In [None]:
image = cv2.imread(args['image'])
(H, W) = image.shape[:2]

determining the output layer names that we need from the yolo

In [None]:
ln = net.getLayerNames()
ln = [ln[i[0] - 1] for i in net.getUnconnectedOutLayers()]

We now construct a blob from the image and then perform a forward pass of the YOLO object detector giving us the bounding boxes and the associated probabilities.  
We pass the blob from our model network and show the time taken.

In [None]:
blob = cv2.dnn.blobFromImage(image, 1/255.0, (416, 416), swapRB = True, crop = False)
net.setInput(blob)
start = time.time()
layerOutputs = net.forward(ln)
end = time.time()  
print("YOLO took {:.6f} seconds".format(end - start))

We need to initailize some listst that we will require.  
**boxes** - our bounding boxes around the object  
**confidences** - our model's confidece values that will show how confident our YOLO is in determining an object.  
**classIDs** - the detected object's class label. 

In [16]:
boxes = []
confidences = []
classIDs = []

We populate these lists with our network outputs.  
Now we loop over each of the layer Outputs then we loop over each detection in output and extract the classID and the confidence.  
We use the confidence to filter out weak detections.

After filtering out the unwanted detections we,  
- Scale the bounding box coordinates so we can display them properly on our original image.  
- Extract coordinates and dimensions of the bounding box. YOLO returns bounding box coordinates in the form: (centerX, centerY, width, and height) .
- Use this information to derive the top-left (x, y)-coordinates of the bounding box.
- Update the boxes , confidences , and classIDs  lists.

In [None]:
for output in layerOutputs:
    for detection in output:
        scores = detection[5:]
        classID = np.argmax(scores)
        confidence = scores[classID]
        
        if confidence > args['confidence']:
            box = detection[0:4] * np.array([W, H, W, H])
            (centerX, centerY, width, height) = box.astype('int')
            
            x = int(centerX - (width / 2))
            y = int(centerY - (height / 2))
            
            boxes.append([x, y, int(width), int(height)])
            confidences.append(float(confidence))
            classIDs.append(classID)

YOLO doesn't apply NMS automatically so to suppress the weak detections we apply NMS explicitly:

In [None]:
idxs = cv2.dnn.NMSBoxes(boxes, confidences, args['confidence'], arg['threshold'])

Now we draw the boxes ad class text on the images.

In [None]:
if len(idxs ) > 0:
    for i in idxs.flatten():
        (x, y) = (boxes[i][0], boxes[i][1])
        (w, h) = (boxes[i][2], boxes[i][3])
        
        color = [int(c) for c in COLORS[classIDs[i]]]
        cv2.rectangle(image, (x, y), (x+w, y+h), color ,2)
        text = "{}: {:.2f}".format(LABELS[classIDs[i]], confidences[i])
        cv2.putText(image, text, (x, y -5), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color ,2)

        
cv2.imshow("Image", image)
cv2.waitkey(0)

To execute, go to the base path where the script is located and then shell like:  
**python yolo.py --image images/image.jpg --yolo yolo-coco**

references:    
https://arxiv.org/pdf/1506.02640v5.pdf (The original Paper)  
https://www.pjreddie.com  
https://www.stackoverflow.com  
https://www.medium.com  
https://www.pyimagesearch.com      


### de nada!