https://learnopencv.com/object-detection-using-yolov5-and-opencv-dnn-in-c-and-python/

Using PyTorchHub
The following script downloads a pre trained model from PyTorchHub and passes an image for inference. By default, yolov5s.pt is downloaded unless the name is changed. The results can be printed to console, saved to ./yolov5/runs/hub, displayed on screen(local), and returned as tensors or pandas data frames. You can also play with various inference attributes. Check out this link for details.

In [None]:
import cv2
import torch
# Model
model = torch.hub.load('ultralytics/yolov5', 'yolov5s')
# Image
img = cv2.imread(PATH_TO_IMAGE)
# Inference
results = model(imgs, size=640)  # includes NMS
# Results
results.print()  
results.save()

The following commands are for converting the YOLOv5s model. The notebook contains the code to convert and download rest of the models.
https://colab.research.google.com/github/spmallick/learnopencv/blob/master/Object-Detection-using-YOLOv5-and-OpenCV-DNN-in-CPP-and-Python/Convert_PyTorch_models.ipynb

In [None]:
# Clone the repository. 
!git clone https://github.com/ultralytics/YOLOv5

%cd YOLOv5 # Install dependencies.
!pip install -r requirements.txt
!pip install onnx

# Download .pt model.
!wget https://github.com/ultralytics/YOLOv5/releases/download/v6.1/YOLOv5s.pt

%cd .. # Export to ONNX.
!python export.py --weights models/YOLOv5s.pt --include onnx

# Download the file.
from google.colab import files
files.download('models/YOLOv5s.onnx')

Import Libraries


In [None]:
import cv2
import numpy as np



The constants INPUT_WIDTH and INPUT_HEIGHT are for the blob size. The BLOB stands for Binary Large Object. It contains the data in readable raw format. The image has to be converted to a blob so that the network can process it. In our case, it is a 4D array object with the shape (1, 3, 640, 640).

SCORE_THRESHOLD: To filter low probability class scores.
NMS_THRESHOLD: To remove overlapping bounding boxes.
CONFIDENCE_THRESHOLD: Filters low probability detections.

In [None]:
# Constants.
INPUT_WIDTH = 640
INPUT_HEIGHT = 640
SCORE_THRESHOLD = 0.5
NMS_THRESHOLD = 0.45
CONFIDENCE_THRESHOLD = 0.45

# Text parameters.
FONT_FACE = cv2.FONT_HERSHEY_SIMPLEX
FONT_SCALE = 0.7
THICKNESS = 1

# Colors.
BLACK  = (0,0,0)
BLUE   = (255,178,50)
YELLOW = (0,255,255)

Draw Label
The function draw_label annotates the class names anchored to the top left corner of the bounding box. The code is fairly simple. We pass the text string as a label in the argument which is passed to the OpenCV function getTextSize(). It returns the size of the bounding box that the text string would take up. These dimension values are used to draw a black background rectangle on which label is rendered by putText() function.


In [None]:
def draw_label(im, label, x, y):
    """Draw text onto image at location."""
    # Get text size.
    text_size = cv2.getTextSize(label, FONT_FACE, FONT_SCALE, THICKNESS)
    dim, baseline = text_size[0], text_size[1]
    # Use text size to create a BLACK rectangle.
    cv2.rectangle(im, (x,y), (x + dim[0], y + dim[1] + baseline), (0,0,0), cv2.FILLED);
    # Display text inside the rectangle.
    cv2.putText(im, label, (x, y + dim[1]), FONT_FACE, FONT_SCALE, YELLOW, THICKNESS, cv2.LINE_AA)

PRE-PROCESSING
The function pre–process takes the image and the network as arguments. At first, the image is converted to a blob. Then it is set as input to the network. The function getUnconnectedOutLayerNames() provides the names of the output layers. It has features of all the layers, through which the image is forward propagated to acquire the detections. After processing, it returns the detection results.

In [None]:
def pre_process(input_image, net):
      # Create a 4D blob from a frame.
      blob = cv2.dnn.blobFromImage(input_image, 1/255,  (INPUT_WIDTH, INPUT_HEIGHT), [0,0,0], 1, crop=False)

      # Sets the input to the network.
      net.setInput(blob)

      # Run the forward pass to get output of the output layers.
      outputs = net.forward(net.getUnconnectedOutLayersNames())
      return outputs

POST-PROCESSING
In the previous function pre_process, we get the detection results as an object. It needs to be unwrapped for further processing. Before discussing the code any further, let us see the shape of this object and what it contains.
Filter Good Detections
While unwrapping, we need to be careful with the shape. With OpenCV-Python 4.5.5, the object is a tuple of a 3-D array of size 1x row x column. It should be row x column. Hence, the array is accessed from the zeroth index. This issue is not observed in the case of C++.

The returned object is a 2-D array. The output depends on the size of the input. For example, with the default input size 640, we get a 2D-array of size 25200×85 (rows and columns). The rows represent the number of detections. So each time the network runs, it predicts 25200 bounding boxes. Every bounding box has a 1-D array of 85 entries that tells the quality of the detection. This information is enough to filter out the desired detections.


The first two places are normalized center coordinates of the detected bounding box. Then comes the normalized width and height. Index 4 has the confidence score that tells the probability of the detection being an object. The following 80 entries tell class scores of 80 objects of the COCO dataset 2017, on which the model has been trained.

While unwrapping, we need to be careful with the shape. With OpenCV-Python 4.5.5, the object is a tuple of a 3-D array of size 1x row x column. It should be row x column. Hence, the array is accessed from the zeroth index. This issue is not observed in the case of C++.

The network generates output coordinates based on the input size of the blob,  i.e. 640. Therefore, the coordinates should be multiplied by the resizing factors to get the actual output. Following steps are involved in unwrapping the detections.

Loop through detections.
Filter out good detections.
Get the index of the best class score.
Discard detections with class scores lower than the threshold value.

In [None]:
def post_process(input_image, outputs):
      # Lists to hold respective values while unwrapping.
      class_ids = []
      confidences = []
      boxes = []
      # Rows.
      rows = outputs[0].shape[1]
      image_height, image_width = input_image.shape[:2]
      # Resizing factor.
      x_factor = image_width / INPUT_WIDTH
      y_factor =  image_height / INPUT_HEIGHT
      # Iterate through detections.
      for r in range(rows):
            row = outputs[0][0][r]
            confidence = row[4]
            # Discard bad detections and continue.
            if confidence >= CONFIDENCE_THRESHOLD:
                  classes_scores = row[5:]
                  # Get the index of max class score.
                  class_id = np.argmax(classes_scores)
                  #  Continue if the class score is above threshold.
                  if (classes_scores[class_id] > SCORE_THRESHOLD):
                        confidences.append(confidence)
                        class_ids.append(class_id)
                        cx, cy, w, h = row[0], row[1], row[2], row[3]
                        left = int((cx - w/2) * x_factor)
                        top = int((cy - h/2) * y_factor)
                        width = int(w * x_factor)
                        height = int(h * y_factor)
                        box = np.array([left, top, width, height])
                        boxes.append(box)

Remove Overlapping Boxes
After filtering good detections, we are left with the desired bounding boxes. However, there can be multiple overlapping bounding boxes, which may look like the following.
This is solved by performing Non-Maximum Suppression. The function NMSBoxes() takes a list of boxes, calculates IOU(Intersection Over Union), and decides to keep the boxes depending on the NMS_THRESHOLD. Curious about how it works? Check out our previous article on NMS to know more.

In [None]:
# Perform non maximum suppression to eliminate redundant, overlapping boxes with lower confidences.
      indices = cv2.dnn.NMSBoxes(boxes, confidences, CONFIDENCE_THRESHOLD, NMS_THRESHOLD)
      for i in indices:
            box = boxes[i]
            left = box[0]
            top = box[1]
            width = box[2]
            height = box[3]             
            # Draw bounding box.             
            cv2.rectangle(input_image, (left, top), (left + width, top + height), BLUE, 3*THICKNESS)
            # Class label.                      
            label = "{}:{:.2f}".format(classes[class_ids[i]], confidences[i])             
            # Draw label.             
            draw_label(input_image, label, left, top)
      return input_image

Main Function
Finally, we load the model. Perform pre-processing and post-processing followed by displaying efficiency information.

In [None]:
if __name__ == '__main__':
      # Load class names.
      classesFile = "coco.names"
      classes = None
      with open(classesFile, 'rt') as f:
            classes = f.read().rstrip('\n').split('\n')
      # Load image.
      frame = cv2.imread(‘traffic.jpg)
      # Give the weight files to the model and load the network using       them.
      modelWeights = "YOLOv5s.onnx"
      net = cv2.dnn.readNet(modelWeights)
      # Process image.
      detections = pre_process(frame, net)
      img = post_process(frame.copy(), detections)
      """
      Put efficiency information. The function getPerfProfile returns       the overall time for inference(t) 
      and the timings for each of the layers(in layersTimes).
      """
      t, _ = net.getPerfProfile()
      label = 'Inference time: %.2f ms' % (t * 1000.0 /  cv2.getTickFrequency())
      print(label)
      cv2.putText(img, label, (20, 40), FONT_FACE, FONT_SCALE,  (0, 0, 255), THICKNESS, cv2.LINE_AA)
      cv2.imshow('Output', img)
      cv2.waitKey(0)