# Detect and Track Objects with OpenVINO™ for Self-Checkout

Automated self-checkout is a popular application centered around improving shoppers’ experiences through expedited check-out experiences. Consumers can easily grasp an object, place it in a shopping cart, or scan the object in a self-checkout kiosk and purchase the item with minimal contact, allowing for increased operational efficiency.unting.



In this article, you’ll learn how to use OpenVINO™ with Ultralytics’ YOLOv8 and Roboflow’s supervision libraries to create the fundamentals of an Automated Self-Checkout system. This application offers a short and easily modifiable implementation to detect and tracks objects in a zone and provide real-time analytics data regarding whether the object was added or removed from the zone and the IDs of the person who interact with the objects. 

Because the zone definition is flexible, retailers can define custom zones depending on the type of self-checkout they would like to perform, such as:

-	Self-checkout counters with designated areas for placing, removing, and bagging the item.
-	Defining a zone on shelves to identify how many objects are removed from shelves for theft detecion.
-	Within shopping carts to identify how many items users add/remove from the platform/shelf.

This application leverages the YOLOv8 model and optimization process and tackles similar elements for zone definition also described in **Intelligent Queue Management Edge AI Reference Kit**.

> NOTE: This notebook involves performing object detection and tracking on a video clip, for accurate definition of the polygone zone.

# Imports and Dependencies Installation

In [162]:
!pip install -r requirements.txt

from IPython import display
display.clear_output()

In [2]:
import supervision as sv
from ultralytics import YOLO
import cv2
import numpy as np
from pathlib import Path
from collections import Counter
import logging as log
import json

log.basicConfig(level=log.INFO)

# Loading our OpenVINO™ YOLO model

YOLOv8 provides API for convenient model exporting to different formats, including OpenVINO IR. `model.export` is responsible for model conversion. We need to specify the format, and additionally, we could preserve dynamic shapes in the model. It would limit us to use CPU only, so we're not doing this. Also, we specify we want to use half-precision (FP16) to get better performance.

Let's load our OpenVINO YOLOv8 model FP16 model via the Ultralytics API for faster inference with a smaller model footprint.

In [27]:
#Specify our models path
models_dir = Path("./model")
models_dir.mkdir(exist_ok=True)

DET_MODEL_NAME = "yolov8m"
det_model = YOLO(models_dir / f'{DET_MODEL_NAME}.pt')
label_map = det_model.model.names

# Load our Yolov8 object detection model
ov_model_path = Path(f"model/{DET_MODEL_NAME}_openvino_model/{DET_MODEL_NAME}.xml")
if not ov_model_path.exists():
    # export model to OpenVINO format
    out_dir = det_model.export(format="openvino", dynamic=False, half=True)

model = YOLO("model/yolov8m_openvino_model/")

Ultralytics YOLOv8.0.117  Python-3.10.6 torch-1.13.1+cpu CPU
YOLOv8m summary (fused): 218 layers, 25886080 parameters, 0 gradients, 78.9 GFLOPs

[34m[1mPyTorch:[0m starting from model\yolov8m.pt with input shape (1, 3, 640, 640) BCHW and output shape(s) (1, 84, 8400) (49.7 MB)

[34m[1mONNX:[0m starting export with onnx 1.14.0 opset 16...
[34m[1mONNX:[0m export success  3.2s, saved as model\yolov8m.onnx (99.0 MB)

[34m[1mOpenVINO:[0m starting export with openvino 2023.0.0-10926-b4452d56304-releases/2023/0...
[34m[1mOpenVINO:[0m export success  3.7s, saved as model\yolov8m_openvino_model\ (49.8 MB)

Export complete (9.1s)
Results saved to [1mC:\Users\rcheruvu\Desktop\openvino_notebooks\recipes\automated_detection_tracking\model[0m
Predict:         yolo predict task=detect model=model\yolov8m_openvino_model imgsz=640 
Validate:        yolo val task=detect model=model\yolov8m_openvino_model imgsz=640 data=coco.yaml 
Visualize:       https://netron.app


# Define and Load a Zone

In order to accurately define a zone for our input video clip, we can extract a single video frame from our clip using the Supervision library, and drag and drop it into [Roboflow's open-source Polygon Zone tool](https://roboflow.github.io/polygonzone/) to define the coordinates of our zone.

Let's start with loading in a sample video and extracting a single frame.

In [28]:
#Load in our sample video
VID_PATH = "data/example.mp4"
#Show the dimensions and additional information from the video
video_info = sv.VideoInfo.from_video_path(VID_PATH)
video_info

VideoInfo(width=3840, height=2160, fps=29, total_frames=640)

In [29]:
#Extract a single frame from the video
generator = sv.get_video_frames_generator(VID_PATH)
iterator = iter(generator)
frame = next(iterator)
#Save the frame
cv2.imwrite("frame.jpg", frame)

True

Next, we can navigate over to the Polygon Zone tool to extract the coordinates and incorporate them into the zones.json file.

We've already included two example configurations as part of the zones.json file that you can also readily leverage.

![Roboflow Tool snapshot](https://github.com/openvinotoolkit/openvino_notebooks/assets/22090501/51d8ef0f-ff7a-42c8-b755-5aaaae9a3a11)

Next, let's load our zone coordinates. The following function takes a path to a JSON file that defines zones and their boundaries.

In [30]:
def load_zones(json_path, zone_str):
    """
        Load zones specified in an external json file
        Parameters:
            json_path: path to the json file with defined zones
            zone_str:  name of the zone in the json file
        Returns:
           zones: a list of arrays with zone points
    """
    # load json file
    with open(json_path) as f:
        zones_dict = json.load(f)
    # return a list of zones defined by points
    return np.array(zones_dict[zone_str]["points"], np.int32)

In [31]:
polygon = load_zones("config/zones.json", "test-example-1")
polygon

array([[ 776,  321],
       [3092,  305],
       [3112, 1965],
       [ 596, 2005],
       [ 768,  321]])

We can now create PolygonZone, PolygonZoneAnnotator, and BoxAnnotator objects for each zone based on the polygon coordinates we determined.

In [32]:
zone = sv.PolygonZone(polygon=polygon, frame_resolution_wh=video_info.resolution_wh)
box_annotator = sv.BoxAnnotator(thickness=4, text_thickness=4, text_scale=2)
zone_annotator = sv.PolygonZoneAnnotator(zone=zone, color=sv.Color.white(), thickness=6, text_thickness=6, text_scale=4)

# Define Helper Functions

In this section, we'll define a few helper functions that can helps us with the flow of our self-checkout pipeline.

The `draw_text()` function calculates the size of the text and the size of the rectangle that will be drawn around the text based on the image size. It uses the `cv2.rectangle()` function to draw the rectangle and the `cv2.putText()` function to draw the text. We'll need this function to be able to overlay text on our video stream.

In [33]:
def draw_text(image, text, point, color=(255, 255, 255)) -> None:
    """
    Draws text

    Parameters:
        image: image to draw on
        text: text to draw
        point:
        color: text color
    """
    _, f_width = image.shape[:2]
    
    text_size, _ = cv2.getTextSize(text, fontFace=cv2.FONT_HERSHEY_SIMPLEX, fontScale=2, thickness=2)

    rect_width = text_size[0] + 20
    rect_height = text_size[1] + 20
    rect_x, rect_y = point

    cv2.rectangle(image, pt1=(rect_x, rect_y), pt2=(rect_x + rect_width, rect_y + rect_height), color=(255, 255, 255), thickness=cv2.FILLED)

    text_x = (rect_x + (rect_width - text_size[0]) // 2) - 10
    text_y = (rect_y + (rect_height + text_size[1]) // 2) - 10
    
    cv2.putText(image, text=text, org=(text_x, text_y), fontFace=cv2.FONT_HERSHEY_SIMPLEX, fontScale=2, color=color, thickness=2, lineType=cv2.LINE_AA)

The `get_iou()` function calculates the Intersection Over Union score using the `xyxy` coordinates of two bounding boxes corresponding to two detected objects. In this case, we will use the `get_iou()` function to identify if the detected bounding boxes for a person intersects with a detected object.

In [48]:
def get_iou(person_det, object_det):
    #Obtain the Intersection 
    x_left = max(person_det[0], object_det[0])
    y_top = max(person_det[1], object_det[1])
    x_right = min(person_det[2], object_det[2])
    y_bottom = min(person_det[3], object_det[3])
    if x_right < x_left or y_bottom < y_top:
        return 0.0
    intersection_area = (x_right - x_left) * (y_bottom - y_top)

    person_area = (person_det[2] - person_det[0]) * (person_det[3] - person_det[1])
    obj_area = (object_det[2] - object_det[0]) * (object_det[3] - object_det[1])
    
    return intersection_area / float(person_area + obj_area - intersection_area)

The `intersecting_bboxes()` function identifies if the bounding boxes for people and objects are intersecting leveraging the above function, and logs the appropriate interaction accordingly.

In [49]:
def intersecting_bboxes(bboxes, person_bbox, action_str):
    #Identify if person and object bounding boxes are intersecting using IOU
    for box in bboxes:
      if box.cls == 0:
          #If it is a person
          try:
              person_bbox.append([box.xyxy[0], box.id.numpy().astype(int)])
          except:
              pass
      elif box.cls != 0 and len(person_bbox) >= 1:
          #If it is not a person and an interaction took place with a person
          for p_bbox in person_bbox:
              if box.cls != 0:
                  result_iou = get_iou(p_bbox[0], box.xyxy[0])
                  if result_iou > 0:
                     try:
                        person_intersection_str = f"Person #{p_bbox[1][0]} interacted with object #{int(box.id[0])} {label_map[int(box.cls[0])]}"
                     except:
                         person_intersection_str = f"Person {p_bbox[1][0]} interacted with object (ID unable to be assigned) {label_map[int(box.cls[0])]}"
                     #log.info(person_intersection_str)
                     person_action_str = action_str + f" by person {p_bbox[1][0]}"
                     return person_action_str

# Run the Main Processing Loop

Run object detection and tracking on the specified video clip.

To customize the tracking algorithm, visit [https://docs.ultralytics.com/modes/track/#tracker-selection](https://docs.ultralytics.com/modes/track/#tracker-selection) to learn more about the default algorithm and additional option. 

Note that there are a few misses that can occur with object detection and tracking algorithms in this use case: 

- The off-the-shelf object detection algorithm sometimes does not immediately detect objects that are present, and can take a few frames to do so
- The off-the-shelf tracking algorithm can sometimes assign multiple IDs for the same object (and even in some cases multiple objects the same ID). It's important to keep these potential mistakes the algorithm can make in mind, and consider a custom tracking algorithm (using the details in the link above) to be able to customize the algorithm for your use case if these elements impact your use case.

In [None]:
#Define empty lists to keep track of labels
original_labels = []
final_labels = []
person_bbox = []
p_items = []
purchased_items = set(p_items)
a_items = []
added_items = set(a_items)

#Save result as det_tracking_result
with sv.VideoSink("new_det_tracking_result.mp4", video_info) as sink:
    #Iterate through model predictions and tracking results
    for index, result in enumerate(model.track(source=VID_PATH, show=False, stream=True, verbose=True, persist=True)):
      #Define variables to store interactions that are refreshed per frame
      interactions = []
      person_intersection_str = ""

      #Obtain predictions from yolov8 model
      frame = result.orig_img
      detections = sv.Detections.from_ultralytics(result)
      detections = detections[detections.class_id < 55]
      mask = zone.trigger(detections=detections)
      detections_filtered = detections[mask]
      bboxes = result.boxes
      if bboxes.id is not None:
          detections.tracker_id = bboxes.id.cpu().numpy().astype(int)
        
      labels = [
          f'#{tracker_id} {label_map[class_id]} {confidence:0.2f}'
          for _, _, confidence, class_id, tracker_id
          in detections
      ]

      #Annotate the frame with the zone and bounding boxes.
      frame = box_annotator.annotate(scene=frame, detections=detections_filtered, labels=labels)
      frame = zone_annotator.annotate(scene=frame)

      objects = [f'#{tracker_id} {label_map[class_id]}' for _, _, confidence, class_id, tracker_id in detections]

      #If this is the first time we run the application,
      #store the objects' labels as they are at the beginning
      if index == 0:
          original_labels = objects
          original_dets = len(detections_filtered)
      else:
          #To identify if an object has been added or removed
          #we'll use the original labels and identify any changes
          final_labels = objects
          new_dets = len(detections_filtered)
          #Identify if an object has been added or removed using Counters
          removed_objects = Counter(original_labels) - Counter(final_labels)
          added_objects = Counter(final_labels) - Counter(original_labels)

          #Create two variables we can increment for drawing text
          draw_txt_ir = 1
          draw_txt_ia = 1
          #Check for objects being added or removed
          if new_dets - original_dets != 0 and len(removed_objects) >= 1:
             #An object has been removed
              for k,v in removed_objects.items():
                 #For each of the objects, check the IOU between a designated object
                 #and a person.
                 if 'person' not in k:
                     removed_object_str = f"{v} {k} removed from zone"
                     removed_action_str = intersecting_bboxes(bboxes, person_bbox, removed_object_str)
                     if removed_action_str is not None:
                         #If we have determined an interaction with a person,
                         #log the interaction.
                         log.info(removed_action_str)
                         #Add the purchased items to a "receipt" of sorts
                         if removed_object_str not in purchased_items:
                             #print(f"{v} {k}", a_items)
                             #if f"{v} {k}" in a_items:
                             purchased_items.add(f"{v} {k}")
                             p_items.append(f" - {v} {k}")
                     #Draw the result on the screen        
                     draw_text(frame, text=removed_action_str, point=(50, 50 + draw_txt_ir), color=(0, 0, 255))
                     draw_txt_ir += 80
          
          if len(added_objects) >= 1:
              #An object has been added
              for k,v in added_objects.items():
                  #For each of the objects, check the IOU between a designated object
                  #and a person.
                  if 'person' not in k:
                      added_object_str = f"{v} {k} added to zone"
                      added_action_str = intersecting_bboxes(bboxes, person_bbox, added_object_str)
                      if added_action_str is not None:
                          #If we have determined an interaction with a person,
                          #log the interaction.
                          log.info(added_action_str)
                          if added_object_str not in added_items:
                            added_items.add(added_object_str)
                            a_items.append(added_object_str)
                      #Draw the result on the screen  
                      draw_text(frame, text=added_action_str, point=(50, 300 + draw_txt_ia), color=(0, 128, 0))
                      draw_txt_ia += 80
      
      draw_text(frame, "Receipt: " + str(purchased_items), point=(50, 800), color=(30, 144, 255))
      sink.write_frame(frame)

Sample output:

> video 1/1 (1/640) automated_detection_tracking\data\asset.mp4: 640x640 1 bottle, 1 banana, 1 apple, 949.0ms
> 
> video 1/1 (2/640) automated_detection_tracking\data\asset.mp4: 640x640 1 bottle, 1 banana, 1 apple, 930.7ms
> 
> video 1/1 (3/640) automated_detection_tracking\data\asset.mp4: 640x640 1 bottle, 1 banana, 1 apple, 1100.5ms

In [None]:
"Receipt: " + str(purchased_items)

'Receipt: set()'

In [68]:
added_objects

Counter({'#29 apple': 1, '#27 bottle': 1, '#25 banana': 1})

## Benchmarking

For more information on how to performance benchmark OpenVINO YOLOv8 models, visit this [notebook](https://github.com/openvinotoolkit/openvino_notebooks/blob/recipes/recipes/intelligent_queue_management/docs/convert-and-optimize-the-model.ipynb).