<a href="https://colab.research.google.com/github/itberrios/CV_tracking/blob/main/setup_tutorials/tutorial_yolo_nas_and_ocsort.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **YOLO-NAS + OC-SORT**

This notebook contains a tutorial that shows how to incorporate YOLO-NAS with OC-SORT to perform real time visual tracking on a data stream. In this case, we will use a YouTube video to simulate a data stream of image frames.

## **YOLO-NAS**
[YOLO-NAS](https://github.com/Deci-AI/super-gradients/blob/master/YOLONAS.md) is a powerful object detector with an optimal neural network architecture that has been selected using Neural Architecture Search (NAS), hence the name NAS. 

At the time of release, it outperforms all of the other single shot object detectos in terms of speed and accuracy. It also excels in an area that most single shot detector struggle with, small objects. Two-stage detectors typically perform better than single stage detectors on small objects, at the cost of increased detection time [source](https://arxiv.org/pdf/1907.09408.pdf). YOLO-NAS, howver seems to provide a good tradeoff between detection speed and accuracy on small objects that previous versions of YOLO have not been able to deliver.

## **OC-SORT**
[OC-SORT](https://arxiv.org/abs/2203.14360) is a robust visual obejct tracking algorithm that improves upon the already popular [SORT](https://arxiv.org/abs/1602.00763) algorithm.

SORT tends to loose track on obejcts when they are lost for extended periods of time or when non-linear motion occurs. Algorithms such as Deep SORT have effectoively improved SORT in these scenarios with a Deep Association metric that is computed with a [Siamese Neural Network](https://arxiv.org/pdf/1707.02131.pdf) over the image patches. Eventhough this is effective it comes with the cost of increased detection time due to the deep association and the Siamese network needs to be trained on in-domain data for this approach to be effective. OC-SORT on the otherhand is able to effectively increase tracking performance in a model free fashion with minimal impact to inference speed.

OC-SORT introduces
- Observation Centric Re-Update (ORU)
    - Reduces accumulated Kalman Filter error/uncertainty when a lost track is re-associated
- Observation Centric Momentum (OCM)
    - Uses previous observations to compute a low noise expected motion direction and incorporates it into the track association cost
- Observation Centric Recovery (OCR)
    - Uses the last known observation as a secondary association to help prevent lost tracks

For more details and a break down of each technique that OC-SORT introduces, please see this [article](https://medium.com/@itberrios6/introduction-to-ocsort-c1ea1c6adfa2).

#### Install Libraries

In [None]:
# sketchy fix for "https://stackoverflow.com/questions/73711994/importerror-cannot-import-name-is-directory-from-pil-util-usr-local-lib"
!pip install fastcore -U

In [None]:
# supergradients installs
! pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113 &> /dev/null
! pip install super-gradients
! pip install pytorch-quantization==2.1.2 --extra-index-url https://pypi.ngc.nvidia.com &> /dev/null
! pip install matplotlib==3.1.3 &> /dev/null
! pip install --upgrade psutil==5.9.2 &> /dev/null
! pip install --upgrade pillow==7.1.2 &> /dev/null

In [None]:
!pip install filterpy
!pip install pytube
!pip install moviepy
!pip install ffmpeg

# bug fix for imageio-ffmpeg
!pip install imageio==2.4.1

### Get OCSORT code

In [None]:
!mkdir ocsort
%cd ocsort
!wget https://raw.githubusercontent.com/noahcao/OC_SORT/master/trackers/ocsort_tracker/ocsort.py 
!wget https://raw.githubusercontent.com/noahcao/OC_SORT/master/trackers/ocsort_tracker/kalmanfilter.py 
!wget https://raw.githubusercontent.com/noahcao/OC_SORT/master/trackers/ocsort_tracker/association.py 

%cd ..

### Import Libraries

In [1]:
import os
import time
from tqdm import tqdm

import numpy as np
import cv2
import filterpy

import torch
import super_gradients as sg
import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams["figure.figsize"] = (20, 10)

The console stream is logged into /root/sg_logs/console.log


[2023-05-11 00:10:10] INFO - crash_tips_setup.py - Crash tips is enabled. You can set your environment variable to CRASH_HANDLER=FALSE to disable it


## Download video from YouTube

In [2]:
from pytube import YouTube

url =r"https://www.youtube.com/watch?v=wIYD42DV3Ro" # horse racing
# url = r"https://www.youtube.com/watch?v=JteKbauGolo" # nascar
yt = YouTube(url)
print("Video Title: ", yt.title)

# download video
video_path = yt.streams \
  .filter(progressive=True, file_extension='mp4') \
  .order_by('resolution') \
  .desc() \
  .first() \
  .download()

Video Title:  Kentucky Derby 2022 (FULL RACE) | NBC Sports


In [4]:
# horce racing
video_savepath = 'derby.mp4'
video_waudio_savepath = 'derby_with_audio.mp4'

# nascar
# video_savepath = 'chastain_wall_ride.mp4'
# video_waudio_savepath = 'chastain_wall_ride_with_audio.mp4'

## Instantiate YOLO-NAS model

In [80]:
from super_gradients.training import models

model = models.get("yolo_nas_s", pretrained_weights="coco").cuda()
model.eval();

## Quantize model to increase inference speed
THIS SEEMS to degrade speed from 15 FPS to 4 FPS)

In [77]:
# from super_gradients.training.utils.quantization.selective_quantization_utils import SelectiveQuantizer

# q_util = SelectiveQuantizer(
#     default_quant_modules_calibrator_weights="max",
#     default_quant_modules_calibrator_inputs="histogram",
#     default_per_channel_quant_weights=True,
#     default_learn_amax=False,
#     verbose=True,
# )
# q_util.quantize_module(model)
# model.to('cuda');
# model.eval();

## Check Model Inference Speed

#### First get a test frame

In [29]:
cap = cv2.VideoCapture(video_path)
if (cap.isOpened() == False):
    print("Error opening video file")

while(cap.isOpened()):

  # read each video frame
  ret, frame = cap.read()
  frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

  break

cap.release()
del cap

## Test Pipeline speed
We will need to check the speed at which we can get the required inputs for the tracker.

In [81]:
# baseline speed test to get detections for tracker
image_in = frame.copy()
base_test_times = []

for i in range(100):
  # start clock and rpedict
  tic = time.perf_counter()
  preds_0 = model.predict(image_in, iou=0.25, conf=0.30)
  img_preds = list(preds_0._images_prediction_lst)[0]
  dets = np.hstack((img_preds.prediction.bboxes_xyxy, 
                    np.c_[img_preds.prediction.confidence]))
  base_test_times.append(time.perf_counter() - tic)

print(f"Average Prediction time: {np.mean(base_test_times)} "
      f"Average FPS: {1/np.mean(base_test_times)}")

In [82]:
np.round(np.mean(base_test_times), 4), np.round(1/np.mean(base_test_times), 4)

(0.0605, 16.5362)

Get new method of inference with reduced overhead. We can explore places in the API to leverage by finding where the predcit function is located. We can locate it with the inspect library.





In [83]:
import inspect

os.path.abspath(inspect.getfile(model.predict))

'/usr/local/lib/python3.10/dist-packages/super_gradients/training/models/detection_models/customizable_detector.py'

In [84]:
from super_gradients.training.models.detection_models.customizable_detector import CustomizableDetector
from super_gradients.training.pipelines.pipelines import DetectionPipeline

# make sure to set IOU and confidence in the pipeline constructor
pipeline = DetectionPipeline(
            model=model,
            image_processor=model._image_processor,
            post_prediction_callback=model.get_post_prediction_callback(iou=0.25, conf=0.30),
            class_names=model._class_names,
        )


def get_prediction(image_in, pipeline):
  ''' Obtains DetectionPrediction object from a single input RGB image
  '''
  # Preprocess
  preprocessed_image, processing_metadata = pipeline.image_processor.preprocess_image(image=image_in.copy())

  # Predict
  with torch.no_grad():
      torch_input = torch.Tensor(preprocessed_image).unsqueeze(0).to('cuda')
      model_output = model(torch_input)
      prediction = pipeline._decode_model_output(model_output, model_input=torch_input)

  # Postprocess
  return pipeline.image_processor.postprocess_predictions(predictions=prediction[0], metadata=processing_metadata)

In [85]:
image_in = frame.copy()
test_times = []

for i in range(100):
  # start clock and rpedict
  tic = time.perf_counter()
  pred = get_prediction(image_in, pipeline)
  xyxyc = np.hstack((pred.bboxes_xyxy, 
                     np.c_[pred.confidence]))
  test_times.append(time.perf_counter() - tic)

print(f"Average Prediction time: {np.mean(test_times)} "
      f"Average FPS: {1/np.mean(test_times)}")

Average Prediction time: 0.06047330899001281 Average FPS: 16.536220965933094


In [86]:
np.round(np.mean(test_times), 4), np.round(1/np.mean(test_times), 4)

(0.0659, 15.1718)

In [89]:
pred

DetectionPrediction(bboxes_xyxy=array([[1178.3569 ,  275.83493, 1218.6821 ,  383.23434],
       [1232.7158 ,  262.42215, 1264.4724 ,  370.75522],
       [1217.1113 ,  198.12184, 1253.5657 ,  253.76126],
       [ 982.8118 ,  220.69008, 1010.603  ,  248.22868]], dtype=float32), confidence=array([0.6609424 , 0.555541  , 0.40006214, 0.3380518 ], dtype=float32), labels=array([ 0.,  0., 74.,  0.], dtype=float32))

In [88]:
img_preds.prediction

DetectionPrediction(bboxes_xyxy=array([[1178.3569 ,  275.83493, 1218.6821 ,  383.23434],
       [1232.7158 ,  262.42215, 1264.4724 ,  370.75522],
       [1217.1113 ,  198.12184, 1253.5657 ,  253.76126],
       [ 982.8118 ,  220.69008, 1010.603  ,  248.22868]], dtype=float32), confidence=array([0.6609424 , 0.555541  , 0.40006214, 0.3380518 ], dtype=float32), labels=array([ 0.,  0., 74.,  0.], dtype=float32))

## Perform inference on Simulated Video Stream

#### First we will use MoviePy to get the frame rate and save the audio for later

In [90]:
from moviepy.editor import VideoFileClip

videoclip = VideoFileClip(video_path)
audioclip = videoclip.audio

video_fps = videoclip.fps
video_fps

29.97002997002997

## Instantiate tracker object

In [91]:
from ocsort import ocsort

tracker = ocsort.OCSort(det_thresh=0.25)

Helper function for bounding box colors

In [92]:
import colorsys    

def get_color(number):
    """ Converts an integer number to a color """
    # change these however you want to
    hue = number*30 % 180
    saturation = number*103 % 256
    value = number*50 % 256

    # expects normalized values
    color = colorsys.hsv_to_rgb(hue/179, saturation/255, value/255)

    return [int(c*255) for c in color]

## Now we can simualate the data stream using opencv

Make sure to reset the tracker each time you run the inference

In [97]:
# get frame info for tracker and video saving 
h, w = (720, 1280)
h2, w2 = h//2, w//2
# h2, w2 = 640, 640 # this degrades performance

# OCSORT automatically rescales bboxes if we inference with a diff img size
img_info = (h, w)
img_size = (h2, w2) 

In [None]:
# ensure tracker is reset
tracker = ocsort.OCSort(det_thresh=0.30, max_age=10, min_hits=2)

cap = cv2.VideoCapture(video_path)

if (cap.isOpened() == False):
    print("Error opening video file")

frames = []
i = 0
counter, fps, elapsed = 0, 0, 0
start_time = time.perf_counter()

while(cap.isOpened()):

  # read each video frame (read time is about 0.006 sec)
  ret, frame = cap.read()

  if ret == True:

    # read image and resize by half for inference
    og_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    frame = cv2.resize(og_frame, 
                       (w2, h2), interpolation=cv2.INTER_LINEAR)

    # perform inference on small frame and get (x1, y1, x2, y2, confidence)
    pred = get_prediction(frame, pipeline)
    xyxyc = np.hstack((pred.bboxes_xyxy, 
                      np.c_[pred.confidence]))

    # update tracker
    tracks = tracker.update(xyxyc, img_info, img_size)

    # draw tracks on frame
    for track in tracker.trackers:
      
      track_id = track.id
      hits = track.hits
      color = get_color(track_id*15)
      x1,y1,x2,y2 = np.round(track.get_state()).astype(int).squeeze()

      cv2.rectangle(og_frame, (x1,y1),(x2,y2), color, 2)
      cv2.putText(og_frame, 
                  f"{track_id}-{hits}", 
                  (x1+10,y1-5), 
                  cv2.FONT_HERSHEY_SIMPLEX, 
                  0.5,
                  color, 
                  1,
                  cv2.LINE_AA)
      
    # update FPS and place on frame
    current_time = time.perf_counter()
    elapsed = (current_time - start_time)
    counter += 1
    if elapsed > 1:
      fps = counter / elapsed;
      counter = 0;
      start_time = current_time;

    cv2.putText(og_frame, 
                f"FPS: {np.round(fps, 2)}", 
                (10,h - 10), 
                cv2.FONT_HERSHEY_SIMPLEX, 
                1,
                (255,255,255), 
                2,
                cv2.LINE_AA)

    # append to list
    frames.append(og_frame)

    # # TEMP for debug
    # if i == 10:
    #   break
    # else:
    #   i += 1

  # Break the loop
  else:
    break

# release video capture object
cap.release()
del cap

In [None]:
plt.imshow(frames[450])

## Now let's put this in a video

In [None]:
# save to mp4
out = cv2.VideoWriter(video_savepath,
                      cv2.VideoWriter_fourcc(*'MP4V'), 
                      video_fps,
                      (w, h))
 
for frame in frames:
    out.write(cv2.cvtColor(frame, cv2.COLOR_RGB2BGR))

out.release()
del out

### Now add back the audio

In [None]:
from moviepy.editor import CompositeAudioClip
detection_video = VideoFileClip(video_savepath)

# add sound and save
detection_video.audio = CompositeAudioClip([audioclip])
detection_video.write_videofile(video_waudio_savepath)