# SJSU MSDS 255 DL, Spring 2024
Homework 04: Transfer Learning and Bounding Boxes and YOLOV8

Git: https://github.com/jrgosalvez/data255_DL

## Part 1

### Object Detection Inference from VIDEO with YOLO8 Transfer Learning

Source: <br>
- https://www.youtube.com/watch?app=desktop&v=o4Zd-IeMlSY
- https://pysource.com/2023/03/28/object-detection-with-yolo-v8-on-mac-m1/
- https://docs.opencv.org/4.x/d6/d6e/group__imgproc__draw.html#ga5126f47f883d730f633d74f07456c576

<b>NOTE:</b> YOLOv8 relies on PyTorch as its deep learning framework. NVIDIA NGC Catalog offers additional GPU optimized frameworks for alternative transfer learning models, frameworks, and containers (e.g. NeMO).

#### Step 1:
- Collect a source video. 
- YOLO8 accepts the following video formats: .asf, .avi, .gif, .m4v, .mkv, .mov, .mp4, .mpeg, .mpg, .ts, .wmv, .webm.
- iphone video creates .mp4 videos; I created 5-10 sec street videos and saved them locally.

#### Steps 2 & 3: 
Conduct inference on video, frame by frame, drawing bounding boxes around detected objects (specifically vehicles) and output a video of the object detection results.

In [1]:
# check mac systems for GPU
import torch
print(torch.backends.mps.is_available())   # check for Mac M1 GPU

False


In [2]:
# import openCV, numpy, and ultralytics YOLO8

import cv2
from ultralytics import YOLO
import numpy as np

In [3]:
%%time

# load 5-10 sec street intersection video
video = './video_in/rick_boston_01.MOV'

# measure, capture, and config frames
cap = cv2.VideoCapture(video)
frame_width  = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = int(cap.get(cv2.CAP_PROP_FPS))

# count frames and clip video so can clip video to specific time limit
frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
duration = frame_count / fps

# Set the start and end times for clipping (6, 8, or 10 if greater than 10 seconds)
start_time = 0
end_time = min(6, duration)

# Set path, name, and video output writer
output_path  = 'output_rick_boston_01.mp4'
out = cv2.VideoWriter(output_path, cv2.VideoWriter_fourcc(*'mp4v'), fps, (frame_width, frame_height))

# load pretrained model, in this case YOLO8, but can be any that do object detection on an image frame
# for YOLO I will use the small version; it is less percise, but faster. Medium is default, can also change to larger more percise
# can also use the base model or the segmentation model by adding -seg before .pt 
model = YOLO('yolov8s.pt') 

# create labels per the output of printing 'results' (see below, commented out)
names = {0: 'person', 1: 'bicycle', 2: 'car', 3: 'motorcycle', 4: 'airplane', 5: 'bus', 6: 'train', 7: 'truck', 8: 'boat', 9: 'traffic light', 10: 'fire hydrant', 11: 'stop sign', 12: 'parking meter', 13: 'bench', 14: 'bird', 15: 'cat', 16: 'dog', 17: 'horse', 18: 'sheep', 19: 'cow', 20: 'elephant', 21: 'bear', 22: 'zebra', 23: 'giraffe', 24: 'backpack', 25: 'umbrella', 26: 'handbag', 27: 'tie', 28: 'suitcase', 29: 'frisbee', 30: 'skis', 31: 'snowboard', 32: 'sports ball', 33: 'kite', 34: 'baseball bat', 35: 'baseball glove', 36: 'skateboard', 37: 'surfboard', 38: 'tennis racket', 39: 'bottle', 40: 'wine glass', 41: 'cup', 42: 'fork', 43: 'knife', 44: 'spoon', 45: 'bowl', 46: 'banana', 47: 'apple', 48: 'sandwich', 49: 'orange', 50: 'broccoli', 51: 'carrot', 52: 'hot dog', 53: 'pizza', 54: 'donut', 55: 'cake', 56: 'chair', 57: 'couch', 58: 'potted plant', 59: 'bed', 60: 'dining table', 61: 'toilet', 62: 'tv', 63: 'laptop', 64: 'mouse', 65: 'remote', 66: 'keyboard', 67: 'cell phone', 68: 'microwave', 69: 'oven', 70: 'toaster', 71: 'sink', 72: 'refrigerator', 73: 'book', 74: 'clock', 75: 'vase', 76: 'scissors', 77: 'teddy bear', 78: 'hair drier', 79: 'toothbrush'}

# object detection frame loop
while True:
    ret, frame = cap.read()
           
    if ret:
        # current time in seconds
        current_time = cap.get(cv2.CAP_PROP_POS_MSEC) / 1000  
        if current_time >= start_time:
            print(f'current_time: {current_time:.3f} | start_time {start_time}')
            
            # Clip the video to 6, 8, 10 seconds
            if current_time <= end_time:
                print(f'current_time: {current_time:.3f} | end_time {end_time}')
                
                # Perform object detection on the frame & check labels - names = 2: 'car', 3: ' motorcycle', 5: 'bus', 7: 'truck'
                results = model(frame, device='cpu')
                #print(results)
                
                # integer is better than float for printing bounding boxes
                result = results[0]
                bboxes = np.array(result.boxes.xyxy.cpu(), dtype='int')  
                #print(bboxes)
                
                # get class labels
                classes = np.array(result.boxes.cls.cpu(), dtype='int')  

                for cls, bbox in zip(classes, bboxes):
                    (x, y, x2, y2) = bbox
                    #print('x', x, 'y', y)
                    
                    # (input, top left, bottom right, color in RGB, line thickness)
                    cv2.rectangle(frame, (x, y), (x2, y2), (0,0,225), 2)  
                    
                    # put class name inside drawing box 5 px in and down 20 px
                    frame_with_boxes = cv2.putText(frame, str(names[cls]), (x+5, y+20), cv2.FONT_HERSHEY_PLAIN, 2,(0,0,225), 3)  
                    cv2.imshow('YOLO8 Object Detection, SJSU MSDA 255 DL, Spring 2024', frame)       
                    
                    # format results back into a video
                    out.write(frame_with_boxes)
                    
            else:
                break
        
        # Break the loop if 'q' key is pressed. waitKey(0) requires key pressing to advance; 1 auto advances image frames of video
        if cv2.waitKey(1) & 0xFF == ord('q'):     
            print('Break Loop')
            break
            
    else:
        break

print()
print('Out of Loop')
print()
        
# Release the video capture and close all windows. NOTE: works in CLI and .py script, but glitchy in Jupyter, requires manual kernel restart.
cap.release()
out.release()
cv2.destroyAllWindows()

current_time: 0.000 | start_time 0
current_time: 0.000 | end_time 6

0: 384x640 2 persons, 4 cars, 357.9ms
Speed: 20.3ms preprocess, 357.9ms inference, 1.7ms postprocess per image at shape (1, 3, 384, 640)
current_time: 0.033 | start_time 0
current_time: 0.033 | end_time 6

0: 384x640 2 persons, 4 cars, 404.2ms
Speed: 6.7ms preprocess, 404.2ms inference, 1.1ms postprocess per image at shape (1, 3, 384, 640)
current_time: 0.067 | start_time 0
current_time: 0.067 | end_time 6

0: 384x640 2 persons, 4 cars, 277.6ms
Speed: 4.3ms preprocess, 277.6ms inference, 1.4ms postprocess per image at shape (1, 3, 384, 640)
current_time: 0.100 | start_time 0
current_time: 0.100 | end_time 6

0: 384x640 2 persons, 7 cars, 1 bus, 276.6ms
Speed: 6.5ms preprocess, 276.6ms inference, 1.0ms postprocess per image at shape (1, 3, 384, 640)
current_time: 0.133 | start_time 0
current_time: 0.133 | end_time 6

0: 384x640 2 persons, 5 cars, 1 bus, 278.6ms
Speed: 5.7ms preprocess, 278.6ms inference, 1.0ms postproc