In [None]:
%pip install ultralytics opencv-python pyyaml

## Vision inferencing with a local model

In the first exercise, we can test out a basic [computer vision](https://www.microsoft.com/en-us/research/research-area/computer-vision/?msockid=22ee1fda33f46de00ef10b8532d86c89) inferencing task using a popular AI model called [YOLOv8](https://docs.ultralytics.com/models/yolov8/). YOLO (You Only Look Once) is a real-time object detection system that works by processing static images. It divides the image into a grid and predicts bounding boxes and probabilities for each grid cell, allowing it to detect multiple objects within a single image efficiently. 

To get started we will initialize the model via the Ultralytics python library. This will automatically download the model. Different sizes for the YOLOv8 model can be specified depending on the workload to adjust balance for accuracy versus speed. Once we initialize the model in our code, we can label the detected objects using [COCO dataset](https://cocodataset.org/#overview) class labels. The class labels dataset can be viewed [here](../artifacts/coco.yaml) where you can see the different types of objects that can be potentially identified.

Click on the Play icon to the left of the cell below to initialize the model.

In [None]:
import cv2, yaml
from ultralytics import YOLO
from pprint import pprint 

model = YOLO('yolov8n.pt')  # You can use 'yolov8s.pt', 'yolov8m.pt', etc. for different model sizes

# This code loads the class names from the COCO dataset yaml file. 
def load_class_names(yaml_file):
    with open(yaml_file, 'r') as f:
        data = yaml.safe_load(f)
    return data['names']

class_names = load_class_names('../artifacts/coco.yaml')  # Adjust the path to your .names file

pprint(class_names)


### Basic object detection on a static image

The next code block will load an image from disk using the Python [OpenCV](https://opencv.org/) library and send it to the model for basic object detection. Any detected objects will be annotated with a box drawn around them.

In [None]:
# Load image
image_path = '../media/image/people_on_street.jpg'
image = cv2.imread(image_path)

# Perform basic detection
results = model(image)

# Draw bounding boxes on the image and label objects by referencing the class names
for result in results:
    for box in result.boxes:
        class_id = int(box.cls[0])
        x1, y1, x2, y2 = map(int, box.xyxy[0])
        confidence = box.conf[0]
        label = f'{class_names[class_id]} {confidence:.2f}'
        cv2.rectangle(image, (x1, y1), (x2, y2), (0, 255, 0), 2)
        cv2.putText(image, label, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

# Display the image until with the bounding boxes until a key is pressed
cv2.imshow('Press any key to close', image)
while True:
    if cv2.waitKey(1) != -1:
        break
    if cv2.getWindowProperty('Press any key to close', cv2.WND_PROP_VISIBLE) < 1:
        break
cv2.destroyAllWindows()

### Object detection in a video file

By adjusting our technical implementation, we can detect objects with YOLO inside a video file. 

To use YOLO with a video file, we need to extract individual frames from the video and then apply the YOLO model to each frame separately. This process involves reading the video file, extracting frames at a specified frame rate, performing object detection on each frame, and then potentially reassembling the processed frames back into a video format. This approach allows us to leverage YOLO's capabilities for real-time object detection in video streams.

![A diagram illustrating the video-to-frame concept](./img/video_to_frame_diagram_small.png)

Another concept to consider is the rate at which frames are extracted from the video and sent to the model for inferencing. This can be measured in frames-per-second, also known as framerate. At 30 frames per second, we will need to extract 30 individual images from the video stream every second. 

![A diagram illustrating frames-per-second](./img/fps_diagram.png)

Framerate can be adjusted as needed to balance between performance and cost. In our example we will set a framerate of 3, which will result in a moderate amount of frames written to disk for the included video sample file. This in turn will result in less resource cost to run inferencing against our video.

Let's use a sample video file and perform this first step to extract frames from a sample video file.

Run the next cell using the Play button the left. 

In [None]:
import os

video_path = '../media/video/sample.mp4'
#video_path = '../media/video/sample2.mp4'
video_filename = os.path.splitext(os.path.basename(video_path))[0]
output_folder='../video_frames/' + video_filename
os.makedirs(output_folder, exist_ok=True)

frame_skip = 3 # Set the frame skip rate
cap = cv2.VideoCapture(video_path) # Open the video file

# Get the total number of frames in the video and calculate the interval between frames to capture
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) 
frame_interval = int(cap.get(cv2.CAP_PROP_FPS) / frame_skip)
frame_count = 0
saved_frame_count = 0

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    # Save the frame if it is at the specified interval
    if frame_count % frame_interval == 0:
        frame_filename = os.path.join(output_folder, f'frame_{saved_frame_count:04d}.jpg')
        cv2.imwrite(frame_filename, frame)
        saved_frame_count += 1

    frame_count += 1
    print(f"Extracting frame {frame_count} from {video_path}.")
    
cap.release()
print(f"Extracted {saved_frame_count} frames from the video.")

In [None]:
# Load the YOLOv8 model
model = YOLO('yolov8n.pt')

# Load video
video_path = '../media/video/sample.mp4'
cap = cv2.VideoCapture(video_path)

delay = 1

# Get video writer initialized to save the output video
fourcc = cv2.VideoWriter_fourcc(*'XVID')
out = cv2.VideoWriter('output_video.avi', fourcc, 20.0, (int(cap.get(3)), int(cap.get(4))))

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    # Perform detection
    results = model(frame)

    # Draw bounding boxes on the frame
    for result in results:
        for box in result.boxes:
            class_id = int(box.cls[0])
            x1, y1, x2, y2 = map(int, box.xyxy[0])
            confidence = box.conf[0]
            label = f'{class_names[class_id]} {confidence:.2f}'
            cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
            cv2.putText(frame, label, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)


    # Write the frame into the output video
    out.write(frame)

    # Display the frame until q is pressed
    cv2.imshow('Detected People (press Q to exit)', frame)
    if cv2.waitKey(delay) & 0xFF == ord('q'): 
        break

cap.release()
out.release() 
cv2.destroyAllWindows()

## Vision inferencing with Azure OpenAI and GPT4o

In [None]:
import os
import requests
import base64
from pprint import pprint

IMAGE_PATH="../media/image/people_on_street.jpg"
ENDPOINT = "https://jsextoai.openai.azure.com/openai/deployments/gpt-4/chat/completions?api-version=2024-08-01-preview"
API_KEY = "xxxxxx"

encoded_image = base64.b64encode(open(IMAGE_PATH, 'rb').read()).decode('ascii')
headers = {
    "Content-Type": "application/json",
    "api-key": API_KEY,
}

# Payload for the request
payload = {
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "You are an AI assistant that helps people find information."
        }
      ]
    }
  ],
  "temperature": 0.7,
  "top_p": 0.95,
  "max_tokens": 800
}


# Send request
try:
    response = requests.post(ENDPOINT, headers=headers, json=payload)
    response.raise_for_status()  # Will raise an HTTPError if the HTTP request returned an unsuccessful status code
except requests.RequestException as e:
    raise SystemExit(f"Failed to make the request. Error: {e}")

# Handle the response as needed (e.g., print or process)
response_json = response.json()
pprint(response_json)

In [None]:
import json
from pprint import pprint

def image_to_base64(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')


IMAGE_PATH = "./images/columns.png"
image_base64 = image_to_base64(IMAGE_PATH)
payload = {
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
        { "role": "user", "content": [  
            { 
                "type": "text", 
                "text": "Describe this picture:" 
            },
            { 
                "type": "image_url",
                "image_url": {
                    "url": "data:image/png;base64," + image_base64
                }
            }
        ] } 
  ],
  "temperature": 0.7,
  "top_p": 0.95,
  "max_tokens": 800
}

response = requests.post(ENDPOINT, headers=headers, json=payload)
response_json = response.json()

# Pretty print the JSON response
pprint(response_json)