Setup

In [1]:
# !{sys.executable} -m pip install git+https://github.com/LAS-NCSU/openai-python

In [None]:
# !pip install pillow==10.3.0 torch==2.3.1 git+https://github.com/openai/CLIP.git langchain-community==0.2.5 langchain-openai==0.1.9
# !pip install git+https://github.com/LAS-NCSU/openai-python
# !pip install -q yt_dlp

In [None]:
# =========================== Add your Token here ===========================

%env API_TOKEN=
# ===========================================================================

In [2]:
import sys
import yt_dlp
import torch
import clip
import os
from openai import OpenAI

import base64
import requests
from IPython.display import Image
from PIL import Image
from pprint import pprint
import pandas as pd
import numpy as np


device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)



Connect to OpenAI

In [3]:

client = OpenAI(
    # This is the default and can be omitted
    api_key=os.environ.get("LAS_API_TOKEN"),
)

In [4]:
#test to make sure you're connected to the AI
chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Say this is a test",
        }
    ],
    model="gpt-3.5-turbo",
)
print(chat_completion.choices[0].message.content)

This is a test


# Video Clip (Image) Testing
## Single Image Evaluation

In [8]:


# get an image for testing from the image folder
clip_folder = "./dashcam/rgb-images/sample_images"
clip_files = []

for x in os.listdir(clip_folder):
    if x.endswith(".jpg"):
        # Prints only jpg files present in the folder
        # print(x)
        clip_files.append("./dashcam/rgb-images/sample_image/"+x)

In [9]:
import base64
import requests
from IPython.display import Image


# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')


In [None]:
# Path to your image: check the path works with randome image
image_path = clip_files[5]
print(image_path)
Image(filename=image_path)

In [14]:
# Getting the base64 string
base64_image = encode_image(image_path)
chat_completion = client.chat.completions.create(
    messages=[
        {"role": "user", "content": [
            {"type": "text", "text": "Identify all the cars in the image and their pixel location, including visible vehicles in the background."},
            {"type": "image_url", "image_url": 
                 {"url": f"data:image/jpeg;base64,{base64_image}"}
            }
        ]}],
    model="gpt-4o",
)
print(chat_completion.choices[0].message.content)

Here is a list of the visible cars in the image along with their pixel locations:

1. **White car (center-left, moving in opposite direction)**
   - Approximate location: (100, 300)

2. **Yellow taxi (center-right, middle lane)**
   - Approximate location: (720, 410)

3. **Yellow taxi (center-right, right lane)**
   - Approximate location: (860, 520)

4. **Yellow taxi (middle of three forward lanes)**
   - Approximate location: (900, 400)

5. **White vehicle (ahead of the foremost yellow taxi in the middle lane)**
   - Approximate location: (950, 350)

6. **Silver car (distanced, behind the white vehicle and to its left)**
   - Approximate location: (1050, 320)

7. **Set of multiple vehicles in the background (mostly white/light-colored)** 
   - Approximate location: (1000, 200)

These locations are estimates and correspond to the approximate center of the respective vehicles in the image.


### Output review from inital pass on an image clip within the video

While prompting gpt4 to *"Specify all the cars in the image and their pixel location."* it only specified 5 cars in the image. A quick look at 00476.jpg by human eyes will tell us that there are many more cars in this image. 

**Output**:

Sure, here's the list of cars visible in the image along with their approximate pixel locations:

1. **White Car (partially visible in the bottom-left corner)**
   - Approx. Pixel Location: (100, 800)

2. **Gray Car (left side in the rear lane)**
   - Approx. Pixel Location: (250, 600)

3. **Yellow Car (center lane)**
   - Approx. Pixel Location: (550, 380)

4. **White Van (right center lane)**
   - Approx. Pixel Location: (800, 320)

5. **Yellow Car (right side of the image)**
   - Approx. Pixel Location: (1000, 500)

These locations are approximate and the center of the given cars in the image.


Another prompt is provided to gpt4, *"Identify all the cars in the image and their pixel location, including visible vehicles in the background."* with less success. 7 cars are identified here, with a little more details, which seem to be slightly false. For example it identifies 3 yellow taxis but it is not obvious or clear that these yellow cars are actually taxis.  Ultimately, it still can't exactly identify more of the noisy cars involved. 

**Output**

Here is a list of the visible cars in the image along with their pixel locations:

1. **White car (center-left, moving in opposite direction)**
   - Approximate location: (100, 300)

2. **Yellow taxi (center-right, middle lane)**
   - Approximate location: (720, 410)

3. **Yellow taxi (center-right, right lane)**
   - Approximate location: (860, 520)

4. **Yellow taxi (middle of three forward lanes)**
   - Approximate location: (900, 400)

5. **White vehicle (ahead of the foremost yellow taxi in the middle lane)**
   - Approximate location: (950, 350)

6. **Silver car (distanced, behind the white vehicle and to its left)**
   - Approximate location: (1050, 320)

7. **Set of multiple vehicles in the background (mostly white/light-colored)** 
   - Approximate location: (1000, 200)

These locations are estimates and correspond to the approximate center of the respective vehicles in the image.



### Cosine Similarity Threshold Implementation

Here we try a common computer vision step that is used to cut out noise (redundancies, outliers, etc) within the image/video data. Because every detial of the dashcam footage is important, this step might not be appropriate.

--

The approach to cosine similarity might not be too fruitful here because we want to make sure we're getting all the new cars, and new positioning of the cars that a high threshold has to be employed here. Is there a point to using such a high score? 
- Many object detection models use a scoring based approach to 'strengthen important information and suppress non-importatnt information by dynamic weighting (attention model). Many others apply a similarity evaluation threshold in an attempt to 'suppress' reduntant bounding boxes [source](https://ieeexplore.ieee.org/document/9985205).

Set the generic threshold at .97 but I'm not fully convinced. This is a good topic to explore when writing the paper - "what is the best cosine similarity for dashcam video". Is this the best implementation of suppressing the redundant images? Should we suppress the redundant images?

Another question would be to see how many video clips are enough to provide a good summary using the LMM. For example, there is not enough space and processing power to analyze ~120 video clips at once (at a .97 threshold). But it did process 30 images at a .94 threshold. One potential deep dive would be to identify a sweet spot in the threshold/image analysis.

In [18]:
def get_features(files):
    ret = []
    for index, image in enumerate(files):
        # print("Found image:", image, index)
        preprocess_image = preprocess(Image.open(image)).unsqueeze(0).to(device)

        with torch.no_grad():
            image_features = model.encode_image(preprocess_image)
            ret.append(image_features)
    return ret

def get_different_frames(files, cos_threshold):
    ret = [0]
    i = 0
    # i'd be amazed if this double while loop had no errors
    while i < len(files) - 1:
        print("Index i", i)
        j = i + 1
        while j < len(files) -1:
            print("Index j", j)
            similarity_score = torch.nn.functional.cosine_similarity(files[i], files[j])
            if device == "cuda":
                similarity_score  = similarity_score.cpu()
            similarity_score = similarity_score.numpy()[0]
            print("similarity scores", i, i+j, similarity_score)
            #most of the scores are below 0.85 ... let's see if this threshold is too high
            if similarity_score < cos_threshold:
                ret.append(j)
                print("pushed!", j)
                i = j + 1
                break
            j = j + 1
        i = j
    return ret

image_features = get_features(image_files)
# print(image_features, len(image_features))

ct = .958
video_frames = get_different_frames(image_features, ct)
for video in video_frames:
    print(image_files[video])
print("video_frames", video_frames)


percentage_kept = len(video_frames)/len(image_files)*100
print("\n" + str(round(percentage_kept,2)) + f"%, or {len(video_frames)} video frames were kept from the full 1800 frames provided at a threshold of {ct}")

Index i 0
Index j 1
similarity scores 0 1 0.99
Index j 2
similarity scores 0 2 0.9863
Index j 3
similarity scores 0 3 0.9863
Index j 4
similarity scores 0 4 0.983
Index j 5
similarity scores 0 5 0.984
Index j 6
similarity scores 0 6 0.981
Index j 7
similarity scores 0 7 0.9795
Index j 8
similarity scores 0 8 0.9775
Index j 9
similarity scores 0 9 0.97
Index j 10
similarity scores 0 10 0.964
Index j 11
similarity scores 0 11 0.9697
Index j 12
similarity scores 0 12 0.966
Index j 13
similarity scores 0 13 0.965
Index j 14
similarity scores 0 14 0.9614
Index j 15
similarity scores 0 15 0.964
Index j 16
similarity scores 0 16 0.966
Index j 17
similarity scores 0 17 0.972
Index j 18
similarity scores 0 18 0.971
Index j 19
similarity scores 0 19 0.9673
Index j 20
similarity scores 0 20 0.9697
Index j 21
similarity scores 0 21 0.9653
Index j 22
similarity scores 0 22 0.972
Index j 23
similarity scores 0 23 0.9785
Index j 24
similarity scores 0 24 0.976
Index j 25
similarity scores 0 25 0.9775

In [14]:
from openai import OpenAI 
import os
import base64
import pprint

## Set the API key and model name
MODEL="gpt-4o"
client = OpenAI(api_key=os.environ.get("LAS_API_TOKEN", "<your OpenAI API key if not set as an env var>"))

# completion = client.chat.completions.create(
#   model=MODEL,
#   messages=[
#     {"role": "system", "content": "You are a helpful assistant. Help me with my math homework!"}, # <-- This is the system message that provides context to the model
#     {"role": "user", "content": "Hello! Could you solve 2+2?"}  # <-- This is the user message for which the model will generate a response
#   ]
# )
# print("Assistant: " + completion.choices[0].message.content)
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")
    
def gpt_frame_summaries(video_frames, image_files):
    ret  = []
    for frame in video_frames:
        ret.append(
            {
                "type": "image_url", "image_url": {
                    "url": f"data:image/png;base64,{encode_image(image_files[frame])}"
                }
            })
    return ret

# pprint.pprint(gpt_frame_summaries(video_frames, image_files))


### there are too many video frames here. You need to find a way to get them all working
    
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are closely watching dashcam footage. The video is provided through its images clips. Please identify all cars in this video and provide a timestamp, their pixel location within the image frame, and If possible, provide details about the car."},
        {"role": "user", "content": gpt_frame_summaries(video_frames, image_files)},
        {"role": "system", "content": transcript},
        {"role": "system", "content": "Please provide a summary of the video."},
    ],
    temperature=0.0,
)
print(response.choices[0].message.content)

Here is a summary of the cars identified in the provided frames:

1. **Timestamp: 11:14:11**
   - Yellow car in the center of the frame.
   - White van on the right side of the frame.
   - Multiple cars parked on the right side of the road.

2. **Timestamp: 11:14:13**
   - Yellow car in the center of the frame.
   - White van on the right side of the frame.
   - Yellow car on the left side of the frame.

3. **Timestamp: 11:14:14**
   - Yellow car in the center of the frame.
   - White van on the right side of the frame.
   - Yellow car on the left side of the frame.

4. **Timestamp: 11:14:15**
   - Yellow car in the center of the frame.
   - White van on the right side of the frame.
   - Yellow car on the left side of the frame.

5. **Timestamp: 11:14:17**
   - Yellow car in the center of the frame.
   - White van on the right side of the frame.
   - Yellow car on the left side of the frame.

6. **Timestamp: 11:14:18**
   - Yellow car in the center of the frame.
   - White van on the r

While using GPT4o with a cosine similarity threshold on the video clips, the model was able to identify a timestamp and multiple details of cars in the video clips.

### First Prompt
It was able to identify 8 cars with timestamps, speed and details, along with a short summary from the prompts '*You are generating a video summary using images from frames. Please identify cars in this video and provide a timestamp and their pixel location within the image frame. If possible, provide details about the car.*', and '*Please provide a summary of this video*'. The output was:


The video appears to be a dashcam recording of a drive through a busy urban area. Here is a summary of the key points observed in the frames:

1. **Timestamp 11:14:11 to 11:14:13**:
   - The vehicle is driving on a multi-lane road with moderate traffic.
   - Several yellow taxis are visible ahead and to the sides.
   - The speed is around 37 MPH.

2. **Timestamp 11:14:14 to 11:14:17**:
   - The vehicle continues to drive along the road.
   - More yellow taxis and white vans are visible.
   - The speed decreases from 35 MPH to 24 MPH.

3. **Timestamp 11:14:18 to 11:14:22**:
   - The vehicle is approaching a busier section with more traffic.
   - The speed further decreases to 26 MPH.
   - A mix of yellow taxis, white vans, and other vehicles are seen.

4. **Timestamp 11:14:24 to 11:14:29**:
   - The vehicle is now in heavier traffic, moving at a slower pace.
   - The speed drops to 19 MPH.
   - More yellow taxis and a few motorcycles are visible.

5. **Timestamp 11:14:30 to 11:14:36**:
   - The vehicle is almost at a standstill in heavy traffic.
   - The speed is around 9 MPH.
   - The road is congested with various vehicles, including yellow taxis, white vans, and motorcycles.

6. **Timestamp 11:14:41 to 11:14:48**:
   - The vehicle is moving very slowly in dense traffic.
   - The speed is around 3-4 MPH.
   - The road is filled with yellow taxis, white vans, and other vehicles.

7. **Timestamp 11:14:54 to 11:15:04**:
   - The vehicle is still in heavy traffic, moving at a very slow pace.
   - The speed is around 2-3 MPH.
   - A mosque with minarets is visible in the background.

8. **Timestamp 11:15:09 to 11:15:11**:
   - The vehicle is almost at a standstill in front of the mosque.
   - The speed is 0 MPH.
   - The road is congested with yellow taxis, white vans, and motorcycles.

"Throughout the video, the vehicle navigates through a busy urban area with significant traffic, including many yellow taxis and white vans. The speed varies from moderate to very slow as the vehicle approaches and passes through congested sections of the road. The mosque with minarets is a notable landmark visible towards the end of the video."

--- and again after multiple prompts ---

The video appears to be a dashcam recording of a drive through a busy urban area. Here is a summary of the key points observed in the frames:

1. **Timestamp: 11:14:11 - 11:14:13**
   - The vehicle is driving on a multi-lane road with a central divider.
   - Several cars are visible, including a yellow taxi directly ahead and various vehicles parked on the side of the road.

2. **Timestamp: 11:14:14 - 11:14:15**
   - The vehicle continues to drive along the road.
   - More cars are visible, including a white van and a yellow car parked on the right side.

3. **Timestamp: 11:14:17 - 11:14:18**
   - The vehicle is approaching a busier section with more parked cars and a few pedestrians visible on the sidewalk.

4. **Timestamp: 11:14:21 - 11:14:22**
   - The vehicle is moving at a slower speed, indicating possible traffic congestion.
   - More cars are visible, including a white van and several yellow taxis.

5. **Timestamp: 11:14:24 - 11:14:25**
   - The vehicle is driving past a commercial area with shops and buildings.
   - Several cars are visible, including a white van and a yellow taxi.

6. **Timestamp: 11:14:26 - 11:14:28**
   - The vehicle is approaching a traffic jam.
   - More cars are visible, including a silver car on the left and several yellow taxis.

7. **Timestamp: 11:14:29 - 11:14:30**
   - The vehicle is moving slowly due to traffic congestion.
   - Several cars are visible, including a white van and a yellow taxi.

8. **Timestamp: 11:14:31 - 11:14:32**
   - The vehicle is in heavy traffic.
   - More cars are visible, including a white van and several yellow taxis.

9. **Timestamp: 11:14:36 - 11:14:41**
   - The vehicle is in a traffic jam.
   - Several cars are visible, including a white van, a yellow taxi, and a silver car.

10. **Timestamp: 11:14:44 - 11:14:48**
    - The vehicle is moving very slowly due to heavy traffic.
    - Several cars are visible, including a white van, a yellow taxi, and a silver car.

11. **Timestamp: 11:14:54 - 11:14:56**
    - The vehicle is approaching an intersection with a mosque visible in the background.
    - Several cars are visible, including a white van, a yellow taxi, and a silver car.

12. **Timestamp: 11:14:57 - 11:15:03**
    - The vehicle is at a standstill due to traffic congestion.
    - Several cars are visible, including a white van, a yellow taxi, and a silver car.

13. **Timestamp: 11:15:04 - 11:15:11**
    - The vehicle is still in heavy traffic near the mosque.
    - Several cars are visible, including a white van, a yellow taxi, and a silver car.

Throughout the video, the vehicle encounters heavy traffic with numerous cars, including yellow taxis, white vans, and silver cars. The area appears to be a busy urban environment with commercial buildings, shops, and a mosque. The vehicle's speed varies, indicating traffic congestion at various points.



### Second Prompt
It was able to identify an overall review of the images with the prompt '*You are generating a video summary using images from frames. Please identify vehicles in this video and provide a timestamp, their pixel location within the image frame, and If possible, provide details about the car.*', and '*Please provide a summary of this video*'. The output was:

The video appears to be a dashcam recording of a drive through a busy urban area. The timestamps on the images range from 11:14:11 to 11:15:11 on March 1, 2023. Here is a summary of the key observations:

1. **Traffic and Road Conditions:**
   - The road is a multi-lane street with a central divider.
   - Traffic is moderate to heavy, with vehicles moving at varying speeds.
   - The speed of the vehicle with the dashcam ranges from 37 MPH to 0 MPH, indicating periods of both movement and stopping.

2. **Vehicles:**
   - Numerous vehicles are visible, including:
     - Yellow taxis, which are prevalent throughout the video.
     - White vans and minibuses, likely used for public transportation.
     - Private cars of various colors and models.
     - A few motorcycles and a three-wheeler (tuk-tuk) are also seen.
   - A notable vehicle is a white truck with a wooden frame on its back, visible in the later frames.

3. **Surroundings:**
   - The area appears to be a commercial district with multiple shops and buildings on both sides of the road.
   - There are several signboards and advertisements in Arabic.
   - Palm trees and other greenery are visible, indicating a warm climate.

4. **Landmarks:**
   - A prominent mosque with two minarets and a blue dome is visible towards the end of the video, suggesting the vehicle is approaching a significant landmark.

5. **Pedestrians:**
   - A few pedestrians are seen walking along the sidewalks and crossing the street.

6. **Weather and Lighting:**
   - The weather is clear and sunny, with bright daylight illuminating the scene.

"Overall, the video captures a typical urban driving experience with various vehicles, commercial establishments, and a notable mosque, all under clear weather conditions."

### Third Prompt

Without using the second prompt of "summarize the video" thr prompt below essentially provided a play by play of the video with the same unecessary details for each timestamp. There are timestamps but they aren't linear. :

*"You are closely watching dashcam footage and need to identify the different cars in the video. The video is provided using images from frames. Please identify all cars in this video and provide a timestamp, their pixel location within the image frame, and If possible, provide details about the car."*

Here is a summary of the cars identified in the provided frames:

1. **Timestamp: 11:14:11**
   - Yellow car in the center of the frame.
   - White van on the right side of the frame.
   - Multiple cars parked on the right side of the road.

2. **Timestamp: 11:14:13**
   - Yellow car in the center of the frame.
   - White van on the right side of the frame.
   - Yellow car on the left side of the frame.

3. **Timestamp: 11:14:14**
   - Yellow car in the center of the frame.
   - White van on the right side of the frame.
   - Yellow car on the left side of the frame.

4. **Timestamp: 11:14:15**
   - Yellow car in the center of the frame.
   - White van on the right side of the frame.
   - Yellow car on the left side of the frame.

5. **Timestamp: 11:14:17**
   - Yellow car in the center of the frame.
   - White van on the right side of the frame.
   - Yellow car on the left side of the frame.

6. **Timestamp: 11:14:18**
   - Yellow car in the center of the frame.
   - White van on the right side of the frame.
   - Yellow car on the left side of the frame.

7. **Timestamp: 11:14:21**
   - Yellow car in the center of the frame.
   - White van on the right side of the frame.
   - Yellow car on the left side of the frame.

8. **Timestamp: 11:14:22**
   - Yellow car in the center of the frame.
   - White van on the right side of the frame.
   - Yellow car on the left side of the frame.

9. **Timestamp: 11:14:24**
   - Yellow car in the center of the frame.
   - White van on the right side of the frame.
   - Yellow car on the left side of the frame.

10. **Timestamp: 11:14:25**
    - Yellow car in the center of the frame.
    - White van on the right side of the frame.
    - Yellow car on the left side of the frame.

11. **Timestamp: 11:14:26**
    - Yellow car in the center of the frame.
    - White van on the right side of the frame.
    - Yellow car on the left side of the frame.


...


86



