# Video QA GPT
* Using OpenAI's vision models to understand video and answer any questions about the video.

Idea for implementation:

* resized to 150x150, but can be changed easily (cost adjustment)
* 250 frame API limit on input / 30 fps = 8.3 seconds of video
    * These 250 frames can skip frames, models don't see video data the same way humans do. 
    * 2x = 16.6 seconds, 4x = 33.2 seconds

## Approach
I made it so that every 250 frames (about 8.3 seconds at 30 fps), I would prompt GPT 4o-mini to describe the video in detail alongside a frame recomendation so that a single frame from the 250 can be included in context on the final generation (that might not be necessary). I also made it so that the frames skip by 4 for faster/cheaper processing. This makes it so that the video length isn't as limited.

## Cost Assesment

**GPT-4o**:
Using this video processing method is pretty expensive. Even at 150x150, it costs $0.32 for one set of 250 frames. Then incorporate the additional system message text and the expected output text (output tokens cost ~3x more than input normally). Thats $0.35-0.50 every 8.3 seconds of video. Thats $2.50-$3.50 every minute of video. Also this is just to generate the helpful context to supply a conversational chatbot that can answer questions about the video (which would then add more cost), so this is all a video processing fee before the user even can try out the chatbot. That seems pretty unrealistic. I'd never pay $2.50+ to have a chatbot answer questions on a one minute video.

**GPT-4o-mini**:
GPT 4-o mini is way more cost efficient. A little bit less than a penny for every 250 frames (8.3 seconds) of video. There will still be an uploading cost attached to this ($0.32 for almost 5 mins), but thats more reasonable. Also compressing the video by 4x frames helps with processing cost and time. 

## Time Assesment 
One generation with 250 frames + text prompt can take up to 2-3 minutes for some reason. Making all the description generations async so that it runs in parallel instead of series. This will save you n-fold on video processing time. 

### Work on:
* Make description generation function async for parallel processing
* Create an option to change from 1,2 and 4x "frame speed"
* Do I need the frame recommendation? The last generation will have a lot of text and it should be encompassing (test out)

### Install Libraries

In [None]:
%pip install openai python-dotenv opencv-python

### Create imports

In [1]:
from IPython.display import display, Image
import cv2  # We're using OpenCV to read video, to install !pip install opencv-python
import base64
import time
from openai import OpenAI
from dotenv import load_dotenv
import os, json, sys
import numpy as np


### Set API Connection

In [3]:
load_dotenv('.env')
# print(os.getenv('OPENAI_API_KEY'))
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
# print(client.chat.completions.create(model='gpt-4o-mini', messages=[{'role': 'user', 'content': 'What is the capital of France?'}]))

### Get video path and initialize global variables

In [4]:
# Set your video path here
video_path = 'enter_your_video_path_here.mp4'

CONTEXT_LIMIT = 128000 # 128k tokens for GPT 4-v and GPT 4-o(mini)
MAX_FRAMES = 250 # OpenAI API limit for video frames
SIZE_LIMIT_MB = 20 # OpenAI API limit 20MB for image size
GPT_4o_MINI_MAX_GEN_TOKENS = 16384 # 16k tokens for GPT 4-o(mini)

# Preprocessing
### Resize and encode video
* We can resize the video to 150x150 (we can change the resize dimensions) for cost effectiveness.
    * The bigger the video dimension, the more tokens per frame
* We encode the video to base64 encoding
    * We decode the video frame by frame as a .jpg image
    * Then we decode the base64 encoded data into utf-8 for openai

In [None]:
def resize_and_encode_video(input_path, width, height):
    # Open the input video
    input_video = cv2.VideoCapture(input_path)
    if not input_video.isOpened():
        print("Error opening video file")
        return []

    base64Frames = []
    while True:
        ret, frame = input_video.read()
        if not ret:
            break

        # Resize the frame
        resized_frame = cv2.resize(frame, (width, height))

        # Encode the frame to JPEG
        _, buffer = cv2.imencode(".jpg", resized_frame)
        base64Frames.append(base64.b64encode(buffer).decode("utf-8"))

    # Get FPS
    fps = input_video.get(cv2.CAP_PROP_FPS)
    
    # Release the video capture
    input_video.release()
    cv2.destroyAllWindows()
    
    print(len(base64Frames), "frames read.")
    return base64Frames, fps

# Example usage
base64Frames, fps = resize_and_encode_video(video_path, 150, 150)

### Check if the video got resized

In [None]:
def get_video_dimensions(frames_array):
    # Check if video opened successfully
    if frames_array:
        # Decode base64 to bytes
        frame_data = base64.b64decode(frames_array[0])
        
        # Convert bytes data to numpy array
        nparr = np.frombuffer(frame_data, np.uint8)
        
        # Decode image
        img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
        
        # Get dimensions
        height, width = img.shape[:2]
        print("Video dimensions:", width, "x", height)
    else:
        print("The list of frames is empty.")
        height, width = 0, 0
    return height, width


height, width = get_video_dimensions(base64Frames)

## Take out excess frames
* Saves on processing time and cost

In [7]:
# Here we define how we are going to compress the video frames
# We can simply take every N-th frame, that way we compress the video while keeping the same length
def compress_video_frames(frames_array, n):
    return frames_array[::n]

base64Frames = compress_video_frames(base64Frames, 4)

### Display the video that will be shown to the model
* Remember the model is able to use each frame as context information
    * Different than how humans experience video

In [None]:
display_handle = display(None, display_id=True)
for img in base64Frames: 
    display_handle.update(Image(data=base64.b64decode(img.encode("utf-8"))))
    time.sleep(0.1)

### Calculate the tokens for each frame (image)
* By this time, the video got resized to 150x150, so this function will consistently output the same number of tokens
    * This function makes the code modular in the way that you can change the resize dimension

In [None]:
from math import ceil

def calculate_image_tokens(width: int, height: int):
    if width > 2048 or height > 2048:
        aspect_ratio = width / height
        if aspect_ratio > 1:
            width, height = 2048, int(2048 / aspect_ratio)
        else:
            width, height = int(2048 * aspect_ratio), 2048
            
    if width >= height and height > 768:
        width, height = int((768 / height) * width), 768
    elif height > width and width > 768:
        width, height = 768, int((768 / width) * height)

    tiles_width = ceil(width / 512)
    tiles_height = ceil(height / 512)
    total_tokens = 85 + 170 * (tiles_width * tiles_height)
    
    return total_tokens

tokens_per_frame = calculate_image_tokens(width, height)
print("Tokens per frame:", tokens_per_frame)

## Calculate Cost

In [None]:
# Calculate the total number of tokens for a video with 250 frames (OpenAI API limit)
video_tokens = MAX_FRAMES * tokens_per_frame

# Initialize the cost variables for gpt-4o and gpt-4o-mini
input_token_cost_per1k = 0.0050
output_token_cost_per1k = 0.0150
mini_input_token_cost_per1k = 0.00015
mini_output_token_cost_per1k = 0.0006

# Calculate the cost for 250 frames
cost_frames_250 = (video_tokens/1000) * mini_input_token_cost_per1k

# Calculate the maximum tokens and cost for the natural language description output
# max_output_tokens = (CONTEXT_LIMIT - video_tokens) - 250 # padding for prompt and other tokens
max_output_cost = GPT_4o_MINI_MAX_GEN_TOKENS / 1000 * mini_output_token_cost_per1k

# Calculate how many generations are needed for the preprocessed video
v_length_frames = len(base64Frames)
if (v_length_frames % MAX_FRAMES) == 0:
    num_generations = v_length_frames // MAX_FRAMES
else:
    num_generations = v_length_frames // MAX_FRAMES + 1

# Calculate the cost of the uploaded video and the video description generations
input_video_cost = num_generations * cost_frames_250
max_desc_generation_cost = num_generations * max_output_cost

# Calculate the total cost of video processing
max_total_video_processing_cost = input_video_cost + max_desc_generation_cost

# Print the results
print("Total tokens for 250 frames:", video_tokens)
print(f"Total cost for 250 frames: ${cost_frames_250}")
print(f"Max output cost: ${max_output_cost}\n") # max_output_cost
print(f"Your video needs {num_generations} descriptions.")
print(f"Total cost of uploaded video: ${input_video_cost}")
print(f"Total cost of video description generations: ${max_desc_generation_cost}")
print(f"Total cost of video processing: ${max_total_video_processing_cost}")

## Generate Video Chunk Description
* Added try/except block for API call due to various possible errors.
    * This allows the program fill out the JSON even if one API call results in an error

In [None]:
def generate_video_description(frames_array):
    """
    This function generates a description of the video for every 250 frames.
    """
    total_frames = len(frames_array)
    descriptions = []
    counter = 1
    for i in range(0, total_frames, MAX_FRAMES):
        end_frame = min(i + MAX_FRAMES, total_frames)
        video_chunk = frames_array[i:end_frame]
        chunk_size = sum(sys.getsizeof(frame) for frame in video_chunk)
        chunk_size_mb = chunk_size / (1024 * 1024)
        print(f"Generating description of chunk {counter} from frame {i} to {end_frame}\nChunk size: {chunk_size_mb:.2f} MB")
        try:
            description = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {
                        "role": "system",
                        "content": "You are a helpful video description assistant. Please describe the video that the user inputs with as much detail as possible. Be specific about colors, numbers, and all the fine details of the scene. Please format the description as a paragraph (no bullet points, or numbering) When you're done, please pick the frame number (1-250) that you think would best represent the video (ex. Frame Recomendation: 125)."
                    },
                    {
                        "role": "user",
                        "content": [
                            "Describe the video", *map(lambda x: {"image": x}, video_chunk)
                        ]
                    }
                ],
                temperature=0,
                max_tokens=GPT_4o_MINI_MAX_GEN_TOKENS,
            )
            generation_desc = description.choices[0].message.content
            print(generation_desc)
            chunk_dict = {}
            chunk_dict["start_frame"] = i
            chunk_dict["end_frame"] = end_frame
            chunk_dict["total_tokens"] = description.usage.total_tokens
            if (generation_desc.rfind("Frame Recomendation: ") != -1):
                index = generation_desc.rfind("Frame Recomendation: ")
                frame_number = int(generation_desc[index:3].strip())
                print("Frame Recomendation: ", frame_number, "\n\n")
                chunk_dict["frame_number"] = frame_number
                chunk_dict["frame_data"] = frames_array[i + frame_number - 1]
            else:
                print("No frame recommendation found. Defaulting to frame 125.\n\n")
                chunk_dict["frame_number"] = 125
                chunk_dict["frame_data"] = frames_array[125]
            chunk_dict["description"] = generation_desc
        except Exception as e:
            print("Error generating description:", str(e))
            chunk_dict = {}
            chunk_dict["start_frame"] = i
            chunk_dict["end_frame"] = end_frame
            chunk_dict["description"] = "Error generating description"
            chunk_dict["total_tokens"] = 0
            chunk_dict["frame_number"] = 125
            chunk_dict["frame_data"] = frames_array[125]

        descriptions.append(chunk_dict)
        counter += 1
    return descriptions

video_descriptions = generate_video_description(base64Frames)
# Save the video descriptions to a file
with open("video_descriptions.json", "w") as f:
    json.dump(video_descriptions, f, indent=4)

## Take a peek at the description data

In [None]:
video_descriptions[:5]

## Create the API Call with video description context
* message_list as an argument for the conversational history

In [23]:
# Using video_descriptions as a video summary, answer questions about the video

def ask_informed_model(message_list):
    ''' 
    This function prompts a GPT-4o model with a question and a video description.
    '''
    response = client.chat.completions.create(
        model = "gpt-4o-mini",
        messages = message_list,
        temperature=0,
    )
    # print(response.usage.total_tokens)
    response_text = response.choices[0].message.content
    message_list.append(
        {
            "role": "assistant",
            "content": response_text
        }
    )
    return response_text

# Conversational Loop

In [None]:
total_description = ""
for set in video_descriptions:
    total_description += set["description"] + " "
msgs = [
    {
        "role": "system",
        "content": f"You are a helpful video assistant. Please answer the following questions about the video in which you have the text description and a reference frame right here: {total_description}. If you aren't sure of the answer, you can say 'I am not sure. Heres what I have [unsure answer], and here is what I am basing it off of [knowledge]'."
    }
]

while True:
    prompt = input("Ask a question about the video (enter 'exit' to stop): ").strip()
    if prompt.lower() == "exit":
        print("Goodbye!")
        break
    else:
        print(f"User: {prompt}")
        msgs.append(
            {
                "role": "user",
                "content": prompt
            }
        )
    print(f"Assistant: {ask_informed_model(msgs)}") 