<img src="../DLSU-ALTDSI-logo.png" width="100%" style="margin-bottom:10px; margin-top:0px;"/>

**This notebook contains the context-aware video retrieval pipeline used in the study:**

## *Comparing Modality Representation Schemes in Video Retrieval for More Context-Aware Auto-Annotation of Trending Short-Form Videos*

**By the following researchers from the Andrew L. Tan Data Science Institute:**
1. Ong, Matthew Kristoffer Y. (matthew_kristoffer_ong@dlsu.edu.ph)
2. Presas, Shanette Giane G. (shanette_giane_presas@dlsu.edu.ph)
3. Sarreal, Sophia Althea R. (sophia_sarreal@dlsu.edu.ph)
4. To, Jersey Jaclyn K. (jers_to@dlsu.edu.ph)

---

Note to thesismates:
1. Navigate first to the similarity pipeline folder
2. Run this to activate venv for the terminal instance: .venv\Scripts\activate
3. NOTE: you will also need the ff files:
    1. 'class_labels_indices.csv'
    2. 'Cnn14_mAP=0.431.pth' (these are the model weights to be used) from https://zenodo.org/records/3987831
    3. This specific torchaudio/torchvision model: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

## Dependencies

In [70]:
import os
from pathlib import Path

# audio
import numpy as np
import matplotlib.pyplot as plt
import ffmpeg
import torch
import librosa
from panns_inference import AudioTagging

# visuals
import torchvision.models as models
import torchvision.transforms as transforms
import cv2
import argparse
from tqdm import tqdm
from PIL import Image
import time

#text
import easyocr
import pandas as pd
import re
import json
from collections import defaultdict
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import warnings
import string
from ftfy import fix_text
from wordsegment import load as ws_load, segment as ws_segment
from spellchecker import SpellChecker
from transformers import pipeline
import wordninja
from wordfreq import zipf_frequency
import glob
import pathlib
import subprocess
import sys
from faster_whisper import WhisperModel
from openai import OpenAI
from difflib import SequenceMatcher
from sentence_transformers import SentenceTransformer

#similarity
from numpy.linalg import norm

#gemini
import getpass
from typing import List, Dict, Any
import google.generativeai as genai

import faulthandler
faulthandler.enable()

# Make sure cuda (gpu) is active!
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)

warnings.filterwarnings('ignore')

Using device: cpu


---
## **AUDIO MODALITY**
**Goal**: Produce embeddings representing the audio modality of a given set of videos.

**Preprocessing step:** extracts 32kHz waveform files from the input videos.

In [34]:
def extract_audio_to_wavs(video_path: str, out32: str, overwrite: bool=True):
    extract_32k=(
        ffmpeg.input(video_path).output(out32, format='wav', acodec='pcm_s16le', ac=1, ar=32000)
    )
    if overwrite:
        extract_32k = extract_32k.overwrite_output()
    
    extract_32k.run(quiet=True)
    print("Wrote 32kHz", out32)

In [35]:
def process_video(video_path: str, out_dir: str ="proc_out"):
    out_dir = Path(out_dir)
    audio_dir = out_dir.parent / (out_dir.name + "_32kHz")
    audio_dir.mkdir(parents=True, exist_ok=True) # 32kHz goes to audio_dir

    video = Path(video_path)
    out32 = audio_dir / (video.stem + "_32k.wav") # 32kHz output

    # Extract audio
    extract_audio_to_wavs(str(video), str(out32))

In [36]:
media_dir = Path("media")
videos = list(media_dir.glob("*.mp4"))
print(f"{len(videos)} videos found!")

for video in videos:
    print(f"\nProcessing: {video.name}")
    process_video(video)

10 videos found!

Processing: trend10vid5.mp4
Wrote 32kHz proc_out_32kHz\trend10vid5_32k.wav

Processing: trend10vid6.mp4
Wrote 32kHz proc_out_32kHz\trend10vid6_32k.wav

Processing: trend1vid1.mp4
Wrote 32kHz proc_out_32kHz\trend1vid1_32k.wav

Processing: trend1vid2.mp4
Wrote 32kHz proc_out_32kHz\trend1vid2_32k.wav

Processing: trend1vid3.mp4
Wrote 32kHz proc_out_32kHz\trend1vid3_32k.wav

Processing: trend1vid4.mp4
Wrote 32kHz proc_out_32kHz\trend1vid4_32k.wav

Processing: trend1vid5.mp4
Wrote 32kHz proc_out_32kHz\trend1vid5_32k.wav

Processing: trend3vid6.mp4
Wrote 32kHz proc_out_32kHz\trend3vid6_32k.wav

Processing: trend3vid7.mp4
Wrote 32kHz proc_out_32kHz\trend3vid7_32k.wav

Processing: trend6vid7.mp4
Wrote 32kHz proc_out_32kHz\trend6vid7_32k.wav


**Feature extraction step:** produces embeddings in the form of a 2048-dimensional feature vector representing the audio of the videos.

In [37]:
proc_out_32kHz_dir = Path("proc_out_32kHz")
emb_out_dir = Path("embeddings_out/audio2048") # 2048-d vectors go here
emb_out_dir.mkdir(parents=True, exist_ok=True)

at_model = AudioTagging(checkpoint_path=None, device=device) #this is the pretrained CNN14

wav_files = sorted(proc_out_32kHz_dir.glob("*_32k.wav"))
print(f"{len(wav_files)} WAV files found!")

for wav_path in wav_files:
    print(f"\nProcessing: {wav_path.name}")
    wav, sr = librosa.load(str(wav_path), sr=32000, mono=True) # just to make sure wav is 32kHz
    audio_batch = np.expand_dims(wav, axis=0) # matches the expected shape of PANN

    _, embedding = at_model.inference(audio_batch) # gets the embedding as numpy array

    embedding_vec = embedding[0] # first element of embedding array

    # just removing the "_32k" for filename consistency
    stem = wav_path.stem
    if stem.endswith("_32k"):
        stem = stem[:-4]

    out_path = emb_out_dir / f"{stem}_emb-audio2048.npy"
    np.save(str(out_path), embedding_vec)
    print("Embedding saved: ", out_path)

    print(embedding_vec) # if you want to see the vector
    print(embedding_vec.shape)

Checkpoint path: C:\Users\Shanette/panns_data/Cnn14_mAP=0.431.pth
Using CPU.
200 WAV files found!

Processing: airball_10_32k.wav
Embedding saved:  embeddings_out\audio2048\airball_10_emb-audio2048.npy
[0.         0.         0.         ... 0.17802377 0.         0.        ]
(2048,)

Processing: airball_11_32k.wav
Embedding saved:  embeddings_out\audio2048\airball_11_emb-audio2048.npy
[0.        0.        0.        ... 0.2917655 0.        0.       ]
(2048,)

Processing: airball_12_32k.wav
Embedding saved:  embeddings_out\audio2048\airball_12_emb-audio2048.npy
[0.         0.         0.         ... 0.24517164 0.         0.        ]
(2048,)

Processing: airball_13_32k.wav
Embedding saved:  embeddings_out\audio2048\airball_13_emb-audio2048.npy
[0.         0.         0.         ... 0.31056064 0.         0.        ]
(2048,)

Processing: airball_1_32k.wav
Embedding saved:  embeddings_out\audio2048\airball_1_emb-audio2048.npy
[0.         0.         0.         ... 0.31978565 0.02038169 0.        

---
## **VISUAL MODALITY**
**Goal**: Produce embeddings representing the visual modality of a given set of videos.

In [38]:
# --- 1. SET YOUR LOCAL DIRECTORIES ---
INPUT_DIR = Path("media")
OUTPUT_DIR = Path("embeddings_out/video2048")

# --- 2. SET YOUR MODEL PARAMETERS ---
FRAME_SAMPLE_RATE = 30
BATCH_SIZE = 32

# --- 3. DEFINE VIDEO EXTENSIONS TO FIND ---
VIDEO_EXTENSIONS = [".mp4", ".mov", ".avi", ".mkv", ".webm"]

In [39]:
def get_resnet_model(device: str):
    """Loads the pre-trained ResNet-50 model and its associated transforms."""
    weights = models.ResNet50_Weights.DEFAULT
    model = models.resnet50(weights=weights)
    model = torch.nn.Sequential(*list(model.children())[:-1])
    model.eval()
    model.to(device)
    preprocess = weights.transforms()
    return model, preprocess

model, preprocess = get_resnet_model(device)

In [40]:
def extract_resnet_embeddings(
    video_path: Path, 
    model, 
    preprocess, 
    device: str, 
    frame_sample_rate: int = 30, 
    batch_size: int = 32
) -> np.ndarray:
    if not video_path.exists():
        raise FileNotFoundError(f"Video file not found: {video_path}")

    cap = cv2.VideoCapture(str(video_path))
    if not cap.isOpened():
        raise IOError(f"Cannot open video file: {video_path}")

    frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    all_features = []
    frame_batch = []
    frame_idx = 0
    
    pbar = tqdm(total=frame_count, desc=f"Frames for {video_path.name}", leave=True, disable=True)

    with torch.no_grad():
        while True:
            ret, frame = cap.read()
            if not ret: break
            pbar.update(1)
            
            if frame_idx % frame_sample_rate == 0:
                frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                pil_img = Image.fromarray(frame_rgb)
                frame_batch.append(pil_img)

                if len(frame_batch) == batch_size:
                    image_inputs = torch.stack(
                        [preprocess(img) for img in frame_batch]
                    ).to(device)
                    image_features = model(image_inputs)
                    all_features.append(image_features.squeeze().cpu().numpy())
                    frame_batch = []
            frame_idx += 1
        
        if frame_batch:
            image_inputs = torch.stack(
                [preprocess(img) for img in frame_batch]
            ).to(device)
            image_features = model(image_inputs)
            all_features.append(image_features.squeeze().cpu().numpy())

    cap.release()
    pbar.close()
    if not all_features:
        raise ValueError(f"No frames sampled for {video_path.name}")

    embeddings = np.vstack(all_features)
    mean_embedding = np.mean(embeddings, axis=0)
    return mean_embedding

In [41]:
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"Reading videos from: {INPUT_DIR.resolve()}")
print(f"Saving embeddings to: {OUTPUT_DIR.resolve()}")

# Find all video files
video_files = []
for ext in VIDEO_EXTENSIONS:
    video_files.extend(INPUT_DIR.glob(f"*{ext}"))
print(f"Found {len(video_files)} videos.")

# Get list of files ALREADY in the output folder to skip them
existing_embeddings = {f.name for f in OUTPUT_DIR.glob('*_resnet.npy')}
print(f"Found {len(existing_embeddings)} existing ResNet embeddings.")

for video_path in tqdm(video_files, desc="Processing Videos (ResNet)"):
    output_filename = f"{video_path.stem}_emb-visual2048.npy"

    # Skip if already processed
    if output_filename in existing_embeddings:
        continue
    
    output_path = OUTPUT_DIR / output_filename
    
    try:
        print(f"Processing {video_path.name}...")
        mean_embedding = extract_resnet_embeddings(
            video_path=video_path,
            model=model,
            preprocess=preprocess,
            device=device,
            frame_sample_rate=FRAME_SAMPLE_RATE,
            batch_size=BATCH_SIZE
        )
        np.save(output_path, mean_embedding)

    except Exception as e:
        print(f"\n[ERROR] Failed to process {video_path.name}: {e}")

print("\n--- Batch processing complete. ---")

Reading videos from: C:\Users\Shanette\Downloads\COLLEGE\CSST Y4-T1\THS-ST2\context-aware-video-retrieval\similarity pipeline\media
Saving embeddings to: C:\Users\Shanette\Downloads\COLLEGE\CSST Y4-T1\THS-ST2\context-aware-video-retrieval\similarity pipeline\embeddings_out\video2048
Found 10 videos.
Found 0 existing ResNet embeddings.


Processing Videos (ResNet):   0%|          | 0/10 [00:00<?, ?it/s]

Processing trend10vid5.mp4...


Processing Videos (ResNet):  10%|█         | 1/10 [00:05<00:52,  5.80s/it]

Processing trend10vid6.mp4...


Processing Videos (ResNet):  20%|██        | 2/10 [00:11<00:46,  5.84s/it]

Processing trend1vid1.mp4...


Processing Videos (ResNet):  30%|███       | 3/10 [00:12<00:26,  3.73s/it]

Processing trend1vid2.mp4...


Processing Videos (ResNet):  40%|████      | 4/10 [00:24<00:41,  6.85s/it]

Processing trend1vid3.mp4...


Processing Videos (ResNet):  50%|█████     | 5/10 [00:26<00:25,  5.15s/it]

Processing trend1vid4.mp4...


Processing Videos (ResNet):  60%|██████    | 6/10 [00:27<00:15,  3.84s/it]

Processing trend1vid5.mp4...


Processing Videos (ResNet):  70%|███████   | 7/10 [00:29<00:08,  2.94s/it]

Processing trend3vid6.mp4...


Processing Videos (ResNet):  80%|████████  | 8/10 [00:29<00:04,  2.24s/it]

Processing trend3vid7.mp4...


Processing Videos (ResNet):  90%|█████████ | 9/10 [00:31<00:02,  2.15s/it]

Processing trend6vid7.mp4...


Processing Videos (ResNet): 100%|██████████| 10/10 [00:32<00:00,  3.26s/it]


--- Batch processing complete. ---





---
## **TEXT MODALITY**
**Goal**: Produce embeddings representing the text modality of a given set of videos.

In [None]:
OUTPUT_CSV = "video_text_outputs.csv"
WHISPER_MODEL = WhisperModel("base", device="cuda" if torch.cuda.is_available() else "cpu",
                    compute_type="int8_float16" if torch.cuda.is_available() else "int8")
OPENAI_API_KEY = "" 
# OPENAI_API_KEY = "" # test w/o consuming tokens

In [43]:
def _list_local_videos(root_dir):
    exts = ('.mp4', '.mov', '.m4v', '.mkv', '.avi', '.webm')
    paths = []
    for ext in exts:
        paths.extend(glob.glob(os.path.join(root_dir, f"**/*{ext}"), recursive=True))
    return sorted(paths)

def _fetch_videos_from_folder(folder_path):
    if not folder_path or folder_path.strip() == "":
        raise ValueError("Folder path is empty.")

    if os.path.isdir(folder_path):
        return _list_local_videos(folder_path)

    raise ValueError(f"Not a valid local directory: {folder_path}")

**Automatic speech recognition (ASR) step:**

In [44]:
def extract_audio_with_whisper(video_path):
    try:
        segments, _ = WHISPER_MODEL.transcribe(video_path, beam_size=1)
        return " ".join(s.text for s in segments).strip()
    except Exception as e:
        print(f"ASR Error: {e}")
        return ""

**Optical character recognition (OCR) step:**

In [45]:
def is_valid_text(text):
    if not text or len(text.strip()) < 2:
        return False
    clean = re.sub(r'[^\w#@]', '', text)
    return len(clean) > 0

def preprocess_frame_for_ocr(frame):
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
    enhanced = clahe.apply(gray)
    denoised = cv2.GaussianBlur(enhanced, (3, 3), 0)
    kernel = np.array([[-1,-1,-1], [-1, 9,-1], [-1,-1,-1]])
    sharpened = cv2.filter2D(denoised, -1, kernel)
    return sharpened

def extract_ocr_from_video(video_path, sample_rate_fps=1):
    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened():
        print(f"Error: Cannot open video {video_path}")
        return {}

    fps = cap.get(cv2.CAP_PROP_FPS) or 30.0
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT) or 0)
    frame_interval = max(1, int(round(fps / max(0.1, sample_rate_fps))))

    reader = easyocr.Reader(['en'], gpu=torch.cuda.is_available())
    text_detections = defaultdict(lambda: {'count': 0, 'timestamps': [], 'positions': []})

    print("Processing video frames for OCR...")
    processed = 0

    for frame_idx in range(0, total_frames, frame_interval):
        cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
        ret, frame = cap.read()
        if not ret or frame is None:
            continue

        h, w = frame.shape[:2]
        max_w = 960
        if w > max_w:
            scale = max_w / float(w)
            frame = cv2.resize(frame, (int(w * scale), int(h * scale)), interpolation=cv2.INTER_AREA)

        timestamp = frame_idx / fps

        try:
            frame_pp = preprocess_frame_for_ocr(frame)
            results = reader.readtext(frame_pp, detail=1, paragraph=False)
            for (bbox, text, confidence) in results:
                if confidence > 0.5 and is_valid_text(text):
                    xs = [p[0] for p in bbox]
                    ys = [p[1] for p in bbox]
                    x_left = float(min(xs))
                    y_top = float(min(ys))

                    text_detections[text]['count'] += 1
                    if len(text_detections[text]['timestamps']) < 5:
                        text_detections[text]['timestamps'].append(round(timestamp, 2))
                    if len(text_detections[text]['positions']) < 5:
                        text_detections[text]['positions'].append((round(y_top, 2), round(x_left, 2)))

        except Exception as e:
            print(f"OCR error at frame {frame_idx}: {e}")

        processed += 1
        if processed % 10 == 0:
            print(f"Processed {processed} sampled frames...")

    cap.release()
    print(f"OCR processing complete. Found {len(text_detections)} unique text phrases.")
    return dict(text_detections)

**OCR cleaning step:**

In [46]:
def clean_ocr_with_openai(ocr_phrases, api_key, model="gpt-4o-mini"):
    phrases = list(ocr_phrases.keys())
    if not phrases:
        return []

    print(f"Cleaning {len(phrases)} OCR phrases with {model}...")
    client = OpenAI(api_key=api_key)

    system_prompt = """You are an OCR error correction assistant. Fix only obvious OCR mistakes.

Common OCR errors:
- Character confusion: 'v' → 'y', 'rn' → 'm', '0' → 'O', 'i' → 'l', 'vv' → 'w', '@' → 'a', '@' → 'o'
- Missing spaces: 'helloworld' → 'hello world'
- Extra spaces: 'hel lo' → 'hello'

Rules:
1. ONLY fix clear OCR errors - do not rephrase or change meaning
2. Preserve hashtags (#) exactly
3. Keep original capitalization only for proper nouns, otherwise make everything lowercase.
4. Output ONLY the corrected text (no quotes, explanations, or extra words)
5. If a word has the letter 'v' in it and it looks misspelled, try swapping the 'v' with a 'y' to see if it makes more sense, and vice versa.
6. If a word other than "I" has the letter 'i' in it and it looks misspelled, try swapping the 'i' with a 'l' to see if it makes more sense, and vice versa.
7. Unless an acronym makes sense in the context, make it lowercase."""

    cleaned = []
    total_cost = 0.0

    for i, phrase in enumerate(phrases):
        if i % 10 == 0 and i > 0:
            print(f"  Cleaned {i}/{len(phrases)} (Cost: ${total_cost:.4f})")

        try:
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": f"Fix OCR errors: {phrase}"}
                ],
                temperature=0,
                max_tokens=100
            )

            result = response.choices[0].message.content.strip()

            # Track cost
            if hasattr(response, 'usage') and response.usage:
                input_tok = response.usage.prompt_tokens or 0
                output_tok = response.usage.completion_tokens or 0
                total_cost += (input_tok * 0.15 + output_tok * 0.60) / 1_000_000

            # Remove quotes if added
            if (result.startswith('"') and result.endswith('"')) or \
               (result.startswith("'") and result.endswith("'")):
                result = result[1:-1]

            # Fallback if result is empty or way too different
            if not result or len(result) > len(phrase) * 3:
                result = phrase

            cleaned.append(result.strip())

        except Exception as e:
            print(f"  Error cleaning '{phrase}': {e}")
            cleaned.append(phrase)

    print(f"Cleaning complete! Total cost: ${total_cost:.4f}")
    return cleaned

**Deduplication and merge step:**

In [47]:
def normalize_text(text):
    if not text:
        return ""
    text = text.lower()
    text = re.sub(r'[^\w\s#@]', '', text)  # Keep hashtags and mentions
    return re.sub(r'\s+', ' ', text.strip())

def text_similarity(s1, s2):
    return SequenceMatcher(None, normalize_text(s1), normalize_text(s2)).ratio()

def smart_deduplicate_and_merge(ocr_data, cleaned_phrases):
    # Create phrase objects with metadata
    phrases = []
    for orig, clean in zip(ocr_data.keys(), cleaned_phrases):
        data = ocr_data[orig]
        phrases.append({
            'original': orig,
            'clean': clean,
            'normalized': normalize_text(clean),
            'count': data['count'],
            'timestamps': data.get('timestamps', []),
            'positions': data.get('positions', [])
        })

    # Sort by frequency (most common first) and timestamp (earliest first)
    phrases.sort(key=lambda x: (-x['count'], min(x['timestamps']) if x['timestamps'] else float('inf')))

    merged = []
    skip_indices = set()

    for i, phrase1 in enumerate(phrases):
        if i in skip_indices:
            continue

        # Start with this phrase as the canonical version
        canonical = phrase1.copy()

        # Check against remaining phrases
        for j in range(i + 1, len(phrases)):
            if j in skip_indices:
                continue

            phrase2 = phrases[j]

            # Calculate similarity
            similarity = text_similarity(phrase1['clean'], phrase2['clean'])

            # Merge if very similar (likely same text with OCR errors)
            if similarity > 0.85:
                # Choose the better version (longer, more common, or earlier)
                if len(phrase2['clean']) > len(canonical['clean']):
                    canonical['clean'] = phrase2['clean']
                    canonical['normalized'] = phrase2['normalized']

                # Merge metadata
                canonical['count'] += phrase2['count']
                canonical['timestamps'].extend(phrase2['timestamps'])
                canonical['positions'].extend(phrase2['positions'])

                skip_indices.add(j)

            # Check if one is substring of another
            elif canonical['normalized'] in phrase2['normalized']:
                # phrase1 is substring of phrase2, keep phrase2's text
                canonical['clean'] = phrase2['clean']
                canonical['normalized'] = phrase2['normalized']
                canonical['count'] += phrase2['count']
                canonical['timestamps'].extend(phrase2['timestamps'])
                canonical['positions'].extend(phrase2['positions'])
                skip_indices.add(j)

            elif phrase2['normalized'] in canonical['normalized']:
                # phrase2 is substring of phrase1, keep canonical and merge counts
                canonical['count'] += phrase2['count']
                canonical['timestamps'].extend(phrase2['timestamps'])
                canonical['positions'].extend(phrase2['positions'])
                skip_indices.add(j)

        # Clean up merged data
        canonical['timestamps'] = sorted(set(canonical['timestamps']))[:10]
        canonical['positions'] = list(set(map(tuple, canonical['positions'])))[:10]

        merged.append(canonical)

    return merged

**Final ASR + OCR string assembly and export to CSV step:**

In [48]:
def assemble_final_text(merged_phrases, api_key, model="gpt-4o-mini"):
    if not merged_phrases:
        return ""

    # Sort by timestamp (chronological order)
    sorted_phrases = sorted(merged_phrases,
                           key=lambda x: min(x['timestamps']) if x['timestamps'] else float('inf'))

    # Simple fallback assembly
    simple_assembly = " ".join(p['clean'] for p in sorted_phrases)

    # Try LLM assembly for better coherence
    try:
        client = OpenAI(api_key=api_key)

        phrases_list = [p['clean'] for p in sorted_phrases]

        system_prompt = """You are a Gen-Z person familiar with Tiktok trends assembling OCR text fragments into one coherent sentence or phrase.

Rules:
1. Arrange the fragments by timestamp; if it doesn't make sense, then you can rearrange it minimally.
2. Remove duplicate or very similar fragments
3. Add minimal punctuation ONLY where clearly needed
4. Do NOT add new words or rephrase or change existing words
5. Preserve all hashtags
6. Output ONE clean line of text"""

        user_prompt = f"""Assemble these OCR fragments in order into one coherent line:

{chr(10).join(f'{i+1}. {p}' for i, p in enumerate(phrases_list))}

Assembled text:"""

        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=0,
            max_tokens=200
        )

        result = response.choices[0].message.content.strip()

        # Remove quotes if present
        if (result.startswith('"') and result.endswith('"')) or \
           (result.startswith("'") and result.endswith("'")):
            result = result[1:-1]

        # Validate result isn't too different from source material
        if result and len(result) > 10 and len(result) < len(simple_assembly) * 2:
            print("Using LLM-assembled text")
            return result
        else:
            print("LLM assembly invalid, using simple assembly")
            return simple_assembly

    except Exception as e:
        print(f"LLM assembly failed ({e}), using simple assembly")
        return simple_assembly
    
def create_output_csv(asr_text, merged_phrases, final_text, output_csv):
    rows = []

    # Add ASR
    if asr_text:
        rows.append({
            "source": "ASR",
            "text": asr_text,
            "count": 1,
            "timestamps": "[]",
            "original_text": ""
        })

    # Add individual OCR phrases
    for phrase in merged_phrases:
        rows.append({
            "source": "OCR_PHRASE",
            "text": phrase['clean'],
            "count": phrase['count'],
            "timestamps": json.dumps(phrase['timestamps'][:5]),
            "original_text": phrase['original']
        })

    # Add final assembled text
    if final_text:
        rows.append({
            "source": "OCR_FINAL",
            "text": final_text,
            "count": sum(p['count'] for p in merged_phrases),
            "timestamps": "[]",
            "original_text": ""
        })

    df = pd.DataFrame(rows)
    df.to_csv(output_csv, index=False)
    return df

In [49]:
def _final_cleaned_phrase_list(merged_phrases):
    # Order by earliest timestamp first
    ordered = sorted(
        merged_phrases,
        key=lambda x: min(x['timestamps']) if x.get('timestamps') else float('inf')
    )

    seen = set()
    out = []
    for p in ordered:
        key = normalize_text(p.get('clean', ''))
        if key and key not in seen:
            out.append(p['clean'])
            seen.add(key)
    return out

def _process_one_video(video_path):
    print(f"\n==============================")
    print(f"Processing video: {video_path}")
    print(f"==============================\n")

    # === STEP 1: ASR (unchanged) ===
    print("=== STEP 1: Audio Transcription ===")
    asr_text = extract_audio_with_whisper(video_path)
    print(f"ASR Result: {asr_text[:200]}{'...' if len(asr_text) > 200 else ''}\n")

    # === STEP 2: OCR (unchanged) ===
    print("=== STEP 2: OCR Extraction ===")
    ocr_data = extract_ocr_from_video(video_path, sample_rate_fps=1)
    print(f"Extracted {len(ocr_data)} unique text phrases\n")

    # === STEP 3: Clean OCR (unchanged) ===
    print("=== STEP 3: OCR Cleaning ===")
    cleaned_phrases = clean_ocr_with_openai(ocr_data, OPENAI_API_KEY)

    print("\nOCR Corrections (sample):")
    for orig, clean in list(zip(ocr_data.keys(), cleaned_phrases))[:10]:
        if orig != clean:
            print(f"  ✓ '{orig}' → '{clean}'")
    print()

    # === STEP 4: Dedup & Merge (unchanged) ===
    print("=== STEP 4: Deduplication & Merging ===")
    merged_phrases = smart_deduplicate_and_merge(ocr_data, cleaned_phrases)
    print(f"Consolidated to {len(merged_phrases)} unique phrases\n")

    # === STEP 5: Final Assembly (unchanged) ===
    print("=== STEP 5: Final Text Assembly ===")
    final_text = assemble_final_text(merged_phrases, OPENAI_API_KEY)
    print(f"Final Text: {final_text}\n")

    return asr_text, merged_phrases, final_text

def main():
    print("Discovering videos...\n")

    MEDIA_FOLDER = r"./media"

    video_paths = _fetch_videos_from_folder(MEDIA_FOLDER)
    if not video_paths:
        print("No videos found. Please check the media folder")
        return

    print(f"Found {len(video_paths)} video(s).")
    for v in video_paths:
        print(" -", v)
    print()

    rows = []
    for vp in video_paths:
        asr_text, merged_phrases, final_text = _process_one_video(vp)
        phrases_list = _final_cleaned_phrase_list(merged_phrases)  # simple list of final cleaned phrases

        rows.append({
            "video": os.path.basename(vp),
            "asr": asr_text,
            "ocr_final": final_text,
            "cleaned_phrases": json.dumps(phrases_list, ensure_ascii=False)
        })

    df = pd.DataFrame(rows, columns=["video", "asr", "ocr_final", "cleaned_phrases"])
    df.to_csv(OUTPUT_CSV, index=False)
    print(f"\n✓ Batch results saved to: {OUTPUT_CSV}")
    print(f"Total videos processed: {len(df)}")
    return df

if __name__ == "__main__":
    result_df = main()

Discovering videos...

Found 10 video(s).
 - ./media\trend10vid5.mp4
 - ./media\trend10vid6.mp4
 - ./media\trend1vid1.mp4
 - ./media\trend1vid2.mp4
 - ./media\trend1vid3.mp4
 - ./media\trend1vid4.mp4
 - ./media\trend1vid5.mp4
 - ./media\trend3vid6.mp4
 - ./media\trend3vid7.mp4
 - ./media\trend6vid7.mp4


Processing video: ./media\trend10vid5.mp4

=== STEP 1: Audio Transcription ===


Using CPU. Note: This module is much faster with a GPU.


ASR Result: It's changed challenge  Holiday edition  Favorite stat  Favorite grade  Something unusual  Something unique  What are you studying?  I'm studying the same thing  You know you know  You know you know  ...

=== STEP 2: OCR Extraction ===
Processing video frames for OCR...
Processed 10 sampled frames...


KeyboardInterrupt: 

**Loading csv and metadata JSON step:**

In [50]:
CSV_PATH = "video_text_outputs.csv"
JSON_DIR = "meta"
MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
SAVE_DIR = "embeddings_out/text768"
DEVICE = device
os.makedirs(SAVE_DIR, exist_ok=True)

In [52]:
# loads the csv
df = pd.read_csv(CSV_PATH).fillna("")
df["video_base"] = df["video"].apply(lambda x: os.path.splitext(os.path.basename(str(x)))[0])
print(f"{len(df)} video entries found in CSV!")

# loads the json
json_map = {}
for fname in os.listdir(JSON_DIR):
    if fname.lower().endswith(".json"):
        json_map[os.path.splitext(fname)[0]] = os.path.join(JSON_DIR, fname)

descs, hashtags_texts = [], []
for base in df["video_base"]:
    path = json_map.get(base)
    if not path:
        descs.append("")
        hashtags_texts.append("")
        continue
    try:
        with open(path, "r", encoding="utf-8") as f:
            data = json.load(f)
        vm = data.get("video_metadata", data)
        descs.append(vm.get("description", "") or "")
        hashtags = vm.get("hashtags", [])
        if isinstance(hashtags, list):
            hashtags_texts.append(" ".join(f"#{h}" for h in hashtags))
        else:
            hashtags_texts.append(str(hashtags))
    except Exception:
        descs.append("")
        hashtags_texts.append("")

df["description"] = descs
df["hashtags_text"] = hashtags_texts

10 video entries found in CSV!


**Model loading and text field encoding step:**

In [53]:
model = SentenceTransformer(MODEL_NAME, device=DEVICE)
weights = np.array([0.4, 0.1, 0.3, 0.2])
weights = weights / weights.sum()

print(f"Loaded model: {MODEL_NAME}")

modalities = ["ocr_final", "hashtags_text", "asr", "description"]
embs = {}

for m in modalities:
    print(f"Encoding {m}...")
    texts = df[m].fillna("").astype(str).tolist()
    embs[m] = model.encode(texts, batch_size=8, convert_to_numpy=True,
                           show_progress_bar=True, normalize_embeddings=True)
    
# concatenate all text
df["concatenated_text"] = df.apply(
    lambda row: " ".join([
        str(row.get("ocr_final", "")),
        str(row.get("hashtags_text", "")),
        str(row.get("asr", "")),
        str(row.get("description", ""))
    ]).strip(),
    axis=1
)

print("\n===== CONCATENATED TEXTS =====")
for i, text in enumerate(df["concatenated_text"].tolist(), start=1):
    print(f"[{i}] {text}\n")

Loaded model: sentence-transformers/all-mpnet-base-v2
Encoding ocr_final...


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Encoding hashtags_text...


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Encoding asr...


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Encoding description...


Batches:   0%|          | 0/2 [00:00<?, ?it/s]


===== CONCATENATED TEXTS =====
[1] holiday edition exchange gift challenge at a time one favorite snacks favorite drinks something unusual something you need favorite color something reminds me of you something special nasan na something special mo? #christmas #japan #exchangegiftchallenge #couplegoals #coupleexchangegift #fyp #foryou #christmasexchangegift #coupletok #giftideas #christmas #newyear #holidayseason It's changed challenge  Holiday edition  Favorite stat  Favorite grade  Something unusual  Something unique  What are you studying?  I'm studying the same thing  You know you know  You know you know  What are you studying?  I love the thing, remind me of you  I love the thing, remind me of you  I love the thing, remind me of you  When nobody is coming to him  When nobody's coming to him  Your daddy lives well  Something special  I love the thing  When I'm working 儿  You know he's not the same as Exchange gift with my Lovelove ♡ #exchangegiftchallenge #couplegoals #coupleexcha

**Fuse modalities and save:**

In [54]:
combined_embs = (
    weights[0] * embs["ocr_final"] +
    weights[1] * embs["hashtags_text"] +
    weights[2] * embs["asr"] +
    weights[3] * embs["description"]
)

for i, row in tqdm(df.iterrows(), total=len(df), desc="Saving embeddings"):
    video_name = row["video_base"]
    out_path = os.path.join(SAVE_DIR, f"{video_name}_emb-text768.npy")
    np.save(out_path, combined_embs[i])
    print(f"✓ Saved: {out_path}")

print(f"\nAll text embeddings saved in: {SAVE_DIR}")

Saving embeddings: 100%|██████████| 10/10 [00:00<00:00, 935.35it/s]

✓ Saved: embeddings_out/text768\trend10vid5_emb-text768.npy
✓ Saved: embeddings_out/text768\trend10vid6_emb-text768.npy
✓ Saved: embeddings_out/text768\trend1vid1_emb-text768.npy
✓ Saved: embeddings_out/text768\trend1vid2_emb-text768.npy
✓ Saved: embeddings_out/text768\trend1vid3_emb-text768.npy
✓ Saved: embeddings_out/text768\trend1vid4_emb-text768.npy
✓ Saved: embeddings_out/text768\trend1vid5_emb-text768.npy
✓ Saved: embeddings_out/text768\trend3vid6_emb-text768.npy
✓ Saved: embeddings_out/text768\trend3vid7_emb-text768.npy
✓ Saved: embeddings_out/text768\trend6vid7_emb-text768.npy

All text embeddings saved in: embeddings_out/text768





---
## **RETRIEVING SIMILAR VIDEOS**
**Goal**: Produce a list of most similar videos based on a weighted combination of modality-specific cosine similarity scores.

**Embedding loading step:** creates a dict of embedding vectors following the below format to keep everything organized and so embedding retrieval for each video is trivial.
$$
video\_name \;\rightarrow\; \{ audio,\; video,\; text \}
$$

In [60]:
def load_all_embeddings(base_dir="embeddings_out"):
    base_dir = Path(base_dir)

    folders = {
        "audio":  base_dir / "audio2048",
        "visual": base_dir / "video2048",
        "text":   base_dir / "text768",
    }

    suffix_map = {
        "audio":  "audio2048",
        "visual": "visual2048",
        "text":   "text768",
    }

    modality_files = {} # collect keys per modality
    for modality, folder in folders.items():
        files = list(folder.glob(f"*emb-{suffix_map[modality]}.npy"))
        modality_files[modality] = {f.stem.split("_emb-")[0]: f for f in files}

    all_video_ids = set()
    for d in modality_files.values():
        all_video_ids.update(d.keys())

    embeddings = {}
    missing = []

    for vid in all_video_ids:
        embeddings[vid] = {}
        for modality in ["audio", "visual", "text"]:
            file = modality_files[modality].get(vid, None)
            if file is None:
                missing.append((vid, modality))
                embeddings[vid][modality] = None
            else:
                embeddings[vid][modality] = np.load(str(file))

    if missing:
        print("WARNING: Missing modality embeddings detected:") # just to be safe
        for vid, modality in missing:
            print(f"  - {vid} missing {modality}")

    return embeddings

embeddings = load_all_embeddings() # get embeddings with embeddings["video_name"]

# to check
for video, emb_vec in embeddings.items():
    print(video, emb_vec)


trend1vid2 {'audio': array([0.        , 0.        , 0.        , ..., 0.3142212 , 0.24341571,
       0.        ], shape=(2048,), dtype=float32), 'visual': array([0.03805558, 0.05196837, 0.02482746, ..., 0.06943745, 0.01536561,
       0.00614327], shape=(2048,), dtype=float32), 'text': array([-1.59083478e-02,  5.00531867e-02,  9.72347478e-03, -1.22285879e-02,
        1.34470744e-02,  4.25800551e-02, -3.28271045e-02,  8.49587172e-03,
        2.87535451e-03,  5.13269072e-03,  2.27625009e-02,  1.58448140e-03,
        1.52954398e-02,  3.08346902e-02,  2.58764983e-03,  1.57428544e-02,
       -1.93314908e-02,  2.53984896e-02, -1.67302251e-02,  7.31674973e-03,
        1.21933154e-03,  1.01124798e-02, -5.63533604e-03,  9.80435442e-03,
       -5.74925121e-02, -8.76363218e-03,  3.51267155e-02,  2.97832509e-02,
       -1.46522796e-02, -6.00763654e-02, -1.85377326e-02, -1.78352914e-02,
       -3.40070769e-02, -2.55339764e-02,  2.52843963e-06,  1.47461123e-02,
       -2.80552280e-02,  8.89153648e-03,

>**NOTE: Input the query video here :))**

If testing different queries with the same set of videos, just <u>run the notebook starting at this cell</u> to skip the preprocessing and loading of embeddings.

In [61]:
# please type the EXACT filename of the query video
QUERY = "trend1vid1"

**Cosine similarity computation step:** computes modality-specific cosine similarity scores for each video and a query video, resulting in each video being represented as a vector of 3 similarity scores.

In [62]:
def cosine_similarity(vec1, vec2):
    if vec1 is None or vec2 is None:
        return np.nan
    
    return np.dot(vec1, vec2) / (norm(vec1) * norm(vec2)) # maybe cast to float?  

In [63]:
def compute_modality_similarities(query_video_name: str, embeddings_dir: Path):
    
    if query_video_name not in embeddings:
        raise ValueError(f"Query video '{query_video_name}' not found in embeddings.")
    
    query_emb = embeddings[query_video_name]
    
    similarity_dict = {}
    
    for video_name, video_emb in embeddings.items(): 
        if video_name == query_video_name: # skips self
            continue 
        
        sims = []
        missing_modalities = []
        
        for modality in ["audio", "visual", "text"]:
            if modality not in video_emb or modality not in query_emb:
                missing_modalities.append(modality)
                sims.append(np.nan)  # for missing embeddings
            else:
                sims.append(cosine_similarity(video_emb[modality], query_emb[modality]))
        
        if missing_modalities:
            print(f"[WARNING] {video_name} missing embeddings for: {', '.join(missing_modalities)}")
        
        similarity_dict[video_name] = np.array(sims)
    
    return similarity_dict

similarities = compute_modality_similarities(QUERY, "embeddings_out") # get similarity vector with similarities["video_name"]

# to check
for video, sim_vec in similarities.items():
    print(video, sim_vec)

trend1vid2 [0.61004132 0.45218876 0.41709727]
trend1vid4 [0.71428573 0.56592977 0.34414892]
trend6vid7 [0.65801275 0.36166543 0.37380399]
trend1vid3 [0.85486114 0.62778491 0.73276067]
trend3vid6 [0.59579337 0.39125368 0.27075447]
trend3vid7 [0.70232642 0.46148127 0.23307862]
trend10vid5 [0.66130465 0.59437317 0.23907252]
trend10vid6 [0.59007102 0.52113569 0.16222057]
trend1vid5 [0.65241677 0.5497331  0.4879465 ]


>**NOTE: Input the weights here :))**

If testing different weights with the same query video and set of videos, just <u>run the notebook starting at this cell</u> to skip computing the modality-specific cosine similarity scores.

In [None]:
# (e.g., "visual", "audio", "visual_audio", "audio_text", etc.)
# edit this when you edit weight so the annotations are correctly labeled
CONDITION_NAME = "visual_audio_text"

WEIGHT_AUDIO = 1/3
WEIGHT_VIDEO = 1/3
WEIGHT_TEXT = 1/3

**Weighted-sum fusion step:** uses weighted linear combination to form a final similarity score for each video and a query video, where the weights can be modified according to the different test cases.

In [65]:
def weighted_sum_fusion(similarity_dict, weight_audio, weight_video, weight_text):

    weights = np.array([weight_audio, weight_video, weight_text])
    weights = weights / weights.sum() # apparently we need to normalize this cuz it might not equal 1
    final_weighted_dict = {}
    
    for video, sim_vec in similarity_dict.items():
        if len(sim_vec) != 3:
            raise ValueError(f"Expected 3 modalities in similarity vector for {video}, got {len(sim_vec)}")
        
        sim_audio, sim_video, sim_text = sim_vec
        weighted_score = (sim_audio*weights[0] + sim_video*weights[1] + sim_text*weights[2])
        final_weighted_dict[video] = float(weighted_score)

    return final_weighted_dict

final_scores = weighted_sum_fusion(similarities, WEIGHT_AUDIO, WEIGHT_VIDEO, WEIGHT_TEXT)

# to check
for video, score in final_scores.items():
    print(video, score)

trend1vid2 0.49310911614692154
trend1vid4 0.5414548084420647
trend6vid7 0.46449405416873063
trend1vid3 0.7384689052613296
trend3vid6 0.419267172112229
trend3vid7 0.4656287705014119
trend10vid5 0.4982501146043166
trend10vid6 0.4244757616437078
trend1vid5 0.5633654552190861


**Ranking step:** uses the final scores from weighted sum fusion to rank all videos by their similarity score with the query video, printed in descending order.

***similar_videos_output*** is the final output which will be fed into the annotation generation section.

In [None]:
def rank_by_score(final_weighted_dict, top_k=None):
    ranked_videos = sorted(final_weighted_dict.items(), key=lambda x: x[1], reverse=True)
    
    if top_k is not None:
        ranked_videos = ranked_videos[:top_k]
    
    return ranked_videos

k = 5
most_similar_videos = rank_by_score(final_scores, top_k = k)

print(f"Top {k} most similar videos to {QUERY}")
for video, score in most_similar_videos:
    print(f"{video}: {score:.4f}")

similar_videos_output = [QUERY] + [video for video, _ in most_similar_videos]

print("\nGemini output array format:")
print(similar_videos_output)

Top 5 most similar videos to trend1vid1
trend1vid3: 0.7385
trend1vid5: 0.5634
trend1vid4: 0.5415
trend10vid5: 0.4983
trend1vid2: 0.4931

Gemini output array format:
['trend1vid1', 'trend1vid3', 'trend1vid5', 'trend1vid4', 'trend10vid5', 'trend1vid2']


---
---
## **GEMINI ANNOTATION GENERATOR**
**Goal**: Generate an annotation for the query video, using the ranked list from the pipeline above as context.


In [75]:
try:
    GEMINI_API_KEY = getpass.getpass("Enter your Gemini API Key: ")
except ImportError:
    GEMINI_API_KEY = "PASTE_YOUR_API_KEY_HERE"
    
GENAI_MODEL_NAME = "gemini-2.5-pro"

QUERY_VIDEO_PATH = rf"media\{QUERY}.mp4"
BASE_MEDIA_PATH = r"media" 

CONTEXT_VIDEO_TUPLES = most_similar_videos

TOP_K = 5

JSON_OUTPUT_PATH = f"{QUERY}_{CONDITION_NAME}.json"
CSV_OUTPUT_PATH = f"{QUERY}_{CONDITION_NAME}.csv"

def build_paths(similar_list: List, base_path: str, k: int) :
    video_names = [video_name for video_name, _ in similar_list]
    top_k_names = video_names[:k]
    return [rf"{base_path}\{name}.mp4" for name in top_k_names]

if CONDITION_NAME == "baseline":
    CONTEXT_VIDEO_PATHS = []
else:
    CONTEXT_VIDEO_PATHS = build_paths(CONTEXT_VIDEO_TUPLES, BASE_MEDIA_PATH, TOP_K)

print(f"--- Preparing Annotation for: {QUERY} ---")
print(f"Condition: {CONDITION_NAME}")
print(f"Context Videos: {CONTEXT_VIDEO_PATHS}")

--- Preparing Annotation for: trend1vid1 ---
Condition: visual_audio_text
Context Videos: ['media\\trend1vid3.mp4', 'media\\trend1vid5.mp4', 'media\\trend1vid4.mp4', 'media\\trend10vid5.mp4', 'media\\trend1vid2.mp4']


In [76]:
BASELINE_SYSTEM = "You are an assistant tasked with generating a brief summary of a short video. Use only the information available in the video. Do not rely on any external knowledge or assumptions. Focus on describing what is happening in the video concisely."
CONTEXT_AWARE_SYSTEM = "You are an assistant tasked with generating a summary of a short video. You are provided with the main video and a few additional videos that are semantically related. Use all available information to generate a summary that best describes what is happening in the main video. Focus on enhancing your understanding using the related videos, but ensure the summary reflects the main video."
BASELINE_USER = "Please generate a 2–3 sentence summary of the following video based solely on its content."
CONTEXT_AWARE_USER = "Please summarize the main video using all the information provided. The first video is the main one, and the others are related videos that may provide helpful context. Your summary should describe what is happening in the main video in 2–3 sentences."

_uploaded_cache: Dict[str, Any] = {}
def _upload_video(path: str):
    global _uploaded_cache
    full = str(Path(path).resolve())
    if full not in _uploaded_cache:
        print(f"Uploading: {full}")
        try:
            file_obj = genai.upload_file(path=full)
            print("Uploaded, waiting for processing...")
            while True:
                file_obj = genai.get_file(file_obj.name)
                if file_obj.state.name == "ACTIVE":
                    print(f"File is ACTIVE: {file_obj.name}"); break
                elif file_obj.state.name == "FAILED":
                    raise RuntimeError(f"File {file_obj.name} failed to process.")
                time.sleep(2)
            _uploaded_cache[full] = file_obj
        except Exception as e:
            print(f"Error uploading {path}: {e}"); return None
    return _uploaded_cache.get(full)

def _make_model(system_instruction: str):
    return genai.GenerativeModel(
        model_name=GENAI_MODEL_NAME,
        system_instruction=system_instruction
    )

In [77]:
genai.configure(api_key=GEMINI_API_KEY)
print("--- Starting Annotation ---")

_uploaded_cache = {} 
annotation_text = ""
final_record = {}

query_file = _upload_video(QUERY_VIDEO_PATH)

if query_file is None:
    print(f"Aborting: Failed to upload main query video {QUERY_VIDEO_PATH}")
else:
    if CONDITION_NAME == "baseline":
        print("Running BASELINE annotation...")
        model = _make_model(BASELINE_SYSTEM)
        try:
            response = model.generate_content([query_file, BASELINE_USER])
            annotation_text = response.text.strip()
        except Exception as e:
            print(f"Error in baseline generation: {e}"); annotation_text = f"ERROR: {e}"
        
        final_record = {
            "query_id": QUERY, "condition_name": CONDITION_NAME,
            "context_video_paths": [], "annotation_text": annotation_text,
        }

    else:
        print(f"Running CONTEXT-AWARE annotation for: {CONDITION_NAME}...")
        model = _make_model(CONTEXT_AWARE_SYSTEM)
        
        ctx_files = []
        for p in CONTEXT_VIDEO_PATHS:
            f = _upload_video(p)
            if f: ctx_files.append(f)
        
        contents = [query_file] + ctx_files + [CONTEXT_AWARE_USER]
        
        try:
            response = model.generate_content(contents)
            annotation_text = response.text.strip()
        except Exception as e:
            print(f"Error in context generation: {e}"); annotation_text = f"ERROR: {e}"
        
        final_record = {
            "query_id": QUERY, "condition_name": CONDITION_NAME,
            "context_video_paths": CONTEXT_VIDEO_PATHS, "annotation_text": annotation_text,
        }

    print("\n--- Annotation Complete ---")
    print(f"Result: {annotation_text}")
    
    if final_record:
        pd.DataFrame([final_record]).to_csv(CSV_OUTPUT_PATH, index=False)
        with open(JSON_OUTPUT_PATH, "w", encoding="utf-8") as f:
            json.dump(final_record, f, indent=2, ensure_ascii=False)
        print(f"Saved results to:\n  {CSV_OUTPUT_PATH}\n  {JSON_OUTPUT_PATH}")
    else:
        print("No result to save.")

--- Starting Annotation ---
Uploading: C:\Users\Shanette\Downloads\COLLEGE\CSST Y4-T1\THS-ST2\context-aware-video-retrieval\similarity pipeline\media\trend1vid1.mp4
Uploaded, waiting for processing...
File is ACTIVE: files/l2m2ellcb6wh
Running CONTEXT-AWARE annotation for: visual_audio_text...
Uploading: C:\Users\Shanette\Downloads\COLLEGE\CSST Y4-T1\THS-ST2\context-aware-video-retrieval\similarity pipeline\media\trend1vid3.mp4
Uploaded, waiting for processing...
File is ACTIVE: files/n51uyxzwucl4
Uploading: C:\Users\Shanette\Downloads\COLLEGE\CSST Y4-T1\THS-ST2\context-aware-video-retrieval\similarity pipeline\media\trend1vid5.mp4
Uploaded, waiting for processing...
File is ACTIVE: files/e6cqoai9737a
Uploading: C:\Users\Shanette\Downloads\COLLEGE\CSST Y4-T1\THS-ST2\context-aware-video-retrieval\similarity pipeline\media\trend1vid4.mp4
Uploaded, waiting for processing...
File is ACTIVE: files/lalppqihbyqa
Uploading: C:\Users\Shanette\Downloads\COLLEGE\CSST Y4-T1\THS-ST2\context-aware-v