<img src="../DLSU-ALTDSI-logo.png" width="100%" style="margin-bottom:10px; margin-top:0px;"/>

**This notebook contains the context-aware video retrieval pipeline used in the study:**

## *Comparing Modality Representation Schemes in Video Retrieval for More Context-Aware Auto-Annotation of Trending Short-Form Videos*

**By the following researchers from the Andrew L. Tan Data Science Institute:**
1. Ong, Matthew Kristoffer Y. (matthew_kristoffer_ong@dlsu.edu.ph)
2. Presas, Shanette Giane G. (shanette_giane_presas@dlsu.edu.ph)
3. Sarreal, Sophia Althea R. (sophia_sarreal@dlsu.edu.ph)
4. To, Jersey Jaclyn K. (jers_to@dlsu.edu.ph)

---

Note to thesismates:
1. Run this to activate venv for the terminal instance: .venv\Scripts\activate
2. NOTE: you will also need the ff files:
    1. 'class_labels_indices.csv'
    2. 'Cnn14_mAP=0.431.pth' (these are the model weights to be used) from https://zenodo.org/records/3987831

## Dependencies

In [143]:
import os
from pathlib import Path

# audio
import numpy as np
import matplotlib.pyplot as plt
import ffmpeg
import torch
import librosa
from panns_inference import AudioTagging

# visuals
import torchvision.models as models
import torchvision.transforms as transforms
import cv2
import argparse
from tqdm import tqdm
from PIL import Image
import time

#text


#similarity
from numpy.linalg import norm

#annotation
import json
import pandas as pd
import google.generativeai as genai
import getpass
from typing import List, Dict, Any

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)


Using device: cpu


---
## **AUDIO MODALITY**
**Goal**: Produce embeddings representing the audio modality of a given set of videos.

**Preprocessing step:** extracts 32kHz waveform files from the input videos.

In [144]:
def extract_audio_to_wavs(video_path: str, out32: str, overwrite: bool=True):
    extract_32k=(
        ffmpeg.input(video_path).output(out32, format='wav', acodec='pcm_s16le', ac=1, ar=32000)
    )
    if overwrite:
        extract_32k = extract_32k.overwrite_output()
    
    extract_32k.run(quiet=True)
    print("Wrote 32kHz", out32)

In [145]:
def process_video(video_path: str, out_dir: str ="proc_out"):
    out_dir = Path(out_dir)
    audio_dir = out_dir.parent / (out_dir.name + "_32kHz")
    audio_dir.mkdir(parents=True, exist_ok=True) # 32kHz goes to audio_dir

    video = Path(video_path)
    out32 = audio_dir / (video.stem + "_32k.wav") # 32kHz output

    # Extract audio
    extract_audio_to_wavs(str(video), str(out32))

In [146]:
media_dir = Path("media")
videos = list(media_dir.glob("*.mp4"))
print(f"{len(videos)} videos found!")

for video in videos:
    print(f"\nProcessing: {video.name}")
    process_video(video)

100 videos found!

Processing: airball_1.mp4
Wrote 32kHz proc_out_32kHz\airball_1_32k.wav

Processing: airball_10.mp4
Wrote 32kHz proc_out_32kHz\airball_10_32k.wav

Processing: airball_11.mp4
Wrote 32kHz proc_out_32kHz\airball_11_32k.wav

Processing: airball_12.mp4
Wrote 32kHz proc_out_32kHz\airball_12_32k.wav

Processing: airball_13.mp4
Wrote 32kHz proc_out_32kHz\airball_13_32k.wav

Processing: airball_2.mp4
Wrote 32kHz proc_out_32kHz\airball_2_32k.wav

Processing: airball_3.mp4
Wrote 32kHz proc_out_32kHz\airball_3_32k.wav

Processing: airball_4.mp4
Wrote 32kHz proc_out_32kHz\airball_4_32k.wav

Processing: airball_5.mp4
Wrote 32kHz proc_out_32kHz\airball_5_32k.wav

Processing: airball_6.mp4
Wrote 32kHz proc_out_32kHz\airball_6_32k.wav

Processing: airball_7.mp4
Wrote 32kHz proc_out_32kHz\airball_7_32k.wav

Processing: airball_8.mp4
Wrote 32kHz proc_out_32kHz\airball_8_32k.wav

Processing: airball_9.mp4
Wrote 32kHz proc_out_32kHz\airball_9_32k.wav

Processing: box_1.mp4
Wrote 32kHz pro

**Feature extraction step:** produces embeddings in the form of a 2048-dimensional feature vector representing the audio of the videos.

In [147]:
proc_out_32kHz_dir = Path("proc_out_32kHz")
emb_out_dir = Path("embeddings_out/audio2048") # 2048-d vectors go here
emb_out_dir.mkdir(parents=True, exist_ok=True)

at_model = AudioTagging(checkpoint_path=None, device=device) #this is the pretrained CNN14

wav_files = sorted(proc_out_32kHz_dir.glob("*_32k.wav"))
print(f"{len(wav_files)} WAV files found!")

for wav_path in wav_files:
    print(f"\nProcessing: {wav_path.name}")
    wav, sr = librosa.load(str(wav_path), sr=32000, mono=True) # just to make sure wav is 32kHz
    audio_batch = np.expand_dims(wav, axis=0) # matches the expected shape of PANN

    _, embedding = at_model.inference(audio_batch) # gets the embedding as numpy array

    embedding_vec = embedding[0] # first element of embedding array

    # just removing the "_32k" for filename consistency
    stem = wav_path.stem
    if stem.endswith("_32k"):
        stem = stem[:-4]

    out_path = emb_out_dir / f"{stem}_emb-audio2048.npy"
    np.save(str(out_path), embedding_vec)
    print("Embedding saved: ", out_path)

    print(embedding_vec) # if you want to see the vector
    print(embedding_vec.shape)

Checkpoint path: C:\Users\Shanette/panns_data/Cnn14_mAP=0.431.pth
Using CPU.
100 WAV files found!

Processing: airball_10_32k.wav
Embedding saved:  embeddings_out\audio2048\airball_10_emb-audio2048.npy
[0.         0.         0.         ... 0.17802377 0.         0.        ]
(2048,)

Processing: airball_11_32k.wav
Embedding saved:  embeddings_out\audio2048\airball_11_emb-audio2048.npy
[0.        0.        0.        ... 0.2917655 0.        0.       ]
(2048,)

Processing: airball_12_32k.wav
Embedding saved:  embeddings_out\audio2048\airball_12_emb-audio2048.npy
[0.         0.         0.         ... 0.24517164 0.         0.        ]
(2048,)

Processing: airball_13_32k.wav
Embedding saved:  embeddings_out\audio2048\airball_13_emb-audio2048.npy
[0.         0.         0.         ... 0.31056064 0.         0.        ]
(2048,)

Processing: airball_1_32k.wav
Embedding saved:  embeddings_out\audio2048\airball_1_emb-audio2048.npy
[0.         0.         0.         ... 0.31978565 0.02038169 0.        

---
## **VISUAL MODALITY**
**Goal**: Produce embeddings representing the visual modality of a given set of videos.

In [148]:
INPUT_DIR = Path("media")
OUTPUT_DIR = Path("embeddings_out/video2048")

FRAME_SAMPLE_RATE = 30
BATCH_SIZE = 32

VIDEO_EXTENSIONS = [".mp4", ".mov", ".avi", ".mkv", ".webm"]

In [149]:
def get_resnet_model(device: str):
    """Loads the pre-trained ResNet-50 model and its associated transforms."""
    weights = models.ResNet50_Weights.DEFAULT
    model = models.resnet50(weights=weights)
    model = torch.nn.Sequential(*list(model.children())[:-1])
    model.eval()
    model.to(device)
    preprocess = weights.transforms()
    return model, preprocess

model, preprocess = get_resnet_model(device)

In [150]:
def extract_resnet_embeddings(
    video_path: Path, 
    model, 
    preprocess, 
    device: str, 
    frame_sample_rate: int = 30, 
    batch_size: int = 32
) -> np.ndarray:
    if not video_path.exists():
        raise FileNotFoundError(f"Video file not found: {video_path}")

    cap = cv2.VideoCapture(str(video_path))
    if not cap.isOpened():
        raise IOError(f"Cannot open video file: {video_path}")

    frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    all_features = []
    frame_batch = []
    frame_idx = 0
    
    pbar = tqdm(total=frame_count, desc=f"Frames for {video_path.name}", leave=True, disable=True)

    with torch.no_grad():
        while True:
            ret, frame = cap.read()
            if not ret: break
            pbar.update(1)
            
            if frame_idx % frame_sample_rate == 0:
                frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                pil_img = Image.fromarray(frame_rgb)
                frame_batch.append(pil_img)

                if len(frame_batch) == batch_size:
                    image_inputs = torch.stack(
                        [preprocess(img) for img in frame_batch]
                    ).to(device)
                    image_features = model(image_inputs)
                    all_features.append(image_features.squeeze().cpu().numpy())
                    frame_batch = []
            frame_idx += 1
        
        if frame_batch:
            image_inputs = torch.stack(
                [preprocess(img) for img in frame_batch]
            ).to(device)
            image_features = model(image_inputs)
            all_features.append(image_features.squeeze().cpu().numpy())

    cap.release()
    pbar.close()
    if not all_features:
        raise ValueError(f"No frames sampled for {video_path.name}")

    embeddings = np.vstack(all_features)
    mean_embedding = np.mean(embeddings, axis=0)
    return mean_embedding

In [151]:
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"Reading videos from: {INPUT_DIR.resolve()}")
print(f"Saving embeddings to: {OUTPUT_DIR.resolve()}")

video_files = []
for ext in VIDEO_EXTENSIONS:
    video_files.extend(INPUT_DIR.glob(f"*{ext}"))
print(f"Found {len(video_files)} videos.")

existing_embeddings = {f.name for f in OUTPUT_DIR.glob('*.npy')}
print(f"Found {len(existing_embeddings)} existing ResNet embeddings.")

for video_path in tqdm(video_files, desc="Processing Videos (ResNet)"):
    output_filename = f"{video_path.stem}_emb-visual2048.npy"

    if output_filename in existing_embeddings:
        continue
    
    output_path = OUTPUT_DIR / output_filename
    
    try:
        print(f"Processing {video_path.name}...")
        mean_embedding = extract_resnet_embeddings(
            video_path=video_path,
            model=model,
            preprocess=preprocess,
            device=device,
            frame_sample_rate=FRAME_SAMPLE_RATE,
            batch_size=BATCH_SIZE
        )
        np.save(output_path, mean_embedding)

    except Exception as e:
        print(f"\n[ERROR] Failed to process {video_path.name}: {e}")

print("\n--- Batch processing complete. ---")

Reading videos from: C:\Users\Shanette\Downloads\COLLEGE\CSST Y4-T1\THS-ST2\context-aware-video-retrieval\similarity pipeline\media
Saving embeddings to: C:\Users\Shanette\Downloads\COLLEGE\CSST Y4-T1\THS-ST2\context-aware-video-retrieval\similarity pipeline\embeddings_out\video2048
Found 100 videos.
Found 100 existing ResNet embeddings.


Processing Videos (ResNet): 100%|██████████| 100/100 [00:00<00:00, 99556.23it/s]


--- Batch processing complete. ---





---
## **TEXT MODALITY**
**Goal**: Produce embeddings representing the text modality of a given set of videos.

---
## **RETRIEVING SIMILAR VIDEOS**
**Goal**: Produce a list of most similar videos based on a weighted combination of modality-specific cosine similarity scores.

**Embedding loading step:** creates a dict of embedding vectors following the below format to keep everything organized and so embedding retrieval for each video is trivial.
$$
video\_name \;\rightarrow\; \{ audio,\; video,\; text \}
$$

In [152]:
def load_all_embeddings(base_dir="embeddings_out"):
    base_dir = Path(base_dir)

    folders = {
        "audio":  base_dir / "audio2048",
        "visual": base_dir / "video2048",
        "text":   base_dir / "text768",
    }

    suffix_map = {
        "audio":  "audio2048",
        "visual": "visual2048",
        "text":   "text768",
    }

    modality_files = {} # collect keys per modality
    for modality, folder in folders.items():
        files = list(folder.glob(f"*emb-{suffix_map[modality]}.npy"))
        modality_files[modality] = {f.stem.split("_emb-")[0]: f for f in files}

    all_video_ids = set()
    for d in modality_files.values():
        all_video_ids.update(d.keys())

    embeddings = {}
    missing = []

    for vid in all_video_ids:
        embeddings[vid] = {}
        for modality in ["audio", "visual", "text"]:
            file = modality_files[modality].get(vid, None)
            if file is None:
                missing.append((vid, modality))
                embeddings[vid][modality] = None
            else:
                embeddings[vid][modality] = np.load(str(file))

    if missing:
        print("WARNING: Missing modality embeddings detected:") # just to be safe
        for vid, modality in missing:
            print(f"  - {vid} missing {modality}")

    return embeddings

embeddings = load_all_embeddings() # get embeddings with embeddings["video_name"]

# to check
for video, emb_vec in embeddings.items():
    print(video, emb_vec)


  - kidnap_7 missing text
  - exchange_1 missing text
  - fit_2 missing text
  - kidnap_8 missing text
  - rps_3 missing text
  - airball_12 missing text
  - where_1 missing text
  - rps_4 missing text
  - box_5 missing text
  - fun_10 missing text
  - airball_1 missing text
  - kidnap_5 missing text
  - box_7 missing text
  - kidnap_9 missing text
  - where_3 missing text
  - freewill_8 missing text
  - exchange_9 missing text
  - exchange_2 missing text
  - box_8 missing text
  - freewill_3 missing text
  - rps_7 missing text
  - fit_10 missing text
  - box_1 missing text
  - kidnap_4 missing text
  - where_7 missing text
  - kidnap_10 missing text
  - where_8 missing text
  - freewill_7 missing text
  - freewill_16 missing text
  - where_5 missing text
  - fun_6 missing text
  - fit_5 missing text
  - rps_5 missing text
  - fit_6 missing text
  - kidnap_1 missing text
  - kidnap_3 missing text
  - airball_10 missing text
  - fit_3 missing text
  - fun_8 missing text
  - where_2 miss

>**NOTE: Input the query video here :))**

If testing different queries with the same set of videos, just <u>run the notebook starting at this cell</u> to skip the preprocessing and loading of embeddings.

In [153]:
# please type the EXACT filename of the query video
QUERY_ID = "airball_1"
QUERY_VIDEO_PATH = f"media/{QUERY_ID}.mp4"
BASE_MEDIA_PATH = r"media"

**Cosine similarity computation step:** computes modality-specific cosine similarity scores for each video and a query video, resulting in each video being represented as a vector of 3 similarity scores.

In [154]:
def cosine_similarity(vec1, vec2):
    if vec1 is None or vec2 is None:
        return np.nan  # <--- CHANGED FROM None to np.nan
    
    # Check for zero-norm vectors (which would cause division by zero)
    n1 = norm(vec1)
    n2 = norm(vec2)
    if n1 == 0 or n2 == 0:
        return 0.0 # Define similarity as 0 if either vector is all zeros
    
    return np.dot(vec1, vec2) / (n1 * n2)

In [155]:
def compute_modality_similarities(query_video_name: str, embeddings_dir: Path):
    
    if query_video_name not in embeddings:
        raise ValueError(f"Query video '{query_video_name}' not found in embeddings.")
    
    query_emb = embeddings[query_video_name]
    
    similarity_dict = {}
    
    for video_name, video_emb in embeddings.items(): 
        if video_name == query_video_name: # skips self
            continue 
        
        sims = []
        missing_modalities = []
        
        for modality in ["audio", "visual", "text"]:
            if modality not in video_emb or modality not in query_emb:
                missing_modalities.append(modality)
                sims.append(np.nan)  # for missing embeddings
            else:
                sims.append(cosine_similarity(video_emb[modality], query_emb[modality]))
        
        if missing_modalities:
            print(f"[WARNING] {video_name} missing embeddings for: {', '.join(missing_modalities)}")
        
        similarity_dict[video_name] = np.array(sims)
    
    return similarity_dict

similarities = compute_modality_similarities(QUERY_ID, "embeddings_out") # get similarity vector with similarities["video_name"]

# to check
for video, sim_vec in similarities.items():
    print(video, sim_vec)

kidnap_7 [0.56134838 0.32091469        nan]
exchange_1 [0.64918184 0.45566502        nan]
fit_2 [0.86118448 0.66660649        nan]
kidnap_8 [0.55460483 0.51111072        nan]
rps_3 [0.54650992 0.49365288        nan]
airball_12 [0.98227698 0.62172717        nan]
where_1 [0.49181014 0.6853745         nan]
rps_4 [0.55064255 0.58628416        nan]
box_5 [0.49744254 0.5607928         nan]
fun_10 [0.52101535 0.69809717        nan]
kidnap_5 [0.55799776 0.28628424        nan]
box_7 [0.54632092 0.50814146        nan]
kidnap_9 [0.55460483 0.4925746         nan]
where_3 [0.57611936 0.67777193        nan]
freewill_8 [0.79449123 0.58143461        nan]
exchange_9 [0.62143105 0.55445236        nan]
exchange_2 [0.50487858 0.62993276        nan]
box_8 [0.56646043 0.62201983        nan]
freewill_3 [0.62562209 0.53297114        nan]
rps_7 [0.54968208 0.59066129        nan]
fit_10 [0.84237576 0.50965542        nan]
box_1 [0.57380652 0.61256623        nan]
kidnap_4 [0.55009645 0.45111495        nan]
where_

>**NOTE: Input the weights here :))**

If testing different weights with the same query video and set of videos, just <u>run the notebook starting at this cell</u> to skip computing the modality-specific cosine similarity scores.

In [156]:
WEIGHT_AUDIO = 1/3
WEIGHT_VIDEO = 1/3
WEIGHT_TEXT = 1/3

**Weighted-sum fusion step:** uses weighted linear combination to form a final similarity score for each video and a query video, where the weights can be modified according to the different test cases.

In [157]:
def weighted_sum_fusion(similarity_dict, weight_audio, weight_video, weight_text):

    weights = np.array([weight_audio, weight_video, weight_text])
    weights = weights / weights.sum() # apparently we need to normalize this cuz it might not equal 1
    final_weighted_dict = {}
    
    for video, sim_vec in similarity_dict.items():
        if len(sim_vec) != 3:
            raise ValueError(f"Expected 3 modalities in similarity vector for {video}, got {len(sim_vec)}")
        
        sim_audio, sim_video, sim_text = sim_vec
        weighted_score = (sim_audio*weights[0] + sim_video*weights[1] + sim_text*weights[2])
        final_weighted_dict[video] = float(weighted_score)

    return final_weighted_dict

final_scores = weighted_sum_fusion(similarities, WEIGHT_AUDIO, WEIGHT_VIDEO, WEIGHT_TEXT)

# to check
for video, score in final_scores.items():
    print(video, score)

kidnap_7 nan
exchange_1 nan
fit_2 nan
kidnap_8 nan
rps_3 nan
airball_12 nan
where_1 nan
rps_4 nan
box_5 nan
fun_10 nan
kidnap_5 nan
box_7 nan
kidnap_9 nan
where_3 nan
freewill_8 nan
exchange_9 nan
exchange_2 nan
box_8 nan
freewill_3 nan
rps_7 nan
fit_10 nan
box_1 nan
kidnap_4 nan
where_7 nan
kidnap_10 nan
where_8 nan
freewill_7 nan
freewill_16 nan
where_5 nan
fun_6 nan
fit_5 nan
rps_5 nan
fit_6 nan
kidnap_1 nan
kidnap_3 nan
airball_10 nan
fit_3 nan
fun_8 nan
where_2 nan
exchange_6 nan
where_11 nan
freewill_9 nan
airball_11 nan
where_6 nan
fit_8 nan
fun_5 nan
rps_6 nan
fun_2 nan
airball_7 nan
airball_2 nan
exchange_3 nan
exchange_5 nan
freewill_11 nan
freewill_13 nan
airball_4 nan
exchange_4 nan
box_11 nan
freewill_4 nan
fit_7 nan
airball_5 nan
box_9 nan
freewill_2 nan
fun_9 nan
fit_1 nan
airball_9 nan
freewill_6 nan
airball_6 nan
fit_4 nan
fun_3 nan
fun_4 nan
freewill_10 nan
freewill_1 nan
rps_1 nan
fun_1 nan
where_12 nan
exchange_10 nan
box_3 nan
airball_13 nan
rps_10 nan
airball_8 na

In [158]:
# import numpy as np

# def create_context_sets(similarity_dict):
#     """
#     Takes the similarities dict and generates 7 ranked lists of (video, score) tuples.
#     Handles missing modalities (np.nan) by ignoring them in the weighted sum.
#     """
    
#     weights_per_condition = {
#         "visual_only":       np.array([0, 1, 0]),
#         "audio_only":        np.array([1, 0, 0]),
#         "text_only":         np.array([0, 0, 1]),
#         "visual_plus_audio": np.array([0.5, 0.5, 0]),
#         "audio_plus_text":   np.array([0.5, 0, 0.5]),
#         "visual_plus_text":  np.array([0, 0.5, 0.5]),
#         "visual_audio_text": np.array([1/3, 1/3, 1/3]),
#     }
    
#     ranked_lists = {}

#     for cond_name, weights in weights_per_condition.items():
#         scores_for_this_condition = []
        
#         for video_name, sim_vec in similarity_dict.items():
            
#             weighted_score = np.nansum(sim_vec * weights)
            
#             non_nan_weights_sum = np.sum(weights[~np.isnan(sim_vec)])
            
#             if non_nan_weights_sum == 0:
#                 final_score = 0.0
#             else:
#                 final_score = weighted_score / non_nan_weights_sum
                
#             scores_for_this_condition.append((video_name, final_score))
        
#         scores_for_this_condition.sort(key=lambda x: x[1], reverse=True)
#         ranked_lists[cond_name] = scores_for_this_condition
        
#     return ranked_lists

# ranked_context_lists = create_context_sets(similarities)

# print("--- Top 3 videos for each condition ---")
# for cond_name, videos in ranked_context_lists.items():
#     print(f"[{cond_name}]:")
#     for video, score in videos[:3]:
#         print(f"  - {video} (Score: {score:.4f})")

**Ranking step:** uses the final scores from weighted sum fusion to rank all videos by their similarity score with the query video, printed in descending order.

In [159]:
def rank_by_score(final_weighted_dict, top_k=None):
    ranked_videos = sorted(final_weighted_dict.items(), key=lambda x: x[1], reverse=True)
    
    if top_k is not None:
        ranked_videos = ranked_videos[:top_k]
    
    return ranked_videos

k = 5 # no of similar videos to retrieve
most_similar_videos = rank_by_score(final_scores, top_k = k)

print(f"Top {k} most similar videos to {QUERY_ID}")
for video, score in most_similar_videos:
    print(f"{video}: {score:.4f}")

print("\nCOPY THIS LIST FOR THE NEXT STEP:")
print(most_similar_videos)

Top 5 most similar videos to airball_1
kidnap_7: nan
exchange_1: nan
fit_2: nan
kidnap_8: nan
rps_3: nan

COPY THIS LIST FOR THE NEXT STEP:
[('kidnap_7', nan), ('exchange_1', nan), ('fit_2', nan), ('kidnap_8', nan), ('rps_3', nan)]


---
---
## **GEMINI ANNOTATION GENERATOR**
**Goal**: Generate an annotation for the query video, using the ranked list from the pipeline above as context.


In [160]:
# --- ⚠️ MANUAL INPUT 3: Configure this Annotation Run ---

# 1. Enter your Gemini API Key (or use getpass)
GEMINI_API_KEY = "PASTE_YOUR_API_KEY_HERE"
GENAI_MODEL_NAME = "gemini-1.5-pro"

# 2. Define your query video path and media folder
# (These should be correct from your pipeline)
# QUERY_ID is already set from cell [80952079]
QUERY_VIDEO_PATH = rf"media\{QUERY_ID}.mp4"
BASE_MEDIA_PATH = r"media" # Folder where all videos are

# 3. ⚠️ SET THE NAME FOR THIS RUN ⚠️
# (e.g., "baseline", "visual_only", "visual_plus_audio")
CONDITION_NAME = "visual_plus_audio" 

# 4. ⚠️ PASTE THE LIST FROM THE CELL ABOVE ⚠️
CONTEXT_VIDEO_TUPLES = [
    ('airball_8', 0.85), 
    ('airball_3', 0.79), 
    ('where_1', 0.75)
]

# 5. Set how many context videos to use
TOP_K = 3 

# 6. Define output filenames (will be automatic)
JSON_OUTPUT_PATH = f"{QUERY_ID}_{CONDITION_NAME}.json"
CSV_OUTPUT_PATH = f"{QUERY_ID}_{CONDITION_NAME}.csv"

# --- Helper function to build the final list ---
def build_paths(similar_list: List, base_path: str, k: int) :
    video_names = [video_name for video_name, _ in similar_list]
    top_k_names = video_names[:k]
    # Assumes .mp4, you may need to change this
    return [rf"{base_path}\{name}.mp4" for name in top_k_names]

CONTEXT_VIDEO_PATHS = build_paths(CONTEXT_VIDEO_TUPLES, BASE_MEDIA_PATH, TOP_K)
print(f"Running annotation for: {QUERY_ID} - {CONDITION_NAME}")
print(f"Using context videos: {CONTEXT_VIDEO_PATHS}")

Running annotation for: airball_1 - visual_plus_audio
Using context videos: ['media\\airball_8.mp4', 'media\\airball_3.mp4', 'media\\where_1.mp4']


In [161]:
BASELINE_SYSTEM = "You are an assistant tasked with generating a brief summary of a short video. Use only the information available in the video. Do not rely on any external knowledge or assumptions. Focus on describing what is happening in the video concisely."
CONTEXT_AWARE_SYSTEM = "You are an assistant tasked with generating a summary of a short video. You are provided with the main video and a few additional videos that are semantically related. Use all available information to generate a summary that best describes what is happening in the main video. Focus on enhancing your understanding using the related videos, but ensure the summary reflects the main video."
BASELINE_USER = "Please generate a 2–3 sentence summary of the following video based solely on its content."
CONTEXT_AWARE_USER = "Please summarize the main video using all the information provided. The first video is the main one, and the others are related videos that may provide helpful context. Your summary should describe what is happening in the main video in 2–3 sentences."

_uploaded_cache: Dict[str, Any] = {}
def _upload_video(path: str):
    global _uploaded_cache
    full = str(Path(path).resolve())
    if full not in _uploaded_cache:
        print(f"Uploading: {full}")
        try:
            file_obj = genai.upload_file(path=full)
            print("Uploaded, waiting for processing...")
            while True:
                file_obj = genai.get_file(file_obj.name)
                if file_obj.state.name == "ACTIVE":
                    print(f"✅ File is ACTIVE: {file_obj.name}"); break
                elif file_obj.state.name == "FAILED":
                    raise RuntimeError(f"❌ File {file_obj.name} failed to process.")
                time.sleep(2)
            _uploaded_cache[full] = file_obj
        except Exception as e:
            print(f"❌ Error uploading {path}: {e}"); return None
    return _uploaded_cache.get(full)

def _make_model(system_instruction: str):
    return genai.GenerativeModel(GENAI_MODEL_NAME, system_instruction)

In [162]:
genai.configure(api_key=GEMINI_API_KEY)
print("--- Starting Annotation ---")

# Clear cache for each new run
_uploaded_cache = {} 
annotation_text = ""
final_record = {}

query_file = _upload_video(QUERY_VIDEO_PATH)

if query_file is None:
    print(f"❌ Aborting: Failed to upload main query video {QUERY_VIDEO_PATH}")
else:
    if CONDITION_NAME == "baseline":
        print("Running BASELINE annotation...")
        model = _make_model(BASELINE_SYSTEM)
        try:
            response = model.generate_content([query_file, BASELINE_USER])
            annotation_text = response.text.strip()
        except Exception as e:
            print(f"❌ Error in baseline generation: {e}"); annotation_text = f"ERROR: {e}"
        
        final_record = {
            "query_id": QUERY_ID, "condition_name": CONDITION_NAME,
            "context_video_paths": [], "annotation_text": annotation_text,
        }

    else:
        print(f"Running CONTEXT-AWARE annotation for: {CONDITION_NAME}...")
        model = _make_model(CONTEXT_AWARE_SYSTEM)
        
        ctx_files = []
        for p in CONTEXT_VIDEO_PATHS:
            f = _upload_video(p)
            if f: ctx_files.append(f)
        
        contents = [query_file] + ctx_files + [CONTEXT_AWARE_USER]
        
        try:
            response = model.generate_content(contents)
            annotation_text = response.text.strip()
        except Exception as e:
            print(f"❌ Error in context generation: {e}"); annotation_text = f"ERROR: {e}"
        
        final_record = {
            "query_id": QUERY_ID, "condition_name": CONDITION_NAME,
            "context_video_paths": CONTEXT_VIDEO_PATHS, "annotation_text": annotation_text,
        }

    print("\n--- Annotation Complete ---")
    print(f"Result: {annotation_text}")
    
    # Save the single result
    if final_record:
        pd.DataFrame([final_record]).to_csv(CSV_OUTPUT_PATH, index=False)
        with open(JSON_OUTPUT_PATH, "w", encoding="utf-8") as f:
            json.dump(final_record, f, indent=2, ensure_ascii=False)
        print(f"Saved results to:\n  {CSV_OUTPUT_PATH}\n  {JSON_OUTPUT_PATH}")
    else:
        print("No result to save.")

--- Starting Annotation ---
Uploading: C:\Users\Shanette\Downloads\COLLEGE\CSST Y4-T1\THS-ST2\context-aware-video-retrieval\similarity pipeline\media\airball_1.mp4
❌ Error uploading media\airball_1.mp4: <HttpError 400 when requesting https://generativelanguage.googleapis.com/$discovery/rest?version=v1beta&key=PASTE_YOUR_API_KEY_HERE returned "API key not valid. Please pass a valid API key.". Details: "[{'@type': 'type.googleapis.com/google.rpc.ErrorInfo', 'reason': 'API_KEY_INVALID', 'domain': 'googleapis.com', 'metadata': {'service': 'generativelanguage.googleapis.com'}}, {'@type': 'type.googleapis.com/google.rpc.LocalizedMessage', 'locale': 'en-US', 'message': 'API key not valid. Please pass a valid API key.'}]">
❌ Aborting: Failed to upload main query video media\airball_1.mp4
