# Video RAG Implementation - Phase 1: Video Processing

This notebook implements the first phase of the Video RAG pipeline based on the research paper. We will process the video to extract multi-modal information (visual, audio/text, scene text) and prepare it for retrieval.

### **Steps:**
1.  **Video Segmentation**: Extract frames from the video at a fixed sampling rate.
2.  **ASR (Automatic Speech Recognition)**: Extract audio, transcribe it using Whisper, and index the text using FAISS.
3.  **OCR (Optical Character Recognition)**: Detect text in frames using EasyOCR and index the text using FAISS.
4.  **Visual Feature Extraction**: Extract visual embeddings using CLIP for later use (Object Detection keyframe selection).

In [1]:
# Install necessary packages if not already installed
!pip install torch transformers accelerate openai-whisper easyocr sentence-transformers faiss-cpu opencv-python numpy pillow


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import os
import cv2
import numpy as np
import torch
import faiss
import whisper
import easyocr
from PIL import Image
from transformers import AutoTokenizer, AutoModel, CLIPProcessor, CLIPModel
import pickle
from tqdm import tqdm

# Configuration
VIDEO_PATH = "test_video.mp4"
DB_DIR = "db/video_rag_db"
os.makedirs(DB_DIR, exist_ok=True)

DEVICE = "mps" if torch.backends.mps.is_available() else "cpu"
print(f"Using device: {DEVICE}")

  from .autonotebook import tqdm as notebook_tqdm


Using device: mps


## 1. Video Segmentation

We extract frames from the video. The paper suggests uniformly sampling frames (e.g., 1 frame per second). This reduces redundancy while capturing sufficient temporal information.

In [3]:
def extract_frames(video_path, fps=1.0):
    frames = []
    timestamps = []
    
    cap = cv2.VideoCapture(video_path)
    video_fps = cap.get(cv2.CAP_PROP_FPS)
    frame_interval = int(video_fps / fps)
    
    count = 0
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        
        if count % frame_interval == 0:
            # Convert BGR to RGB
            frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frames.append(frame_rgb)
            timestamps.append(cap.get(cv2.CAP_PROP_POS_MSEC) / 1000.0) # timestamp in seconds
        
        count += 1
    
    cap.release()
    return frames, timestamps

print("Extracting frames...")
frames, timestamps = extract_frames(VIDEO_PATH, fps=1.0)
print(f"Extracted {len(frames)} frames from {VIDEO_PATH}")

Extracting frames...
Extracted 77 frames from test_video.mp4


## 2. Automatic Speech Recognition (ASR)

We use **Whisper** to transcribe the audio from the video. The transcripts provide critical context that might be missed visually. We then encode these transcripts using **Contriever** (or a similar SentenceBERT model) and store them in a FAISS index for retrieval.

In [4]:
# Load Whisper Model
asr_model = whisper.load_model("base", device=DEVICE)

print("Transcribing audio...")
result = asr_model.transcribe(VIDEO_PATH)
segments = result['segments']

# Prepare ASR documents
asr_docs = []
for seg in segments:
    text = seg['text'].strip()
    start = seg['start']
    end = seg['end']
    if text:
        asr_docs.append({
            "text": text,
            "start": start,
            "end": end,
            "type": "ASR"
        })

print(f"Generated {len(asr_docs)} ASR segments.")
# Example
if asr_docs:
    print(asr_docs[0])

Transcribing audio...
Generated 20 ASR segments.
{'text': "In Noet Six, we're looking at the seasonal displays that are popping up everywhere these", 'start': 0.0, 'end': 6.28, 'type': 'ASR'}


## 3. Optical Character Recognition (OCR)

We use **EasyOCR** to extract text appearing in the video frames. This helps in understanding scene text which is often query-relevant. Like ASR, we index these text segments.

In [5]:
reader = easyocr.Reader(['en'], gpu=(DEVICE != 'cpu'))

ocr_docs = []

print("Running OCR on sampled frames...")
for i, frame in enumerate(tqdm(frames)):
    timestamp = timestamps[i]
    # EasyOCR expects numpy array (RGB or BGR, we have RGB)
    results = reader.readtext(frame)
    
    frame_text = " ".join([res[1] for res in results])
    
    if frame_text.strip():
        ocr_docs.append({
            "text": frame_text,
            "timestamp": timestamp,
            "frame_idx": i,
            "type": "OCR"
        })

print(f"Found text in {len(ocr_docs)} frames.")
if ocr_docs:
    print(ocr_docs[0])

Running OCR on sampled frames...


100%|██████████| 77/77 [00:26<00:00,  2.92it/s]

Found text in 58 frames.
{'text': '13newsnow-com NEW AT 6', 'timestamp': 0.0, 'frame_idx': 0, 'type': 'OCR'}





In [7]:
for i in ocr_docs:
    print(i)

{'text': '13newsnow-com NEW AT 6', 'timestamp': 0.0, 'frame_idx': 0, 'type': 'OCR'}
{'text': '13newsnow.com NEW AT 6', 'timestamp': 0.9676333333333333, 'frame_idx': 1, 'type': 'OCR'}
{'text': '13newsnow.com NEW AT 6', 'timestamp': 1.9352666666666667, 'frame_idx': 2, 'type': 'OCR'}
{'text': '13newsnow.com NEW AT 6', 'timestamp': 2.9029000000000003, 'frame_idx': 3, 'type': 'OCR'}
{'text': '13newsnow com', 'timestamp': 3.8705333333333334, 'frame_idx': 4, 'type': 'OCR'}
{'text': '13newsnow com', 'timestamp': 4.838166666666667, 'frame_idx': 5, 'type': 'OCR'}
{'text': '13newsnow com', 'timestamp': 5.8058000000000005, 'frame_idx': 6, 'type': 'OCR'}
{'text': '13newsnow: com', 'timestamp': 6.773433333333334, 'frame_idx': 7, 'type': 'OCR'}
{'text': '13newsnow com', 'timestamp': 7.741066666666667, 'frame_idx': 8, 'type': 'OCR'}
{'text': '13newsnow com', 'timestamp': 8.7087, 'frame_idx': 9, 'type': 'OCR'}
{'text': '13newsnow com', 'timestamp': 9.676333333333334, 'frame_idx': 10, 'type': 'OCR'}
{'t

## 4. Text Embedding & Indexing (FAISS)

We use **Contriever** (or `sentence-transformers/all-MiniLM-L6-v2` as a practical proxy if Contriever is hard to set up immediately, but the paper specifies Contriever). Let's use a standard `sentence-transformers` model which is compliant with the "visual-aligned auxiliary texts" concept.

In [8]:
from sentence_transformers import SentenceTransformer

# Load efficient retrieval model
retriever = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device=DEVICE)

def build_faiss_index(docs, index_name):
    if not docs:
        return None, None
    
    texts = [d['text'] for d in docs]
    embeddings = retriever.encode(texts, convert_to_numpy=True)
    
    # Normalize for cosine similarity (if using IP index)
    faiss.normalize_L2(embeddings)
    
    d = embeddings.shape[1]
    index = faiss.IndexFlatIP(d)
    index.add(embeddings)
    
    # Save Index
    faiss.write_index(index, os.path.join(DB_DIR, f"{index_name}.index"))
    
    # Save Metadata
    with open(os.path.join(DB_DIR, f"{index_name}_meta.pkl"), "wb") as f:
        pickle.dump(docs, f)
    
    print(f"Saved {index_name} index with {len(docs)} documents.")

build_faiss_index(asr_docs, "asr")
build_faiss_index(ocr_docs, "ocr")

Saved asr index with 20 documents.
Saved ocr index with 58 documents.


## 5. Visual Feature Extraction (CLIP)

We extract CLIP visual embeddings for every sampled frame. These will be used in Phase 2 for **Object Detection Keyframe Selection** (filtering frames relevant to a query like "find the red car").

In [9]:
clip_model_name = "openai/clip-vit-large-patch14"
clip_processor = CLIPProcessor.from_pretrained(clip_model_name)
clip_model = CLIPModel.from_pretrained(clip_model_name).to(DEVICE)

batch_size = 32
visual_embeddings = []

print("Extracting Visual Embeddings...")
with torch.no_grad():
    for i in range(0, len(frames), batch_size):
        batch_frames = frames[i : i + batch_size]
        # CLIP expects PIL images or pixel values
        # We have numpy RGB arrays, convert to PIL or use processor handles numpy too usually
        # but explicit PIL is safer
        inputs = clip_processor(images=batch_frames, return_tensors="pt", padding=True).to(DEVICE)
        
        outputs = clip_model.get_image_features(**inputs)
        outputs = outputs / outputs.norm(p=2, dim=-1, keepdim=True) # Normalize
        visual_embeddings.append(outputs.cpu())

if visual_embeddings:
    visual_features = torch.cat(visual_embeddings)
    torch.save(visual_features, os.path.join(DB_DIR, "visual_embeddings.pt"))
    print(f"Saved visual embeddings with shape {visual_features.shape}")
else:
    print("No frames to process.")

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


Extracting Visual Embeddings...
Saved visual embeddings with shape torch.Size([77, 768])


## Phase 1 Complete
We have successfully:
1. Segmented the video.
2. Built an ASR Retrieval Database.
3. Built an OCR Retrieval Database.
4. Saved Visual Embeddings for future Object Detection steps.

Next Step: Phase 2 - Query Processing & Retrieval.