[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mongodb-developer/GenAI-Showcase/blob/main/notebooks/advanced_techniques/agentic_video_search.ipynb)

[![View Article](https://img.shields.io/badge/View%20Article-blue)](https://www.mongodb.com/company/blog/technical/agentic-video-search/?utm_campaign=devrel&utm_source=cross-post&utm_medium=organic_social&utm_content=https%3A%2F%2Fgithub.com%2Fmongodb-developer%2FGenAI-Showcase&utm_term=apoorva.joshi)

# Building an Agentic Video Search System using Voyage AI and MongoDB

## Step 1: Install required packages

- **voyageai**: Voyage AI's Python SDK
- **pymongo**: MongoDB's Python driver
- **anthropic**: Anthropic's Python SDK
- **huggingface_hub**: Python library for interacting with the Hugging Face Hub
- **ffmpeg-python**: Python wrapper for `ffmpeg`
- **tqdm**: Python library to display progress bars for loops

In [1]:
!pip install -qU voyageai==0.3.7 pymongo==4.15.5 anthropic==0.75.0 huggingface-hub==1.2.3 ffmpeg-python==0.2.0 tqdm==4.67.1

You'll also need to install the `ffmpeg` binary itself. To do this, run the following commands from the terminal and note the path to the `ffmpeg` installation:

#### MacOS

```
brew install ffmpeg
```

#### Linux

```
sudo apt-get install ffmpeg
```

#### Windows
* Download the executable from [ffmpeg.org](https://ffmpeg.org/download.html#build-windows)
* Extract the downloaded zip file
* Note the path to the `bin` folder

## Step 2: Setup prerequisites

**Voyage AI**
- [Obtain a Voyage AI API key](https://dashboard.voyageai.com/organization/api-keys)

**MongoDB**
- Register for a [free MongoDB Atlas account](https://www.mongodb.com/cloud/atlas/register)
- [Create a new database cluster](https://www.mongodb.com/docs/guides/atlas/cluster/)
- [Obtain the connection string](https://www.mongodb.com/docs/guides/atlas/connection-string/) for your database cluster

**Anthropic**
- [Obtain an Anthropic API key](https://platform.claude.com/settings/keys)

In [2]:
import getpass
import os

import anthropic
import voyageai
from pymongo import MongoClient

In [171]:
# Set Voyage API key as an environment variable
os.environ["VOYAGE_API_KEY"] = getpass.getpass("Enter your Voyage API key:")
# Initialize the Voyage AI client
voyage_client = voyageai.Client()

Enter your Voyage API key: ········


In [4]:
# Set the MongoDB connection string
MONGODB_URI = getpass.getpass("Enter your MongoDB connection string:")
# Initialize the MongoDB client
mongodb_client = MongoClient(
    MONGODB_URI, appname="devrel.showcase.agentic_video_search"
)
# Check MongoDB connection
mongodb_client.admin.command("ping")

Enter your MongoDB connection string: ········


{'ok': 1.0,
 '$clusterTime': {'clusterTime': Timestamp(1767387291, 1),
  'signature': {'hash': b'\xf8\xbcI\xcf\x81DR\xc1\xcdO\xcf\xa8\x1d\xc9\x1do\x14dH\xf2',
   'keyId': 7558184680432861186}},
 'operationTime': Timestamp(1767387291, 1)}

In [5]:
# Set Anthropic API key as an environment variable
os.environ["ANTHROPIC_API_KEY"] = getpass.getpass("Enter your Anthropic API key:")
# Initialize the Anthropic client
anthropic_client = anthropic.Anthropic()

Enter your Anthropic API key: ········


In [17]:
# Make ffmpeg accessible from the notebook
# Replace /path/to/ffmpeg with your ffmpeg path
os.environ["PATH"] = f"/path/to/ffmpeg:{os.environ['PATH']}"

## Step 3: Download the dataset

In [172]:
from huggingface_hub import snapshot_download

In [173]:
data_dir = snapshot_download(
    repo_id="MongoDB/cooking-videos-with-captions",
    repo_type="dataset",
    local_dir="./videos/",
)

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

## Step 4: Segment the videos using captions

`voyage-multimodal-3.5` has a 32k token limit or a 20 MB file size limit for video inputs. When working with large videos, split them into smaller segments prior to embedding to keep them within the model’s limits. Splitting videos at natural breaks in captions/transcripts ensures that related frames remain together, resulting in more focused embeddings.

In [98]:
import glob
import json

import ffmpeg

In [99]:
# Create directory to store video segments
segments_dir = "./videos/segments"
os.makedirs(segments_dir, exist_ok=True)

In [100]:
num_videos = len(glob.glob(os.path.join(data_dir, "video_*.mp4")))
num_videos

4

In [101]:
docs = []
for num in range(num_videos):
    video_id = f"video_{num:03d}"
    video_path = os.path.join(data_dir, f"{video_id}.mp4")
    captions_path = os.path.join(data_dir, f"{video_id}.json")

    # Load captions
    with open(captions_path) as f:
        data = json.load(f)

    captions = data["captions"]
    title = data["title"]

    # Segment the video based on captions
    for i, caption in enumerate(captions):
        segment_id = f"segment_{i:03d}"
        # Create segment
        output_file = os.path.join(segments_dir, f"{video_id}_{segment_id}.mp4")
        (
            ffmpeg.input(video_path, ss=caption["start"], to=caption["end"])
            .output(output_file, c="copy")
            .overwrite_output()
            .run(quiet=True)
        )
        # Create segment document to write to MongoDB
        doc = {
            "segment_id": segment_id,
            "video_id": video_id,
            "caption": caption["text"],
            "metadata": {
                "video_title": title,
                "start": caption["start"],
                "end": caption["end"],
            },
        }
        docs.append(doc)

In [102]:
# Preview a segment doc
docs[0]

{'segment_id': 'segment_000',
 'video_id': 'video_000',
 'caption': 'Chef Marguerite Dubois, wearing her signature striped apron, rolls out the laminated croissant dough using a wooden rolling pin on a granite countertop dusted with flour.',
 'metadata': {'video_title': 'Classic French Croissants with Chef Marguerite Dubois',
  'start': 0,
  'end': 7}}

## Step 5: Embed the video segments

In [103]:
from tqdm import tqdm
from voyageai.video_utils import Video

In [104]:
MODEL_NAME = "voyage-multimodal-3.5"

In [189]:
def generate_embeddings(inputs: list[list], input_type: str) -> list[list]:
    """
    Generate embeddings using Voyage AI's latest multimodal embedding model.

    Args:
        inputs (list[list]): Inputs as a list of lists
        input_type (str): Type of input. Can be one of "document" or "query"

    Returns:
        list[list]: List of embeddings
    """
    embeddings = voyage_client.multimodal_embed(
        inputs=inputs, model=MODEL_NAME, input_type=input_type
    ).embeddings
    return embeddings

In [107]:
for doc in tqdm(docs):
    video_obj = Video.from_path(
        path=f"{segments_dir}/{doc['video_id']}_{doc['segment_id']}.mp4",
        model=MODEL_NAME,
    )
    # Embed the video segment and its caption together
    embeddings = generate_embeddings([[video_obj, doc["caption"]]], "document")
    # Add the embedding to the MongoDB document
    doc["embedding"] = embeddings[0]


  0%|          | 0/17 [00:00<?, ?it/s][A
  6%|▌         | 1/17 [00:07<02:05,  7.86s/it][A
 12%|█▏        | 2/17 [00:15<01:59,  8.00s/it][A
 18%|█▊        | 3/17 [00:23<01:47,  7.68s/it][A
 24%|██▎       | 4/17 [00:31<01:40,  7.73s/it][A
 29%|██▉       | 5/17 [00:37<01:27,  7.26s/it][A
 35%|███▌      | 6/17 [00:44<01:20,  7.28s/it][A
 41%|████      | 7/17 [00:52<01:13,  7.39s/it][A
 47%|████▋     | 8/17 [01:02<01:14,  8.24s/it][A
 53%|█████▎    | 9/17 [01:06<00:55,  6.95s/it][A
 59%|█████▉    | 10/17 [01:13<00:49,  7.00s/it][A
 65%|██████▍   | 11/17 [01:22<00:44,  7.47s/it][A
 71%|███████   | 12/17 [01:29<00:37,  7.44s/it][A
 76%|███████▋  | 13/17 [01:36<00:29,  7.39s/it][A
 82%|████████▏ | 14/17 [01:42<00:20,  6.92s/it][A
 88%|████████▊ | 15/17 [01:48<00:13,  6.60s/it][A
 94%|█████████▍| 16/17 [01:55<00:06,  6.72s/it][A
100%|██████████| 17/17 [02:02<00:00,  7.18s/it][A


In [109]:
# Ensure that embeddings were added to the MongoDB docs
docs[0].keys()

dict_keys(['segment_id', 'video_id', 'caption', 'metadata', 'embedding'])

## Step 6: Ingest documents into MongoDB

In [110]:
db = mongodb_client["video_search"]

In [111]:
collection = db["segments"]

In [112]:
# Delete existing documents from collection
collection.delete_many({})

DeleteResult({'n': 0, 'electionId': ObjectId('7fffffff0000000000000048'), 'opTime': {'ts': Timestamp(1767391621, 1), 't': 72}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1767391621, 1), 'signature': {'hash': b'\x01)\xa3v^\x13N\xb8\xc7Ny\x97\xf0\xa5\x885\x92?M\xcd', 'keyId': 7558184680432861186}}, 'operationTime': Timestamp(1767391621, 1)}, acknowledged=True)

In [113]:
# Insert `docs` into the collection
collection.insert_many(docs)

InsertManyResult([ObjectId('695841876d5b2abc43875acc'), ObjectId('695841876d5b2abc43875acd'), ObjectId('695841876d5b2abc43875ace'), ObjectId('695841876d5b2abc43875acf'), ObjectId('695841876d5b2abc43875ad0'), ObjectId('695841876d5b2abc43875ad1'), ObjectId('695841876d5b2abc43875ad2'), ObjectId('695841876d5b2abc43875ad3'), ObjectId('695841876d5b2abc43875ad4'), ObjectId('695841876d5b2abc43875ad5'), ObjectId('695841876d5b2abc43875ad6'), ObjectId('695841876d5b2abc43875ad7'), ObjectId('695841876d5b2abc43875ad8'), ObjectId('695841876d5b2abc43875ad9'), ObjectId('695841876d5b2abc43875ada'), ObjectId('695841876d5b2abc43875adb'), ObjectId('695841876d5b2abc43875adc')], acknowledged=True)

## Step 7: Create search indexes

In [114]:
from pymongo.operations import SearchIndexModel

In [115]:
# Full-text search index definition
fts_model = SearchIndexModel(
    name="fts-index",
    definition={
        "mappings": {"dynamic": False, "fields": {"caption": {"type": "string"}}}
    },
)

In [116]:
# Vector search index definition
vs_model = SearchIndexModel(
    name="vector-index",
    type="vectorSearch",
    definition={
        "fields": [
            {
                "type": "vector",
                "path": "embedding",
                "numDimensions": 1024,
                "similarity": "cosine",
            }
        ]
    },
)

In [117]:
collection.create_search_indexes([fts_model, vs_model])

['fts-index', 'vector-index']

## Step 8: Define search functions

In [162]:
def format_time(seconds: int) -> str:
    """
    Format a second timestamp as min:sec.

    Args:
        seconds (int): Time in seconds

    Returns:
        str: Formatted timestamp
    """
    mins = int(seconds // 60)
    secs = int(seconds % 60)
    return f"{mins}:{secs:02d}"

In [194]:
def vector_search(query: str) -> None:
    """
    Retrieve relevant video segments using vector search.

    Args:
        query (str): User query string
    """
    query_embedding = generate_embeddings([[query]], "query")[0]
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector-index",
                "queryVector": query_embedding,
                "path": "embedding",
                "numCandidates": 200,
                "limit": 3,
            }
        },
        {
            "$project": {
                "_id": 0,
                "video_title": "$metadata.video_title",
                "start": "$metadata.start",
                "end": "$metadata.end",
                "score": {"$meta": "vectorSearchScore"},
            }
        },
    ]

    results = collection.aggregate(pipeline)
    for result in results:
        print(
            f"{result.get('video_title')} ({format_time(result.get('start'))} - {format_time(result.get('end'))})"
        )

In [201]:
def hybrid_search(query: str) -> None:
    """
    Retrieve relevant video segments using hybrid search.

    Args:
        query (str): User query string
    """
    query_embedding = generate_embeddings([[query]], "query")[0]
    pipeline = [
        {
            "$rankFusion": {
                "input": {
                    "pipelines": {
                        "vector_pipeline": [
                            {
                                "$vectorSearch": {
                                    "index": "vector-index",
                                    "path": "embedding",
                                    "queryVector": query_embedding,
                                    "numCandidates": 200,
                                    "limit": 10,
                                }
                            }
                        ],
                        "fts_pipeline": [
                            {
                                "$search": {
                                    "index": "fts-index",
                                    "text": {"query": query, "path": "caption"},
                                }
                            },
                            {"$limit": 10},
                        ],
                    }
                },
                "combination": {
                    "weights": {"vector_pipeline": 0.5, "fts_pipeline": 0.5}
                },
                "scoreDetails": True,
            }
        },
        {
            "$project": {
                "_id": 0,
                "video_title": "$metadata.video_title",
                "start": "$metadata.start",
                "end": "$metadata.end",
                "score": "$scoreDetails.value",
            }
        },
        {"$limit": 3},
    ]

    results = collection.aggregate(pipeline)
    for result in results:
        print(
            f"{result.get('video_title')} ({format_time(result.get('start'))} - {format_time(result.get('end'))})"
        )

In [196]:
vector_search("Rolling croissant dough")

Classic French Croissants with Chef Marguerite Dubois (0:24 - 0:37)
Classic French Croissants with Chef Marguerite Dubois (0:59 - 1:01)
Classic French Croissants with Chef Marguerite Dubois (0:00 - 0:07)


In [202]:
hybrid_search("Coil fold technique")

Artisan Sourdough Bread Folding Technique (0:10 - 0:18)
Artisan Sourdough Bread Folding Technique (0:19 - 0:20)
Classic French Croissants with Chef Marguerite Dubois (0:24 - 0:37)


## Step 9: Building the Agentic Search Pipeline

In [125]:
# Define structured output schema
output_schema = {
    "type": "object",
    "properties": {"search": {"type": "string", "enum": ["vector", "hybrid"]}},
    "required": ["search"],
    "additionalProperties": False,
}

In [127]:
SYSTEM_PROMPT = """Given a query, choose the optimal search strategy to retrieve the most relevant video segments for it: 

vector
- Best for: Visual actions and details, methods, concepts or general descriptions.
- Examples: "How to chop onions", "Grilling vegetables"
- Uses: Multimodal embeddings that capture both video and caption meaning

hybrid
- Best for: Specific names and terms such as techniques, chef names, dietary restrictions etc.
- Examples: "Coil fold technique", "Egg wash ingredients"

Default to vector unless exact word matching is critical."""

In [182]:
def get_search_type(query: str) -> str:
    """
    Use an LLM to determine the search strategy based on the query.

    Args:
        query (str): User query string

    Returns:
        str: Search type. One of "vector" or "hybrid"
    """
    print("Determining search type...")
    response = anthropic_client.beta.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=50,
        temperature=0,
        betas=["structured-outputs-2025-11-13"],
        system=SYSTEM_PROMPT,
        messages=[{"role": "user", "content": f"Query: {query}"}],
        output_format={"type": "json_schema", "schema": output_schema},
    )
    search_type = json.loads(response.content[0].text).get("search", "unknown")
    print(f"Using search type: {search_type}")
    return search_type

In [183]:
def search(query: str) -> None:
    """
    Given a query, determine the search type and execute the search.

    Args:
        query (str): User quqery string
    """
    search_type = get_search_type(query)
    if search_type == "vector":
        vector_search(query)
    elif search_type == "hybrid":
        hybrid_search(query)
    else:
        print(f"Not a supported search type: {search_type}")

In [184]:
search("Rolling croissant dough")

Determining search type...
Using search type: vector
Classic French Croissants with Chef Marguerite Dubois (0:24 - 0:37)
Classic French Croissants with Chef Marguerite Dubois (0:59 - 1:01)
Classic French Croissants with Chef Marguerite Dubois (0:00 - 0:07)


In [203]:
search("Coil fold technique")

Determining search type...
Using search type: hybrid
Artisan Sourdough Bread Folding Technique (0:10 - 0:18)
Artisan Sourdough Bread Folding Technique (0:19 - 0:20)
Classic French Croissants with Chef Marguerite Dubois (0:24 - 0:37)
