# Building Multimodal AI Applications with MongoDB and Voyage AI

In this codealong, you will learn how to build a multimodal AI agent from scratch using Voyage AI's multimodal embedding models, Google's multimodal LLMs, and MongoDB as a vector database and memory provider for the agent.

# Task 0: Install required libraries

- **pymongo**: Python driver for MongoDB
- **voyageai**: Python client for Voyage AI
- **google-genai**: Python library to access Google's embedding models and LLMs via Google AI Studio
- **PyMuPDF**: Python library for analyzing and manipulating PDFs
- **Pillow**: A Python imaging library
- **tqdm**: Show progress bars for loops in Python

In [3]:
!pip install -qU pymongo voyageai google-genai PyMuPDF Pillow tqdm

# Task 1: Setup prerequisites

- **MongoDB cluster setup**:
    - Register for a [free MongoDB Atlas account](https://www.mongodb.com/cloud/atlas/register/?utm_campaign=devrel&utm_source=third-party-content&utm_medium=cta&utm_content=datacamp&utm_term=apoorva.joshi).
    - [Create a new database cluster](https://www.mongodb.com/docs/guides/atlas/cluster/?utm_campaign=devrel&utm_source=third-party-content&utm_medium=cta&utm_content=datacamp&utm_term=apoorva.joshi).
    - [Obtain the connection string for your database cluster](https://www.mongodb.com/docs/guides/atlas/connection-string/?utm_campaign=devrel&utm_source=third-party-content&utm_medium=cta&utm_content=datacamp&utm_term=apoorva.joshi).
- **Obtain a Voyage AI API key**: Follow the steps [here](https://docs.voyageai.com/docs/api-key-and-installation#authentication-with-api-keys) to get a Voyage AI API key.
- **Obtain a Gemini API key**: Follow the steps [here](https://ai.google.dev/gemini-api/docs/api-key) to get a Gemini API key via Google AI Studio.

In [4]:
import getpass
import os

In [5]:
# Set your MongoDB connection string
MONGODB_URI = getpass.getpass("Enter your MongoDB connection string: ")

In [6]:
# Set Voyage AI API Key
os.environ["VOYAGE_API_KEY"] = getpass.getpass("Enter your Voyage AI API key: ")

In [7]:
# Set Gemini API Key
GEMINI_API_KEY = getpass.getpass("Enter your Gemini API key: ")

In this codealong, we will build an AI agent that can help users make sense of large documents containing text interleaved with figures and tables. Some examples of these in the real world are financial reports, technical manuals, business proposals, product catalogs, research papers, etc. To represent this type of data in the codealong, we will use the [Deepseek-R1 paper](https://arxiv.org/pdf/2501.12948).

The goal of our agent will be two-fold:
- Answer questions about the paper
- Explain charts and diagrams found in the paper

Let's first preprocess the document in order to effectively retrieve information from it. Here's what the data processing workflow will look like:

![data_processing.png](data_processing.png)

- 1: Convert each document to a series of screenshots
- 2: Save the screenshots to blob storage (AWS S3, GCS etc.) and extract a unique identifier for each screenshot
- 3-4: Pass the screenshots through Voyage AI's voyage-multimodal-3 model to generate embeddings
- 5: Insert documents consisting of the screenshot embeddings and metadata into MongoDB

# Task 2: Read PDF from URL

In [8]:
import pymupdf
import requests

In [9]:
# Download the DeepSeek paper
response = requests.get("https://arxiv.org/pdf/2501.12948")
if response.status_code != 200:
    raise ValueError(f"Failed to download PDF. Status code: {response.status_code}")
# Get the content of the response
pdf_stream = response.content
# Open the data in `pdf_stream` as a PDF document.
pdf = pymupdf.Document(stream=pdf_stream, filetype="pdf")

# Task 3: Store PDF images locally and extract metadata for MongoDB

In [10]:
from tqdm import tqdm

In [11]:
docs = []

In [12]:
zoom = 3.0
# Set image matrix dimensions
mat = pymupdf.Matrix(zoom, zoom)
# Iterate through the pages of the PDF
for n in tqdm(range(pdf.page_count)):
    temp = {}
    # Use the `get_pixmap` method to render the PDF page as a matrix of pixels as specified by the variable `mat`
    pix = pdf[n].get_pixmap(matrix=mat)
    # Store image locally
    key = f"{n+1}.png"
    pix.save(key)
    # Extract image metadata to be stored in MongoDB
    temp["key"] = key
    temp["width"] = pix.width
    temp["height"] = pix.height
    docs.append(temp)

In [13]:
# Ensure that all pages of the PDF were processed
len(docs)

In [14]:
# Preview a document 
docs[0]

# Task 4: Add embeddings to the MongoDB documents

In [15]:
from voyageai import Client
from PIL import Image

In [16]:
# Initialize the Voyage AI client
voyageai_client = Client()

In [17]:
# Helper function to generate embeddings using the voyage-multimodal-3 model
# input_type can be one of `document` or `query` depending on what you are embedding 
def get_embedding(data, input_type): 
    embedding = voyageai_client.multimodal_embed(
        inputs=[[data]], model="voyage-multimodal-3", input_type=input_type
    ).embeddings[0]
    return embedding

In [18]:
embedded_docs = []
for doc in tqdm(docs):
    # Open the image from file
    img = Image.open(f"{doc['key']}")
    # Add the embeddings to the document
    doc["embedding"] = get_embedding(img, "document")
    embedded_docs.append(doc)

In [19]:
# Preview a document
embedded_docs[0]

# Task 5: Write documents into a MongoDB collection

In [20]:
from pymongo import MongoClient

In [21]:
# Create a MongoDB client
mongodb_client = MongoClient(
    MONGODB_URI, appname="devrel.datacamp_codealong"
)
# Check connection to the cluster
mongodb_client.admin.command("ping")

In [22]:
# Database name
DB_NAME = "datacamp"
# Name of the collection to insert documents into
COLLECTION_NAME = "multimodal_codealong"

In [23]:
# Connect to the MongoDB collection
collection = mongodb_client[DB_NAME][COLLECTION_NAME]

In [24]:
# Delete existing documents from the collection
collection.delete_many({})

In [25]:
# Insert the embedded documents into the collection
collection.insert_many(embedded_docs)

# Task 6: Create a vector search index

In [26]:
VS_INDEX_NAME = "vector_index"

In [27]:
# Create vector index definition specifying:
# path: Path to the embeddings field
# numDimensions: Number of embedding dimensions- depends on the embedding model used
# similarity: Similarity metric. One of cosine, euclidean, dotProduct.
model = {
    "name": VS_INDEX_NAME,
    "type": "vectorSearch",
    "definition": {
        "fields": [
            {
                "type": "vector",
                "path": "embedding",
                "numDimensions": 1024,
                "similarity": "cosine",
            }
        ]
    },
}

In [27]:
# Create a vector search index with the above `model` for the `collection` collection
collection.create_search_index(model=model)

**NOTE**: Before proceeding further, navigate to the MongoDB Atlas UI and ensure that the vector search index is in READY status.

Now that we have prepared our mixed-modality document for search and retrieval, let's build an AI agent that can retrieve information from text, images and tables in this document to help answer questions about the document. The workflow for the agent looks as follows:

![agent_worflow_1.png](agent_workflow_1.png)

![agent_worflow_2.png](agent_workflow_2.png)

- 1: User sends a query to the agent
- 2: Agent forwards the user query to an LLM
- 3: LLM decides whether to call a tool or not. If yes, the LLM also extracts the arguments for the tool call.
- 4: The agent calls the tool selected by the LLM using the arguments generated by the LLM
- 5: If the vector search tool is called, it returns the IDs of the screenshots
- 6-7: The agent obtains the relevant screenshots from blob storage using the IDs
- 8: The user query and the retrieved screenshots are passed to the LLM
- 9: The LLM generates an answer using the provided context
- 10: The answer is forwarded to the user 

# Task 8: Create agent tools

Tools for agents are simply functions written in a programming language of your choice!

In [28]:
def get_information_for_question_answering(user_query: str):
    """
    Retrieve information using vector search to answer a user query.

    Args:
    user_query (str): The user's query string.
    """
    # Get query embedding using the `get_embedding` helper function
    query_embedding = get_embedding(user_query, "query")
    # Define the VS aggregation pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": VS_INDEX_NAME,
                "queryVector": query_embedding,
                "path": f"embedding",
                "numCandidates": 150,
                "limit": 2,
            }
        },
        {
            "$project": {
                "_id": 0,
                "key": 1,
                "width": 1,
                "height": 1,
                "score": {"$meta": "vectorSearchScore"},
            }
        },
    ]

    # Execute the aggregation `pipeline` against the `collection` collection and store the results in `results`
    results = collection.aggregate(pipeline)
    # Get images from local storage
    keys = [result["key"] for result in results]
    print(f"Keys: {keys}")
    return keys

In [29]:
# Test the vector search tool with an example query
get_information_for_question_answering("What is the Pass@1 accuracy of Deepseek R1 on the MATH500 benchmark?")

In addition to defining the tool itself, you need to create function schemas to help the LLM identify what tools to use and the arguments for the tool calls.

In [30]:
# Define the function declaration for the `get_information_for_question_answering` function
get_information_for_question_answering_declaration = {
    "name": "get_information_for_question_answering",
    "description": "Retrieve information using vector search to answer a user query.",
    "parameters": {
        "type": "object",
        "properties": {
            "user_query": {
                "type": "string",
                "description": "Query string to use for vector search",
            }
        },
        "required": ["user_query"],
    },
}

# Task 9: Instantiate the Gemini LLM and client

In [31]:
from google import genai
from google.genai import types

In [32]:
# Gemini LLM to use
LLM = "gemini-2.5-flash"

In [33]:
# Instantiate the Gemini client
gemini_client = genai.Client(api_key=GEMINI_API_KEY)

# Task 10: Create generation config

For Gemini models, the generation config specifies parameters for the LLM generation.

This might look different for a different LLM providers. Check the API documentation for the provider you are using to understand how to specify these parameters.

In [34]:
# Create a generation config with the `get_information_for_question_answering_declaration` function declaration and `temperature` set to 0.0
tools = types.Tool(
    function_declarations=[get_information_for_question_answering_declaration]
)
tools_config = types.GenerateContentConfig(tools=[tools], temperature=0.0)

# Task 11: Define a function for tool selection

Tool selection in the context of AI agents involves using LLMs to identify which tool to call and the arguments for the tool call. Note that the LLM doesn't execute the tool call. This needs to be implemented in the agent's code, as you will see in Task 12.

In [35]:
from google.genai.types import FunctionCall

In [36]:
# Function that uses an LLM to decide if a tool needs to be called
def select_tool(messages):
    system_prompt = [
        (
            "You're an AI assistant. Based on the given information, decide which tool to use."
            "If the user is asking to explain an image, don't call any tools unless that would help you better explain the image."
            "Here is the provided information:\n"
        )
    ]
    # Input to the LLM
    contents = system_prompt + messages
    # Use the `gemini_client`, `LLM`, `contents` and `tools_config` defined previously to generate a response using Gemini
    response = gemini_client.models.generate_content(
        model=LLM, contents=contents, config=tools_config
    )
    # Extract and return the function call from the response
    return response.candidates[0].content.parts[0].function_call

# Task 12: Define a function to execute tools and generate responses

In [37]:
def generate_answer(user_query, images):
    # Use the `select_tool` function above to get the tool config
    tool_call = select_tool([user_query])
    # If a tool call is found and the name is `get_information_for_question_answering`
    if (
        tool_call is not None
        and tool_call.name == "get_information_for_question_answering"
    ):
        print(f"Agent: Calling tool: {tool_call.name}")
        # Call the tool with the arguments extracted by the LLM
        tool_images = get_information_for_question_answering(**tool_call.args)
        # Add images returned by the tool to the list of input images if any
        images.extend(tool_images)

    system_prompt = f"Answer the questions based on the provided context only. If the context is not sufficient, say I DON'T KNOW. DO NOT use any other information to answer the question."
    # Pass the system prompt, user query, and content retrieved using vector search (`images`) as input to the LLM
    contents = [system_prompt] + [user_query] + [Image.open(image) for image in images]

    # Get the response from the LLM
    response = gemini_client.models.generate_content(
        model=LLM,
        contents=contents,
        config=types.GenerateContentConfig(temperature=0.0),
    )
    answer = response.text
    return answer

# Task 13: Execute the agent

In [38]:
def execute_agent(user_query, images=[]):
    # Use the `generate_answer` function to generate responses 
    response = generate_answer(user_query, images)
    print("Agent:", response)

In [39]:
# Test the agent with a text input
execute_agent("What is the Pass@1 accuracy of Deepseek R1 on the MATH500 benchmark?")

In [40]:
# Test the agent with an image input
execute_agent("Explain the graph in this image:", ["test.png"])

# Task 14: Add memory to the agent

Memory is important for agents to learn from past interactions and maintain consistent, coherent dialogue across conversations. 

In this codealong, we will use MongoDB to persist and manage short-term memory of the agent. The memory management workflow of our agent looks as follows:

![memory_mgmt_1.png](memory_mgmt_1.png)

![memory_mgmt_2.png](memory_mgmt_2.png)

- 1: Get the session ID that the user question belongs to
- 2: Retrieve session history from MongoDB using the session ID
- 3: Pass the session history as additional context to the LLM
- 4: Add the current question to the session history
- 5: Add the LLM's answer to the session history


In [41]:
from datetime import datetime

In [42]:
# Instantiate the history collection
history_collection = mongodb_client[DB_NAME]["history"]

In [43]:
# Create an index on `session_id` on the `history_collection` collection
history_collection.create_index("session_id")

In [44]:
def store_chat_message(session_id, role, type, content):
    # Create a message object with `session_id`, `role`, `type`, `content` and `timestamp` fields
    # `timestamp` should be set to the current timestamp
    message = {
        "session_id": session_id,
        "role": role,
        "type": type,
        "content": content,
        "timestamp": datetime.now(),
    }
    # Insert the `message` into the `history_collection` collection
    history_collection.insert_one(message)

In [45]:
def retrieve_session_history(session_id):
    # Query the `history_collection` collection for documents where the "session_id" field has the value of the input `session_id`
    # Sort the results in increasing order of the values in `timestamp` field
    cursor = history_collection.find({"session_id": session_id}).sort("timestamp", 1)
    messages = []
    if cursor:
        for msg in cursor:
            # If the message type is `text`, append the content as is
            if msg["type"] == "text":
                messages.append(msg["content"])
            # If message type is `image`, open the image
            elif msg["type"] == "image":
                messages.append(Image.open(msg["content"]))
    return messages

In [46]:
def generate_answer(session_id, user_query, images):
    # Retrieve past conversation history for the specified `session_id` using the `retrieve_session_history` method
    print("Retrieving chat history...")
    history = retrieve_session_history(session_id)
    # Determine if any additional tools need to be called
    tool_call = select_tool(history + [user_query])
    if (
        tool_call is not None
        and tool_call.name == "get_information_for_question_answering"
    ):
        print(f"Agent: Calling tool: {tool_call.name}")
        # Call the tool with the arguments extracted by the LLM
        tool_images = get_information_for_question_answering(**tool_call.args)
        # Add images returned by the tool to the list of input images if any
        images.extend(tool_images)

    # Pass the system prompt, conversation history, user query and retrieved context (`images`) to the LLM to generate an answer
    system_prompt = f"Answer the questions based on the provided context only. If the context is not sufficient, say I DON'T KNOW. DO NOT use any other information to answer the question."
    contents = (
        [system_prompt]
        + history
        + [user_query]
        + [Image.open(image) for image in images]
    )
    # Get a response from the LLM
    response = gemini_client.models.generate_content(
        model=LLM,
        contents=contents,
        config=types.GenerateContentConfig(temperature=0.0),
    )
    answer = response.text
    # Write the current user query to memory using the `store_chat_message` function
    # The `role` for user queries is "user" and `type` is "text"
    print("Updating chat history...")
    store_chat_message(session_id, "user", "text", user_query)
    # Write the filepaths of input/retrieved images to memory using the store_chat_message` function
    # The `role` for these is "user" and `type` is "image"
    for image in images:
        store_chat_message(session_id, "user", "image", image)
    # Write the LLM generated response to memory
    # The `role` for these is "agent" and `type` is "text"
    store_chat_message(session_id, "agent", "text", answer)
    return answer

In [47]:
def execute_agent_with_memory(session_id, user_query, images=[]):
    response = generate_answer(session_id, user_query, images)
    print("Agent:", response)

In [48]:
execute_agent_with_memory(
    "1",
    "What is the Pass@1 accuracy of Deepseek R1 on the MATH500 benchmark?",
)

In [49]:
# Follow-up question to make sure chat history is being used.
execute_agent_with_memory(
    "1",
    "What question did I just ask you?",
)