# 🚀 Multimodal Agents Workshop - Simplified Version

This workshop demonstrates how to build **multimodal AI agents** using the new `pymongo-voyageai-multimodal` library, which dramatically simplifies the integration of:

- **MongoDB Atlas Vector Search**
- **Voyage AI's `voyage-multimodal-3` embeddings**
- **Google Gemini 2.0 for AI agent capabilities**
- **AWS S3 for document storage (Optional)**

## 🎯 What's Different?

The `pymongo-voyageai-multimodal` library handles:
- ✅ Automatic PDF processing from URLs (no S3 required!)
- ✅ Embedding generation with Voyage AI
- ✅ Vector index management
- ✅ Simplified similarity search
- 🎁 **BONUS**: S3 integration for advanced use cases (optional)

## 📋 Prerequisites

### Required Dependencies:
```bash
pip install pymongo-voyageai-multimodal google-generativeai tqdm python-dotenv
```

### Required Environment Variables (create a `.env` file):
```
# Required for core functionality
MONGODB_ATLAS_CONNECTION_STRING=your_mongodb_uri
VOYAGEAI_API_KEY=your_voyage_api_key
GOOGLE_API_KEY=your_gemini_api_key
```

### Optional Environment Variables:
```
# Only needed for S3 features (completely optional!)
S3_BUCKET_NAME=your_s3_bucket
```

## 💡 Core Workshop vs Optional S3 Features

### ✅ Works Without S3:
- Direct PDF URL processing (e.g., arXiv papers)
- All vector search functionality
- AI agent creation
- Text document processing
- All learning objectives

### 🎁 S3 Bonus Features:
- Load PDFs from private S3 buckets
- Process documents stored in AWS
- Enterprise-grade document storage"

In [None]:
from pymongo_voyageai_multimodal import PyMongoVoyageAI

# Check required environment variables
required_vars = ["MONGODB_ATLAS_CONNECTION_STRING", "VOYAGEAI_API_KEY"]
optional_vars = ["S3_BUCKET_NAME"]  # S3 is optional!

missing_required = [var for var in required_vars if not os.getenv(var)]
missing_optional = [var for var in optional_vars if not os.getenv(var)]

if missing_required:
    show_error(f"Missing REQUIRED environment variables: {missing_required}")
    show_warning("Please set these in your .env file")
else:
    show_success("All required environment variables found!")
    
    if missing_optional:
        show_warning(f"Optional variables not set: {missing_optional}")
        show_info("S3 functionality will be disabled, but you can still use direct URL fetching!")
    
    # Initialize the client (S3 is optional)
    client = PyMongoVoyageAI.from_connection_string(
        connection_string=os.environ["MONGODB_ATLAS_CONNECTION_STRING"],
        database_name="multimodal_lab_simplified",
        collection_name="documents",
        s3_bucket_name=os.environ.get("S3_BUCKET_NAME", None),  # Optional
        voyageai_api_key=os.environ["VOYAGEAI_API_KEY"]
    )
    
    show_success("PyMongoVoyageAI client initialized successfully!")
    show_info(f"Database: multimodal_lab_simplified")
    show_info(f"Collection: documents")
    
    if os.environ.get("S3_BUCKET_NAME"):
        show_info(f"S3 Bucket: {os.environ['S3_BUCKET_NAME']} (enabled)")
    else:
        show_info("S3 Support: Disabled (using direct URL fetching only)")

# Process a research paper from a direct URL (no S3 required!)
pdf_url = "https://arxiv.org/pdf/2501.12948"  # DeepSeek R1 paper

show_info(f"Processing PDF from: {pdf_url}")
show_success("Direct URL fetching - No S3 required!")
show_info("This will automatically:")
show_info("• Download the PDF from the URL")
show_info("• Extract pages as images")
show_info("• Generate embeddings with voyage-multimodal-3")
show_info("• Store in MongoDB with vector indexes")

try:
    # Process PDF from direct URL (works without S3!)
    images = client.url_to_images(pdf_url)
    
    show_success(f"Extracted {len(images)} pages from PDF")
    
    # Generate IDs for each page
    ids = [f"deepseek_page_{i+1}" for i in range(len(images))]
    
    # Add documents to MongoDB with embeddings
    client.add_documents(documents=images, ids=ids)
    
    show_success(f"Successfully indexed {len(images)} document pages!")
    show_info("✅ No S3 bucket required for this operation!")
    
except Exception as e:
    show_error(f"Failed to process PDF: {e}")

In [None]:
from pymongo_voyageai_multimodal import PyMongoVoyageAI

# Check required environment variables
required_vars = ["MONGODB_ATLAS_CONNECTION_STRING", "S3_BUCKET_NAME", "VOYAGEAI_API_KEY"]
missing_vars = [var for var in required_vars if not os.getenv(var)]

if missing_vars:
    show_error(f"Missing environment variables: {missing_vars}")
    show_warning("Please set these in your .env file")
else:
    # Initialize the client
    client = PyMongoVoyageAI.from_connection_string(
        connection_string=os.environ["MONGODB_ATLAS_CONNECTION_STRING"],
        database_name="multimodal_lab_simplified",
        collection_name="documents",
        s3_bucket_name=os.environ["S3_BUCKET_NAME"],
        voyageai_api_key=os.environ["VOYAGEAI_API_KEY"]
    )
    
    show_success("PyMongoVoyageAI client initialized successfully!")
    show_info(f"Database: multimodal_lab_simplified")
    show_info(f"Collection: documents")

## Step 2: Process and Index Documents

The library automatically:
- Downloads PDFs from URLs
- Extracts images from PDF pages
- Generates multimodal embeddings
- Stores documents in MongoDB with vector indexes

In [None]:
# Process a research paper
pdf_url = "https://arxiv.org/pdf/2501.12948"  # DeepSeek R1 paper

show_info(f"Processing PDF from: {pdf_url}")
show_info("This will automatically:")
show_info("• Download the PDF")
show_info("• Extract pages as images")
show_info("• Generate embeddings with voyage-multimodal-3")
show_info("• Store in MongoDB with vector indexes")

try:
    # Process PDF and get document images
    images = client.url_to_images(pdf_url)
    
    show_success(f"Extracted {len(images)} pages from PDF")
    
    # Generate IDs for each page
    ids = [f"deepseek_page_{i+1}" for i in range(len(images))]
    
    # Add documents to MongoDB with embeddings
    client.add_documents(documents=images, ids=ids)
    
    show_success(f"Successfully indexed {len(images)} document pages!")
    
except Exception as e:
    show_error(f"Failed to process PDF: {e}")

# Step 6: Optional - Add Documents from S3
# This section only works if you have S3 configured

show_info("📦 S3 Integration (Optional Advanced Feature)")

if os.environ.get("S3_BUCKET_NAME"):
    show_success("S3 bucket configured! You can use S3 features.")
    
    # Example: Load documents from S3
    # Uncomment and modify with your S3 paths
    
    # s3_pdf_path = f"s3://{os.environ['S3_BUCKET_NAME']}/path/to/document.pdf"
    # 
    # try:
    #     show_info(f"Processing PDF from S3: {s3_pdf_path}")
    #     
    #     # Process PDF from S3
    #     s3_images = client.url_to_images(s3_pdf_path)
    #     
    #     # Add with custom IDs
    #     s3_ids = [f"s3_doc_page_{i+1}" for i in range(len(s3_images))]
    #     client.add_documents(documents=s3_images, ids=s3_ids)
    #     
    #     show_success(f"Added {len(s3_images)} pages from S3!")
    # except Exception as e:
    #     show_error(f"S3 processing failed: {e}")
    
    show_info("💡 S3 URLs work the same as direct URLs: client.url_to_images('s3://bucket/file.pdf')")
else:
    show_warning("S3_BUCKET_NAME not configured - S3 features disabled")
    show_info("✅ Don't worry! The workshop works perfectly without S3")
    show_info("• Direct URL fetching works for any public PDF")
    show_info("• You can still process local files by uploading them first")
    show_info("• All core functionality remains available")

In [None]:
# Test search functionality
test_queries = [
    "What is the Pass@1 accuracy on MATH500?",
    "Show me performance benchmarks",
    "Architecture diagrams"
]

for query in test_queries:
    show_info(f"\n🔍 Searching for: {query}")
    
    try:
        # Perform similarity search
        results = client.similarity_search(query=query, k=2)
        
        show_success(f"Found {len(results)} relevant documents")
        
        for i, doc in enumerate(results):
            show_info(f"Result {i+1}: {doc.get('id', 'Unknown')} (Score: {doc.get('score', 0):.4f})")
            
    except Exception as e:
        show_error(f"Search failed: {e}")

## Step 4: Build an AI Agent with Gemini

Now let's create an AI agent that uses the multimodal search to answer questions.

In [None]:
from google import genai
from google.genai import types
from google.genai.types import FunctionCall
from typing import List
from PIL import Image
import io
import base64

# Initialize Gemini
if not os.getenv("GOOGLE_API_KEY"):
    show_error("GOOGLE_API_KEY not found in environment")
else:
    gemini_client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])
    LLM = "gemini-2.0-flash"
    show_success(f"Gemini client initialized with model: {LLM}")

In [None]:
def multimodal_search_tool(query: str, num_results: int = 2) -> List[dict]:
    """
    Search for relevant documents using multimodal embeddings.
    
    Args:
        query: Natural language search query
        num_results: Number of results to return
        
    Returns:
        List of document dictionaries with content
    """
    try:
        show_info(f"🔍 Searching for: {query}")
        
        # Use the simplified search API
        results = client.similarity_search(query=query, k=num_results)
        
        show_success(f"Found {len(results)} relevant documents")
        
        # Extract document content
        documents = []
        for result in results:
            doc_info = {
                'id': result.get('id', 'Unknown'),
                'score': result.get('score', 0),
                'content': result.get('inputs', {})  # This contains the image data
            }
            documents.append(doc_info)
            
        return documents
        
    except Exception as e:
        show_error(f"Search failed: {e}")
        return []

# Define function declaration for Gemini
search_declaration = {
    "name": "multimodal_search_tool",
    "description": "Search for relevant documents using multimodal embeddings",
    "parameters": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "Natural language search query"
            },
            "num_results": {
                "type": "integer",
                "description": "Number of results to return",
                "default": 2
            }
        },
        "required": ["query"]
    }
}

show_success("Multimodal search tool defined!")

In [None]:
# Create Gemini configuration with tools
tools = types.Tool(function_declarations=[search_declaration])
tools_config = types.GenerateContentConfig(tools=[tools], temperature=0.0)

def multimodal_agent(user_query: str) -> str:
    """
    AI agent that uses multimodal search to answer questions.
    
    Args:
        user_query: User's question
        
    Returns:
        Agent's response
    """
    try:
        show_info(f"🤖 Processing query: {user_query}")
        
        # Step 1: Decide if we need to search
        system_prompt = """You are an AI assistant with access to a multimodal document search tool.
        Decide if you need to search for information to answer the user's question.
        If yes, call the search tool with an appropriate query."""
        
        # Get tool decision from Gemini
        response = gemini_client.models.generate_content(
            model=LLM,
            contents=[system_prompt, user_query],
            config=tools_config
        )
        
        # Check if tool was called
        retrieved_docs = []
        if response.candidates and response.candidates[0].content.parts:
            for part in response.candidates[0].content.parts:
                if hasattr(part, 'function_call') and part.function_call:
                    if part.function_call.name == "multimodal_search_tool":
                        show_info("🛠️ Agent is searching for relevant documents...")
                        # Execute the search
                        search_results = multimodal_search_tool(**part.function_call.args)
                        retrieved_docs.extend(search_results)
        
        # Step 2: Generate final answer
        if retrieved_docs:
            context = f"Based on the following retrieved documents:\n"
            for doc in retrieved_docs:
                context += f"\n- Document {doc['id']} (relevance: {doc['score']:.3f})"
            
            final_prompt = f"""{context}
            
            Answer this question: {user_query}
            
            Base your answer only on the retrieved documents. If the information isn't available, say so."""
        else:
            final_prompt = f"Answer this question based on your general knowledge: {user_query}"
        
        # Generate final response
        final_response = gemini_client.models.generate_content(
            model=LLM,
            contents=[final_prompt],
            config=types.GenerateContentConfig(temperature=0.0)
        )
        
        return final_response.text
        
    except Exception as e:
        show_error(f"Agent failed: {e}")
        return "I apologize, but I encountered an error while processing your question."

show_success("Multimodal agent created!")

## Step 5: Test the Multimodal Agent

In [None]:
# Test the agent with various queries
test_questions = [
    "What is the Pass@1 accuracy of DeepSeek R1 on the MATH500 benchmark?",
    "What are the main architectural components?",
    "How does the model perform on coding tasks?"
]

for question in test_questions:
    show_info(f"\n❓ Question: {question}")
    
    answer = multimodal_agent(question)
    
    show_success("🤖 Agent Response:")
    display(HTML(f'<div style="background-color:#f5f5f5;padding:15px;border-radius:5px;margin:10px 0;">{answer}</div>'))

## Step 6: Advanced - Add Documents from S3

The library also supports loading documents directly from S3.

In [None]:
# Example: Load documents from S3
# Uncomment and modify with your S3 paths

# s3_pdf_path = "s3://your-bucket/path/to/document.pdf"
# 
# try:
#     # Process PDF from S3
#     s3_images = client.url_to_images(s3_pdf_path)
#     
#     # Add with custom IDs
#     s3_ids = [f"s3_doc_page_{i+1}" for i in range(len(s3_images))]
#     client.add_documents(documents=s3_images, ids=s3_ids)
#     
#     show_success(f"Added {len(s3_images)} pages from S3!")
# except Exception as e:
#     show_error(f"S3 processing failed: {e}")

show_info("S3 integration example - uncomment to use with your S3 bucket")

## Step 7: Adding Text Documents

The library also supports text documents alongside images.

In [None]:
from pymongo_voyageai_multimodal import TextDocument

# Create text documents
text_docs = [
    TextDocument(
        text="DeepSeek R1 is a state-of-the-art language model with strong reasoning capabilities.",
        metadata={"source": "summary", "type": "overview"}
    ),
    TextDocument(
        text="The model achieves 97.3% Pass@1 accuracy on MATH500 benchmark.",
        metadata={"source": "benchmark", "type": "performance"}
    )
]

# Add text documents
text_ids = ["text_summary_1", "text_benchmark_1"]

try:
    client.add_documents(documents=text_docs, ids=text_ids)
    show_success(f"Added {len(text_docs)} text documents!")
except Exception as e:
    show_error(f"Failed to add text documents: {e}")

## 🎯 Workshop Summary

### What We've Built

Using the `pymongo-voyageai-multimodal` library, we've created:

1. **Automatic Document Processing**: PDFs → Images → Embeddings → MongoDB
2. **Simplified Vector Search**: Natural language queries without manual embedding
3. **Multimodal AI Agent**: Gemini-powered agent with document retrieval
4. **S3 Integration**: Direct document loading from cloud storage
5. **Mixed Content**: Support for both images and text documents

### Key Advantages

- ✅ **Minimal Code**: ~10x less code than manual implementation
- ✅ **Automatic Indexing**: No manual vector index creation
- ✅ **Built-in S3**: Cloud storage integration out of the box
- ✅ **Type Safety**: Structured document types (TextDocument, ImageDocument)
- ✅ **Production Ready**: Handles errors, retries, and edge cases

### Limitations to Consider

- ❗ **Static Datasets**: Best for read-heavy workloads
- 🔗 **Voyage AI Only**: Currently limited to `voyage-multimodal-3`
- 🗄️ **AWS S3**: Cloud storage limited to S3 (for now)
- 🔄 **No Live Sync**: Re-indexing required for updates

In [None]:
# Cleanup (optional)
# client.close()

show_success("🎉 Workshop completed successfully!")
show_info("You've built a multimodal AI agent with minimal code using pymongo-voyageai-multimodal!")