<a href="https://colab.research.google.com/github/lbsocial/data-analysis-with-generative-ai/blob/main/multimodal_search_engine_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üê¶ Tutorial: Build a Multimodal "Smart Search" with MongoDB & AI

In this tutorial, we will build a search engine that goes beyond simple keyword matching. Using **Vector Search** and **Multimodal AI**, we will create a system that allows users to search for tweets using **Text** or **Images**.

**What you will learn:**
1.  **Mock Data Generation:** How to create realistic social media data with Python.
2.  **Multimodal Embeddings:** How to use OpenAI's **CLIP** model to understand both text and images.
3.  **Vector Search:** How to store and search embeddings using **MongoDB Atlas**.
4.  **Cross-Modal Retrieval:** How to search for images using text (and vice versa).

**Prerequisites:**
* A MongoDB Atlas database.
* The connection string saved in Colab Secrets as `mongodb_connection`.

## ‚ö†Ô∏è Important: Enable GPU Runtime
To ensure the AI model loads and runs quickly, please enable the T4 GPU.

1.  Click **Runtime** in the top menu.
2.  Select **Change runtime type**.
3.  Under **Hardware accelerator**, choose **T4 GPU**.
4.  Click **Save**.

## ‚öôÔ∏è Step 1: Environment Setup
We need to install the `sentence-transformers` library to load our AI model, and `pymongo` to interact with the database. We will also load the **CLIP** model, which is designed to map text and images into the same "vector space," enabling us to compare them mathematically.



In [None]:
!pip install sentence-transformers pymongo pillow requests -q

We use **OpenAI's CLIP model** via the `sentence-transformers` library.

**Why CLIP?**
It aligns text and images in the same "vector space," meaning the math for the word "dog" is similar to the math for a *picture* of a dog. This enables **Cross-Modal Search** (searching for images using text).

**Technical Specs:**
* **Model:** `clip-ViT-B-32` (A vision transformer trained on 32x32 pixel patches).
* **Library:** `sentence-transformers` (Handles the complex image processing automatically).



In [None]:
from sentence_transformers import SentenceTransformer
import time
from PIL import Image

print("‚è≥ Loading CLIP model... (this may take a moment)")
# We use CLIP because it understands both Text and Images in the same vector space
model = SentenceTransformer('clip-ViT-B-32')
Image.MAX_IMAGE_PIXELS = None
print("‚úÖ Model loaded!")

### üñºÔ∏è Visualizing the Vector Space

The illustration below demonstrates the core magic of **Multimodal Embeddings**:

1.  **Dual Inputs:** We feed two completely different types of data‚Äîan **Image** (pixels of a cat) and **Text** ("A fluffy cat")‚Äîinto the same AI model.
2.  **Translation:** The model converts both inputs into **Vectors** (lists of numbers).
3.  **The Shared Space:** Notice that the **Blue Dot** (Image) and the **Green Dot** (Text) land very close to each other because they represent the same concept.

<img src="https://raw.githubusercontent.com/lbsocial/data-analysis-with-generative-ai/main/image/Gemini_Generated_Image_rj1wcvrj1wcvrj1w.png" width="600" alt="Shared Vector Space">

*(Source: LBSocial)*

# Connect to MongoDB & Twitter
Make sure you have your secrets `mongodb_connection` set up in the Colab side panel.

In [None]:
from google.colab import userdata
from pymongo import MongoClient

# Setup MongoDB Connection
mongo_uri = userdata.get('mongodb_connection')
mongo_client = MongoClient(mongo_uri)

# Connect to the specific collection
db = mongo_client['demo']
collection = db['tweet_collection']

print("‚úÖ Connected to MongoDB collection: demo.tweet_collection")

## üõ†Ô∏è Step 2: Generate Synthetic Tweets


In a real-world application, you would connect this system to the Twitter API to index your own top tweets or timeline.

**Option A: Use Your Own Data**
If you have a collection of tweets (with image URLs), you can use them here!

**Option B: Generate Mock Data**
If you **don't have tweets**, simply run the code below. We will create tweets across distinct categories (Tech, Animals, Food) to test if our search engine can accurately distinguish between visual and semantic concepts. Each tweet will follow the standard Twitter data structure (JSON with `id`, `text`, and `entities`).

In [None]:
# --- 1. SETUP & IMPORTS ---
import datetime
import random
from google.colab import userdata
from pymongo import MongoClient

# Connect to MongoDB
try:
    mongo_uri = userdata.get('mongodb_connection')
    client = MongoClient(mongo_uri)
    db = client['demo']
    collection = db['tweet_collection']
    print("‚úÖ Connected to MongoDB collection: demo.tweet_collection")
except Exception as e:
    print(f"‚ùå Connection Error: {e}")

# --- 2. CONFIG: DIVERSE DATA BANKS ---

# A. Image URLs (Unsplash) - We cycle through these
image_bank = {
    "tech_setup": [
        "https://images.unsplash.com/photo-1595225476474-87563907a212", # Keyboard
        "https://images.unsplash.com/photo-1593640408182-31c70c8268f5", # PC Setup
        "https://images.unsplash.com/photo-1587202372775-e229f172b9d7", # Monitor
        "https://images.unsplash.com/photo-1550745165-9bc0b252726f",  # Retro
        "https://images.unsplash.com/photo-1527443224154-c4a3942d3acf"  # Mouse
    ],
    "animals": [
        "https://images.unsplash.com/photo-1552053831-71594a27632d", # Retriever
        "https://images.unsplash.com/photo-1514888286974-6c03e2ca1dba", # Cat
        "https://images.unsplash.com/photo-1583511655857-d19b40a7a54e", # Dog face
        "https://images.unsplash.com/photo-1573865526739-10659fec78a5", # Sleeping cat
        "https://images.unsplash.com/photo-1537151608828-ea2b11777ee8"  # Puppy
    ],
    "food": [
        "https://images.unsplash.com/photo-1511920170033-f8396924c348", # Latte
        "https://images.unsplash.com/photo-1579871494447-9811cf80d66c", # Sushi
        "https://images.unsplash.com/photo-1565299624946-b28f40a0ae38", # Pizza
        "https://images.unsplash.com/photo-1482049016688-2d3e1b311543", # Toast
        "https://images.unsplash.com/photo-1484723091739-30a097e8f929"  # Burger
    ]
}

# B. Text Content (Human-written variety for better semantic search)
text_bank = {
    "tech_setup": [
        "Just installed my new RTX 4090. The frame rates are buttery smooth!",
        "Why is cable management so hard? I spent 3 hours just hiding wires.",
        "My mechanical keyboard is way too loud for late night gaming sessions.",
        "Finally upgraded to a dual monitor setup. Productivity increased by 200%.",
        "Building a custom water-cooled PC is terrifying but worth it.",
        "Does anyone else hate Windows 11 updates? They break everything.",
        "Loving the RGB aesthetic on this new mousepad.",
        "My laptop is overheating again. Time to clean the fans.",
        "Testing out the new VR headset. Virtual reality is finally getting good.",
        "Is it worth buying a curved monitor for coding?",
        "My wifi speed is terrible today, I can't stream anything.",
        "Just bought a standing desk. My back feels so much better.",
        "Reviewing the latest tech gadgets on my blog tonight.",
        "The battery life on this new device is actually impressive.",
        "Nothing beats a clean, minimalist desk setup."
    ],
    "animals": [
        "My golden retriever is afraid of the vacuum cleaner. Poor guy!",
        "Woke up to my cat sleeping on my face. Best alarm clock ever.",
        "Took the dog to the beach today. He tried to eat the ocean.",
        "Adopting a rescue kitten was the best decision I made this year.",
        "Why do dogs chase their own tails? It's hilarious.",
        "My parrot learned to mimic the microwave beep. It's so confusing.",
        "Walking the dog in the rain is not my favorite activity.",
        "Look at those puppy eyes! I can't say no to him.",
        "My cat knocked a glass of water onto my laptop. Chaos ensues.",
        "Spending the weekend hiking with my furry best friend.",
        "Does your pet have a favorite toy they carry everywhere?",
        "Watching birds in the garden is surprisingly relaxing.",
        "My hamster ran on his wheel for 4 hours straight last night.",
        "Just got back from the vet. Clean bill of health for the pup!",
        "Cuddling with my cat after a long day is pure therapy."
    ],
    "food": [
        "This spicy ramen is clearing my sinuses instantly! So good.",
        "Nothing beats the smell of fresh coffee and croissants in the morning.",
        "Tried making sushi at home. It looks ugly but tastes amazing.",
        "Best burger in town is definitely at that new downtown spot.",
        "I could eat avocado toast for every meal of the day.",
        "Craving some authentic Italian pasta right now.",
        "This chocolate cake is way too rich, but I'm eating it anyway.",
        "Trying to eat healthy, so I made a giant salad. It needs more dressing.",
        "Ordering late-night pizza because I'm too lazy to cook.",
        "Freshly squeezed orange juice is a game changer.",
        "The seafood platter at this restaurant is massive!",
        "Baking cookies for the holiday party. hope they don't burn.",
        "I need a strong espresso to survive this Monday afternoon.",
        "Enjoying a glass of red wine with a cheese board.",
        "Why does pineapple on pizza cause so many arguments?"
    ],
    "coding_text": [
        "Spent 4 hours debugging a missing semicolon. I love programming.",
        "Git merge conflict: 1, Me: 0. I hate this.",
        "Deploying to production on a Friday. Living dangerously!",
        "Python is so much more readable than C++. Change my mind.",
        "Finally fixed that recursion error! I feel like a wizard.",
        "My SQL query is taking forever to run. Need to index these tables.",
        "Learning Rust is humbling. The compiler is so strict.",
        "Stack Overflow is down. Guess I can't do my job today.",
        "Refactoring legacy code is a nightmare. Who wrote this mess?",
        "Docker containers are failing to spin up. Send help.",
        "Just pushed my first open source contribution! So proud.",
        "Writing documentation is boring, but future me will be thankful.",
        "Why does this code work on localhost but fail on the server?",
        "Automating my boring tasks with a simple shell script.",
        "Unit tests are all passing. I am suspicious..."
    ]
}

# --- 3. GENERATION LOOP (With Shuffle) ---
print("üöÄ Generating 60 Unique Mock Tweets...")

docs_to_insert = []
categories = ["tech_setup", "animals", "food", "coding_text"]

for category in categories:
    print(f"   Processing category: {category}...")

    # Get the text list and shuffle it so it's random
    texts = text_bank[category]
    random.shuffle(texts)

    # Get images (if applicable)
    images = image_bank.get(category, [])

    # Generate 15 tweets per category (matching the text bank size)
    for i in range(15):
        text = texts[i]

        # Assign Image URL (Cycle through available images)
        if category == "coding_text":
            img_url = None
            entities = {}
        else:
            img_url = images[i % len(images)]
            # Standard Twitter Media Structure
            entities = {
                "media": [{
                    "media_url_https": img_url,
                    "type": "photo",
                    "display_url": "pic.twitter.com/xyz"
                }]
            }

        # Generate Fake ID & Timestamp
        fake_id = str(random.randint(1000000000000000000, 1999999999999999999))
        created_at = datetime.datetime.now().isoformat()

        # Final Object
        tweet_doc = {
            "id": fake_id,
            "text": text,
            "created_at": created_at,
            "entities": entities,
            "category": category # Helper field for tutorial
        }

        docs_to_insert.append(tweet_doc)

# --- 4. INSERT ---
if docs_to_insert:
    collection.delete_many({})
    collection.insert_many(docs_to_insert)
    print("-" * 40)
    print(f"üéâ SUCCESS! Stored {len(docs_to_insert)} high-quality tweets.")
    print("   Example Text: " + docs_to_insert[0]['text'])
    print("-" * 40)

**Verify the Dataset**

Before generating embeddings, let's peek into our MongoDB collection to ensure the data looks correct.

We will run a simple **Aggregation Query** to:
1.  **Group** the tweets by their category (e.g., Tech, Animals, Food).
2.  **Count** how many tweets are in each category (should be 15 each).
3.  **Preview** a few examples of the text and image URLs to make sure they were generated properly.

In [None]:
# --- 1. SETUP ---
from google.colab import userdata
from pymongo import MongoClient

# Connect
try:
    mongo_uri = userdata.get('mongodb_connection')
    client = MongoClient(mongo_uri)
    db = client['demo']
    collection = db['tweet_collection']
    print("‚úÖ Connected to MongoDB")
except Exception as e:
    print(f"‚ùå Connection Error: {e}")

# --- 2. SUMMARY QUERY (AGGREGATION) ---
print("\nüìä DATASET SUMMARY:")
print("=" * 60)

pipeline = [
    {
        "$group": {
            "_id": "$category",
            "count": { "$sum": 1 },
            # Collect the first 3 text examples
            "sample_texts": { "$push": "$text" },
            # Collect the first 3 image examples (extracting from the nested entities)
            "sample_images": { "$push": { "$arrayElemAt": ["$entities.media.media_url_https", 0] } }
        }
    }
]

results = list(collection.aggregate(pipeline))

# --- 3. DISPLAY RESULTS ---
for cat_data in results:
    category = cat_data['_id']
    count = cat_data['count']
    texts = cat_data['sample_texts'][:3] # Show only top 3

    # Filter out None/Null images (e.g., for coding_text)
    images = [img for img in cat_data['sample_images'] if img][:2] # Show only top 2

    print(f"üìÇ CATEGORY: {category} ({count} docs)")

    print("   üìù Text Examples:")
    for t in texts:
        print(f"      - \"{t[:60]}...\"") # Truncate for cleaner view

    if images:
        print("   üñºÔ∏è  Image Examples:")
        for img in images:
            print(f"      - {img}")
    else:
        print("   üñºÔ∏è  No Images (Expected for this category)")

    print("-" * 60)

## üß† Step 3: The "Split Strategy" for Embeddings

This is the core logic of our Multimodal engine. We treat a tweet as two objects:

1.  **Text Vector:** Represents the meaning of the text.
2.  **Image Vector:** Represents the visual content of the image.

By storing these separately, we can match a user's query against *either* the text *or* the image.

<img src="https://github.com/lbsocial/data-analysis-with-generative-ai/blob/main/image/Gemini_Generated_Image_bctu0ubctu0ubctu.png?raw=true" width="600" alt="Split Strategy">

*(Source: LBSocial)*

In [None]:
import requests
from PIL import Image
from io import BytesIO
from sentence_transformers import SentenceTransformer
from google.colab import userdata
from pymongo import MongoClient


# --- 1. FETCH RAW DATA ---
raw_tweets = list(collection.find({}))
print(f"üìÇ Found {len(raw_tweets)} raw tweets to process.")

# --- 2. EMBEDDING LOOP ---
vector_documents = []
print("üöÄ Starting Embedding Process...")

for i, tweet in enumerate(raw_tweets):
    if i % 10 == 0 and i > 0: print(f"   ... processed {i} tweets")

    # Extract Fields (Safely)
    tweet_id = tweet.get('id')
    text_content = tweet.get('text')

    # Dig for Image URL in the 'entities' structure
    image_url = None
    entities = tweet.get('entities', {})
    if 'media' in entities and len(entities['media']) > 0:
        image_url = entities['media'][0]['media_url_https']

    # A. TEXT EMBEDDING (Always exists)
    try:
        text_emb = model.encode(text_content).tolist()

        vector_documents.append({
            "original_id": tweet_id,
            "category": tweet.get('category'), # Keep category for tutorial
            "media_type": "text",
            "text": text_content,
            "image_url": image_url,
            "embedding": text_emb
        })
    except Exception as e:
        print(f"   ‚ö†Ô∏è Text Error on ID {tweet_id}: {e}")

    # B. IMAGE EMBEDDING (If exists)
    if image_url:
        try:
            # Download
            response = requests.get(image_url, timeout=5)
            if response.status_code == 200:
                img = Image.open(BytesIO(response.content))

                # Generate Vector
                img_emb = model.encode(img).tolist()

                vector_documents.append({
                    "original_id": tweet_id,
                    "category": tweet.get('category'),
                    "media_type": "image",
                    "text": text_content,
                    "image_url": image_url,
                    "embedding": img_emb
                })
        except Exception as e:
            # If an image is corrupt or too big for Colab RAM, skip it
            print(f"   ‚ö†Ô∏è Image skipped for ID {tweet_id}: {e}")

# --- 3. SAVE RESULTS ---
if vector_documents:
    collection.delete_many({})
    collection.insert_many(vector_documents)
    print("-" * 40)
    print(f"üéâ DONE! Saved {len(vector_documents)} vector documents to MongoDB.")
    print("-" * 40)
else:
    print("‚ùå No documents generated.")

## ‚ö° Step 4: Create Vector Search Index
For MongoDB to perform fast vector searches, we must define an index.

We configure the index to use **512 dimensions** (matching the output of the `clip-ViT-B-32` model) and **Cosine Similarity**, which is the standard metric for measuring distance between semantic vectors.

In [None]:
from pymongo.operations import SearchIndexModel
import time

index_name = "vector_index"

# 1. Define Index
index_definition = {
  "fields": [
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 512,
      "similarity": "cosine"
    },
    {"type": "filter", "path": "category"},
    {"type": "filter", "path": "media_type"}
  ]
}

# 2. Create Index
print("‚è≥ Creating Vector Search Index...")
try:
    collection.create_search_index(
        model=SearchIndexModel(definition=index_definition, name=index_name, type="vectorSearch")
    )
    print("‚úÖ Index creation command sent.")
except Exception as e:
    print(f"‚ö†Ô∏è Index might already exist: {e}")

# 3. Poll for Readiness
print("‚è≥ Waiting for index to be queryable...")
while True:
    indices = list(collection.list_search_indexes(index_name))
    if indices and indices[0].get('queryable'):
        print("üéâ Index is READY!")
        break
    time.sleep(5)

## üîç Step 5: Define the "Double-Tap" Search Logic

To fix the **"Modality Gap"** (where text and images group separately), we use a **Double-Tap Strategy**:

1.  **Search A:** Force the database to find the best **Text** matches.
2.  **Search B:** Force the database to find the best **Image** matches.
3.  **Merge:** Combine both sets to guarantee a rich result.

<img src="https://github.com/lbsocial/data-analysis-with-generative-ai/blob/main/image/Gemini_Generated_Image_mqvyn0mqvyn0mqvy.png?raw=true" width="600" alt="Double Tap Strategy">

*(Source: LBSocial)*

In [None]:
import requests
from PIL import Image
from io import BytesIO

def mixed_search(query, num_results=1):
    # 1. DETECT INPUT & ENCODE
    if query.startswith("http"):
        print(f"üñºÔ∏è  Query: [Image URL]")
        try:
            response = requests.get(query, stream=True)
            img = Image.open(response.raw)
            query_vector = model.encode(img).tolist()
        except:
            print("‚ùå Error loading image")
            return
    else:
        print(f"üìù Query: '{query}'")
        query_vector = model.encode(query).tolist()

    # --- 2. RUN TWO SEPARATE SEARCHES ---

    # SEARCH A: Find only TEXT matches
    pipeline_text = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "path": "embedding",
                "queryVector": query_vector,
                "limit": num_results,
                "numCandidates": 100,       # <--- FIXED: This was missing!
                "filter": { "media_type": "text" }
            }
        },
        { "$project": { "_id": 0, "text": 1, "category": 1, "score": { "$meta": "vectorSearchScore" } } }
    ]

    # SEARCH B: Find only IMAGE matches
    pipeline_image = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "path": "embedding",
                "queryVector": query_vector,
                "limit": num_results,
                "numCandidates": 100,       # <--- FIXED: This was missing!
                "filter": { "media_type": "image" }
            }
        },
        { "$project": { "_id": 0, "image_url": 1, "category": 1, "score": { "$meta": "vectorSearchScore" } } }
    ]

    # Execute both
    text_results = list(collection.aggregate(pipeline_text))
    image_results = list(collection.aggregate(pipeline_image))

    # --- 3. DISPLAY RESULTS ---
    print(f"\nüîé MIXED RESULTS (Guaranteed):")
    print("=" * 60)

    print(f"üìÑ BEST TEXT MATCHES:")
    if text_results:
        for r in text_results:
            print(f"   ‚Ä¢ {r['score']:.4f} | {r['category']} | \"{r['text'][:50]}...\"")
    else:
        print("   (No text matches found)")

    print("-" * 60)

    print(f"üì∑ BEST IMAGE MATCHES:")
    if image_results:
        for r in image_results:
            print(f"   ‚Ä¢ {r['score']:.4f} | {r['category']} | [Image Found]")
            if r.get('image_url'): print(f"     Target: {r['image_url']}")
    else:
        print("   (No image matches found)")

    print("=" * 60)

## üöÄ Step 6: Test the Engine
Now it's time to verify our system. We will run three types of queries:
1.  **Text-to-Text:** Searching "Pizza" to find tweets discussing food.
2.  **Image-to-Image:** Searching with a photo of a dog to find similar pets.


In [None]:
# --- TEST 1: Text-to-Text & Text-to-Image ---
# Search: "Pizza"
# Expectation: Finds text discussing food AND photos of pizza (even if the file name isn't "pizza").
print(">>> TEST 1: Text Search ('Pizza')")
mixed_search("pizza")

In [None]:
# --- TEST 2: Image-to-Image & Image-to-Text ---
# Search: [Photo of a Dog]
# Expectation: Finds similar dog photos AND text tweets about "puppies" or "retrievers".
print("\n>>> TEST 2: Image Search (Using a URL of a Dog)")
dog_img_url = "https://images.unsplash.com/photo-1558788353-f76d92427f16"
mixed_search(dog_img_url)

## üéì Conclusion & References

Congratulations! You have successfully built a **Multimodal Search Engine**.

You have moved beyond simple keyword matching to creating a system that understands **concepts**. It knows that a picture of a keyboard is related to the text "fast computer," and it can bridge the gap between images and text using Vector Search.

### üìö References & Resources
* **Tutorial Source:** [LBSocial](https://lbsocial.net)
* **The AI Model:** [OpenAI CLIP (Hugging Face)](https://huggingface.co/sentence-transformers/clip-ViT-B-32)
* **The Library:** [Sentence-Transformers Documentation](https://www.sbert.net/examples/applications/image-search/README.html)
* **The Database:** [MongoDB Atlas Vector Search](https://www.mongodb.com/products/platform/atlas-vector-search)
* **Concept Paper:** [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020)