<a href="https://colab.research.google.com/github/ndulam/AIMLReference/blob/main/MultiModelRAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Simple Project Implementation: Multi-Modal RAG with OpenAI and Chroma DB
Let's create a basic Multi-Modal RAG system using OpenAI's GPT-4 Vision model and Chroma DB for vector storage. This project will allow users to ask questions about images, retrieving relevant information from a database of image-text pairs.


Step 1: Set up the environment
First, install the required libraries:
bash

In [None]:
!pip install openai chromadb pillow

Step 2: Initialize the project
Create a new Python file named multimodal_rag.py and add the following imports:

In [None]:
import os
import base64
import openai
import chromadb
from PIL import Image
from io import BytesIO

# Set your OpenAI API key
openai.api_key = "your_openai_api_key_here"

Step 3: Create a function to encode images
Add a function to encode images to base64:

In [None]:
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

Step 4: Set up Chroma DB
Initialize Chroma DB and create a collection:

In [None]:
client = chromadb.Client()
collection = client.create_collection("image_text_pairs")

Step 5: Add sample data to Chroma DB
Add some sample image-text pairs to the database:

In [None]:
sample_data = [
    {"image_path": "path/to/eiffel_tower.jpg", "description": "The Eiffel Tower in Paris, France"},
    {"image_path": "path/to/statue_of_liberty.jpg", "description": "The Statue of Liberty in New York, USA"},
    # Add more samples as needed
]

for idx, item in enumerate(sample_data):
    encoded_image = encode_image(item["image_path"])
    collection.add(
        documents=[item["description"]],
        metadatas=[{"image": encoded_image}],
        ids=[f"img_{idx}"]
    )

Step 6: Implement the Multi-Modal RAG query function
Create a function to process user queries:

In [None]:
def multimodal_rag_query(query, image_path):
    # Encode the query image
    encoded_query_image = encode_image(image_path)

    # Retrieve relevant information from Chroma DB
    results = collection.query(query_texts=[query], n_results=1)

    if results["documents"]:
        context = results["documents"][0][0]
        context_image = results["metadatas"][0][0]["image"]
    else:
        context = "No relevant information found."
        context_image = None

    # Prepare the messages for GPT-4 Vision
    messages = [
        {"role": "system", "content": "You are a helpful assistant that can see and analyze images."},
        {"role": "user", "content": [
            {"type": "text", "text": f"Query: {query}\n\nContext: {context}"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encoded_query_image}"}}
        ]}
    ]

    if context_image:
        messages[1]["content"].append({"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{context_image}"}})

    # Generate a response using GPT-4 Vision
    response = openai.ChatCompletion.create(
        model="gpt-4-vision-preview",
        messages=messages,
        max_tokens=300
    )

    return response.choices[0].message["content"]

Step 7: Test the Multi-Modal RAG system
Add a main section to test the system:

In [None]:
if __name__ == "__main__":
    query = "What can you tell me about this landmark?"
    image_path = "path/to/query_image.jpg"

    response = multimodal_rag_query(query, image_path)
    print(response)

This implementation creates a simple Multi-Modal RAG system that can answer questions about images by retrieving relevant information from a database of image-text pairs. The system uses OpenAI's GPT-4 Vision model to generate responses based on the query image and the retrieved context.
To use this system, ensure you have valid image paths and an OpenAI API key. You can expand the sample data and fine-tune the retrieval process to improve the system's performance for your specific use case.
By leveraging Multi-Modal RAG, software engineers can create more powerful and versatile AI applications that can understand and process various types of data, leading to more engaging and informative user experiences.