<h1 style="background: linear-gradient(to right, #ff6b6b, #4ecdc4); 
           color: white; 
           padding: 20px; 
           border-radius: 10px; 
           text-align: center; 
           font-family: Arial, sans-serif; 
           text-shadow: 2px 2px 4px rgba(0,0,0,0.5);">
  Multimodal RAG
</h1>

<div style="background-color:#f0f8ff; padding: 15px; border-radius: 10px;">
  <br>
  <hr>

  <p>This exercise shows how to implement a <b>multimodal retrieval augmented generation (RAG)</b> system. In retrieval augmented generation, an external source of information as well as the input prompt are used to generate the response. In a multimodal setting, one of the most popular use cases is to include the <b>images</b> in the response generation process.</p>
  
  <p>This exercise implements the multimodal RAG system by using a <b>PDF file</b> that includes images, text, and tables. This PDF file is the external source of information mentioned earlier in the RAG definition. Once the system is set up, the model will be able to generate its responses considering the images, text, and tables from the PDF provided.</p>

  <p>Here is the list of topics covered in this exercise:</p>
  <ol>
    <li>Installing dependencies</li>
    <li>Process PDF</li>
    <li>Generate multimodal embeddings</li>
    <li>Create vector database</li>
    <li>Generate a RAG Response</li>
    <li>Test RAG Workflow</li>
  </ol>

  <hr>
</div>


### Installing dependencies

Installing the required libraries:

In [1]:
# %%capture
# !pip install -q -r ../requirements.txt

Importing the libraries used in this exercise:

In [2]:
import boto3
from botocore.exceptions import ClientError
import os
import json
import numpy as np
import base64
import pymupdf
import pandas as pd
from PIL import Image
import faiss
from tqdm import tqdm
from IPython import display

import logging

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.ERROR)

### Process PDF 

<div style="background-color:#f0f8ff; padding: 15px; border-radius: 10px; border-left: 6px solid #4682B4;">
  <p>In this lab, we will read the sample PDF file of the well-known paper <b>“Attention Is All You Need”</b> by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. This research paper laid out the foundations of the <b>transformer models</b> that power many generative AI applications nowadays. Paper linked <a href="https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf">here</a>. The paper is <b>11 pages</b> long.</p>
</div>


In [3]:
filename = "attention_paper.pdf"
filepath = "data/" + filename
display.IFrame(filepath, width=600, height=600)

### Extract text and images from each page

<div style="background-color:#f0f8ff; padding: 15px; border-radius: 10px; border-left: 6px solid #4682B4;">
  <p>The contents of the PDF need to be extracted and processed to be compatible with the <b>RAG application</b>.</p>
  
  <p>The following are the steps we will follow to process the data in this section:</p>
  <ol>
    <li>Extract the data (text and images) from the PDF using <b>pymupdf</b>.</li>
    <li>Go through each page. Create smaller text chunks from the text of the page.</li>
    <li>Convert each page of the PDF into an image.</li>
    <li>For each text chunk, image, and page, generate embeddings using <b>Amazon Titan Multimodal</b>.</li>
    <li>Save the information of each page in a list to store in a <b>vector database</b>.</li>
  </ol>
</div>


In [4]:
from utils.utils import pdf2imgs

doc = pymupdf.open(filepath)
num_pages = len(doc)

# Define the directories to store the extracted text, images and page images from each page
image_save_dir = "data/images"
text_save_dir = "data/text"
page_images_save_dir = "data/page_images"

# Chunk the text for effective retrieval
chunk_size = 700
overlap=200


items = []
# Process all pages of the PDF
for page_num in tqdm(range(num_pages), desc="Processing PDF pages"):
    page = page = doc[page_num]
    text = page.get_text()
    
    # Process chunks with overlap
    chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size-overlap)]
    
    # Generate an item to add to items
    for i,chunk in enumerate(chunks):
        text_file_name = f"{text_save_dir}/{filename}_text_{page_num}_{i}.txt"
        # If the text folder doesn't exist, create one
        os.makedirs(text_save_dir, exist_ok=True)
        with open(text_file_name, 'w') as f:
            f.write(chunk)
        
        item={}
        item["page"] = page_num
        item["type"] = "text"
        item["text"] = chunk
        item["path"] = text_file_name
        items.append(item)
    
    
    # Get all the images in the current page
    images = page.get_images()
    for idx, image in enumerate(images):        
        # Extract the image data
        xref = image[0]
        pix = pymupdf.Pixmap(doc, xref)
        pix.tobytes("png")
        # Create the image_name that includes the image path
        image_name = f"{image_save_dir}/{filename}_image_{page_num}_{idx}_{xref}.png"
        # If the image folder doesn't exist, create one
        os.makedirs(image_save_dir, exist_ok=True)
        # Save the image
        pix.save(image_name)
        
        # Produce base64 string
        with open(image_name, 'rb') as f:
            image = base64.b64encode(f.read()).decode('utf8')
        
        item={}
        item["page"] = page_num
        item["type"] = "image"
        item["path"] = image_name
        item["image"] = image
        items.append(item)

# Save pdf pages as images
page_images_save_dir = pdf2imgs(filepath, page_images_save_dir)

for page_num in range(num_pages):
    page_path = os.path.join(page_images_save_dir,  f"page_{page_num:03d}.png")
    
    # Produce base64 string
    with open(image_name, 'rb') as f:
        page_image = base64.b64encode(f.read()).decode('utf8')
    
    item = {}
    item["page"] = page_num
    item["type"] = "page"
    item["path"] = page_path
    item["image"] = page_image
    items.append(item)

Processing PDF pages: 100%|██████████| 11/11 [00:00<00:00, 34.47it/s]


<div style="background-color:#f0f8ff; padding: 15px; border-radius: 10px; border-left: 6px solid #4682B4;">
  <p>In the cell above, we have used a simple <b>character-based chunking</b> solution. This can result in broken sentences and words, losing a lot of <b>semantic</b> and <b>syntactic meaning</b>. Try updating the chunking process to preserve the structure and meaning of the document. You can use modules offered by <b>LangChain</b> for this purpose.</p>
</div>


### Generate Multimodal Embeddings


<div style="background-color:#f0f8ff; padding: 15px; border-radius: 10px; border-left: 6px solid #4682B4;">
  <p>We will use the same function defined in <b>Lab 2</b> to generate embeddings from <b>text</b> or <b>image data</b>.</p>

  <p>The following function is used to generate <b>multimodal embeddings</b> using <b>Amazon's Titan Multimodal Embeddings model</b>. Embeddings can be generated with <b>text data</b>, <b>image data</b>, or <b>both</b>.</p>
</div>


In [5]:
def generate_multimodal_embeddings(prompt=None, image=None, output_embedding_length = 384):
    """
    Invoke the Amazon Titan Multimodal Embeddings model using AWS Bedrock runtime.

    Args:
        prompt (str): The text prompt to provide to the model.
        image (str): A base64-encoded image data.
    Returns:
        str: The model's response text.

    Raises:
        ValueError: If an invalid model name is provided.
    """
    if not prompt and not image:
        raise ValueError("Please provide either a text prompt, base64 image or both as input")
    
    # Initialize the Amazon Bedrock runtime client
    client = boto3.client(service_name="bedrock-runtime")
    model_id = "amazon.titan-embed-image-v1"
    
    body = {"embeddingConfig": {"outputEmbeddingLength": output_embedding_length}}
    
    if prompt:
        body["inputText"] = prompt
    if image:
        body["inputImage"] = image

    try:
        response = client.invoke_model(
            modelId=model_id,
            body=json.dumps(body),
            accept = "application/json",
            contentType = "application/json"
        )

        # Process and return the response
        result = json.loads(response.get("body").read())
        return result.get("embedding")

    except ClientError as err:
        logger.error(
            "Couldn't invoke Titan embedding %s model. Here's why: %s: %s",
            model.capitalize(),
            err.response["Error"]["Code"],
            err.response["Error"]["Message"],
        )
        raise

#### Let's use the `generate_multimodal_embeddings` function to generate embeddings of every item extracted from the PDF

In [6]:
embedding_vector_dimension = 384
for item in tqdm(items, "Generating embeddings"):
    if item['type'] == 'text':
        item['embedding'] = generate_multimodal_embeddings(prompt=item['text'], output_embedding_length=embedding_vector_dimension)
    else:
        item['embedding'] = generate_multimodal_embeddings(image=item['image'], output_embedding_length=embedding_vector_dimension)

Generating embeddings: 100%|██████████| 85/85 [00:09<00:00,  8.67it/s]


### Create vector database

<div style="background-color:#f0f8ff; padding: 15px; border-radius: 10px; border-left: 6px solid #4682B4;">
  <p>In this section, we will create an index using <b>FAISS</b>, similar to <b>Lab 2</b>. We will create a <a href="https://www.pinecone.io/learn/series/faiss/faiss-tutorial/#IndexFlatL2"><b>FlatIndex</b></a> which measures the L2 (or <b>Euclidean</b>) distance between all given points between our query vector and the vectors loaded into the index.</p>

  <div style="text-align: center;">
    <img src="data/vectordb.png" width="500"/>
  </div>
</div>


In [7]:
all_embeddings = np.array([item['embedding'] for item in items])

<div style="background-color:#f0f8ff; padding: 15px; border-radius: 10px; border-left: 6px solid #4682B4;">
  <p>Now, we will use <code>FlatIndexL2</code> as the index type for the <b>vector database</b>. You may like to use a different index and observe how the <b>speed</b> and the <b>quality of results</b> change.</p>
</div>


In [8]:
# Create FAISS Index
index = faiss.IndexFlatL2(embedding_vector_dimension)
index.reset() # Clear any pre-existing index
index.add(np.array(all_embeddings, dtype=np.float32))

### Generate a RAG Response

<div style="background-color:#f0f8ff; padding: 15px; border-radius: 10px; border-left: 6px solid #4682B4;">
  <p>In this section, we will define the function <code>generate_rag_response</code> to generate a response with a <b>retrieval-augmented prompt</b>.</p>

  <p>First, let's define the <code>invoke_claude_3_multimodal</code> function that we used in <b>Lab 1</b> to generate a response to a <b>multimodal prompt</b>.</p>
</div>


In [9]:
def invoke_claude_3_multimodal(prompt, images, image_types):
    """
    Invoke the Claude-3 multimodal model from Anthropic using AWS Bedrock runtime.

    Args:
        prompt (str): The text prompt to provide to the model.
        images (list): A list of base64-encoded image data.
        image_types (list): A list of MIME types corresponding to the images.

    Returns:
        str: The model's response text.

    Raises:
        ValueError: If an invalid model name is provided.
    """
    # Initialize the Amazon Bedrock runtime client
    client = boto3.client(service_name="bedrock-runtime")
    model_id = "anthropic.claude-3-sonnet-20240229-v1:0"

    # Prepare the multimodal prompt message
    message_content = []

    # Add each image to the message content
    for image, img_type in zip(images, image_types):
        message_content.append({
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": img_type,
                "data": image,
            },
        })
    message_content.append({"type": "text", "text": prompt})

    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 2048,
        "temperature": 0.2,
        "top_p": 1.0,
        "top_k": 250,
        "messages": [
            {
                "role": "user",
                "content": message_content,
            }
        ],
    }

    try:
        response = client.invoke_model(
            modelId=model_id,
            body=json.dumps(request_body),
        )

        # Process and return the response
        result = json.loads(response.get("body").read())
        return result['content'][0]['text']

    except ClientError as err:
        logger.error(
            "Couldn't invoke Claude 3 %s model. Here's why: %s: %s",
            model_id.split('.')[-1].capitalize(),
            err.response["Error"]["Code"],
            err.response["Error"]["Message"],
        )
        raise

The following function, `generate_rag_response`, generates a prompt containing the user query, retrieved items and invokes the LLM to generate a RAG response.

In [10]:
def generate_rag_response(prompt, matched_items):
    
    # Create context
    text_context = ""
    image_context = []
    
    for item in matched_items:
        if item['type'] == 'text':
            text_context += str(item["page"]) + ". " + item['text'] + "\n"
        else:
            image_context.append(item['image'])
    
    # Only 5 images are supported by Claude3 models
    if len(image_context) > 5:
        image_context = image_context[:5]
    
    final_prompt = f"""You are a helpful assistant for question answering.
    The text context is relevant information retrieved.
    The provided image(s) are relevant information retrieved.
    
    <context>
    {text_context}
    </context>
    
    Answer the following question using the relevant context and images.
    
    <question>
    {prompt}
    </question>
    
    Answer:"""
    
    return invoke_claude_3_multimodal(final_prompt, image_context, ['image/png' for _ in image_context])
    

### Test RAG Workflow

<div style="background-color:#f0f8ff; padding: 15px; border-radius: 10px; border-left: 6px solid #4682B4;">
  <p>Now that we have our functions ready, let's test our <b>RAG application</b> using a few prompts.</p>

  <p>The steps we follow to generate a <b>RAG response</b> are:</p>
  <ol>
    <li>Generate an embedding of the user query. The embedding would represent the <b>text</b> and the <b>images</b> provided in the user query.</li>
    <li>Retrieve similar items from the vector database using a <b>nearest neighbor</b> search.</li>
    <li>Create a prompt using the user query as well as the retrieved items.</li>
    <li>Generate a response using the <b>retrieval-augmented prompt</b>.</li>
  </ol>
</div>


In [11]:
query = "How is the scaled-dot-product attention is calculated?"

query_embedding = generate_multimodal_embeddings(prompt=query,output_embedding_length=embedding_vector_dimension)
distances, result = index.search(np.array(query_embedding, dtype=np.float32).reshape(1,-1), k=5)

In [12]:
result.flatten()

array([18, 20, 21, 22, 55])

In [13]:
matched_items = [items[index] for index in result.flatten()]

In [14]:
response = generate_rag_response(query, matched_items)

In [15]:
display.Markdown(response)

According to the provided context, the scaled dot-product attention is calculated as follows:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V

Where:

- Q is the matrix of queries 
- K is the matrix of keys
- V is the matrix of values
- d_k is the dimension of the keys

The key steps are:

1. Take the dot product of the query matrix Q with the transpose of the key matrix K^T. This gives the similarity scores between each query and key.

2. Scale the similarity scores by dividing by sqrt(d_k), where d_k is the dimension of the keys. This helps prevent extremely small gradients for large values of d_k.

3. Apply the softmax function to the scaled scores to obtain the attention weights.

4. Multiply the attention weights with the value matrix V to get the weighted sum of values, which are the attended outputs.

So in essence, it computes the dot product similarities between queries and keys, scales them, converts to a probability distribution via softmax, and uses that to compute a weighted sum of the values. The scaling factor sqrt(d_k) helps stabilize the softmax computation.

<div style="background-color:#f0f8ff; padding: 15px; border-radius: 10px; border-left: 6px solid #4682B4;">
  <p>Nice. We have seen a few example questions and answers. Let's try asking more questions. Some example questions are given below:</p>
  <ul style="text-align: left;">
    <li>"How long were the base and big models trained?"</li>
    <li>"Which optimizer was used when training the models?"</li>
    <li>"What is position-wise feed-forward neural network mentioned in the paper?"</li>
    <li>"What is the BLEU score of the model in English to French translation (EN-FR)?"</li>
    <li>"What is the BLEU score of the model in English to German translation (EN-DE)?"</li>
  </ul>
</div>
