# Multimodal RAG

***This notebook works well with the `Data Science 3.0 Python 3` kernel and `ml.t3.medium` instance type.***

***Run the [`0_data_prep.ipynb`](./0_data_prep.ipynb) notebook prior to running this notebook.***

1. We download a subset of data from the [Amazon Berkley Objects](https://amazon-berkeley-objects.s3.amazonaws.com/index.html) dataset. The data includes Amazon products with metadata and catalog images. The metadata includes multiple tags that provide short text description of the product in the image. The data is filtered to only keep images that are associated for tags with description in a given language (`enUS` in our example), to limit the size of the data.

1. We convert all downloaded images into Base64 encoding.

1. An image and its associated text are converted into embeddings in a single `invoke_model` call to the `amazon.titan-embed-image-v1` model. We embed all images in our dataset in this way.

1. These embeddings are then ingested into in-memory [FAISS](https://github.com/facebookresearch/faiss) database to store and search for embeddings vectors. In a real-world scenario, you will likely want to use a persistent data store such as the [vector engine for Amazon OpenSearch Service Serverless](https://aws.amazon.com/opensearch-service/serverless-vector-engine/) or the pgvector extension for PostgreSQL.

1. Now for retrieval, we consider the following scenario: a customer is looking for a product and has a text description of the product and, optionally, an image of the product, we do an embeddings based similarity search by converting the text description and the image into embeddings using the `Amazon Titan Embeddings G1 - Image` model and retrieve the most relevant results from the vector database. 

1. We then further refine these search results by creating a text prompt using the description for the retrieved objects and asking the LLM (`Anthropic Claude V2`) to do the following:
    * Reason through the responses based on the customer's description of what they were looking for and then either accept or reject each of the search results
    * Explain the reasoning why each result was accepted or rejected. 
    * The text response generated by the model along with the accepted set of results (images and text) are returned to the customer.

In [None]:
import sys
!{sys.executable} -m pip install -r requirements.txt


In [None]:
import os
import io
import sys
import json
import glob
import faiss
import boto3
import base64
import logging
import requests
import numpy as np
import pandas as pd
from PIL import Image
from globals import *
from typing import List
from botocore.auth import SigV4Auth
from faiss import write_index, read_index
from langchain_aws import BedrockLLM
from botocore.awsrequest import AWSRequest
from faiss.swigfaiss_avx2 import IndexFlatIP

logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)


In [None]:
# global constants
!pygmentize globals.py


In [None]:
def display_image(fpath: str) -> None:
    image = Image.open(fpath)
    return image

# convert all the downloaded files into base64 encoding
def encode_image_to_base64(image_file_path: str):
    with open(image_file_path, "rb") as image_file:
        b64_image = base64.b64encode(image_file.read()).decode('utf8')
    return b64_image


In [None]:
bedrock = boto3.client(
    service_name="bedrock-runtime", region_name="us-east-1", endpoint_url=FMC_URL
)


In [None]:
image_b64_file_list = glob.glob(os.path.join(B64_ENCODED_IMAGES_DIR, "*.b64"))
logger.info(f"there are {len(image_b64_file_list)} base64 encoded images in {B64_ENCODED_IMAGES_DIR}")
image_b64_file_list[:10]


In [None]:
image_dataset = pd.read_csv(IMAGE_DATASET_FNAME)
logger.info(f"there are {len(image_dataset)} base64 encoded images in {IMAGE_DATASET_FNAME} dataset")

# only keep the rows for which we have image files downloaded already
image_dataset['path_b64'] = image_dataset.path.map(lambda x: os.path.join(B64_ENCODED_IMAGES_DIR, f"{os.path.basename(x)}.b64"))
logger.info(image_dataset.head())

image_b64_list_dataframe = pd.DataFrame(image_b64_file_list)
image_b64_list_dataframe.columns = ["path_b64"]
image_dataset = pd.merge(left=image_dataset, right=image_b64_list_dataframe, how="inner", on="path_b64")
image_dataset


In [None]:
from pandas.core.series import Series
import numpy as np
import numpy
from typing import Dict
def get_embeddings(text: str, image: str) -> numpy.ndarray:
   
    # You can specify either text or image or both
    body = json.dumps(
        {
            "inputText": text,
            "inputImage": image
        }
    )
        
    modelId = FMC_MODEL_ID
    accept = ACCEPT_ENCODING
    contentType = CONTENT_ENCODING

    try:
        response = bedrock.invoke_model(
            body=body, modelId=modelId, accept=accept, contentType=contentType
        )
        response_body = json.loads(response.get("body").read())        
        embeddings = np.array([response_body.get("embedding")]).astype(np.float32)        
    except Exception as e:
        logger.error(f"exception while encoding text={text}, image(truncated)={image[:10]}, exception={e}")
        embeddings = None
    return embeddings

    


In [None]:
%%time

import faiss
import numpy as np

index = None
image_dataset_successful_embeddings_only = []
for _, row in image_dataset.iterrows():
    logger.info(f"encoding image={row['path_b64']}, description={row['description']}")
    path_b64 = row['path_b64']
    # MAX image size supported is 2048 * 2048 pixels
    with open(path_b64, "rb") as image_file:
        input_image_b64 = image_file.read().decode('utf-8')        
    input_text = "No description" if row['description'] is np.nan else row['description']

    embeddings = get_embeddings(input_text, input_image_b64)
    if embeddings is None:
        logger.error(f"error creating embeddings for {row}")
        continue
    image_dataset_successful_embeddings_only.append(row)
    if index is None:
        vector_dimension = embeddings.shape[1]
        index = faiss.IndexFlatIP(vector_dimension)
    
    faiss.normalize_L2(embeddings)
    index.add(embeddings)


In [None]:
logger.info(f"successfully ingested {len(image_dataset_successful_embeddings_only)} images and descriptions into the vector db index")
image_dataset_successful_embeddings_only_df = pd.DataFrame(image_dataset_successful_embeddings_only)
image_dataset_successful_embeddings_only_df.to_csv(IMAGE_DATA_W_SUCCESSFUL_EMBEDDINGS_FPATH, index=False)
image_dataset_successful_embeddings_only_df.head()


In [None]:
logger.info(f"going to save vectordb index with {index.ntotal} to {VECTOR_DB_INDEX_FPATH}")
write_index(index, VECTOR_DB_INDEX_FPATH)


In [None]:
logger.info(f"going to load vectordb index with from {VECTOR_DB_INDEX_FPATH}")
index = read_index(VECTOR_DB_INDEX_FPATH)
logger.info(f"there are {index.ntotal} elements in index loaded from {VECTOR_DB_INDEX_FPATH}, index type={type(index)}")


In [None]:
def find_multimodal_match(search_text: str, search_image: str, index: IndexFlatIP, k: int) -> List:
    logger.info(f"search_text={search_text}, search_image(truncated)={search_image[:100]}, index={index}, k={K}")
    search_vector = get_embeddings(search_text, search_image)
    faiss.normalize_L2(search_vector)
    
    distances, ann = index.search(search_vector, k=k)
    matches = {'distances': distances[0], 'ann': ann[0]}
    results = pd.DataFrame(matches).sort_values(by="distances", ascending=False)
    
    return results


In [None]:
def generate_image(prompt: str, negative_prompts: str):
    body = json.dumps(
        {
            "taskType": "TEXT_IMAGE",
            "textToImageParams": {
                "text": prompt,                    # Required
                "negativeText": negative_prompts   # Optional
            },
            "imageGenerationConfig": {
                "numberOfImages": 1,   # Range: 1 to 5 
                "quality": "standard",  # Options: standard or premium
                "height": 512,        # Supported height list in the docs see here: https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-titan-image.html#w379aac17c27c15c15b7c21b5b7
                "width": 512,         # Supported width list in the docs see here: https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-titan-image.html#w379aac17c27c15c15b7c21b5b7
                "cfgScale": 7.5,       # Range: 1.0 (exclusive) to 10.0
                "seed": 42             # Range: 0 to 214783647
            }
        }
    )
    modelId = "amazon.titan-image-generator-v1"
    accept = "application/json"
    contentType = "application/json"

    try:

        response = bedrock.invoke_model(
            body=body, modelId=modelId, accept=accept, contentType=contentType
        )
        response_body = json.loads(response.get("body").read())

        #print(response_body["result"])
        #print(f'{response_body.get("artifacts")[0].get("base64")[0:80]}...')

    except botocore.exceptions.ClientError as error:

        if error.response['Error']['Code'] == 'AccessDeniedException':
            print(f"\x1b[41m{error.response['Error']['Message']}\
                    \nTo troubeshoot this issue please refer to the following resources.\
                    \nhttps://docs.aws.amazon.com/IAM/latest/UserGuide/troubleshoot_access-denied.html\
                    \nhttps://docs.aws.amazon.com/bedrock/latest/userguide/security-iam.html\x1b[0m\n")

        else:
            raise error
    base_64_img_str = response_body.get("images")[0]
    #image = Image.open(io.BytesIO(base64.decodebytes(bytes(base_64_img_str, "utf-8"))))
    return base_64_img_str


In [None]:
search_text:str = "black or gray colored running shoe for men, trail running, comfortable, wide toebox. Only show PUMA or Nike."
image_prompt:str = "A zoomed in product image of a black or gray colored trail running shoe for men, comfortable with wide toebox."
negative_prompts: str = "in focus background, in focus athlete, front view"
search_image = generate_image(image_prompt, negative_prompts)
image = Image.open(io.BytesIO(base64.decodebytes(bytes(search_image, "utf-8"))))
display(image)


In [None]:
image_dataset_successful_embeddings_only_df = pd.read_csv(IMAGE_DATA_W_SUCCESSFUL_EMBEDDINGS_FPATH)
image_dataset_successful_embeddings_only_df.head()


In [None]:
matches = find_multimodal_match(search_text, search_image, index, K)
display(matches)
matches_from_dataset = image_dataset_successful_embeddings_only_df.iloc[list(matches.ann), :]
matches_from_dataset


In [None]:
logger.info(f"search text = \"{search_text}\"")

for i, row in matches_from_dataset.iterrows():
    logger.info(f"--------- Match {i} ----------")
    logger.info(row['description'])
    fpath = os.path.join(IMAGES_DIR, os.path.basename(row['path']))
    logger.info(f"image file={fpath}")
    image = display_image(fpath)
    display(image)


In [None]:
inference_modifier = {
    "max_tokens_to_sample": 4096,
    "temperature": 0,
    "top_k": 250,
    "top_p": 1,
    "stop_sequences": ["\n\nHuman"],
}
textgen_llm = BedrockLLM(
    model_id=CLAUDE_V2_MODEL_ID,
    model_kwargs=inference_modifier,
)


In [None]:
logger.info(f"search text = \"{search_text}\"")
prompt: str = """Human: You are a shopping assistant bot and helping a customer find what they are looking for from an online product catalog. 
The user query and search results from a product catalog are provided below. Some of the search results may not be relevant based on the user query, 
pick the result which exactly match the user's criteria and return their indexes, if you do not find an exact match then return the next most relevant
results but explicitly say that these are most relevant but not exact matches to the user's criteria.

<user_query>
{}
</user_query>

<search_results>
{}
</search_results>

Now based on the information provided above, provide the index of the most relevant results and also say why you choose those results and for the results that
you did not choose say why not. 
Finally, as the last line of your response, write the most relevant indexes as a comma separated list in a line all by itself as:
<relevant_indices></relevant_indices>

Answer:
"""

search_results: str = ""
for i, row in matches_from_dataset.iterrows():
    search_results += f"Index: {i}, Description: {row['description']}\n\n"
prompt = prompt.format(search_text, search_results)
print(prompt)


In [None]:
response = textgen_llm(prompt)
logger.info(response)


In [None]:
index_list_line = response.split("\n")[-1]
logger.info(f"index_list_line={index_list_line}")


In [None]:
import re
found = re.findall('<relevant_indices>(.*)</relevant_indices>', index_list_line)
logger.info(f"found={found}")
if len(found) > 0:
    indices = [int(i.strip()) for i in found[0].split(",")]
    best_matches = image_dataset_successful_embeddings_only_df.iloc[indices, :]
    for i, row in best_matches.iterrows():
        logger.info(f"--- Best match {i} -----")
        logger.info(f"Description = {row['description']}")
        fpath = os.path.join(IMAGES_DIR, os.path.basename(row['path']))
        logger.info(f"image file={fpath}")
        image = display_image(fpath)
        display(image)
else:
    logger.error(f"the model did not find any valid matches in the multimodal results")
