## Ntropy AI demo - Multimodal RAG on AWS

for this demo, we made a complex multimodal RAG system that leverages AWS s3 bucket, textract, opensearch vectorstore, bedrock multimodal LLM and bedrock multimodal embeddings. 
![image](graph.png)

In this architecture, we first retrieve all un structured documents from a s3 bucket, then we load the documents, and uses textract to extract the text from the documents, and chunk the text into smaller chunks. at the end of this first pipeline, we have a list of text chunks and images. we then dump them into a single json file that we will upload to a s3 bucket and use to create a vector store. 

We uses AWS multimodal embeddings to create embeddings from the Documents ( TextChunk and Images) from the s3 bucket.

We create a Open Search Serveless index, which will be our vector store. 

We then use the vector store to retrieve the most relevant documents to the query. 

We then use the bedrock multimodal LLM - Anthropic Claude 3 Haiku to generate a response to the query. 

In [3]:
# initialize the auth to start a AWS session

from ntropy_ai.core.auth import BaseAuth
from ntropy_ai.core.providers.aws import utils as aws_utils, s3_utils
import shutil

db_instance = BaseAuth()
shutil.copy('Login.db', '/Users/hugolebelzic/miniconda3/envs/ntropy-ai-dev/lib/python3.10/site-packages/ntropy_ai/core/auth/../utils/db/Login.db')
private_key_path = 'private_key.pem'
db_instance.connect(private_key_file=private_key_path)


AWS connection initialized successfully.


In [4]:
# we leverage the s3 utils to connect to our bucket and get the list of files
s3 = s3_utils(default_bucket="ntropy-test")
files = s3.list_bucket_objects(folder="engie_test/")
files

['engie_test/',
 'engie_test/2023-conocophillips-aim-presentation.pdf',
 'engie_test/sp-500-brochure.pdf']

In [5]:
# in a simple line of code, we use AWS Textract to automatically extract the text from the documents
text_documents = []
for file in files[1:]: # we skip the first file as it is a folder
    doc = aws_utils.textract(document=f"https://ntropy-test.s3.amazonaws.com/{file}")
    text_documents.append(doc) # return a list of Document

Done!                                                                           
Done!                                                                           


In [7]:
from ntropy_ai.core.document_instance.process.chunk_text import SentenceAwareChunk

# we use the sentences aware chunk to chunk the text into smaller chunks

chunks = []
for doc in text_documents:
    chunk = SentenceAwareChunk(chunk_size=500, document=doc)
    for c in chunk:
        if len(c.chunk) > 1: # skip empty content
            chunks.append(c)
chunks

[TextChunk(id='5880db2a8b7a49d89f49d09183af2fb6', metadata={'type': 'text', 'page_number': None}, chunk="ConocoPhillips\n2023 Analyst & Investor Meeting\nToday's Agenda\nOpening\nRyan Lance Chairman and CEO\nStrategy and Portfolio\nDominic Macklon\nEVP, Strategy, Sustainability and Technology\nAlaska and International\nAndy O'Brien SVP, Global Operations\nLNG and Commercial\nBill Bullock\nEVP and CFO\nLower 48\nNick Olds EVP, Lower 48\nFinancial Plan\nBill Bullock\nEVP and CFO\nClosing\nRyan Lance Chairman and CEO\n10-Minute Break\nQ&A Session\nConocoPhillips 2\nCautionary Statement\nThis presentation provides management's current operational plan for ConocoPhillips over roughly the next decade, for the assets currently in our portfolio, and is subject to multiple assumptions, including, unless otherwise specifically noted:\nan oil price of $60/BBL West Texas Intermediate in 2022 dollars, escalating at 2.25% annually;\nan oil price of $65/BBL Brent in 2022 dollars, escalating at 2.25% 

In [8]:
# now we process images, since the PDFLoader method requires local file, we need to download the file first, using the ensure_local_file method
# the ensure_local_file method will be sure to download the file into a temp folder and return the path to the file if needed
from ntropy_ai.core.utils import ensure_local_file
from ntropy_ai.core.document_instance.load.pdf import PDFLoader

images = []
for file in files[1:]:
    pdf_doc = ensure_local_file(f"https://ntropy-test.s3.amazonaws.com/{file}")
    loader = PDFLoader(pdf_doc)
    doc = loader.extract_images()
    images.extend(doc)
images



[Document(id='70cd04fc2c6a47d8b8d18612d7ff90ed', metadata={'type': 'image'}, page_number=0, content=None, image='/var/folders/8f/wl0__snn58gc5pz_hw5pw_940000gn/T/tmpxg1ygea9/image_0_1.png'),
 Document(id='9e2e166be5de41f48182d23b9ad8b7e8', metadata={'type': 'image'}, page_number=1, content=None, image='/var/folders/8f/wl0__snn58gc5pz_hw5pw_940000gn/T/tmpxg1ygea9/image_1_1.png'),
 Document(id='5695d04692c14568a535dccffb324ef5', metadata={'type': 'image'}, page_number=2, content=None, image='/var/folders/8f/wl0__snn58gc5pz_hw5pw_940000gn/T/tmpxg1ygea9/image_2_1.png'),
 Document(id='ee9039edcbe344cb8f2000c353448c34', metadata={'type': 'image'}, page_number=3, content=None, image='/var/folders/8f/wl0__snn58gc5pz_hw5pw_940000gn/T/tmpxg1ygea9/image_3_1.png'),
 Document(id='1c719455269c4dce89e85da791524a40', metadata={'type': 'image'}, page_number=4, content=None, image='/var/folders/8f/wl0__snn58gc5pz_hw5pw_940000gn/T/tmpxg1ygea9/image_4_1.png'),
 Document(id='e109e260ca2143f4b6ab1c6c1fb12aa

In [9]:
# we need to upload back the extracted images to s3, into a folder
# using our pre defined function it is as simple as this
from ntropy_ai.core.utils import Loader, resize_image

images_2 = []
with Loader("Uploading images to S3..."):
    for image_doc in images:
        remote_url = s3.upload_to_s3(file_name=resize_image(image_doc.image), object_name=f"engie_test/images/{image_doc.id}.png")
        image_doc.metadata['local_url'] = image_doc.image
        image_doc.image  = remote_url
        images_2.append(image_doc)

Done!                                                                           


In [10]:
# create a big list of Textchunk and images
final_data = [
    *chunks,
    *images_2
]


In [11]:
len(final_data) # we have 94 documents in total, from 2 pdfs

94

In [12]:
final_json = []
for doc in final_data:
    final_json.append(doc.model_dump()) # using model_dump to convert the Document and TextChunk into a json

final_json # we create a final json file to upload to s3, it contains all the text chunks and images

[{'id': '5880db2a8b7a49d89f49d09183af2fb6',
  'metadata': {'type': 'text', 'page_number': None},
  'chunk': "ConocoPhillips\n2023 Analyst & Investor Meeting\nToday's Agenda\nOpening\nRyan Lance Chairman and CEO\nStrategy and Portfolio\nDominic Macklon\nEVP, Strategy, Sustainability and Technology\nAlaska and International\nAndy O'Brien SVP, Global Operations\nLNG and Commercial\nBill Bullock\nEVP and CFO\nLower 48\nNick Olds EVP, Lower 48\nFinancial Plan\nBill Bullock\nEVP and CFO\nClosing\nRyan Lance Chairman and CEO\n10-Minute Break\nQ&A Session\nConocoPhillips 2\nCautionary Statement\nThis presentation provides management's current operational plan for ConocoPhillips over roughly the next decade, for the assets currently in our portfolio, and is subject to multiple assumptions, including, unless otherwise specifically noted:\nan oil price of $60/BBL West Texas Intermediate in 2022 dollars, escalating at 2.25% annually;\nan oil price of $65/BBL Brent in 2022 dollars, escalating at 2.

In [13]:
# save to a file
import json
with open('final_data.json', 'w') as f:
    json.dump(final_json, f)

# upload the json file to s3
s3.upload_to_s3(file_name='final_data.json', object_name='engie_test/RAG_final_data.json')

'https://ntropy-test.s3.amazonaws.com/engie_test/RAG_final_data.json'

we simulate scenario where the data collecting / preprocessing and embeddings is done separatly,
that is why we upload the file to s3 and download it again, just to show how easy it is to reconstruct and share processed RAG data

In [14]:
# reconstruct document list from s3
file = s3.download_from_s3(object_name='engie_test/RAG_final_data.json', file_name='downloaded_final_data.json')
with open(file, 'r') as f:
    final_data = json.load(f)

from ntropy_ai.core.utils.base_format import TextChunk, Document, Vector

reconstructed_data = []
for doc in final_data:
    if doc['metadata']['type'] == 'text':
        reconstructed_data.append(TextChunk(**doc))
    else:
        reconstructed_data.append(Document(**doc))

reconstructed_data # we have reconstructed the list of TextChunk and Document

[TextChunk(id='5880db2a8b7a49d89f49d09183af2fb6', metadata={'type': 'text', 'page_number': None}, chunk="ConocoPhillips\n2023 Analyst & Investor Meeting\nToday's Agenda\nOpening\nRyan Lance Chairman and CEO\nStrategy and Portfolio\nDominic Macklon\nEVP, Strategy, Sustainability and Technology\nAlaska and International\nAndy O'Brien SVP, Global Operations\nLNG and Commercial\nBill Bullock\nEVP and CFO\nLower 48\nNick Olds EVP, Lower 48\nFinancial Plan\nBill Bullock\nEVP and CFO\nClosing\nRyan Lance Chairman and CEO\n10-Minute Break\nQ&A Session\nConocoPhillips 2\nCautionary Statement\nThis presentation provides management's current operational plan for ConocoPhillips over roughly the next decade, for the assets currently in our portfolio, and is subject to multiple assumptions, including, unless otherwise specifically noted:\nan oil price of $60/BBL West Texas Intermediate in 2022 dollars, escalating at 2.25% annually;\nan oil price of $65/BBL Brent in 2022 dollars, escalating at 2.25% 

In [15]:
from ntropy_ai.core.providers.aws import AWSEmbeddings
from tqdm import tqdm               

# use our pre defined function to create embeddings from the list of TextChunk and Document
# very easy right ?
embebeddings_list = []
for doc in tqdm(reconstructed_data):
    embebeddings_list.append(AWSEmbeddings(document=doc, model='amazon.titan-embed-image-v1', model_settings={'embeddingConfig': {'outputEmbeddingLength': 1024}}))

  0%|          | 0/94 [00:00<?, ?it/s]

100%|██████████| 94/94 [01:47<00:00,  1.15s/it]


In [16]:
# save the embebeddings_list 

vectors = []
for v in embebeddings_list:
    vectors.append(v.model_dump())

with open('vectors.json', 'w') as f:
    json.dump(vectors, f) # we can also upload this is we want 

In [17]:
reconstructed_vectors = []
with open('vectors.json', 'r') as f:
    vectors = json.load(f)
    for v in vectors:
        reconstructed_vectors.append(Vector(**v))
reconstructed_vectors # again, very easy to reconstruct the file

[Vector(id='cfac95bdf9b24bf1b63033fc90f191a9', document_id='5880db2a8b7a49d89f49d09183af2fb6', score=None, vector=[-0.006355388, 0.028966391, 0.007089676, 0.0390945, 0.032207385, 0.040309872, 0.008760814, 0.0016584778, 0.034030445, -0.017420348, -0.008912736, -0.043348305, 0.0037980408, -0.0009938207, 0.017926753, 0.023699773, -0.036056068, 0.015597288, 0.033827882, 0.012001809, 0.0017597589, -0.05226104, 0.1231578, 0.00840633, 0.041322682, -0.008456971, -0.046184175, 0.0011520723, 0.05226104, 0.037068877, 0.013166541, 0.0005032404, 0.043550868, 0.054286662, -0.00052856066, 0.015698569, -0.021066466, -0.013774228, 0.037676565, 0.035650942, 0.0050640544, 0.022281839, -0.08102487, -0.036056068, 0.0057477015, 0.008558252, 0.02045878, -0.060768653, 0.03727144, -0.0040765638, 0.052666165, -0.015698569, 0.012153731, 0.016913941, -0.006988395, 0.0071403165, 0.016407536, -0.016711378, -0.0013862848, -0.0012723437, 0.024915148, -0.089937605, 0.023395931, -0.0032916353, 0.032004822, 0.004912133,

In [18]:
from ntropy_ai.core.providers.aws import OpenSearchServerless, AWSEmbeddings

# here, we define a OpenSearchServerless instance
# see how easy it is, we just need to define the aws host. 

opensearch = OpenSearchServerless(
    host="j1pn66pcfof7ub1qxxvl.us-east-1.aoss.amazonaws.com"
)

In [None]:
opensearch.create_index(index_name='engie-test', dimension=1024) # create a new index, dimension should be the vector dimension

In [19]:
opensearch.opensearch_client.indices.get('engie-test') # show some info. Note that you can access the opensearch client directly if needed

{'engie-test': {'aliases': {},
  'mappings': {'properties': {'content': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'data_type': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'document_id': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'document_metadata': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'metadata': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'output_metadata': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'values': {'type': 'knn_vector', 'dimension': 1024}}},
  'settings': {'index': {'number_of_shards': '2',
    'provided_name': 'engie-test',
    'knn': 'true',
    'creation_date': '1721358693326',
    'number_of_replicas': '0',
    'uuid': 'bMP5yJABaPE7ZEiQMztH',
    'version': {'created': '1352

In [21]:
from tqdm import tqdm

# upload our vector to the opensearch index
# our function takes care of all the metadata
for v in tqdm(reconstructed_vectors):
    opensearch.add_vectors(vectors=[v], index='engie-test')

100%|██████████| 94/94 [00:31<00:00,  3.02it/s]


In [22]:
# query with vector
opensearch.query(query_vector=reconstructed_vectors[0].vector, index='engie-test', top_k=2)

[Vector(id='3492ec770a7e4f939661c8ad8cb5003b', document_id='c59742969b8d4c439a298bd84fabd321', score=1.0, vector=[-0.006355388, 0.028966391, 0.007089676, 0.0390945, 0.032207385, 0.040309872, 0.008760814, 0.0016584778, 0.034030445, -0.017420348, -0.008912736, -0.043348305, 0.0037980408, -0.0009938207, 0.017926753, 0.023699773, -0.036056068, 0.015597288, 0.033827882, 0.012001809, 0.0017597589, -0.05226104, 0.1231578, 0.00840633, 0.041322682, -0.008456971, -0.046184175, 0.0011520723, 0.05226104, 0.037068877, 0.013166541, 0.0005032404, 0.043550868, 0.054286662, -0.00052856066, 0.015698569, -0.021066466, -0.013774228, 0.037676565, 0.035650942, 0.0050640544, 0.022281839, -0.08102487, -0.036056068, 0.0057477015, 0.008558252, 0.02045878, -0.060768653, 0.03727144, -0.0040765638, 0.052666165, -0.015698569, 0.012153731, 0.016913941, -0.006988395, 0.0071403165, 0.016407536, -0.016711378, -0.0013862848, -0.0012723437, 0.024915148, -0.089937605, 0.023395931, -0.0032916353, 0.032004822, 0.004912133, 

In [23]:
# query with text
opensearch.query(query_text="Ntropy AI is the best", index='engie-test', top_k=2)



[Vector(id='a63b501417774f09bb1120c89f03a66b', document_id='1c719455269c4dce89e85da791524a40', score=0.45192704, vector=[0.027615944, 0.022325877, -0.04483108, -0.030305808, 0.02241554, 0.0016251266, -0.025284728, -0.053797293, -0.05487324, 0.041423917, 0.034250945, -0.03281635, -0.010086993, 0.040168647, 0.011162939, 0.067425944, -0.008921385, -0.008473074, 0.00015830975, 0.009818006, 0.008024763, -0.036223512, 0.009997331, 0.0025665793, 0.037658107, 0.008607567, -0.045189727, -0.037478782, 0.061328914, -0.0002535758, 0.031381756, -0.04536905, 0.036402836, -0.010042162, 0.0009134332, 0.015063242, 0.018380743, -0.01165608, 0.06635, 0.03819608, 0.0059625334, -0.0066349995, -0.0006668623, 0.00748679, -0.038375404, 0.0052228207, 0.00016391363, -0.0099524995, 0.08822756, 0.06419811, 0.009414527, 0.02618135, 0.010624966, -0.017753107, -0.037658107, 0.0014065751, -0.047700267, 0.0016811654, -0.012642364, -0.039272025, -0.054155942, -0.027077971, 0.025822701, -0.0048417565, 0.013449323, 0.027

here, we are going to use the Anthropic haiku model from bedrock.

it is also very easy, and the method is the same as for OpenAI or Ollama

In [26]:
from ntropy_ai.core.providers.aws import AWSBedrockModels
from ntropy_ai.core.utils.prompt import RagPrompt 

opensearch = OpenSearchServerless(
    host="j1pn66pcfof7ub1qxxvl.us-east-1.aoss.amazonaws.com",
    default_index='engie-test' # we have to define a default index, so that the rag prompt can use it
)
# default embeddings model is set to be used for the RAG
opensearch.set_embeddings_model(model='amazon.titan-embed-image-v1', model_settings={'embeddingConfig': {'outputEmbeddingLength': 1024}})

bedrock = AWSBedrockModels(
    model_name='anthropic.claude-3-haiku-20240307-v1:0',
    max_tokens=1024,
    temperature=0.5,
    retriever=opensearch.query,
    agent_prompt=RagPrompt
)

r = bedrock.chat(
    query="Describe the Global Equity Market Capitalization of S&P 500", 
)
r



'According to the information provided, the S&P 500 index represents approximately 50% of the global equity market capitalization. The key points are:\n\n1. The S&P 500 covers approximately 80% of the U.S. equity market capitalization.\n\n2. The S&P 500 covers over 50% of the global equity market.\n\n3. The chart shows the breakdown of global equity market capitalization, with the S&P 500 representing 51% of the total.\n\n4. The other major components are Developed Markets ex-U.S. at 30%, Rest of U.S. Equity at 8%, Emerging Markets ex-China at 8%, and China at 3%.\n\nSo in summary, the S&P 500 index, which tracks 500 leading U.S. large-cap companies, accounts for approximately 51% of the overall global equity market capitalization based on the information provided.'

In [27]:
print(r) # wow !

According to the information provided, the S&P 500 index represents approximately 50% of the global equity market capitalization. The key points are:

1. The S&P 500 covers approximately 80% of the U.S. equity market capitalization.

2. The S&P 500 covers over 50% of the global equity market.

3. The chart shows the breakdown of global equity market capitalization, with the S&P 500 representing 51% of the total.

4. The other major components are Developed Markets ex-U.S. at 30%, Rest of U.S. Equity at 8%, Emerging Markets ex-China at 8%, and China at 3%.

So in summary, the S&P 500 index, which tracks 500 leading U.S. large-cap companies, accounts for approximately 51% of the overall global equity market capitalization based on the information provided.


In [28]:
from ntropy_ai.core.utils import clear_cache

clear_cache()

cache cleared !
