# FAISS RAG Q&A Pattern with Amazon Bedrock using LangChain

### Challenges

When trying to solve a Question Answering task over a larger document corpus with the help of LLMs we need to master the following challenges (amongst others):
- How to manage large document(s) that exceed the token limit
- How to find the document(s) relevant to the question being asked

### Infusing knowledge into LLM-powered systems

We have two primary: 
- **Parametric knowledge**: refers to everything the LLM learned during training and acts as a frozen snapshot of the world for the LLM. 
- **Source knowledge**: covers any information fed into the LLM via the input prompt. 

When trying to infuse knowledge into a generative AI - powered application we need to choose which of these types to target. Fine-tuning, explored in other workshops, deals with elevating the parametric knowledge through fine-tuning. Since fine-tuning is a resouce intensive operation, this option is well suited for infusing static domain-specific information like domain-specific langauage/writing styles (medical domain, science domain, ...) or optimizing performance towards a very specific task (classification, sentiment analysis, RLHF, instruction-finetuning, ...). 

In contrast to that, targeting the source knowledge for domain-specific performance uplift is very well suited for all kinds of dynamic information, from knowledge bases in structured and unstructured form up to integration of information from live systems. This Lab is about retrieval-augmented generation, a common design pattern for ingesting domain-specific information through the source knowledge. It is particularily well suited for ingestion of information in form of unstructured text with semi-frequent update cycles. 

In this notebook we explain how to utilize the RAG (retrieval-agumented generation) pattern originating from [this](https://arxiv.org/pdf/2005.11401.pdf) paper published by Lewis et al in 2021. It is particularily useful for Question Answering by finding and leveraging the most useful excerpts of documents out of a larger document corpus providing answers to the user questions.

#### Prepare documents
![Embeddings](./images/Embeddings_lang.png)

Before being able to answer the questions, the documents must be processed and a stored in a document store index
- Load the documents
- Process and split them into smaller chunks
- Create a numerical vector representation of each chunk using Amazon Bedrock Titan Embeddings model
- Create an index using the chunks and the corresponding embeddings
#### Ask question
![Question](./images/Chatbot_lang.png)

When the documents index is prepared, you are ready to ask the questions and relevant documents will be fetched based on the question being asked. Following steps will be executed.
- Create an embedding of the input question
- Compare the question embedding with the embeddings in the index
- Fetch the (top N) relevant document chunks
- Add those chunks as part of the context in the prompt
- Send the prompt to the model under Amazon Bedrock
- Get the contextual answer based on the documents retrieved

## Usecase
#### Dataset
In this example, you will use Fannie Mae's selling guide as a text corpus to perform Q&A on.

## Implementation
In order to follow the RAG approach this notebook is using the LangChain framework where it has integrations with different services and tools that allow efficient building of patterns such as RAG. We will be using the following tools:

- **LLM (Large Language Model)**: Anthropic Claude available through Amazon Bedrock

  This model will be used to understand the document chunks and provide an answer in human friendly manner.
- **Embeddings Model**: Amazon Titan Embeddings available through Amazon Bedrock

  This model will be used to generate a numerical representation of the textual documents

- **Document Loader**: 
    - PDF Loader available through LangChain for PDFs

  These are loaders that can load the documents from a source, for the sake of this notebook we are loading the sample files from a local path. This could easily be replaced with a loader to load documents from enterprise internal systems.

- **Vector Store**: FAISS available through LangChain
  In this notebook we are using this in-memory vector-store to store both the embeddings and the documents. In an enterprise context this could be replaced with a persistent store such as AWS OpenSearch, RDS Postgres with pgVector, ChromaDB, Pinecone or Weaviate.

- **Index**: VectorIndex
  The index helps to compare the input embedding and the document embeddings to find relevant document.

- **Wrapper**: wraps index, vector store, embeddings model and the LLM to abstract away the logic from the user.

### Python 3.10

⚠️⚠️⚠️ For this lab we need to run the notebook based on a Python 3.10 runtime. ⚠️⚠️⚠️

### Setup
To run this notebook you would need to install 2 more dependencies, [PyPDF](https://pypi.org/project/pypdf/) and [FAISS vector store](https://github.com/facebookresearch/faiss).

Then begin with instantiating the LLM and the Embeddings model. Here we are using Anthropic Claude to demonstrate the use case.

Note: It is possible to choose other models available with Bedrock. You can replace the `model_id` as follows to change the model.

`llm = Bedrock(model_id="...")`

### Install Libraries

In [2]:
%pip install boto3

[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
%pip install langchain
%pip install pypdf faiss-cpu

[0mNote: you may need to restart the kernel to use updated packages.
[0mNote: you may need to restart the kernel to use updated packages.


In [4]:
%pip install tiktoken

[0mNote: you may need to restart the kernel to use updated packages.


In [5]:
%pip install sqlalchemy

[0mNote: you may need to restart the kernel to use updated packages.


### Import Libraries

In [9]:
import json
import os
import sys
import re
import pandas as pd

import boto3
import botocore

### Initialize Boto session

In [8]:
# module_path = ".."
# sys.path.append(os.path.abspath(module_path))

boto_session = boto3.Session()
aws_region = boto_session.region_name
print(aws_region)
br_client = boto_session.client("bedrock", region_name=aws_region)
br_runtime = boto_session.client("bedrock-runtime", region_name=aws_region)


us-east-1


### Data Preparation
Let's first download the Fannie Mae selling guide (large PDF document) to build our document store.

In this example, you will use Fannie Mae's selling guide as a text corpus to perform Q&A on.

In [14]:
%%sh

wget -O fannie-mf-commentary-oct-2023.pdf https://www.fanniemae.com/media/49331/display
# https://singlefamily.fanniemae.com/media/38401/display

--2024-03-14 12:16:20--  https://www.fanniemae.com/media/49331/display
Resolving www.fanniemae.com (www.fanniemae.com)... 104.18.27.25, 104.18.26.25, 2606:4700::6812:1b19, ...
Connecting to www.fanniemae.com (www.fanniemae.com)|104.18.27.25|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 249442 (244K) [application/pdf]
Saving to: ‘fannie-mf-commentary-oct-2023.pdf’

     0K .......... .......... .......... .......... .......... 20% 23.1M 0s
    50K .......... .......... .......... .......... .......... 41%  153M 0s
   100K .......... .......... .......... .......... .......... 61% 45.8M 0s
   150K .......... .......... .......... .......... .......... 82% 69.3M 0s
   200K .......... .......... .......... .......... ...       100% 35.1M=0.005s

2024-03-14 12:16:20 (43.9 MB/s) - ‘fannie-mf-commentary-oct-2023.pdf’ saved [249442/249442]



After downloading we can load the documents with the help of [DirectoryLoader from PyPDF available under LangChain](https://python.langchain.com/en/latest/reference/modules/document_loaders.html) and splitting them into smaller chunks.

Note: The retrieved document/text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt. Also the embeddings model has a limit of the length of input tokens limited to 4000 tokens, which roughly translates to ~16000 characters. For the sake of this use-case we are creating chunks of roughly 5000 characters with an overlap of 200 characters using [RecursiveCharacterTextSplitter](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/recursive_text_splitter.html).

In [21]:
filenames = ['fannie-mf-commentary-oct-2023.pdf']
metadata = [dict(name='FNMA Selling Guide 2024', source='https://singlefamily.fanniemae.com/media/38401/display')]

print(metadata)

[{'name': 'FNMA Selling Guide 2024', 'source': 'https://singlefamily.fanniemae.com/media/38401/display'}]


In [25]:
import numpy as np
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader

documents = []

for idx, file in enumerate(filenames):
    loader = PyPDFLoader('./' + file)
    document = loader.load()
    for document_fragment in document:
        document_fragment.metadata = metadata[idx]
        
    print(f'Total Page: {len(document)} \n\nFirst Page:\n\n {document[0]}\n')
    documents += document

# - in our testing Character split works better with this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 5000,
    chunk_overlap  = 200,
)

docs = text_splitter.split_documents(documents)

Total Page: 5 

First Page:

 page_content='1Multifamily Economic and Market Commentary\nOCTOBER 2023\nRising Number of Multifamily Properties Offering Concessions\nMultifamily market fundamentals have softened in 2023 compared to the prior year, the result of mixed economic \ntrends including slowing -but-still -positive job growth, elevated single -family housing prices keeping many renters in \nplace, and continued favorable demographics. Rent growth was exceptional over the past two years, and, of course, \nunsustainable, thus 2023 has seen a substantial slowing of rent growth rates. There remains a robust pipeline of new \napartment rental projects that are underway in the nation’s largest metros, and with recessionary concerns there has \nbeen a rise in the number of properties across the country offering concessions.\nIn the multifamily apartment rental market, concessions are incentives with an economic value for renters, such as \nperiods of free rent, free utilities, or other

Before we are proceeding we are looking into some interesting statistics regarding the document preprocessing we just performed:

In [26]:
avg_doc_length = lambda documents: sum([len(doc.page_content) for doc in documents])//len(documents)
print(f'Average length among {len(documents)} documents loaded is {avg_doc_length(documents)} characters.')
print(f'After the split we have {len(docs)} documents as opposed to the original {len(documents)}.')
print(f'Average length among {len(docs)} documents (after split) is {avg_doc_length(docs)} characters.')

Average length among 5 documents loaded is 2007 characters.
After the split we have 5 documents as opposed to the original 5.
Average length among 5 documents (after split) is 2007 characters.


### Setup langchain

We create an instance of the Bedrock classes for the LLM and the embedding models. At the time of writing, Bedrock supports one embedding model and therefore we do not need to specify any model id. To be able to compare token consumption across the different RAG-approaches shown in the workshop labs we use langchain callbacks to count token consumption.

In [28]:
# We will be using the Titan Embeddings Model to generate our Embeddings.
from langchain.embeddings import BedrockEmbeddings
from langchain.llms.bedrock import Bedrock

# - create the Anthropic Model
llm = Bedrock(model_id="anthropic.claude-v2", 
              client=br_runtime, 
              model_kwargs={
                  'max_tokens_to_sample': 200
              })

# - create the Titan Embeddings Model
bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1",
                                       client=br_runtime)

Now we can see how a sample embedding would look like for one of those chunks.

In [29]:
sample_embedding = np.array(bedrock_embeddings.embed_query(docs[0].page_content))
print("Sample embedding of a document chunk: ", sample_embedding)
print("Size of the embedding: ", sample_embedding.shape)

Sample embedding of a document chunk:  [ 0.113061   -0.12235427 -0.10330784 ... -0.37938827  0.21118337
 -0.20052083]
Size of the embedding:  (1536,)


Following the very same approach embeddings can be generated for the entire corpus and stored in a vector store.

This can be easily done using [FAISS](https://github.com/facebookresearch/faiss) implementation inside [LangChain](https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/faiss.html) which takes  input the embeddings model and the documents to create the entire vector store. Using the Index Wrapper we can abstract away most of the heavy lifting such as creating the prompt, getting embeddings of the query, sampling the relevant documents and calling the LLM. [VectorStoreIndexWrapper](https://python.langchain.com/en/latest/modules/indexes/getting_started.html#one-line-index-creation) helps us with that.

**⚠️⚠️⚠️ NOTE: it might take few minutes to run the following cell ⚠️⚠️⚠️**

In [30]:
from langchain.chains.question_answering import load_qa_chain
from langchain.vectorstores import FAISS
from langchain.indexes import VectorstoreIndexCreator
from langchain.indexes.vectorstore import VectorStoreIndexWrapper

vectorstore_faiss = FAISS.from_documents(
    docs,
    bedrock_embeddings,
)

wrapper_store_faiss = VectorStoreIndexWrapper(vectorstore=vectorstore_faiss)

### Question Answering

Now that we have our vector store in place, we can start asking questions.

In [42]:
query_1 = "How many new units expected to complete in 2023?"

The first step would be to create an embedding of the query such that it could be compared with the documents

In [43]:
query_embedding = vectorstore_faiss.embedding_function(query)
np.array(query_embedding)

array([-0.8125    , -0.48242188, -0.17285156, ...,  0.18164062,
        0.71875   , -0.6484375 ])

We can use this embedding of the query to then fetch relevant documents.
Now our query is represented as embeddings we can do a similarity search of our query against our data store providing us with the most relevant information.

In [34]:
relevant_documents = vectorstore_faiss.similarity_search_by_vector(query_embedding)
print(f'{len(relevant_documents)} documents are fetched which are relevant to the query.')
print('----')
for i, rel_doc in enumerate(relevant_documents):
    print(f'## Document {i+1}: {rel_doc.page_content}.......')
    print('---')

4 documents are fetched which are relevant to the query.
----
## Document 1: 1Multifamily Economic and Market Commentary
OCTOBER 2023
Rising Number of Multifamily Properties Offering Concessions
Multifamily market fundamentals have softened in 2023 compared to the prior year, the result of
mixed economic
trends including slowing -but-still -positive job growth, elevated single -family housing prices
keeping many renters in
place, and continued favorable demographics. Rent growth was exceptional over the past two years,
and, of course,
unsustainable, thus 2023 has seen a substantial slowing of rent growth rates. There remains a robust
pipeline of new
apartment rental projects that are underway in the nation’s largest metros, and with recessionary
concerns there has
been a rise in the number of properties across the country offering concessions.
In the multifamily apartment rental market, concessions are incentives with an economic value for
renters, such as
periods of free rent, free ut

Now we have the relevant documents, it's time to use the LLM to generate an answer based on these documents. 

We will take our inital prompt, together with our relevant documents which were retreived based on the results of our similarity search. We then by combining these create a prompt that we feed back to the model to get our result. At this point our model should give us highly informed information on how we can change the tire of our specific car as it was outlined in our manual.

LangChain provides an abstraction of how this can be done easily.

### Quick way
You have the possibility to use the wrapper provided by LangChain which wraps around the Vector Store and takes input the LLM.
This wrapper performs the following steps behind the scences:
- Takes input the question
- Create question embedding
- Fetch relevant documents
- Stuff the documents and the question into a prompt
- Invoke the model with the prompt and generate the answer in a human readable manner.

In [44]:
answer_1 = wrapper_store_faiss.query(question=query, llm=llm)
print(answer_1)

 Based on the information provided in the context, it is expected that about 500,000 new apartment units will be completed in 2023. Specifically, the text states:

"According to Dodge, as of the middle of 2023, nearly 730,000 units were underway with completion dates expected in 2023, though only 164,000 were completed year to date through April. That’s compared to full-year totals of 479,000 units in 2022 and 395,000 units in 2021. With supply chains operating more efficiently in 2023 than in 2022, substantially more units may be completed this year than last, which we believe might lead to about 500,000 units delivering this year."

So the estimate is that about 500,000 new apartment units will be completed in 2023.


Let's ask a different question:

In [45]:
query_2 = "Does the economy favors Multifamily markets?"

In [46]:
answer_2 = wrapper_store_faiss.query(question=query_2, llm=llm)
print(answer_2)

 Based on the context provided, it does not seem like the economy particularly favors multifamily markets right now. A few key points:

- The commentary notes that multifamily market fundamentals have softened in 2023 compared to 2022. Rent growth has slowed substantially.

- There is a rise in the number of properties offering concessions across the country. The percentage of units offering concessions is up significantly compared to a year ago. 

- There is a robust pipeline of new apartment rental projects underway in major metros. This increased supply could add more competition and pressure. 

- The outlook mentions concerns about an economic slowdown and recession in 2024, which could add further stress to the multifamily sector.

So in summary, the context indicates challenging conditions for multifamily markets currently and in the near-term outlook, rather than the economy strongly favoring the multifamily sector. The analysis seems cautious about multifamily prospects in ligh

In [38]:
query_3 = "Which metros are rapidly growing?"

In [39]:
answer_3 = wrapper_store_faiss.query(question=query_3, llm=llm)
print(answer_3)

 Based on the information provided, the passage states that new apartment supply has been concentrated in the nation's largest and rapidly growing metropolitan areas. It specifically names New York City, Dallas, Houston, and Atlanta as metros that are expecting a large number of new apartment units to be completed in the current year. So these major metros - New York City, Dallas, Houston, and Atlanta - are described as being rapidly growing.


### Customisable option
In the above scenario you explored the quick and easy way to get a context-aware answer to your question. Now let's have a look at a more customizable option with the help of [RetrievalQA](https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa.html) where you can customize how the documents fetched should be added to prompt using `chain_type` parameter. Also, if you want to control how many relevant documents should be retrieved then change the `k` parameter in the cell below to see different outputs. In many scenarios you might want to know which were the source documents that the LLM used to generate the answer, you can get those documents in the output using `return_source_documents` which returns the documents that are added to the context of the LLM prompt. `RetrievalQA` also allows you to provide a custom [prompt template](https://python.langchain.com/en/latest/modules/prompts/prompt_templates/getting_started.html) which can be specific to the model.

Note: In this example we are using Anthropic Claude as the LLM under Amazon Bedrock, this particular model performs best if the inputs are provided under `Human:` and the model is requested to generate an output after `Assistant:`. In the cell below you see an example of how to control the prompt such that the LLM stays grounded and doesn't answer outside the context.

In [40]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

prompt_template = """

Human: Use the following pieces of context to provide a concise answer to the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}

Assistant:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore_faiss.as_retriever(
        search_type="similarity", search_kwargs={"k": 3}
    ),
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

In [47]:
query = query_1
result = qa({"query": query})
print(result['result'])

print(f"\n{result['source_documents']}")

 Based on the information provided in the multifamily economic commentary, it is expected that around 500,000 new apartment units will be completed in 2023. 

The key evidence supporting this is:

- As of mid-2023, nearly 730,000 units were underway with expected completion in 2023, though only 164,000 were completed through April.

- In 2022, 479,000 units were completed. In 2021, 395,000 units were completed. 

- Supply chains are operating more efficiently in 2023 than 2022, so substantially more units may be completed this year than last.

- The commentary states "we believe that might lead to about 500,000 units delivering this year."

So in summary, around 500,000 new apartment units are expected to be completed in 2023.

[Document(page_content='1Multifamily Economic and Market Commentary\nOCTOBER 2023\nRising Number of Multifamily Properties Offering Concessions\nMultifamily market fundamentals have softened in 2023 compared to the prior year, the result of mixed economic \ntren

In [48]:
query = query_2
result = qa({"query": query})
print(result['result'])

print(f"\n{result['source_documents']}")

 Based on the provided context, it does not seem like the economy currently favors multifamily
markets. The key points are:

- After strong rent growth in 2021-2022, multifamily rent growth is expected to ease in 2023. Some
markets have already seen rent contractions.

- With an expected mild recession in early 2024, more concessions and stagnant rent growth are
expected as supply remains robust.

- The number of units offering concessions has notably increased compared to 2022, though the value
of concessions has slightly declined.

- A record number of new apartment units are expected to be completed in 2023, adding to supply.

So in summary, the slowing economy along with robust new supply appears to be putting pressure on
the multifamily market, with expectations of more concessions, lower rents, and stagnant rent growth
overall. The economy does not seem to currently favor multifamily markets.

[Document(page_content='Multifamily Economic and Market Commentary\n5Tim Komosa\nEconom

In [49]:
query = query_3
result = qa({"query": query})
print(result['result'])

print(f"\n{result['source_documents']}")

 Based on the context provided, the passage states that new apartment supply has been concentrated in the nation's largest and rapidly growing metropolitan areas. It specifically mentions that New York City, Dallas, Houston, and Atlanta are expecting at least 50,000 new units to be completed this year. So these major metros - New York City, Dallas, Houston, and Atlanta - are described as rapidly growing.

[Document(page_content='2Multifamily Economic and Market Commentary\nConcessions Declining in Most Major Markets \nSource: RealPageMultifamily monthly concession rate by market – select metrosAs has been the case for several years, new apartment supply has been concentrated in the nation’s largest and rapidly \ngrowing metropolitan areas. New York City continues to be the most active metro in the country, with nearly 150,000 units \neither recently completed or underway, with Dallas next at 74,000 units. Houston and Atlanta are also each expecting at \nleast 50,000 units will be compl

## Conclusion
Congratulations on completing this moduel on retrieval augmented generation! This is an important technique that combines the power of large language models with the precision of retrieval methods. By augmenting generation with relevant retrieved examples, the responses we recieved become more coherent, consistent and grounded. You should feel proud of learning this innovative approach. I'm sure the knowledge you've gained will be very useful for building creative and engaging language generation systems. Well done!

In the above implementation of RAG based Question Answering we have explored the following concepts and how to implement them using Amazon Bedrock and it's LangChain integration.

- Loading documents of different kind and generating embeddings to create a vector store
- Retrieving documents to the question
- Preparing a prompt which goes as input to the LLM
- Present an answer in a human friendly manner

### Take-aways
- Experiment with different Vector Stores
- Leverage various models available under Amazon Bedrock to see alternate outputs
- Explore options such as persistent storage of embeddings and document chunks
- Integration with enterprise data stores

# Thank You