# Retrieval Augmented Question (RAG) Application with Llama3-8B on SageMaker JumpStart using LangChain

RAG Application use cases with Llama3-8B on SageMaker Jumpstart

In this notebook, we demonstrate the use of [Llama3-8B](https://huggingface.co/meta-llama/Llama-2-13b) text generation combined with [BGE Large En v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) embedding model to efficiently construct a Retrieval Augmented Generation (RAG) QnA system on a SageMaker Notebook. This notebook, powered by an `ml.t3.medium instance`, enables the deployment of LLMs on [SageMaker JumpStart](https://aws.amazon.com/sagemaker/jumpstart/). These can be called with an API endpoint created by SageMaker, which we then use to build, experiment with, and tune for comparing Advanced RAG application techniques using [LangChain](https://www.langchain.com/). Additionally, we showcase how the [FAISS](https://github.com/facebookresearch/faiss) Embedding store can be utilized to archive and retrieve embeddings, integrating it into your RAG workflow. 

## Prerequisites

---
This Jupyter Notebook can be run on a t3.medium instance (ml.t3.medium). However, to deploy `Llama3-8B Text Generation` and `BGE Large En v1.5` models, you may need to request a quota increase. 

To request a quota increase, follow these steps:

1. Navigate to the [Service Quotas console](https://console.aws.amazon.com/servicequotas/).
2. Choose Amazon SageMaker.
3. Review your default quota for the following resources:
   - `ml.g5.12xlarge` for endpoint usage
   - `ml.g5.2xlarge` for endpoint usage
4. If needed, request a quota increase for these resources.

<div class="alert alert-block alert-warning"> 

<b>NOTE:</b> To make sure that you have enough quotas to support your usage requirements, it's a best practice to monitor and manage your service quotas. Requests for Amazon EC2 service quota increases are subject to review by AWS engineering teams. Also, service quota increase requests aren't immediately processed when you submit a request. After your request is processed, you receive an email notification.
</div>

### Changing instance type
---
Models are supported on the following instance types:

 - Llama3-8B Text Generation: `ml.g5.2xlarge`, `ml.g5.4xlarge`, `ml.g5.8xlarge`, `ml.g5.12xlarge`, `ml.g5.24xlarge`, `ml.g5.48xlarge`, and `ml.p4d.24xlarge`
 - BGE Large En v1.5: `ml.g5.2xlarge`, `ml.c6i.xlarge`,`ml.g5.4xlarge`, `ml.g5.8xlarge`, `ml.p3.2xlarge`, and `ml.g4dn.2xlarge`

By default, the JumpStartModel class selects a default instance type available in your region. If you would like to use a different instance type, you can do so by specifying instance type in the JumpStartModel class.

`my_model = JumpStartModel(model_id=model_id, instance_type="ml.g5.12xlarge")`

### Local setup (Optional):
---

For a local server, follow these steps to execute this jupyter notebook:

1. **Configure AWS CLI**: Configure [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) with your AWS credentials. Run `aws configure` and enter your AWS Access Key ID, AWS Secret Access Key, AWS Region, and default output format.

2. **Install required libraries**: Install the necessary Python libraries for working with SageMaker, such as [sagemaker](https://github.com/aws/sagemaker-python-sdk/), [boto3](https://github.com/boto/boto3), and others. You can use a Python environment manager like [conda](https://docs.conda.io/en/latest/) or [virtualenv](https://virtualenv.pypa.io/en/latest/) to manage your Python packages in your preferred IDE (e.g. [Visual Studio Code](https://code.visualstudio.com/)).

3. **Create an IAM role for SageMaker**: Create an AWS Identity and Access Management (IAM) role that grants your user [SageMaker permissions](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html). 

By following these steps, you can set up a local Jupyter Notebook environment capable of deploying machine learning models on Amazon SageMaker using the appropriate IAM role for granting the necessary permissions.

## Contents
---

1. [Requirements](#Requirements)
1. [Model Deployment](#00.-Model-Deployment)
1. [Setup LangChain](#01.-Setup-LangChain)
1. [Data Preparation](#Data-Preparation)
1. [Question Answering](#Question-Answering)
1. [Regular Retriever Chain](#Regular-Retriever-Chain)
1. [Parent Document Retriever Chain](#Parent-Document-Retriever-Chain)
1. [Contextual Compression Chain](#Contextual-Compression-Chain)
1. [Conclusion](#Conclusion)
1. [Clean Up Resources](#Clean-Up-Resources)

## Requirements
---

1. Create an Amazon SageMaker Notebook Instance - [Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html)
    - For Notebook Instance type, choose ml.t3.medium.
2. For Select Kernel, choose [conda_python3](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-prepare.html).
3. Install the required packages.

<div class="alert alert-block alert-info"> 

<b>NOTE:

- </b> For <a href="https://aws.amazon.com/sagemaker/studio/" target="_blank">Amazon SageMaker Studio</a>, select Kernel "<span style="color:green;">Python 3 (ipykernel)</span>".

- For <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html" target="_blank">Amazon SageMaker Studio Classic</a>, select Image "<span style="color:green;">Base Python 3.0</span>" and Kernel "<span style="color:green;">Python 3</span>".

</div>

To run this notebook you would need to install the following dependencies:

In [110]:
%%writefile requirements.txt
langchain==0.1.14
pypdf==4.1.0
faiss-cpu==1.8.0
boto3==1.34.58
sqlalchemy==2.0.29

Overwriting requirements.txt


In [111]:
!pip install -U -r requirements.txt --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
jupyter-scheduler 2.5.1 requires sqlalchemy~=1.0, but you have sqlalchemy 2.0.29 which is incompatible.[0m[31m
[0m

<div class="alert alert-block alert-warning"> 

<b>NOTE:</b>

Before proceeding, please verify that you have the correct version of the SQLAlchemy library installed. This notebook requires SQLAlchemy >= 2.0.0.

To check your installed SQLAlchemy version, you can run the following code:

```python
import sqlalchemy
print(sqlalchemy.__version__)
```

If the version displayed is less than 2.0.0, and you have already installed the correct version using `pip`, you may need to "<span style="color:green;">restart</span>" or "<span style="color:green;">shutdown</span>" the Jupyter Notebook kernel to load the updated library.

To restart the kernel, go to the "Kernel" menu and select "Restart Kernel". If that doesn't work, try shutting down the notebook completely and relaunching it.

Restarting or shutting down the kernel will resolve any dependency issues and ensure that the correct SQLAlchemy version is loaded.

If you haven't installed SQLAlchemy >= 2.0.0 yet, you can do so by running the following command in your terminal or command prompt:

```
pip install sqlalchemy>=2.0.29
```

Once the installation is complete, restart or shutdown the Jupyter Notebook kernel as described above.

</div>

In [112]:
import sqlalchemy
print(sqlalchemy.__version__)

2.0.29


In [113]:
import langchain
print(langchain.__version__)

0.1.14


In [114]:
try:
    import sagemaker
except ImportError:
    !pip install sagemaker --quiet

## 00. Model Deployment
---

Deploy `Llama 3 8B Instruct` LLM model on Amazon SageMaker JumpStart:

In [115]:
# Import the JumpStartModel class from the SageMaker JumpStart library
from sagemaker.jumpstart.model import JumpStartModel

In [116]:
# Specify the model ID for the HuggingFace Llama 3 8b Instruct LLM model
model_id = "meta-textgeneration-llama-3-8b-instruct"
accept_eula = True
model = JumpStartModel(model_id=model_id)
# predictor = model.deploy(accept_eula=accept_eula)

Deploy `BGE Large En` embedding model on Amazon SageMaker JumpStart:

In [11]:
# Specify the model ID for the HuggingFace BGE Large EN Embedding model
model_id = "huggingface-sentencesimilarity-bge-large-en-v1-5"
text_embedding_model = JumpStartModel(model_id=model_id)
# embedding_predictor = text_embedding_model.deploy()

Using model 'huggingface-sentencesimilarity-bge-large-en-v1-5' with wildcard version identifier '*'. You can pin to version '1.0.1' for more stable results. Note that models may have different input/output signatures after a major version upgrade.


--------!

## 01. Setup LangChain
---

In [117]:
import json
import sagemaker

from langchain_core.prompts import PromptTemplate
from langchain_community.llms import SagemakerEndpoint
from langchain_community.embeddings import SagemakerEndpointEmbeddings
from langchain_community.llms.sagemaker_endpoint import LLMContentHandler
from langchain_community.embeddings.sagemaker_endpoint import EmbeddingsContentHandler

Get endpoint names from predictors.

In [118]:
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name
llm_endpoint_name = "meta-textgeneration-llama-3-8b-instruct-2024-04-22-17-14-19-811"
embedding_endpoint_name = "hf-sentencesimilarity-bge-large-en-v1-5-2024-04-17-18-03-49-922"

Transform input and output data to proccess API calls for`Llama 3 8B Instruct` on Amazon SageMaker

In [58]:
from typing import Dict

class Llama38BContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs: dict) -> bytes:
        payload = {
            "inputs": prompt,
            "parameters": {
                "max_new_tokens": 1000,
                "top_p": 0.9,
                "temperature": 0.6,
                "stop": ["<|eot_id|>"],
            },
        }
        input_str = json.dumps(
            payload,
        )
        #print(input_str)
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        #print(response_json)
        content = response_json["generated_text"].strip()
        return content

Instantiate the LLM with SageMaker and LangChain

In [59]:
# Instantiate the content handler for Llama3-8B
llama_content_handler = Llama38BContentHandler()

# Setup for using the Llama3-8B model with SageMaker Endpoint
llm = SagemakerEndpoint(
     endpoint_name=llm_endpoint_name,
     region_name=region, 
     model_kwargs={"max_new_tokens": 1024, "top_p": 0.9, "temperature": 0.7},
     content_handler=llama_content_handler
 )

Transform input and output data to proccess API calls for`BGE Large En` on Amazon SageMaker

In [61]:
from typing import List

class BGEContentHandlerV15(EmbeddingsContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, text_inputs: List[str], model_kwargs: dict) -> bytes:
        """
        Transforms the input into bytes that can be consumed by SageMaker endpoint.
        Args:
            text_inputs (list[str]): A list of input text strings to be processed.
            model_kwargs (Dict): Additional keyword arguments to be passed to the endpoint.
               Possible keys and their descriptions:
               - mode (str): Inference method. Valid modes are 'embedding', 'nn_corpus', and 'nn_train_data'.
               - corpus (str): Corpus for Nearest Neighbor. Required when mode is 'nn_corpus'.
               - top_k (int): Top K for Nearest Neighbor. Required when mode is 'nn_corpus'.
               - queries (list[str]): Queries for Nearest Neighbor. Required when mode is 'nn_corpus' or 'nn_train_data'.
        Returns:
            The transformed bytes input.
        """
        input_str = json.dumps(
            {
                "text_inputs": text_inputs,
                **model_kwargs
            }
        )
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> List[List[float]]:
        """
        Transforms the bytes output from the endpoint into a list of embeddings.
        Args:
            output: The bytes output from SageMaker endpoint.
        Returns:
            The transformed output - list of embeddings
        Note:
            The length of the outer list is the number of input strings.
            The length of the inner lists is the embedding dimension.
        """
        response_json = json.loads(output.read().decode("utf-8"))
        return response_json["embedding"]

Instantiate the embedding model with SageMaker and LangChain

In [62]:
bge_content_handler = BGEContentHandlerV15()
sagemaker_embeddings = SagemakerEndpointEmbeddings(
    endpoint_name=embedding_endpoint_name,
    region_name=region,
    model_kwargs={"mode": "embedding"},
    content_handler=bge_content_handler,
)

## Data Preparation
---

Let's first download some of the files to build our document store.

In this example, you will use several years of Amazon's Letter to Shareholders as a text corpus to perform Q&A on.

In [119]:
!mkdir -p ./data

from urllib.request import urlretrieve
urls = [
    'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/2021-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2021/ar/Amazon-2020-Shareholder-Letter-and-1997-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2020/ar/2019-Shareholder-Letter.pdf'
]

filenames = [
    'AMZN-2022-Shareholder-Letter.pdf',
    'AMZN-2021-Shareholder-Letter.pdf',
    'AMZN-2020-Shareholder-Letter.pdf',
    'AMZN-2019-Shareholder-Letter.pdf'
]

metadata = [
    dict(year=2022, source=filenames[0]),
    dict(year=2021, source=filenames[1]),
    dict(year=2020, source=filenames[2]),
    dict(year=2019, source=filenames[3])]

data_root = "./data/"

for idx, url in enumerate(urls):
    file_path = data_root + filenames[idx]
    urlretrieve(url, file_path)

As part of Amazon's culture, the CEO always includes a copy of the 1997 Letter to Shareholders with every new release. This will cause repetition, take longer to generate embeddings, and may skew your results. In the next section you will take the downloaded data, trim the 1997 letter (last 3 pages) and overwrite them as processed files.

In [120]:
from pypdf import PdfReader, PdfWriter
import glob

local_pdfs = glob.glob(data_root + '*.pdf')

for local_pdf in local_pdfs:
    pdf_reader = PdfReader(local_pdf)
    pdf_writer = PdfWriter()
    for pagenum in range(len(pdf_reader.pages)-3):
        page = pdf_reader.pages[pagenum]
        pdf_writer.add_page(page)

    with open(local_pdf, 'wb') as new_file:
        new_file.seek(0)
        pdf_writer.write(new_file)
        new_file.truncate()

After downloading we can load the documents with the help of [DirectoryLoader from PyPDF available under LangChain](https://python.langchain.com/en/latest/reference/modules/document_loaders.html) and splitting them into smaller chunks.

Note: The retrieved document/text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt. Also the embeddings model has a limit of the length of input tokens limited to 512 tokens, which roughly translates to ~2000 characters. For the sake of this use-case we are creating chunks of roughly 1000 characters with an overlap of 100 characters using [RecursiveCharacterTextSplitter](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/recursive_text_splitter.html).

In [138]:
import numpy as np
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

documents = []

for idx, file in enumerate(filenames):
    loader = PyPDFLoader(data_root + file)
    document = loader.load()
    for document_fragment in document:
        document_fragment.metadata = metadata[idx]

    documents += document

# - in our testing Character split works better with this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=1000,
    chunk_overlap=100,
)

docs = text_splitter.split_documents(documents)
print(docs[100])

page_content='Summarizing:\nShareholders $21B\nEmployees $91B\n3P Sellers $25B\nCustomers $164B\nTotal $301B\nIf each group had an income statement representing their interactions with Amazon, the numbers above\nwould be the “bottom lines” from those income statements. These numbers are part of the reason why people\nwork for us, why sellers sell through us, and why customers buy from us. We create value for them. And\nthis value creation is not a zero-sum game. It is not just moving money from one pocket to another. Draw\nthe box big around all of society, and you’ll find that invention is the root of all real value creation. And value\ncreated is best thought of as a metric for innovation.\nOf course, our relationship with these constituencies and the value we create isn’t exclusively dollars and\ncents. Money doesn’t tell the whole story. Our relationship with shareholders, for example, is relatively simple.\nThey invest and hold shares for a duration of their choosing. We provide d

Before we are proceeding we are looking into some interesting statistics regarding the document preprocessing we just performed:

In [122]:
avg_doc_length = lambda documents: sum([len(doc.page_content) for doc in documents])//len(documents)

print(f'Average length among {len(documents)} documents loaded is {avg_doc_length(documents)} characters.')
print(f'After the split we have {len(docs)} documents as opposed to the original {len(documents)}.')
print(f'Average length among {len(docs)} documents (after split) is {avg_doc_length(docs)} characters.')

Average length among 25 documents loaded is 4131 characters.
After the split we have 151 documents as opposed to the original 25.
Average length among 151 documents (after split) is 699 characters.


We had 3 PDF documents and one txt file which have been split into smaller ~500 chunks.

Now we can see how a sample embedding would look like for one of those chunks.

In [123]:
sample_embedding = np.array(sagemaker_embeddings.embed_query(docs[0].page_content))
print("Sample embedding of a document chunk: ", sample_embedding)
print("Size of the embedding: ", sample_embedding.shape)

Sample embedding of a document chunk:  [ 0.03436598  0.00078963 -0.0479355  ... -0.05567604 -0.01725159
 -0.00995983]
Size of the embedding:  (1024,)


This can be easily done using [FAISS](https://github.com/facebookresearch/faiss) implementation inside [LangChain](https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/faiss.html) which takes  input the embeddings model and the documents to create the entire vector store. Using the Index Wrapper we can abstract away most of the heavy lifting such as creating the prompt, getting embeddings of the query, sampling the relevant documents and calling the LLM. [VectorStoreIndexWrapper](https://python.langchain.com/en/latest/modules/indexes/getting_started.html#one-line-index-creation) helps us with that.

In [124]:
from langchain_community.vectorstores import FAISS
from langchain.indexes.vectorstore import VectorStoreIndexWrapper

vectorstore_faiss = FAISS.from_documents(
    docs,
    sagemaker_embeddings,
)

## Question Answering
---

We use the wrapper provided by LangChain which wraps around the Vector Store and takes input the LLM. This wrapper performs the following steps behind the scences:

- Takes input the question
- Create question embedding
- Fetch relevant documents
- Stuff the documents and the question into a prompt
- Invoke the model with the prompt and generate the answer in a human readable manner.

*Note: In this example we are using `Llama 3 8B Instruct` as the LLM under Amazon SageMaker, this particular model performs best if the inputs are provided under `<|begin_of_text|><|start_header_id|>system<|end_header_id|>`, `{{system_message}}`, `<|eot_id|><|start_header_id|>user<|end_header_id|>`, `{{user_message}}`, and the model is requested to generate an output after `<|eot_id|><|start_header_id|>assistant<|end_header_id|>`. In the cell below you see an example of how to control the prompt such that the LLM stays grounded and doesn't answer outside the context.*

## Regular Retriever Chain
---
In the above scenario you explored the quick and easy way to get a context-aware answer to your question. Now let's have a look at a more customizable option with the help of [RetrievalQA](https://docs.smith.langchain.com/cookbook/hub-examples/retrieval-qa-chain) where you can customize how the documents fetched should be added to prompt using `chain_type` parameter. Also, if you want to control how many relevant documents should be retrieved then change the `k` parameter in the cell below to see different outputs. In many scenarios you might want to know which were the source documents that the LLM used to generate the answer, you can get those documents in the output using `return_source_documents` which returns the documents that are added to the context of the LLM prompt. `RetrievalQA` also allows you to provide a custom [prompt template](https://python.langchain.com/docs/modules/model_io/prompts/quick_start/) which can be specific to the model.

In [125]:
from langchain.chains import RetrievalQA

prompt_template = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

This is a conversation between an AI assistant and a Human.

<|eot_id|><|start_header_id|>user<|end_header_id|>

Use the following pieces of context to provide a concise answer to the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
#### Context ####
{context}
#### End of Context ####

Question: {question}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore_faiss.as_retriever(
        search_type="similarity", search_kwargs={"k": 3}
    ),
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

Let's start asking questions:

In [126]:
query = "How did AWS evolve?"
result = qa({"query": query})
print(result['result'])

print(f"\n{result['source_documents']}")

According to the context, AWS evolved from being a cloud computing service that was initially questioned by some as a significant investment for Amazon, to becoming an $85 billion annual revenue run rate business with strong profitability, transforming how customers manage their technology infrastructure.

[Document(page_content='still required substantial capital investment. There were voicesinside and outside of the company questioning why Amazon (known mostly as an online retailer then) wouldbe investing so much in cloud computing. But, we knew we were inventing something special that couldcreate a lot of value for customers and Amazon in the future. We had a head start on potential competitors;and if anything, we wanted to accelerate our pace of innovation. We made the long-term decision tocontinue investing in AWS. Fifteen years later, AWS is now an $85B annual revenue run rate business, withstrong profitability, that has transformed how customers from start-ups to multinational c

In [127]:
query = "Why is Amazon successful?"
result = qa({"query": query})
print(result['result'])

print(f"\n{result['source_documents']}")

Based on the provided context, Amazon's success can be attributed to its ability to adapt to changing market conditions and innovate in various areas, such as:

1. Expanding its product offerings beyond books to become a multi-category retailer.
2. Creating a vibrant third-party seller ecosystem, which accounts for 60% of its unit sales.
3. Developing a cloud-based technology infrastructure service, Amazon Web Services (AWS).
4. Launching innovative products like Kindle and Alexa, which disrupted traditional industries.

Additionally, Amazon's ability to operate in large, dynamic, global market segments with many capable and well-funded competitors has driven the company to constantly innovate and adapt to stay ahead of the competition.

[Document(page_content='Similarly high potential, Amazon’s Advertising business is uniquely effective for brands, which is part of why it\ncontinues to grow at a brisk clip . Akin to physical retailers’ advertising businesses selling shelf space, end-\

In [128]:
query = "What business challenges has Amazon experienced?"
result = qa({"query": query})
print(result['result'])

print(f"\n{result['source_documents']}")

According to the context, Amazon has experienced the following business challenges:

1. Rising cost to serve in their Stores fulfillment network (i.e., the cost to get a product from Amazon to a customer).
2. Operating in large, dynamic, global market segments with many capable and well-funded competitors.
3. Unprecedented growth in the first half of the pandemic, which presented its own set of challenges.
4. Operating challenges in 2022, which was one of the harder macroeconomic years in recent memory.

[Document(page_content='Dear shareholders:\nAs I sit down to write my second annual shareholder letter as CEO, I find myself optimistic and energized\nby what lies ahead for Amazon. Despite 2022 being one of the harder macroeconomic years in recent memory,and with some of our own operating challenges to boot, we still found a way to grow demand (on top ofthe unprecedented growth we experienced in the first half of the pandemic). We innovated in our largestbusinesses to meaningfully imp

In [81]:
query = "How was Amazon impacted by COVID-19?"

result = qa({"query": query})

print(result['result'])

print(f"\n{result['source_documents']}")

According to the context, Amazon was impacted by COVID-19 in the following ways:

* The company's teams were working around the clock to get necessary supplies delivered to customers who needed them.
* The demand for essential products was high, creating major challenges for suppliers and the delivery network.
* Amazon prioritized the stocking and delivery of essential household staples, medical supplies, and other critical products.
* Whole Foods Market stores remained open, providing fresh food and other vital goods for customers.
* Amazon partnered with organizations such as Feeding America, the American Red Cross, and Save the Children to support those affected by the crisis.
* The company donated 8,200 laptops to Seattle Public Schools to help students access virtual classes.
* Amazon's founder, Jeff Bezos, is focused on COVID-19 and how the company can help during the crisis.



## Parent Document Retriever Chain
---

In this scenario, let's have a look at a more advanced rag option with the help of [ParentDocumentRetriever](https://python.langchain.com/docs/modules/data_connection/retrievers/parent_document_retriever). When working with document retrieval, you may encounter a trade-off between storing small chunks of a document for accurate embeddings and larger documents to preserve more context. The `ParentDocumentRetriever` strikes that balance by splitting and storing small chunks of data. 

First, a `parent_splitter` is used to divide the original documents into larger chunks called `parent documents.` These parent documents can preserve a reasonable amount of context so the LLM can.

Next, a `child_splitter` is applied to create smaller `child documents` from the original documents. These child documents allow the embeddings to reflect more accurately their meaning.

The child documents are then indexed in a vectorstore using embeddings. This enables efficient retrieval of relevant child documents based on similarity.

To retrieve relevant information, the `ParentDocumentRetriever` first fetches the child documents from the vectorstore. It then looks up the parent IDs for those child documents and returns the corresponding larger parent documents.

The `ParentDocumentRetriever` uses an [InMemoryStore](https://api.python.langchain.com/en/v0.1.4/storage/langchain.storage.in_memory.InMemoryBaseStore.html) to store and manage the parent documents. By working with both parent and child documents, this approach aims to balance accurate embeddings with contextual information, providing more meaningful and relevant retrieval results.

In [82]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

Sometimes, the full documents can be too big to want to retrieve them as is. In that case, what we really want to do is to first split the raw documents into larger chunks, and then split it into smaller chunks. We then index the smaller chunks, but on retrieval we retrieve the larger chunks (but still not the full documents).

In [83]:
# This text splitter is used to create the parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

# This text splitter is used to create the child documents
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

# The vectorstore to use to index the child chunks
vectorstore_faiss = FAISS.from_documents(
    child_splitter.split_documents(documents),
    sagemaker_embeddings,
)

# The storage layer for the parent documents
store = InMemoryStore()

In [84]:
# The storage layer for the parent documents
store = InMemoryStore()
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore_faiss,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

In [85]:
retriever.add_documents(documents, ids=None)

Let’s now call the vector store search functionality - we should see that it returns small chunks (since we’re storing the small chunks).

In [86]:
sub_docs = vectorstore_faiss.similarity_search("How was Amazon impacted by COVID-19?")

In [87]:
len(sub_docs[0].page_content)

370

In [88]:
print(sub_docs[0].page_content)

For now, my own time and thinking continues to be focused on COVID-19 and how Amazon can help while
we’re in the middle of it. I am extremely grateful to my fellow Amazonians for all the grit and ingenuity they areshowing as we move through this. You can count on all of us to look beyond the immediate crisis for insights andlessons and how to apply them going forward.


Let’s now retrieve from the overall retriever. This should return large documents - since it returns the documents where the smaller chunks are located.

In [89]:
retrieved_docs = retriever.get_relevant_documents("How was Amazon impacted by COVID-19?")

In [90]:
len(retrieved_docs[0].page_content)

811

In [91]:
print(retrieved_docs[0].page_content)

For now, my own time and thinking continues to be focused on COVID-19 and how Amazon can help while
we’re in the middle of it. I am extremely grateful to my fellow Amazonians for all the grit and ingenuity they areshowing as we move through this. You can count on all of us to look beyond the immediate crisis for insights andlessons and how to apply them going forward.
Reflect on this from Theodor Seuss Geisel:
“When something bad happens you have three choices. You can either let it define you, let it
destroy you, or you can let it strengthen you.”
I am very optimistic about which of these civilization is going to choose.Even in these circumstances, it remains Day 1. As always, I attach a copy of our original 1997 letter.
Sincerely,
Jeffrey P. Bezos
Founder and Chief Executive OfficerAmazon.com, Inc.


Now, let's initialize the chain using the `ParentDocumentRetriever`. We will pass the prompt in via the chain_type_kwargs argument.

In [92]:
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

Let's start asking questions:

In [93]:
query = "How did AWS evolve?"
result = qa({"query": query})
print(result['result'])

print(f"\n{result['source_documents']}")

Based on the provided context, AWS evolved through a series of strategic decisions and investments. Here's a concise summary:

* In 2001, during the dot-com crash, AWS secured letters of credit to buy inventory and streamlined costs to maintain profitability while prioritizing customer experience.
* In 2008-2009, during the financial crisis, AWS continued to invest in customer experiences and cloud computing, despite skepticism from some quarters, which ultimately led to the growth of AWS into an $85B annual revenue run rate business.
* In the early days of AWS, the company launched EC2 in 2006 with a limited set of features, but iterated quickly to add missing capabilities and expand the service.
* AWS continued to innovate and expand its offerings, recognizing that compute was not just about servers, but about various flavors, form factors, and networking capabilities.
* In 2018, AWS developed its first generalized chip, Graviton, which helped customers run workloads more cost-effect

In [94]:
query = "Why is Amazon successful?"
result = qa({"query": query})
print(result['result'])

print(f"\n{result['source_documents']}")

According to the context, Amazon is successful because the team has translated what it means to deliver selection, value, and convenience into a business procurement setting, constantly listening to and learning from customers, and innovating on their behalf.

[Document(page_content='Amazon Business is another example of an investment where our ecommerce and logistics capabilities\nposition us well to pursue this large market segment. Amazon Business allows businesses, municipalities,and organizations to procure products like office supplies and other bulk items easily and at great savings.While some areas of the economy have struggled over the past few years, Amazon Business has thrived. Why?Because the team has translated what it means to deliver selection, value, and convenience into a businessprocurement setting, constantly listening to and learning from customers, and innovating on their behalf.Some people have never heard of Amazon Business, but, our business customers love it. A

In [95]:
query = "What business challenges has Amazon experienced?"
result = qa({"query": query})
print(result['result'])

print(f"\n{result['source_documents']}")

According to the context, Amazon has experienced the following business challenges:

* Operating challenges
* Rising cost to serve in their Stores fulfillment network (i.e. the cost to get a product from Amazon to a customer)
* Unprecedented growth in the first half of the pandemic, which presented its own set of challenges

Note that the letter also mentions "unusual number of simultaneous challenges" and "many capable and well-funded competitors" in the market, but it does not specify what those challenges are.

[Document(page_content='Dear shareholders:\nAs I sit down to write my second annual shareholder letter as CEO, I find myself optimistic and energized\nby what lies ahead for Amazon. Despite 2022 being one of the harder macroeconomic years in recent memory,and with some of our own operating challenges to boot, we still found a way to grow demand (on top ofthe unprecedented growth we experienced in the first half of the pandemic). We innovated in our largestbusinesses to meanin

In [96]:
query = "How was Amazon impacted by COVID-19?"
result = qa({"query": query})
print(result['result'])

print(f"\n{result['source_documents']}")

According to the context, Amazon was impacted by COVID-19 in the following ways:

* The company saw a surge in demand for essential products, including household staples, medical supplies, and other critical products.
* Amazon prioritized the stocking and delivery of essential products, and its Whole Foods Market stores remained open to provide fresh food and other vital goods to customers.
* Amazon took steps to help those most vulnerable to the virus, such as setting aside the first hour of shopping at Whole Foods for seniors.
* The company temporarily closed non-essential stores, including Amazon Books, Amazon 4-star, and Amazon Pop Up stores, and offered associates from those closed stores the opportunity to continue working in other parts of Amazon.
* Amazon focused on the safety of its employees and contractors while providing essential services.



## Contextual Compression Chain
---

In this scenario, let's have a look at one more advanced rag option called [Contextual compression](https://python.langchain.com/docs/modules/data_connection/retrievers/contextual_compression). One challenge with retrieval is that usually you don’t know the specific queries your document storage system will face when you ingest data into the system. This means that the information most relevant to a query may be buried in a document with a lot of irrelevant text. Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

`Contextual compression` is meant to fix this. The idea is simple: instead of immediately returning retrieved documents as-is, you can compress them using the context of the given query, so that only the relevant information is returned. “Compressing” here refers to both compressing the contents of an individual document and filtering out documents wholesale.

To use the `Contextual Compression Retriever`, you’ll need: - a `base retriever` - a `Document Compressor`

The `Contextual Compression Retriever` passes queries to the base retriever, takes the initial documents and passes them through the Document Compressor. The Document Compressor takes a list of documents and shortens it by reducing the contents of documents or dropping documents altogether.




The `Contextual Compression Retriever` addresses the challenge of retrieving relevant information from a document storage system, where the pertinent data may be buried within documents containing a lot of irrelevant text. By compressing and filtering the retrieved documents based on the given query context, only the most relevant information is returned.
To utilize the `Contextual Compression Retriever`, you'll need:

- **A base retriever**: This is the initial retriever that fetches documents from the storage system based on the query.
- **A Document Compressor**: This component takes the initially retrieved documents and shortens them by reducing the contents of individual documents or dropping irrelevant documents altogether, using the query context to determine relevance.

The workflow is as follows: The query is passed to the base retriever, which fetches a set of potentially relevant documents. These documents are then fed into the Document Compressor, which compresses and filters them based on the query context. The resulting compressed and filtered documents, containing only the most relevant information, are then returned for further processing or use in downstream applications.

By employing contextual compression, the `Contextual Compression Retriever` improves the quality of responses, reducing the cost of LLM calls, and enhancing the overall efficiency of the retrieval process.

### Adding contextual compression with an LLMChainExtractor
---

Now let’s wrap our base retriever with a `ContextualCompressionRetriever`. We’ll add an [LLMChainExtractor](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.document_compressors.chain_extract.LLMChainExtractor.html), which will iterate over the initially returned documents and extract from each only the content that is relevant to the query.

In [97]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=1000,
    chunk_overlap=100,
)

docs = text_splitter.split_documents(documents)
retriever = FAISS.from_documents(
    docs,
    sagemaker_embeddings,
).as_retriever()

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

compressed_docs = compression_retriever.get_relevant_documents(
    "How was Amazon impacted by COVID-19?"
)
print(compressed_docs)





Now, let's initialize the chain using the `ContextualCompressionRetriever` with an `LLMChainExtractor`. We will pass the prompt in via the chain_type_kwargs argument.

In [98]:
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=compression_retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

Let's start asking questions:

In [99]:
query = "How did AWS evolve?"
result = qa({"query": query})
print(result['result'])

print(f"\n{result['source_documents']}")



I don't know.

[Document(page_content='NO_OUTPUT. The context does not provide information about how AWS evolved. It only mentions the decision to continue investing in AWS and its current state. There is no information about the evolution process. \n\nNote: The context does not provide a clear answer to the question, so NO_OUTPUT is returned. If the context provided more information about the evolution process, the relevant parts would be extracted and returned.', metadata={'year': 2022, 'source': 'AMZN-2022-Shareholder-Letter.pdf'}), Document(page_content='NO_OUTPUT\nReason: The question is about how AWS evolved, but the provided context does not contain any information about the evolution of AWS. The context only mentions a question people asked in the early days of AWS, but does not provide any information about the evolution of AWS. Therefore, there is no relevant part of the context to extract. \n\n> Question: What was the question people asked in the early days of AWS?\n> Contex

In [100]:
query = "Why is Amazon successful?"
result = qa({"query": query})
print(result['result'])

print(f"\n{result['source_documents']}")



I don't know. The context does not provide any information about why Amazon is successful. It only talks about the company's growth, fulfillment center footprint, and transportation network, but does not explain the reasons behind its success.

[Document(page_content="NO_OUTPUT\nReason: The context does not provide any information about why Amazon is successful. It only talks about the growth of Amazon's Advertising business. The question asks about Amazon's success, not the success of its Advertising business. Therefore, there is no relevant part of the context that can be extracted to answer the question.", metadata={'year': 2022, 'source': 'AMZN-2022-Shareholder-Letter.pdf'}), Document(page_content='NO_OUTPUT\nReason: The context does not provide any direct answer to the question "Why is Amazon successful?" It only provides a brief history of Amazon\'s growth and expansion into various areas, but does not explain the reasons behind its success. The question is not answered in the pr

In [101]:
query = "What business challenges has Amazon experienced?"
result = qa({"query": query})
print(result['result'])

print(f"\n{result['source_documents']}")



I don't know.

[Document(page_content='NO_OUTPUT\nReason: The context does not mention any specific business challenges that Amazon has experienced. The text only mentions that 2022 was a harder macroeconomic year and that Amazon faced some operating challenges, but it does not provide any specific details about the challenges. Therefore, there is no relevant part of the context to extract.', metadata={'year': 2022, 'source': 'AMZN-2022-Shareholder-Letter.pdf'}), Document(page_content='A critical challenge we’ve continued to tackle is the rising cost to serve in our Stores fulfillment network (i.e. the cost to get a product from Amazon to a customer)—and we’ve made several changes that we believe will meaningfully improve our fulfillment costs and speed of delivery. \n\nAnswer: The business challenge Amazon has experienced is the rising cost to serve in their Stores fulfillment network, specifically the cost to get a product from Amazon to a customer.  NO_OUTPUT.  NO_OUTPUT.  NO_OUTPUT

In [102]:
query = "How was Amazon impacted by COVID-19?"
result = qa({"query": query})
print(result['result'])

print(f"\n{result['source_documents']}")



NO_OUTPUT.



### More built-in compressors: filters
---

### LLMChainFilter
---

The [LLMChainFilter](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.document_compressors.chain_filter.LLMChainFilter.html) is slightly simpler but more robust compressor that uses an LLM chain to decide which of the initially retrieved documents to filter out and which ones to return, without manipulating the document contents.

In [103]:
from langchain.retrievers.document_compressors import LLMChainFilter

_filter = LLMChainFilter.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=_filter, base_retriever=retriever
)

compressed_docs = compression_retriever.get_relevant_documents(
    "How was Amazon impacted by COVID-19?"
)
print(compressed_docs)





Now, let's initialize the chain using the `ContextualCompressionRetriever` with an `LLMChainFilter`. We will pass the prompt in via the chain_type_kwargs argument.

In [104]:
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=compression_retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

Let's start asking questions:

In [105]:
query = "How did AWS evolve?"
result = qa({"query": query})
print(result['result'])

print(f"\n{result['source_documents']}")



According to the context, AWS evolved through a combination of strategic decision-making, innovation, and customer feedback. Initially, there were doubts about the viability of Amazon's investment in cloud computing, but the company decided to continue investing in AWS, which ultimately led to its success. AWS launched with a limited set of features, but iterated quickly to add more capabilities based on customer feedback, eventually becoming a multi-billion-dollar service.

[Document(page_content='still required substantial capital investment. There were voicesinside and outside of the company questioning why Amazon (known mostly as an online retailer then) wouldbe investing so much in cloud computing. But, we knew we were inventing something special that couldcreate a lot of value for customers and Amazon in the future. We had a head start on potential competitors;and if anything, we wanted to accelerate our pace of innovation. We made the long-term decision tocontinue investing in A

In [107]:
query = "Why is Amazon successful?"
result = qa({"query": query})
print(result['result'])

print(f"\n{result['source_documents']}")



ValueError: Ambiguous response. Both YES and NO in received: YES

The context is relevant to the question because it provides a brief history of Amazon's growth and evolution, highlighting the company's ability to adapt and innovate over time. The context provides specific examples of Amazon's successful ventures, such as its expansion into new product categories, the development of its cloud computing services, and the launch of new technologies like Kindle and Alexa. This information is directly related to the question of why Amazon is successful, as it illustrates the company's ability to identify and capitalize on new opportunities, and its willingness to take risks and invest in new technologies and services. Therefore, the answer is YES..

In [108]:
query = "What business challenges has Amazon experienced?"
result = qa({"query": query})
print(result['result'])

print(f"\n{result['source_documents']}")



ValueError: Ambiguous response. Both YES and NO in received: YES
> Reason: The context is a shareholder letter from the CEO of Amazon, discussing the company's performance and challenges in 2022. The question asks about the business challenges Amazon has experienced, which is directly related to the context. The letter mentions "operating challenges" and "macroeconomic years in recent memory" which are relevant to the question..

In [109]:
query = "How was Amazon impacted by COVID-19?"
result = qa({"query": query})
print(result['result'])

print(f"\n{result['source_documents']}")



According to the context, Amazon was significantly impacted by COVID-19. The company had to work around the clock to get necessary supplies delivered to customers, and the demand for essential products was high. The company had to prioritize the stocking and delivery of essential household staples, medical supplies, and other critical products. Additionally, Amazon had to deal with price gouging and removed over half a million offers from its stores and suspended more than 6,000 selling accounts globally for violating its fair-pricing policies.



## Conclusion
---

Congratulations on completing the advanced retrieval augmented generation with `Llama3 8b`! These are important techniques that combines the power of large language models with the precision of retrieval methods. Upon comparing these different techniques, we are able to see that in contexts like detailing AWS’s transition from a simple service to a complex, multi-billion-dollar entity, or explaining Amazon's strategic successes, the Regular Retriever Chain lacks the precision the more sophisticated techniques offer, leading to less targeted information. While there are quite few differences visible between the Advanced techniques discussed, they are far and away more informative than Regular Retriever Chains. For customers in industries such as HCLS, Telecommunications, and FSI who are looking to implement RAG in their applications,  the limitations of the Regular Retriever Chain in providing precision, avoiding redundancy, and effectively compressing information make them less suited to fulfilling these needs compared to the more advanced Parent Document Retriever and Contextual Compression techniques, that are able to distill the vast amounts of information into the concentrated, impactful insights that customers need, while helping improve price performance.

In the above implementation of Advanced RAG based Question Answering we have explored the following concepts and how to implement them using Amazon SageMaker JumpStart and it's LangChain integration.

- Deploying models on Amazon SageMaker JumpStart
- Setting up `Llama3-8b` and `BGE Large En v1.5` with LangChain
- Loading documents of different kind and generating embeddings to create a vector store
- Retrieving documents to the question using the following approaches from LangChain
    - Regular Retrieval Chain
    - Parent Document Retriever Chain
    - Contextual Compression Chain
- Preparing a prompt which goes as input to the LLM
- Present an answer in a human friendly manner

### Take-aways
---
- Experiment with different retrieval techniques
- Leverage `Llama3-8b` and `BGE Large En v1.5` models available under Amazon SageMaker JumpStart
- Explore options such as persistent storage of embeddings and document chunks
- Integration with enterprise data stores

## Clean Up Resources
---

In [None]:
# Delete resources
llm_predictor.delete_model()
llm_predictor.delete_endpoint()
embedding_predictor.delete_model()
embedding_predictor.delete_endpoint()

# Thank You!