# Retrieval-Augmented Generation: Question Answering based on Custom Dataset with Open-sourced [LangChain](https://python.langchain.com/en/latest/index.html) Library


---

This notebook has been tested in us-east-1 with **Data Science 3.0** kernel and **ml.m5.2xlarge**

---


Many use cases such as building a chatbot require text (text2text) generation models like **[BloomZ 7B1](https://huggingface.co/bigscience/bloomz-7b1)**, **[Flan T5 XXL](https://huggingface.co/google/flan-t5-xxl)**, and **[Flan T5 UL2](https://huggingface.co/google/flan-ul2)** to respond to user questions with insightful answers. The **BloomZ 7B1**, **Flan T5 XXL**, and **Flan T5 UL2** models have picked up a lot of general knowledge in training, but we often need to ingest and use a large library of more specific information.

In this notebook we will demonstrate how to use **Flan T5 XXL** to answer questions using a library of documents as a reference, by using document embeddings and retrieval. The embeddings are generated from **GPT-J-6B** embedding model. 

## Step 1. Deploy large language model (LLM) and embedding model in SageMaker JumpStart

To better illustrate the idea, let's first deploy all the models that are required to perform the demo. This was done in the previous lab.

In [2]:
#load stored variables from previous notebook
%store -r

Unable to restore variable 'qa', ignoring (use %store -d to forget!)
The error was: <class 'KeyError'>


In [3]:
!pip install --upgrade sagemaker --quiet
!pip install ipywidgets==7.0.0 --quiet
!pip install langchain==0.0.148 --quiet
!pip install faiss-cpu --quiet
!pip install unstructured --quiet
!pip install pdf2image --quiet
!pip install pypdf --quiet
!pip install google-search-results --quiet
!pip install wikipedia --quiet
!pip install huggingface_hub --quiet

[0m

In [4]:
import time
import sagemaker, boto3, json
import glob
import os
import pandas as pd
from sagemaker.session import Session
from sagemaker.model import Model
from sagemaker import image_uris, model_uris, script_uris, hyperparameters
from sagemaker.predictor import Predictor
from sagemaker.utils import name_from_base
from typing import Any, Dict, List, Optional

sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sm_client = boto3.client("sagemaker", aws_region)
sess = sagemaker.Session()
model_version = "*"

In [5]:
def query_endpoint_with_json_payload(encoded_json, endpoint_name, content_type="application/json"):
    client = boto3.client("runtime.sagemaker")
    response = client.invoke_endpoint(
        EndpointName=endpoint_name, ContentType=content_type, Body=encoded_json
    )
    return response

#method used to parse the inference model's response. we pass it as part of the model's config
def parse_response_model(query_response):
    model_predictions = json.loads(query_response["Body"].read())
    generated_text = model_predictions["generated_texts"]
    return generated_text

Deploy SageMaker endpoint(s) for large language models and GPT-J 6B embedding model. Please uncomment the entries as below if you want to deploy multiple LLM models to compare their performance.

## Step 2. Ask a question to LLM without providing the context

To better illustrate why we need retrieval-augmented generation (RAG) based approach to solve the question and anwering problem. Let's directly ask the model a question and see how they respond.

In [6]:
question = "Which instances can I use with Managed Spot Training in Amazon SageMaker?"

In [7]:
#more info on top_k and top_p here: https://docs.cohere.com/docs/controlling-generation-with-top-k-top-p
payload = {
    "text_inputs": question,
    "max_length": 200,
    "num_return_sequences": 1,
    "top_k": 20,
    "top_p": 0.70,
    "do_sample": True,
    "temperature": 0.5
}

#TO REMOVE manually setting up endpoint for test
#_MODEL_CONFIG_[inference_model]["endpoint_name"] = "jumpstart-dft-hf-text2text-flan-t5-xxl"

endpoint_name = _MODEL_CONFIG_[inference_model]["endpoint_name"]

query_response = query_endpoint_with_json_payload(
    json.dumps(payload).encode("utf-8"), endpoint_name=endpoint_name
)
generated_texts = parse_response_model(query_response)
print(f"For model: {endpoint_name}, \nthe generated output is: {generated_texts[0]}\n")

For model: raglc-huggingface-text2text-flan-t5-xxl-2023-06-13-12-28-50-996, 
the generated output is: AWS Lambda instances



You can see the generated answer is wrong or doesn't make much sense. 

## Step 3. Improve the answer to the same question using **prompt engineering** with insightful context


To better answer the question well, we provide extra contextual information, combine it with a prompt, and send it to model together with the question. Below is an example.

In [8]:
context = """Managed Spot Training can be used with all instances supported in Amazon SageMaker. Managed Spot Training is supported in all AWS Regions where Amazon SageMaker is currently available."""

In [9]:
parameters = {
    "max_length": 200,
    "num_return_sequences": 1,
    "top_k": 20,
    "top_p": 0.70,
    "do_sample": True,
    "temperature": 0.5
}

endpoint_name = _MODEL_CONFIG_[inference_model]["endpoint_name"]

prompt = """Answer based on context:\n\n{context}\n\n{question}"""

text_input = prompt.replace("{context}", context)
text_input = text_input.replace("{question}", question)

payload = {"text_inputs": text_input, **parameters}

query_response = query_endpoint_with_json_payload(
    json.dumps(payload).encode("utf-8"), endpoint_name=endpoint_name
)
generated_texts = parse_response_model(query_response)

print(f"For model: {endpoint_name}, \nthe generated output is: {generated_texts[0]}")

For model: raglc-huggingface-text2text-flan-t5-xxl-2023-06-13-12-28-50-996, 
the generated output is: all instances supported in Amazon SageMaker


The output from step 3 tells us the chance to get the correct response significantly correlates with the insightful context you send into the LLM. 

**<span style="color:red">Now, the question becomes where can I find the insightful context based on the user query? The answer is to use a pre-stored knowledge data base with retrieval augmented generation, as shown in step 4 below</span>.**

## Step 4. Use RAG based approach with [LangChain](https://python.langchain.com/en/latest/index.html) and SageMaker endpoints to build a simplified question and answering application.


We plan to use document embeddings to fetch the most relevant documents in our document knowledge library and combine them with the prompt that we provide to LLM.

To achieve that, we will do following.

1. **Generate embedings for each of document in the knowledge library with SageMaker GPT-J-6B embedding model.**
2. **Identify top K most relevant documents based on user query.**
    - 2.1 **For a query of your interest, generate the embedding of the query using the same embedding model.**
    - 2.2 **Search the indexes of top K most relevant documents in the embedding space using in-memory Faiss search.**
    - 2.3 **Use the indexes to retrieve the corresponded documents.**
3. **Combine the retrieved documents with prompt and question and send them into SageMaker LLM.**



Note: The retrieved document/text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt -- maximum sequence length of 1024 tokens. 

---
To build a simiplied QA application with LangChain, we need: 
1. Wrap up our SageMaker endpoints for embedding model and LLM into `langchain.embeddings.SagemakerEndpointEmbeddings` and `langchain.llms.sagemaker_endpoint.SagemakerEndpoint`. That requires a small overwritten of `SagemakerEndpointEmbeddings` class to make it compatible with SageMaker embedding mdoel.
2. Prepare the dataset to build the knowledge data base. 

---

Wrap up our SageMaker endpoints for embedding model into `langchain.embeddings.SagemakerEndpointEmbeddings`. That requires a small overwritten of `SagemakerEndpointEmbeddings` class to make it compatible with SageMaker embedding mdoel.

In [10]:
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler
from langchain.embeddings import SagemakerEndpointEmbeddings


class SagemakerEndpointEmbeddingsJumpStart(SagemakerEndpointEmbeddings):
    def embed_documents(self, texts: List[str], chunk_size: int = 5) -> List[List[float]]:
        """Compute doc embeddings using a SageMaker Inference Endpoint.

        Args:
            texts: The list of texts to embed.
            chunk_size: The chunk size defines how many input texts will
                be grouped together as request. If None, will use the
                chunk size specified by the class.

        Returns:
            List of embeddings, one for each text.
        """
        results = []
        _chunk_size = len(texts) if chunk_size > len(texts) else chunk_size

        for i in range(0, len(texts), _chunk_size):
            response = self._embedding_func(texts[i : i + _chunk_size])
            print
            results.extend(response)
        return results


class ContentHandler(EmbeddingsContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        input_str = json.dumps({"text_inputs": prompt, **model_kwargs})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        embeddings = response_json["embedding"]
        return embeddings


content_handler = ContentHandler()

embeddings = SagemakerEndpointEmbeddingsJumpStart(
    endpoint_name=_MODEL_CONFIG_[embedding_model]["endpoint_name"],
    region_name=aws_region,
    content_handler=content_handler,
)

Next, we wrap up our SageMaker endpoints for LLM into `langchain.llms.sagemaker_endpoint.SagemakerEndpoint`. 

In [11]:
from langchain.llms.sagemaker_endpoint import LLMContentHandler, SagemakerEndpoint
#from langchain.llms.sagemaker_endpoint import ContentHandlerBase

parameters = {
    "max_length": 300,
    "num_return_sequences": 1,
    "top_k": 30,
    "top_p": 0.50,
    "do_sample": True,
    "temperature": 0.8
}


class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        input_str = json.dumps({"text_inputs": prompt, **model_kwargs})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        return response_json["generated_texts"][0]


content_handler = ContentHandler()

sm_llm = SagemakerEndpoint(
    endpoint_name=_MODEL_CONFIG_[inference_model]["endpoint_name"],
    region_name=aws_region,
    model_kwargs=parameters,
    content_handler=content_handler,
)

## Question and Answering with RAG

## Data sources

In [12]:
from langchain.document_loaders import TextLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import Chroma, AtlasDB, FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain import PromptTemplate
from langchain.chains.question_answering import load_qa_chain
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.document_loaders import UnstructuredURLLoader
from langchain.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQA
from langchain.chains.question_answering import load_qa_chain

### CSV

Now, let's download the example data and prepare it for demonstration. We will use [Amazon SageMaker FAQs](https://aws.amazon.com/sagemaker/faqs/) as knowledge library. The data are formatted in a CSV file with two columns Question and Answer. We use the Answer column as the documents of knowledge library, from which relevant documents are retrieved based on a query. 

**For your purpose, you can replace the example dataset of your own to build a custom question and answering application.**

In [13]:
tmp_folder = "rag_data"

sagemaker_faq = "s3://jumpstart-cache-prod-us-east-2/training-datasets/Amazon_SageMaker_FAQs/"

!mkdir -p $tmp_folder
!aws s3 cp --recursive $sagemaker_faq rag_data

download: s3://jumpstart-cache-prod-us-east-2/training-datasets/Amazon_SageMaker_FAQs/Amazon_SageMaker_FAQs.csv to rag_data/Amazon_SageMaker_FAQs.csv


For the case when you have data saved in multiple subsets. The following code will read all files that end with `.csv` and concatenate them together. Please ensure each `csv` file has the same format.

In [14]:
all_files = glob.glob(os.path.join("rag_data/", "*.csv"))

df_knowledge = pd.concat(
    (pd.read_csv(f, header=None, names=["Question", "Answer"]) for f in all_files),
    axis=0,
    ignore_index=True,
)

In [15]:
#drop the question column as we're not using it for the exercise.
df_knowledge.drop(["Question"], axis=1, inplace=True)

#saving the modified df 
df_knowledge.to_csv("rag_data/processed_data.csv", header=False, index=False)

df_knowledge.head(5)

Unnamed: 0,Answer
0,Amazon SageMaker is a fully managed service to...
1,For a list of the supported Amazon SageMaker A...
2,Amazon SageMaker is designed for high availabi...
3,Amazon SageMaker stores code in ML storage vol...
4,Amazon SageMaker ensures that ML model artifac...


Use langchain to read the `csv` data. There are multiple built-in functions in LangChain to read different format of files such as `txt`, `html`, and `pdf`. For details, see [LangChain document loaders](https://python.langchain.com/en/latest/modules/indexes/document_loaders.html).

In [16]:
csv_loader = CSVLoader(file_path="rag_data/processed_data.csv")

### PDF source

Let's also retrieve the Amazon Personalize documentation in pdf version

In [17]:
import requests
sagemaker_pdf_url = "https://docs.aws.amazon.com/personalize/latest/dg/personalize-dg.pdf"
response = requests.get(sagemaker_pdf_url)
file = open(f"./{tmp_folder}/documentation.pdf", "wb")
file.write(response.content)
file.close()

In [18]:
#this can take few minutes as the pdf is 122mo
pdf_loader = PyPDFLoader(f"./{tmp_folder}/documentation.pdf")
pdf_pages = pdf_loader.load_and_split()

### URLs as source

In [19]:
urls = ["https://docs.aws.amazon.com/sagemaker/latest/dg/train-model.html", 
        "https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html"]
url_loader = UnstructuredURLLoader(urls=urls)
url_data = url_loader.load()

### Exercise: add a new data source

Check for other loaders you can use and add: https://python.langchain.com/en/latest/modules/indexes/document_loaders.html

### Create the vectorstore index

In [20]:
index_creator = VectorstoreIndexCreator(
    vectorstore_cls=FAISS,
    embedding=embeddings,
    text_splitter=RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=0, separators=[" ", ",", ".", "\n"])
)

We create the index from the loaders. it will take 5-10 minutes.
Behind the scene the index_creator is splitting the documents into chunk of size 300 (where possible), create embeddings of those chunks, expose a retriever.

Take a break and have a coffee!


In [21]:
index = index_creator.from_loaders([pdf_loader])

Now let's query the vectorstore index

In [22]:
query="what are the mandatory fields for the Interactions dataset in Amazon Personalize?"
#query="How can you import datasets in Amazon Personalize?"
#query="What are the autoML capabilities in Amazon Sagemaker"
    
index.query_with_sources(question=query, llm=sm_llm)

{'question': 'what are the mandatory fields for the Interactions dataset in Amazon Personalize?',
 'answer': 'The following is an example of a minimal Interactions dataset schema. For more examples, see Datasets and schemas (p. 76).  "type": "record", "name":',
 'sources': ''}

In [27]:
'''
retriever = index.vectorstore.as_retriever()
qa = RetrievalQA.from_chain_type(llm=sm_llm, chain_type="stuff", retriever=retriever)
qa.run(query) '''

'\nretriever = index.vectorstore.as_retriever()\nqa = RetrievalQA.from_chain_type(llm=sm_llm, chain_type="stuff", retriever=retriever)\nqa.run(query) '