# Retrieval-Augmented Generation: Question Answering based on Custom Dataset

Many use cases such as building a chatbot require text (text2text) generation models like **[BloomZ 7B1](https://huggingface.co/bigscience/bloomz-7b1)**, **[Flan T5 XXL](https://huggingface.co/google/flan-t5-xxl)**, and **[Flan T5 UL2](https://huggingface.co/google/flan-ul2)** to respond to user questions with insightful answers. The **BloomZ 7B1**, **Flan T5 XXL**, and **Flan T5 UL2** models have picked up a lot of general knowledge in training, but we often need to ingest and use a large library of more specific information.

In this notebook we will demonstrate how to use **AI21 Contextual Answer** to answer questions using a library of documents as a reference, by using document embeddings and retrieval. The embeddings are generated from **GPT-J-6B** embedding model. 

**This notebook serves a template such that you can easily replace the example dataset by your own to build a custom question and asnwering application.**

## Step 1. Deploy large language model (LLM) in SageMaker JumpStart

To better illustrate the idea, let's first deploy all the models that are required to perform the demo. You can choose either deploying all three Flan T5 XXL, BloomZ 7B1, and Flan UL2 models as the large language model (LLM) to compare their model performances, or select **subset** of the models based on your preference. To do that, you need modify the `_MODEL_CONFIG_` python dictionary defined as below.

In [225]:
!pip install --upgrade sagemaker --quiet
!pip install ipywidgets==7.0.0 --quiet
!pip install --upgrade pymongo

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [226]:
import time
import sagemaker, boto3, json
from sagemaker.session import Session
from sagemaker.model import Model
from sagemaker import image_uris, model_uris, script_uris, hyperparameters
from sagemaker.predictor import Predictor
from sagemaker.utils import name_from_base
import numpy as np
import requests

sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()
model_version = "*"

url = "https://ep928kbfdd.execute-api.us-east-1.amazonaws.com/test/"

In [227]:
def query_endpoint_with_json_payload(url, payload):
    response = requests.post(
        url,
        json=payload,
    )
    return response

def parse_response_model_ai21(query_response):
    model_predictions = query_response.json()
    generated_text = model_predictions["answer"]
    return generated_text

Please uncomment the entries as below if you want to deploy multiple LLM models to compare their performance.

In [228]:
_MODEL_CONFIG_ = {
    # pre-deploy via JS or API Gateway
    "AI21-Contextual-Answers" : {
        "url": url,
        "parse_function": parse_response_model_ai21,
    }
}

## Step 2. Ask a question to LLM without providing the context

To better illustrate why we need retrieval-augmented generation (RAG) based approach to solve the question and anwering problem. Let's directly ask the model a question and see how they respond.

In [229]:
sample_index = 0

sample = {
    0: {"question": "How to scale down SageMaker Asynchronous endpoint to zero?", 
        "context": """You can scale down the Amazon SageMaker Asynchronous Inference endpoint\
instance count to zero in order to save on costs when you are not actively processing\
requests. You need to define a scaling policy that scales on the "ApproximateBacklogPerInstance"\
custom metric and set the "MinCapacity" value to zero. For step-by-step instructions,\
please visit the `autoscale an asynchronous endpoint` section of the developer guide."""},
    
    1: {"question": "How can I be sure SageMaker protects my data security and privacy?",
        "context": """Amazon SageMaker does not use or share customer models, training data, or algorithms.\
We know that customers care deeply about privacy and data security. That's why AWS gives\
you ownership and control over your content through simple, powerful tools that allow you\
to determine where your content will be stored, secure your content in transit and at rest,\
and manage your access to AWS services and resources for your users. We also implement\
responsible and sophisticated technical and physical controls that are designed to \
prevent unauthorized access to or disclosure of your content. As a customer, you maintain\
ownership of your content, and you select which AWS services can process, store, \
and host your content. We do not access your content for any purpose without your consent
"""
       },
    
    2: {"question": "Which new GAI features will be launched on SageMaker at 2023?",
        "context": ""
    },
    3: {
    "question": "What is Amazon SageMaker?",
    "context": ""
    }
}

question, context = sample[sample_index].values()

In [230]:
payload = {
    "context": "",
    "question": question
}


for model_id in _MODEL_CONFIG_:
    query_response = query_endpoint_with_json_payload(
        url, payload
    )
    
    generated_texts = _MODEL_CONFIG_[model_id]["parse_function"](query_response)
    print(f"For model: {model_id}, the generated output is:\n{generated_texts}\n")

For model: AI21-Contextual-Answers, the generated output is:
Answer not in document



## Step 3. Improve the answer to the same question with insightful context


To better answer the question well, we provide extra contextual information, combine it with a prompt, and send it to model together with the question. Below is an example.

In [231]:
payload = {
    "context": context,
    "question": question
}

for model_id in _MODEL_CONFIG_:

    query_response = query_endpoint_with_json_payload(
        url, payload
    )
    generated_texts = _MODEL_CONFIG_[model_id]["parse_function"](query_response)
    print(
        f"For model: {model_id}, the generated output is:\n{generated_texts}"
    )


For model: AI21-Contextual-Answers, the generated output is:
In order to scale down the Amazon SageMaker Asynchronous Inference endpoint instance count to zero, you need to implement a scaling policy that scales on the "ApproximateBacklogPerInstance" custom metric and set the "MinCapacity" value to zero.


## Step 4. Use RAG based approach to identify the correct documents, and use them and question to query LLM


We plan to use document embeddings to fetch the most relevant documents in our document knowledge library and combine them with the prompt that we provide to LLM.

To achieve that, we will do following.

* **Generate embedings for each of document in the knowledge library with the GPT-J-6B embedding model.**
* **Identify top K most relevant documents based on user query.**
    * **For a query of your interest, generate the embedding of the query using the same embedding model.**
    * **Search the indexes of top K most relevant documents in the embedding space using the SageMaker KNN algorithm.**
    * **Use the indexes to retrieve the corresponded documents.**
* **Combine the retrieved documents with prompt and question and send them into LLM.**



Note: The retrieved document/text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt -- maximum sequence length of 1024 tokens. 

### 4.1 Deploying the model endpoint for All MiniLM L6 v2 embedding model

In this section, we will deploy the All MiniLM L6 v2 embedding model from the Jumpstart UI.


On the left-hand-side navigation pane, got to **Home**, under **SageMaker JumpStart**, choose **Model, notebooks, solutions**. You’re presented with a range of solutions, foundation models, and other artifacts that can help you get started with a specific model or a specific business problem or use case. If you want to experiment in a particular area, you can use the search function. Or you can simply browse the artifacts to find the relevant model or business solution for your needs. To start exploring the Stable Diffusion models, complete the following steps:

1. Go to the **Foundation Models** section. In the search bar, search for the **MiniLM L6 v2** model and select the **All MiniLM L6 v2**.
<div>
    <img src="./img/embedding_model.jpg" alt="Image jumpstart" width="800" style="display:inline-block">
</div>
<br>

2. A new tab is opened with the options to train, deploy and view model details as shown below. In the Deploy Model section, expand Deployment Configuration. For SageMaker hosting instance, choose the hosting instance (for this lab, we use ml.g5.2xlarge). You can also change the Endpoint name as needed. Then click the Deploy button.

<div>
    <img src="./img/embedding_deploy.jpg" alt="Image deploy" width="600" style="display:inline-block">
</div>
<br>

3. The deploy action will start a new tab showing the model creation status and the model deployment status. Wait until the endpoint status shows **In Service**. This will take a few minutes.
<div>
    <img src="./img/ready.jpg" alt="Image ready" width="600" style="display:inline-block">
</div>
<br>

In [232]:
endpoint_name_embed = "jumpstart-dft-hf-textembedding-all-minilm-l6-v2" # change the endpoint name as needed

In [233]:
from tqdm import tqdm
def query_endpoint_with_json_payload_endpoint(encoded_json, endpoint_name, content_type="application/json"):
    client = boto3.client("runtime.sagemaker")
    response = client.invoke_endpoint(
        EndpointName=endpoint_name, ContentType=content_type, Body=encoded_json
    )
    return response

def parse_response_multiple_texts(query_response):
    model_predictions = json.loads(query_response["Body"].read())
    embeddings = model_predictions["embedding"]
    return embeddings


def build_embed_table(df_knowledge, endpoint_name_embed, col_name_4_embed, batch_size=10):
    res_embed = []
    N = df_knowledge.shape[0]
    for idx in tqdm(range(0, N, batch_size)):
        content = df_knowledge.loc[idx : (idx + batch_size - 1)][
            col_name_4_embed
        ].tolist()  ## minus -1 as pandas loc slicing is end-inclusive
        payload = {"text_inputs": content}
        query_response = query_endpoint_with_json_payload_endpoint(
            json.dumps(payload).encode("utf-8"), endpoint_name_embed
        )
        generated_embed = parse_response_multiple_texts(query_response)
        res_embed.extend(generated_embed)
    res_embed_df = pd.DataFrame(res_embed)
    return res_embed_df

### 4.2. Generate embedings for each of document in the knowledge library with the All MiniLM L6 v2
 embedding model.

For the purpose of the demo we will use [Amazon SageMaker FAQs](https://aws.amazon.com/sagemaker/faqs/) as knowledge library. The data are formatted in a CSV file with two columns Question and Answer. We use **only** the Answer column as the documents of knowledge library, from which relevant documents are retrieved based on a query. 

**Each row in the CSV format dataset corresponds to a textual document. 
We will iterate each document to get its embedding vector via the All MiniLM L6 v2 embedding models. 
For your purpose, you can replace the example dataset of your own to build a custom question and answering application.**


First, we download the dataset from our S3 bucket to the local.

In [234]:
s3_path = f"s3://jumpstart-cache-prod-us-east-2/training-datasets/Amazon_SageMaker_FAQs/Amazon_SageMaker_FAQs.csv"

In [235]:
# Downloading the Database
!aws s3 cp $s3_path Amazon_SageMaker_FAQs.csv

download: s3://jumpstart-cache-prod-us-east-2/training-datasets/Amazon_SageMaker_FAQs/Amazon_SageMaker_FAQs.csv to ./Amazon_SageMaker_FAQs.csv


In [236]:
import pandas as pd

df_knowledge = pd.read_csv("Amazon_SageMaker_FAQs.csv", header=None, usecols=[1], names=["Answer"])
df_knowledge.head(6)

Unnamed: 0,Answer
0,Amazon SageMaker is a fully managed service to...
1,For a list of the supported Amazon SageMaker A...
2,Amazon SageMaker is designed for high availabi...
3,Amazon SageMaker stores code in ML storage vol...
4,Amazon SageMaker ensures that ML model artifac...
5,Amazon SageMaker does not use or share custome...


In [237]:
df_knowledge_embed = build_embed_table(
    df_knowledge, endpoint_name_embed, col_name_4_embed='Answer', batch_size=20
)
df_knowledge_embed.head(5)

100%|██████████| 8/8 [00:16<00:00,  2.09s/it]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,374,375,376,377,378,379,380,381,382,383
0,-0.05132,-0.086265,-0.03127,0.03322,0.077008,-0.035504,-0.088567,-0.019618,-0.071139,0.043609,...,-0.018304,0.028802,0.042111,-0.127226,0.023232,0.040018,0.014618,-0.035364,0.029234,0.015711
1,0.062383,-0.073353,-0.024781,-0.01937,0.050714,0.028094,-0.128734,-0.038077,-0.075275,0.017887,...,0.005776,0.025631,0.005679,-0.131589,0.012769,0.097371,0.093634,-0.007152,-0.04735,0.036941
2,-0.077738,-0.078382,-0.029804,0.023051,0.043494,-0.042205,-0.109057,0.015371,-0.021639,0.049813,...,-0.028909,-0.002021,0.020902,-0.088226,0.029838,0.029555,0.051726,-0.017517,-0.015522,0.017543
3,-0.055477,-0.058904,-0.145159,0.036093,0.061442,0.017926,-0.094529,-0.001391,0.041213,0.031146,...,-0.01829,0.014639,-0.031307,-0.085898,0.072894,0.011447,0.011394,0.015231,0.036745,-0.059684
4,-0.056408,-0.062517,-0.045154,0.006222,0.074342,0.023912,-0.059651,0.00828,0.029144,0.01536,...,-0.010968,0.053354,0.006876,-0.043502,0.035912,0.05948,0.056548,-0.001481,0.032812,-0.028805


In [238]:
assert df_knowledge_embed.shape[0] == df_knowledge.shape[0]

Save the embedding data for further usage.

In [239]:
df_knowledge_embed.to_csv("Amazon_SageMaker_FAQs_embedding.csv", header=None, index=False)

### 4.3. Index the embedding knowledge library using the Mongo Atlas Search Index

You can choose to use the Mongo Atlas Search Index, which will conduct following.

1. Create a Mongo Atlas database
2. Convert document content to embedding
3. Insert documents together with its vector to Mongo Atlas collection
4. Create vector search indexing from Mongo Atlas collection


**Note.** 

Vector search is a capability that allows you to do semantic search where you are searching data based on meaning. This technique employs machine learning models, often called encoders, to transform text, audio, images, or other types of data into high-dimensional vectors. These vectors capture the semantic meaning of the data, which can then be searched through to find similar content based on vectors being “near” one another in a high-dimensional space

You also have other options to store your vectors in a database. [Amazon Opensearch](https://aws.amazon.com/opensearch-service/) is a popular choice for vector DB. You may refer to this [blog](https://aws.amazon.com/blogs/machine-learning/build-a-powerful-question-answering-bot-with-amazon-sagemaker-amazon-opensearch-service-streamlit-and-langchain/) and follow the instructions to build your own RAG solution with Opensearch! Alternatively, you may also use [Amazon Kendra](https://aws.amazon.com/kendra/) for its built-in NLP capabilities and pre-trained domain knowledge. Refer to this [blog](https://aws.amazon.com/blogs/machine-learning/quickly-build-high-accuracy-generative-ai-applications-on-enterprise-data-using-amazon-kendra-langchain-and-large-language-models/) for guided instructions on how to set it up.

### Connect to Mongo Atlas database with pymongo client:

In [240]:
# REPLACE ME
uri = "mongodb+srv://admin:csAwr7ZU6Dd3GLLG@awsome-builder.pvlboil.mongodb.net/?retryWrites=true&w=majority" # REPLACE ME
DB_NAME = 'awsome-builder'
COLLECTION = "qna"

In [241]:
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi

# Create a new client and connect to the server
client = MongoClient(uri, server_api=ServerApi('1'))

# Send a ping to confirm a successful connection
try:
    client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)

Pinged your deployment. You successfully connected to MongoDB!


### Insert documents together with its vector to Mongo Atlas collection

In [242]:
collection = client[DB_NAME][COLLECTION]


for index, embedding in enumerate(np.array(df_knowledge_embed)):    
    record = { 
        "docs": np.array(df_knowledge)[index][0],
        "plot_embedding": embedding.tolist() 
    }
    collection.insert_one(record)
    
print("Completed import documents to the Mongo Atlas Collection")

Completed import documents to the Mongo Atlas Collection


### Create search indexing from Mongo Atlas collection

Atlas Vector Search allows you to search any unstructured data. You can create vector embeddings using the machine learning model of your choice (OpenAI, Hugging Face, and more) and store them in Atlas. It powers use cases such as similarity search, recommendation engines, Q&A systems, dynamic personalization and long-term memory for LLMs.

For this workshop, we use Mongo Atlas Search with built-in KNN vector search to get the similarity documents. KNN stands for "K Nearest Neighbors," which is the algorithm frequently used to find vectors near one another.

1. Create a search index from Mongo Atlas

<div>
    <img src="./img/mongo_search_index.jpg" alt="Image ready" width="600" style="display:inline-block">
</div>
<br>

Mongo Atlas search index json settings

```json
{
    "mappings": {
    "dynamic": true,
    "fields": {
      "plot_embedding": {
        "dimensions": 384,
        "similarity": "cosine",
        "type": "knnVector"
      }
    }
  }
}
```

2. Complete search indexing

<div>
    <img src="./img/mongo_search_ready.png" alt="Image ready" width="600" style="display:inline-block">
</div>
<br>

## Convert Question to Embedding

In [243]:
query_response = query_endpoint_with_json_payload_endpoint(
    question, endpoint_name_embed, content_type="application/x-text"
)
question_embedding = parse_response_multiple_texts(query_response)
print(question, np.array(question_embedding), sep="\n")

How to scale down SageMaker Asynchronous endpoint to zero?
[[-1.25770820e-02 -2.24030968e-02 -9.01289880e-02  7.80547336e-02
  -2.11871639e-02 -3.50977667e-02 -1.00249007e-01  5.81515282e-02
  -3.54115926e-02 -3.41335796e-02  3.46331932e-02 -3.09880823e-02
  -6.05630726e-02 -5.65665308e-03  1.86302271e-02  3.16554196e-02
   1.50426943e-02 -4.70327586e-02 -1.77897848e-02  1.24955783e-02
  -4.21567336e-02 -2.70334687e-02  2.69602127e-02  6.11171462e-02
   4.77826484e-02 -8.59977305e-02 -5.77164777e-02 -1.69550460e-02
   2.61355340e-02 -4.29883935e-02  1.08105034e-01 -4.23644949e-03
  -3.97508144e-02 -4.10162518e-03  2.23547332e-02  4.66536209e-02
  -3.62018496e-02 -2.69552059e-02  2.05249962e-04  7.39176525e-03
   1.28942356e-01  2.86590513e-02 -6.53640553e-02  7.18173897e-03
   7.03283474e-02 -2.93456037e-02  6.89646089e-03  2.25637667e-03
   2.86780763e-02 -4.96147461e-02  1.62839070e-02  5.43807168e-03
  -5.66927567e-02  1.90991759e-02 -5.72254173e-02  5.60955182e-02
   8.57988521e-02

### 4.4 Retrieve the most relevant documents

Given the embedding of a query, we will query the endpoint to get the indexes of top K most relevant documents and use the indexes to retrieve the corresponded textual documents.

In [244]:
K = 1

# Query for similar documents.
documents = collection.aggregate([
    {
        "$search": {
            "index": "default",
            "knnBeta": {
                "vector": question_embedding[0],
                "path": "plot_embedding",
                "k": K
                }
            }
        }
    ])

documents = list(documents)

context_doc = ''
if len(documents) > 0:
    context_doc = documents[0]['docs']
    
print("The result of relevant document:\n{}".format(context_doc))

The result of relevant document:
You can scale down the Amazon SageMaker Asynchronous Inference endpoint instance count to zero in order to save on costs when you are not actively processing requests. You need to define a scaling policy that scales on the "ApproximateBacklogPerInstance" custom metric and set the "MinCapacity" value to zero. For step-by-step instructions, please visit the autoscale an asynchronous endpoint section of the developer guide. 


## Combine the retrieved documents, prompt, and question to query the LLM

In [245]:
print('context: {}'.format(context_doc))

for model_id in _MODEL_CONFIG_:
    
    if len(context_doc) == 0:
        print(f"Sorry, I don't have information to answer this question:\n{question}.")
        break
    
    payload = {"context": context_doc, "question": question}

    query_response = query_endpoint_with_json_payload(
        url, payload
    )
    generated_texts = _MODEL_CONFIG_[model_id]["parse_function"](query_response)
    print(f"\nAnswer from model: {model_id}:\n{generated_texts}\n")

context: You can scale down the Amazon SageMaker Asynchronous Inference endpoint instance count to zero in order to save on costs when you are not actively processing requests. You need to define a scaling policy that scales on the "ApproximateBacklogPerInstance" custom metric and set the "MinCapacity" value to zero. For step-by-step instructions, please visit the autoscale an asynchronous endpoint section of the developer guide. 

Answer from model: AI21-Contextual-Answers:
In order to scale down the Amazon SageMaker Asynchronous Inference endpoint instance count to zero, you need to define a scaling policy that scales on the "ApproximateBacklogPerInstance" custom metric and set the "MinCapacity" value to zero.



### 4.6 Clean up

Uncomment below cell to delete the endpoint after testing. 

In [None]:
# delete the endpoints hosting the embedding model
sagemaker_session.delete_endpoint(endpoint_name_embed)

## Bonus

### Interact with the model

**AI21 Studio Contextual Answers model** allows you to access our high-quality question answering technology. It was designed to answer questions based on a specific document context provided by the customer. This avoids any factual issues that language models may have and makes sure the answers it provides are grounded in that context document.

This model receives document text, serving as a context, and a question and returns an answer based entirely on this context. This means that if the answer to your question is not in the document, the model will indicate it (instead of providing a false answer).

To get a sense of the model's behavior, let's use this toy example of asking what is the Eiffel tower height. Most language models will simply answer according to their training data.

This model, however, bases its answer solely on the context you provide. Let's use the following [Wikipedia paragraph](https://en.wikipedia.org/wiki/Eiffel_Tower#:~:text=The%20Eiffel%20Tower%20(%2F%CB%88a%C9%AA,from%20the%20Champ%20de%20Mars) as context, with small modifications:

In [None]:
# Actual paragraph
context = "The tower is 330 metres (1,083 ft) tall,[6] about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest human-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure in the world to surpass both the 200-metre and 300-metre mark in height. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct."

# The paragraph with manual changes of the height
false_context = "The tower is 3 metres (10 ft) tall,[6] about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest human-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure in the world to surpass both the 200-metre and 300-metre mark in height. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct."

# The paragraph with the height omitted
partial_context = "Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest human-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure in the world to surpass both the 200-metre and 300-metre mark in height. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct."

In [None]:
# True context
payload = {
    "context": context,
    "question": "What is the height of the Eiffel tower?"
}
query_endpoint_with_json_payload(url, payload).json()

In [None]:
# True context
payload = {
    "context": false_context,
    "question": "What is the height of the Eiffel tower?"
}
query_endpoint_with_json_payload(url, payload).json()

In [None]:
# True context
payload = {
    "context": partial_context,
    "question": "What is the height of the Eiffel tower?"
}
query_endpoint_with_json_payload(url, payload).json()

### Ask about financial reports

The document context should be **no more than 10,000 characters**, and the question can be up to 160 characters.

Imagine you are performing research and rely on financial reports to base your findings. Let's take the following part from [JPMorgan Chase & Co. 2021 annual report](https://www.jpmorganchase.com/content/dam/jpmc/jpmorgan-chase-and-co/investor-relations/documents/annualreport-2021.pdf):

In [None]:
financial_context = """In 2020 and 2021, enormous QE — approximately $4.4 trillion, or 18%, of 2021 gross domestic product (GDP) — and enormous fiscal stimulus (which has been and always will be inflationary) — approximately $5 trillion, or 21%, of 2021 GDP — stabilized markets and allowed companies to raise enormous amounts of capital. In addition, this infusion of capital saved many small businesses and put more than $2.5 trillion in the hands of consumers and almost $1 trillion into state and local coffers. These actions led to a rapid decline in unemployment, dropping from 15% to under 4% in 20 months — the magnitude and speed of which were both unprecedented. Additionally, the economy grew 7% in 2021 despite the arrival of the Delta and Omicron variants and the global supply chain shortages, which were largely fueled by the dramatic upswing in consumer spending and the shift in that spend from services to goods. Fortunately, during these two years, vaccines for COVID-19 were also rapidly developed and distributed.
In today's economy, the consumer is in excellent financial shape (on average), with leverage among the lowest on record, excellent mortgage underwriting (even though we've had home price appreciation), plentiful jobs with wage increases and more than $2 trillion in excess savings, mostly due to government stimulus. Most consumers and companies (and states) are still flush with the money generated in 2020 and 2021, with consumer spending over the last several months 12% above pre-COVID-19 levels. (But we must recognize that the account balances in lower-income households, smaller to begin with, are going down faster and that income for those households is not keeping pace with rising inflation.)
Today's economic landscape is completely different from the 2008 financial crisis when the consumer was extraordinarily overleveraged, as was the financial system as a whole — from banks and investment banks to shadow banks, hedge funds, private equity, Fannie Mae and many other entities. In addition, home price appreciation, fed by bad underwriting and leverage in the mortgage system, led to excessive speculation, which was missed by virtually everyone — eventually leading to nearly $1 trillion in actual losses.
"""

Rather than reading the entire report, just ask what you want to know. The model will answer based on the provided report:

In [None]:
payload = {
    "context": financial_context,
    "question": "Did the economy shrink after the Omicron variant arrived?"
}
query_endpoint_with_json_payload(url, payload).json()

In addition, you can ask more complex questions, where the answer requires deductions rather than just extracting the correct sentence from the document context. This will result in abstractive, rather than extractive, answers that draw on several different parts of the document. For example, look at the following question:

In [None]:
payload = {
    "context": financial_context,
    "question":  "Did COVID-19 eventually help the economy?"
}
query_endpoint_with_json_payload(url, payload).json()

We now present the model with the following question. You may be confused to answer something based on the last paragraph without delving into the text. However, if you read the provided document context properly, you will discover that the answer does not appear there. The model handles this as expected:

In [None]:
payload = {
    "context": financial_context,
    "question":  "How did COVID-19 affect the financial crisis of 2008?"
}
query_endpoint_with_json_payload(url, payload).json()