[![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/aws/sagemaker/sagemaker-pinecone-rag.ipynb)

# Retrieval-Augmented Generation: Question Answering using LLama-2, Pinecone & Custom Dataset


In this notebook we will demonstrate how to use [**Llama-2-7b**](https://ai.meta.com/llama/) to answer questions using a library of documents as a reference, by using document embeddings and retrieval. The embeddings are generated from **MiniLM** embedding model and retrieved from [**Pinecone Vector Database**](https://www.pinecone.io/). 
Access to a Pinecone environment is a prerequisite to run this notebook fully. 

**You can start by using the [Free Tier on Pinecone](https://www.pinecone.io/pricing/). This notebook serves a template such that you can easily replace the example dataset by your own to build a custom question and asnwering application.**

To perform inference on the [Llama models](https://ai.meta.com/llama/), you need to pass custom_attributes='accept_eula=true' as part of header. This means you have read and accept the end-user-license-agreement (EULA) of the model. EULA can be found in model card description or from this [webpage](https://ai.meta.com/resources/models-and-libraries/llama-downloads/). By default, this notebook sets custom_attributes='accept_eula=false', so all inference requests will fail until you explicitly change this custom attribute.

Note: Custom_attributes used to pass EULA are key/value pairs. The key and value are separated by '=' and pairs are separated by ';'. If the user passes the same key more than once, the last value is kept and passed to the script handler (i.e., in this case, used for conditional logic). For example, if 'accept_eula=false; accept_eula=true' is passed to the server, then 'accept_eula=true' is kept and passed to the script handler.

## Step 1. Deploy Llama-2 7 Billion Chat Model in SageMaker JumpStart

In [2]:
!pip install -qU \
    sagemaker \
    pinecone-client==2.2.1 \
    ipywidgets==7.0.0

[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

To begin, we will initialize all of the SageMaker session variables we'll need to use throughout the walkthrough.

In [None]:
import sagemaker
from sagemaker.jumpstart.model import JumpStartModel
from sagemaker.huggingface import HuggingFaceModel

role = sagemaker.get_execution_role()

my_model = JumpStartModel(model_id = "meta-textgeneration-llama-2-7b-f")



We will use a `ml.g5.4xlarge` instance to deploy our Llama-2-7 billion model. We can find pricing for all instances [here](https://aws.amazon.com/sagemaker/pricing/).

In [4]:
predictor = my_model.deploy(
    initial_instance_count=1, instance_type="ml.g5.4xlarge", endpoint_name="llama-2-generator")

----------------!

## Step 2. Ask a question to LLM without providing the context

To better illustrate why we need retrieval-augmented generation (RAG) based approach to solve the question and anwering problem. Let's directly ask the model a question and see how they respond.

In [5]:
question = "Which instances can I use with Managed Spot Training in SageMaker?"

In [6]:
# https://aws.amazon.com/blogs/machine-learning/llama-2-foundation-models-from-meta-are-now-available-in-amazon-sagemaker-jumpstart/

prompt = """Answer the following QUESTION based on the CONTEXT
given. If you do not know the answer and the CONTEXT doesn't
contain the answer truthfully say "I don't know

ANSWER:

"""


payload = {
    "inputs":  
      [
        [
         {"role": "system", "content": prompt},
         {"role": "user", "content": question},
        ]   
      ],
   "parameters":{"max_new_tokens": 64, "top_p": 0.9, "temperature": 0.6, "return_full_text": False}
}

out = predictor.predict(payload, custom_attributes='accept_eula=true')
out[0]['generation']['content']

' Based on the context provided, Managed Spot Training is a feature in SageMaker that allows you to train machine learning models on the spot instances in Amazon EC2.\n\nAccording to the context, you can use Managed Spot Training with the following instances:\n\n* m5 instances\n'

You can see the generated answer is wrong or doesn't make much sense. 

## Step 3. Improve the answer to the same question using **prompt engineering** with insightful context


To better answer the question well, we provide extra contextual information, combine it with a prompt, and send it to model together with the question. Below is an example.

In [7]:
context = """Managed Spot Training can be used with all instances
supported in Amazon SageMaker. Managed Spot Training is supported
in all AWS Regions where Amazon SageMaker is currently available."""

In [8]:
prompt_template = """Answer the following QUESTION based on the CONTEXT
given. If you do not know the answer and the CONTEXT doesn't
contain the answer truthfully say "I don't know".

CONTEXT:
{context}


ANSWER:
"""

text_input = prompt_template.replace("{context}", context).replace("{question}", question)

payload = {
    "inputs":  
      [
        [
         {"role": "system", "content": text_input},
         {"role": "user", "content": question},
        ]   
      ],
   "parameters":{"max_new_tokens": 64, "top_p": 0.9, "temperature": 0.6, "return_full_text": False}
}

out = predictor.predict(payload, custom_attributes='accept_eula=true')
generated_text = out[0]['generation']['content']
print(f"[Input]: {question}\n[Output]: {generated_text}")

[Input]: Which instances can I use with Managed Spot Training in SageMaker?
[Output]:  Based on the context provided, you can use Managed Spot Training with all instances supported in Amazon SageMaker. Therefore, the answer is:

All instances.


Let's see if our LLM is capable of following our instructions...

In [9]:
unanswerable_question = "What color is my desk?"

text_input = prompt_template.replace("{context}", context).replace("{question}", question)

payload = {
    "inputs":  
      [
        [
         {"role": "system", "content": text_input},
         {"role": "user", "content": unanswerable_question},
        ]   
      ],
   "parameters":{"max_new_tokens":256, "top_p":0.9, "temperature":0.6}
}


out = predictor.predict(payload, custom_attributes='accept_eula=true')
generated_text = out[0]['generation']['content']
print(f"[Input]: {unanswerable_question}\n[Output]: {generated_text}")

[Input]: What color is my desk?
[Output]:  I'm afraid I can't answer your question about the color of your desk as it is not related to the context provided. Managed Spot Training is a feature of Amazon SageMaker that allows you to train machine learning models using spot instances, and it is not related to the color of a desk. Therefore, I don't know the answer to your question.


Looks great! The LLM is following instructions and we've also demonstrated how contexts can help our LLM answer questions accurately. However, we're unlikely to be inserting a context directly into a prompt like this unless we already know the answer — and if we already know the answer why would we be asking the question at all?

We need a way of extracting _relevant contexts_ from huge bases of information. For that we need **R**etrieval **A**ugmented **G**eneration (RAG).

## Step 4. Use RAG based approach to identify the correct documents, and use them along with prompt and question to query LLM


We plan to use document embeddings to fetch the most relevant documents in our document knowledge library and combine them with the prompt that we provide to LLM.

To achieve that, we will do following.

* Generate embedings for each of document in the knowledge library with the MiniLM embedding model.
* Identify top K most relevant documents based on user query.
    * For a query of your interest, generate the embedding of the query using the same embedding model.
    * Search the indexes of top K most relevant documents in the embedding space using the SageMaker KNN algorithm.
    * Use the indexes to retrieve the corresponded documents.
* Combine the retrieved documents with prompt and question and send them into LLM.



Note: The retrieved document/text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt -- maximum sequence length of 1024 tokens. 

### 4.1 Deploying the model endpoint for Sentence Transformer embedding model

In [11]:
hub_config = {
    "HF_MODEL_ID": "sentence-transformers/all-MiniLM-L6-v2",  # model_id from hf.co/models
    "HF_TASK": "feature-extraction",
}

huggingface_model = HuggingFaceModel(
    env=hub_config,
    role=role,
    transformers_version="4.6",  # transformers version used
    pytorch_version="1.7",  # pytorch version used
    py_version="py36",  # python version of the DLC
)

Then we deploy the model as we did earlier for our generative LLM:

In [12]:
encoder = huggingface_model.deploy(
    initial_instance_count=1, instance_type="ml.t2.large", endpoint_name="minilm-embedding"
)

-----!

We can then create the embeddings like so:

In [13]:
out = encoder.predict({"inputs": ["some text here", "some more text goes here too"]})

We will see that we have two outputs (one for each of our input sentences):

In [14]:
len(out)

2

But if we look at each of these outputs we see something strange...

In [15]:
len(out[0]), len(out[1])

(8, 8)

We would expect the embeddings to be of dimensionality *384*, but we're seeing two lists containing _eight_ items each? What is happening here?

When we output feature embeddings from the MiniLM model we're actually outputting a single 384-dimensional vector for every _token_ contained in the inputs we provided. Our second text `"some more text goes here too"` contains _eight_ tokens, and so this is where the value `8` is coming from.

So, if we were to take a look at one of these vectors we should find the dimensionality of `384`:

In [16]:
len(out[0][0])

384

Perfect! There's just one problem, how do we transform these eight vector embeddings into a single _sentence embedding_? For this, we simply take the mean value across each vector dimension, like so:

In [17]:
import numpy as np

embeddings = np.mean(np.array(out), axis=1)
embeddings.shape

(2, 384)

Now we have two 384-dimensional vector embeddings, one for each of our input texts. To make our lives easier later, we will wrap this encoding process into a single function:

In [18]:
from typing import List


def embed_docs(docs: List[str]) -> List[List[float]]:
    out = encoder.predict({"inputs": docs})
    embeddings = np.mean(np.array(out), axis=1)
    return embeddings.tolist()

### 4.2. Generate embeddings for each of document in the knowledge library with the Sentence Transformer model.

For the purpose of the demo we will use [Amazon SageMaker FAQs](https://aws.amazon.com/sagemaker/faqs/) as knowledge library. The data are formatted in a CSV file with two columns Question and Answer. We use **only** the Answer column as the documents of knowledge library, from which relevant documents are retrieved based on a query. 

**Each row in the CSV format dataset corresponds to a textual document. 
We will iterate each document to get its embedding vector via the MiniLM embedding model. 
For your purpose, you can replace the example dataset of your own to build a custom question and answering application.**


First, we download the dataset from our S3 bucket to the local.

In [31]:
s3_path = f"s3://appsectestenv/appsec_review_test.pdf"

In [32]:
# Downloading the Database
!aws s3 cp $s3_path appsec_review_test.pdf

download: s3://appsectestenv/appsec_review_test.pdf to ./appsec_review_test.pdf


Open the dataset with Pandas:

In [33]:
# import pandas as pd

# df_knowledge = pd.read_csv("appsec_review_test.pdf", header=None, names=["Question", "Answer"])
# df_knowledge.head()

Drop the `Question` column since it is not used in this notebook.

In [34]:
# df_knowledge.drop(["Question"], axis=1, inplace=True)
# df_knowledge.head()

Next we can initialize our connection to **Pinecone**. To do this we need a [free API key](https://app.pinecone.io).

In [40]:
!pip install pinecone-client

[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

In [41]:
import pinecone      

pinecone.init(      
	api_key='fcd5020d-2186-4f21-88b7-71208d2d8057',      
	environment='gcp-starter'      
)      
index = pinecone.Index('llama')

List all present indexes associated with your key, should be empty on the first run

In [43]:
print(pinecone.list_indexes())


['llama']


Now we create a new index called `retrieval-augmentation-aws`. It's important that we align the index `dimension` and `metric` parameters with those required by the MiniLM model.

In [45]:
import time

index_name = "llama"

if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)

pinecone.create_index(name=index_name, dimension=embeddings.shape[1], metric="cosine")
# wait for index to finish initialization
while not pinecone.describe_index(index_name).status["ready"]:
    time.sleep(1)

In [46]:
pinecone.list_indexes()

['llama']

Now we upsert the data, we will do this in batches of `128`.

In [58]:
!pip install io

[31mERROR: Could not find a version that satisfies the requirement io (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for io[0m[31m
[0m

In [79]:
import PyPDF2
import boto3
import io

def extract_text_from_s3_pdf(pdf_s3_path):
    text = ""

    # Extract the S3 bucket and key from the S3 path
    s3_bucket, s3_key = pdf_s3_path.replace("s3://", "").split("/", 1)

    # Initialize the S3 client without providing access keys
    s3 = boto3.client('s3')

    # Download the PDF file from S3
    pdf_file_bytes = s3.get_object(Bucket=s3_bucket, Key=s3_key)['Body'].read()

    # Use PdfReader to read the PDF
    pdf_reader = PyPDF2.PdfReader(io.BytesIO(pdf_file_bytes))

    # Extract text from the PDF
    for page_num in range(len(pdf_reader.pages)):
        text += pdf_reader.pages[page_num].extract_text()

    return text

# Specify the S3 path to the PDF file
pdf_s3_path = 's3://appsectestenv/appsec_review_test.pdf'

# Call the function to extract text from the PDF
extracted_text = extract_text_from_s3_pdf(pdf_s3_path)

print(extracted_text)


#1 WHEN DO I NEED AN APPSEC REVIEW?
A security review is required any time you're making a change (or releasing a new system or service) that could impact the
security of customers, AWS, or Amazon. More specifically...
●
●All launches (alpha, beta, gamma, demo, GA, public, or private) require a security review.
Any security-impacting change to a production environment, or a test environment that uses or has access to production
data such as customer content/workloads needs a security review. If you don't know what a "security-impacting change"
means, ask yourself "if this change is implemented, is there a likelihood that we negatively impact customers from a
security perspective?"
If a service has been AppSec approved, expanding to other regions within the same  partition  doesn't need a separate review.
For example, expanding from IAD to PDX with same infrastructure, code, permission etc doesn't require another review.
However, expanding from IAD (AWS - Classic Partition) to BJS (AWS-

In [85]:
!pip install gensim
!pip install nltk


Collecting gensim
  Downloading gensim-4.2.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting smart-open>=1.8.1 (from gensim)
  Downloading smart_open-6.3.0-py3-none-any.whl (56 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.8/56.8 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0mta [36m0:00:01[0m
[?25h[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mInstalling collected packages: smart-open, gensim
Successfully installed gensim-4.2.0 smart-open-6.3.0
[33mDEPRECATION: pyodbc 4.0.0-unsup

In [97]:
# check number of records in the index
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

### 4.3 Combine the retrieved documents, prompt, and question to query the LLM

Now we're ready begin querying our LLM with a **R**etrieval **A**ugmented **G**eneration (RAG) pipeline. Let's see how this will work step-by-step first.

First we create our _query embedding_ and use it to query Pinecone:

In [98]:
# extract embeddings for the questions
query_vec = embed_docs(question)[0]

# query pinecone
res = index.query(query_vec, top_k=1, include_metadata=True)

# show the results
res

{'matches': [], 'namespace': ''}

We get multiple relevant contexts here. We can use these to contruct a single `context` to feed into our LLM prompt.

In [99]:
contexts = [match.metadata["text"] for match in res.matches]

In [100]:
max_section_len = 1000
separator = "\n"


def construct_context(contexts: List[str]) -> str:
    chosen_sections = []
    chosen_sections_len = 0

    for text in contexts:
        text = text.strip()
        # Add contexts until we run out of space.
        chosen_sections_len += len(text) + 2
        if chosen_sections_len > max_section_len:
            break
        chosen_sections.append(text)
    concatenated_doc = separator.join(chosen_sections)
    print(
        f"With maximum sequence length {max_section_len}, selected top {len(chosen_sections)} document sections: \n{concatenated_doc}"
    )
    return concatenated_doc

In [101]:
context_str = construct_context(contexts=contexts)

With maximum sequence length 1000, selected top 0 document sections: 



We would then feed this `context_str` into our LLama-2 prompt:

In [102]:
def create_payload(question, context_str) -> dict:
    prompt_template = """Answer the following QUESTION based on the CONTEXT
    given. If you do not know the answer and the CONTEXT doesn't
    contain the answer truthfully say "I don't know".

    CONTEXT:
    {context}


    ANSWER:
    """

    text_input = prompt_template.replace("{context}", context_str).replace("{question}", question)

    payload = {
        "inputs":  
          [
            [
             {"role": "system", "content": text_input},
             {"role": "user", "content": question},
            ]   
          ],
       "parameters":{"max_new_tokens": 256, "top_p": 0.9, "temperature": 0.6, "return_full_text": False}
    }
    return(payload)

In [104]:
payload = create_payload(question, context_str)
out = predictor.predict(payload, custom_attributes='accept_eula=true')
generated_text = out[0]['generation']['content']
print(f"[Input]: {question}\n[Output]: {generated_text}")

[Input]: Which instances can I use with Managed Spot Training in SageMaker?
[Output]:  Based on the context provided, Managed Spot Training is a feature of Amazon SageMaker that allows you to train machine learning models using spare AWS instances.

According to the Amazon SageMaker documentation, you can use Managed Spot Training with the following instances:

* m5.xlarge
* m5.2xlarge
* m5.4xlarge
* m5.8xlarge
* m5.16xlarge

These instances are designed for large-scale machine learning workloads and offer a balance of compute, memory, and storage resources.

Therefore, the answer to your question is:

You can use m5 instances with Managed Spot Training in Amazon SageMaker.


Let's place all of this logic into a single RAG query function:

In [105]:
def rag_query(question: str) -> str:
    # create query vec
    query_vec = embed_docs(question)[0]
    # query pinecone
    res = index.query(query_vec, top_k=5, include_metadata=True)
    # get contexts
    contexts = [match.metadata["text"] for match in res.matches]
    # build the multiple contexts string
    context_str = construct_context(contexts=contexts)
    # create our retrieval augmented prompt
    payload = create_payload(question, context_str)
    # make prediction
    out = predictor.predict(payload, custom_attributes='accept_eula=true')
    return out[0]["generation"]["content"]

We can now ask the question:

In [106]:
rag_query("When do I need a security review?")

With maximum sequence length 1000, selected top 0 document sections: 



" Based on the context provided, a security review is needed when:\n\n* You are implementing a new system or application that will handle sensitive data.\n* You are making significant changes to an existing system or application that handles sensitive data.\n* You are introducing new users, devices, or networks into your system that will have access to sensitive data.\n* You are experiencing security incidents, such as unauthorized access or data breaches, and need to assess and improve your security controls.\n\nI don't know of any specific scenario where a security review is not needed. It is important to regularly conduct security reviews to ensure that your systems and applications are secure and protecting sensitive data."

We can also ask questions about things that are out of context (not contained within our dataset). From this we expect the model to *not* hallucinate and honestly tell us that it does not know the answer:

In [108]:
rag_query("What are baseline security controls?")

With maximum sequence length 1000, selected top 0 document sections: 



" Sure, I'd be happy to help! Based on the context you provided, baseline security controls refer to a set of security measures that are implemented by an organization to protect its information systems and data from unauthorized access, use, disclosure, disruption, modification, or destruction. These controls are considered to be the minimum requirements for securing an organization's information systems and data, and are often used as a starting point for more comprehensive security controls.\n\nSome common examples of baseline security controls include:\n\n1. Access control: Limiting access to sensitive information and systems to only those individuals who have a legitimate need to access them.\n2. Firewalls: Implementing firewalls to control incoming and outgoing network traffic and protect against unauthorized access to the organization's network.\n3. Encryption: Protecting sensitive information and data at rest and in transit using encryption.\n4. Intrusion detection and preventi

---