# Retrieval-Augmented Generation with Pinecone
## Question Answering based on Custom Dataset

This notebook will showcase the utilization of **[BloomZ 3B](https://huggingface.co/bigscience/bloomz-3b)** and **[Flan T5 Large](https://huggingface.co/google/flan-t5-large)** models for question-answering tasks using a library of documents as a reference, by using document embeddings and retrieval, with the embeddings generated from the all-MiniLM-L6-v2 embedding model.
<br><br>
While the BloomZ 3B and Flan T5 Large models have acquired significant general knowledge during training, there is often a requirement to process and utilize a vast library of more specific information.


## Installing dependencies

In [9]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [1]:
!pip install transformers==4.30.2 accelerate==0.20.3 -qU
!pip install sentence-transformers==2.2.2 -qU
!pip install sentencepiece==0.1.99 -qU
!pip install pinecone-client==2.2.1 -qU
!pip install bitsandbytes==0.39.1 -qU
!pip install kaggle==1.5.15 -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m56.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.6/227.6 kB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m37.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m79.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m93.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m80.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━

### Step 1: Defining the LLMs

We will use the MODEL_CONFIG dictionary to define the two models and to store additional information about them later in the notebook.

In [2]:
MODEL_CONFIG = {
    "bigscience/bloomz-3b": {
        "prompt": """question: \"{question}"\\n\nContext: \"{context}"\\n\nAnswer:"""
    },
    "google/flan-t5-large": {
        "prompt": """Answer based on context:\n\n{context}\n\n{question}"""
    }
}

We can set quantization configuration to load large model with less GPU memory.
This requires the `bitsandbytes` library

In [3]:
from torch import cuda, bfloat16
import transformers

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

Loading **BloomZ 3B** model from HuggingFace `transformers` library.

In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "bigscience/bloomz-3b"

MODEL_CONFIG[model_name]["tokenizer"] = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map='auto'
)
MODEL_CONFIG[model_name]["model"] = AutoModelForCausalLM.from_pretrained(model_name,
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map='auto'
)

Downloading (…)okenizer_config.json:   0%|          | 0.00/199 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/715 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/6.01G [00:00<?, ?B/s]


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


Loading **Flan T5 Large** model from HuggingFace `transformers` library.

In [5]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = "google/flan-t5-large"

MODEL_CONFIG[model_name]["tokenizer"] = T5Tokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map='auto'
)
MODEL_CONFIG[model_name]["model"] = T5ForConditionalGeneration.from_pretrained(
    model_name,
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map='auto'
)

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.


Downloading (…)lve/main/config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

### Step 2. Ask a question to LLM without providing the context
To better illustrate why we need retrieval-augmented generation (RAG) based approach to solve the question and anwering problem. Let's directly ask the model a question and see how they respond.

In [6]:
question = "Which instances can I use with Managed Spot Training in SageMaker?"

In [7]:
def answer_based_on_a_question(model_name, question, prompt, tokenizer, model):
  print(f"\nModel name: \n{model_name}\n")
  prompt = prompt.replace("{question}", question).replace("{context}", "")
  inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
  outputs = model.generate(inputs)
  print(f"Model output:")
  print(tokenizer.decode(outputs[0]))

In [8]:
for model in MODEL_CONFIG:
  answer_based_on_a_question(
      model,
      question,
      MODEL_CONFIG[model]["prompt"],
      MODEL_CONFIG[model]["tokenizer"],
      MODEL_CONFIG[model]["model"]
  )

Input length of input_ids is 27, but `max_length` is set to 20. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.



Model name: 
bigscience/bloomz-3b

Model output:
question: "Which instances can I use with Managed Spot Training in SageMaker?"\n
Context: ""\n
Answer: Man

Model name: 
google/flan-t5-large

Model output:
<pad> SageMaker Online</s>


You can see the generated answer is wrong or doesn't make much sense.

### Step 3. Improve the answer to the same question using prompt engineering with insightful context
To better answer the question, we provide extra contextual information, combine it with a prompt, and send it to model together with the question. Below is an example.



In [10]:
context = """Managed Spot Training can be used with all instances supported in Amazon SageMaker.
Managed Spot Training is supported in all AWS Regions where Amazon SageMaker is currently available."""

In [11]:
def answer_based_on_context_and_question(model_name, context, question, prompt, tokenizer, model):
  print(f"\nModel name: \n{model_name}\n")
  prompt = prompt.replace("{question}", question).replace("{context}", context)
  inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
  outputs = model.generate(inputs)
  print(f"Model output:")
  print(tokenizer.decode(outputs[0]))

In [12]:
for model in MODEL_CONFIG:
  answer_based_on_context_and_question(
      model_name,
      context,
      question,
      MODEL_CONFIG[model]["prompt"],
      MODEL_CONFIG[model]["tokenizer"],
      MODEL_CONFIG[model]["model"]
  )

Input length of input_ids is 62, but `max_length` is set to 20. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.



Model name: 
google/flan-t5-large

Model output:
question: "Which instances can I use with Managed Spot Training in SageMaker?"\n
Context: "Managed Spot Training can be used with all instances supported in Amazon SageMaker.
Managed Spot Training is supported in all AWS Regions where Amazon SageMaker is currently available."\n
Answer: all

Model name: 
google/flan-t5-large

Model output:
<pad> all instances supported in Amazon SageMaker</s>


We can observe that the models generate more accurate answers when provided with some context.
<br>
This can be achieved by retrieving the context from a vector database, as demonstrated in the next step.

### Step 4. Use RAG based approach to identify the correct documents, and use them along with prompt and question to query LLM

We plan to use document embeddings to fetch the most relevant documents in our document knowledge library and combine them with the prompt that we provide to LLM.

To achieve that, we will do following.

- Generate embedings for each of document in the knowledge library with the [MiniLM-L6](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) embedding model.
- Identify top K most relevant documents based on user query.
    - For a query of your interest, generate the embedding of the query using the same embedding model.
    - Search the Pinecone index to get the most relevant documents in the embedding space (vector database).
- Combine the retrieved documents with prompt and question and send them into LLM.

#### 4.1 Preparing the [MiniLM-L6](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) embedding model

To create our embeddings we will use the `MiniLM-L6` sentence transformer model. This is a very efficient semantic similarity embedding model from the `sentence-transformers` library. We initialize it like so:

In [13]:
from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
model

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

In [14]:
query = "An example sentence to obtain the embedding dimension."

xq = model.encode(query)
xq.shape

(384,)

Encoding this single sentence leaves us with a `384` dimensional sentence embedding.

In the next step when we do upsert our data to Pinecone, we will be doing so in batches. Meaning `vectors` will be a list of `(id, embedding, metadata)` tuples.
To prepare this for `upsert` to Pinecone, all we do is this:

In [15]:
_id = '0'
metadata = {'text': query}

vectors = [(_id, xq, metadata)]

#### 4.2. Generate embeddings for each document in the knowledge library with the [MiniLM-L6](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) embedding model.

For the purpose of the demo we will use Amazon SageMaker FAQs as knowledge library. The data is formatted in a CSV file with three columns `question`, `answer` and `found_duplicate`. We use only the `answer` column as the documents of knowledge library, from which relevant documents are retrieved based on a query.

Let's prepare the dataset for upserting.

In [16]:
try:
    import kaggle
except OSError as e:
    print(e)

Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method.


Find your [Kaggle credentials](https://www.kaggle.com/settings) and replace them in the following cell.

In [17]:
import json

KAGGLE_USERNAME = "YOUR_KAGGLE_USERNAME"
KAGGLE_KEY = "YOUR_KAGGLE_KEY"

with open('/root/.kaggle/kaggle.json', 'w') as fp:
    fp.write(json.dumps({"username": KAGGLE_USERNAME,"key": KAGGLE_KEY}))

In [18]:
!kaggle datasets download -d abbbhishekkk/faq-datasets-for-chatbot-training

Downloading faq-datasets-for-chatbot-training.zip to /content
100% 264k/264k [00:00<00:00, 440kB/s]
100% 264k/264k [00:00<00:00, 440kB/s]


In [19]:
import zipfile

with zipfile.ZipFile("/content/faq-datasets-for-chatbot-training.zip", 'r') as zip_ref:
        zip_ref.extractall('./')

In [20]:
import pandas as pd

df_knowledge = pd.read_json("/content/Amazon_sagemaker_Faq.txt")

In [21]:
df_knowledge.head()

Unnamed: 0,question,answer,found_duplicate
0,What is Amazon SageMaker?,Amazon SageMaker is a fully managed service th...,False
1,In which regions is Amazon SageMaker available?,For a list of the supported Amazon SageMaker A...,False
2,What is the service availability of Amazon Sag...,Amazon SageMaker is designed for high availabi...,False
3,What security measures does Amazon SageMaker h...,Amazon SageMaker ensures that ML model artifac...,False
4,How does Amazon SageMaker secure my code?,Amazon SageMaker stores code in ML storage vol...,False


In [22]:
df_knowledge.drop(["question", "found_duplicate"], axis=1, inplace=True)
df_knowledge.head()

Unnamed: 0,answer
0,Amazon SageMaker is a fully managed service th...
1,For a list of the supported Amazon SageMaker A...
2,Amazon SageMaker is designed for high availabi...
3,Amazon SageMaker ensures that ML model artifac...
4,Amazon SageMaker stores code in ML storage vol...


In [23]:
df_knowledge.shape

(67, 1)

Next we can initialize our connection to **Pinecone**. To do this we need a [free API key](https://app.pinecone.io).

In [24]:
import pinecone
import os

# Load Pinecone API key
api_key = os.getenv('PINECONE_API_KEY') or 'YOUR_PINECONE_API_KEY'
# Set Pinecone environment. Find next to API key in console
env = os.getenv('PINECONE_ENVIRONMENT') or 'YOUR_PINECONE_ENVIRONMENT'

pinecone.init(
    api_key=api_key,
    environment=env
)

List all present indexes associated with your key, should be empty on the first run


In [25]:
pinecone.list_indexes()

[]

Now we create a new index called `retrieval-augmentation-aws`. It's important that we align the index `dimension` and `metric` parameters with those required by the `MiniLM-L6` model.

In [26]:
index_name = 'retrieval-augmentation-aws'

if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)

In [27]:
pinecone.create_index(
    name=index_name,
    dimension=model.get_sentence_embedding_dimension(),
    metric='cosine'
)

In [28]:
index = pinecone.Index(index_name)

Now we upsert the data, we will do this in batches of `128`.

In [29]:
from tqdm.auto import tqdm

batch_size = 128
vector_limit = 100000

answers = df_knowledge[:vector_limit]

for i in tqdm(range(0, len(answers), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(answers))
    # create IDs batch
    ids = [str(x) for x in range(i, i_end)]
    # create metadata batch
    metadatas = [{'text': text} for text in answers["answer"][i:i_end]]
    # create embeddings
    xc = model.encode(answers["answer"][i:i_end]).tolist()
    # create records list for upsert
    records = zip(ids, xc, metadatas)
    # upsert to Pinecone
    index.upsert(vectors=records)

  0%|          | 0/1 [00:00<?, ?it/s]

In [32]:
# check number of records in the index
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.00067,
 'namespaces': {'': {'vector_count': 67}},
 'total_vector_count': 67}

#### 4.3 Retrieve the most relevant documents
Given the vector embedding of a query, we will query the Pinecone index to get the most relevant documents.



In [34]:
question

'Which instances can I use with Managed Spot Training in SageMaker?'

In [35]:
# extract embeddings for the questions
query_vector = model.encode(question).tolist()

# query pinecone
query_result = index.query(query_vector, top_k=5)

# show the results
print("\n\n\n Original question : " + str(question))
print("\n Most similar answers based on pinecone vector search: \n")

ids = [match.id for match in query_result.matches]
scores = [match.score for match in query_result.matches]
df_result = pd.DataFrame(
    {
        "id": ids,
        "answer": [
            df_knowledge["answer"][int(_id)] for _id in ids
        ],
        "score": scores,
    }
)
df_result




 Original question : Which instances can I use with Managed Spot Training in SageMaker?

 Most similar answers based on pinecone vector search: 



Unnamed: 0,id,answer,score
0,28,Managed Spot Training can be used with all ins...,0.8974
1,22,Managed Spot Training with Amazon SageMaker le...,0.81478
2,29,Managed Spot Training is supported on all AWS ...,0.79689
3,25,Managed Spot Training uses Amazon EC2 Spot ins...,0.737362
4,23,You enable the Managed Spot Training option wh...,0.732323


#### 4.4 Combine the retrieved documents, prompt, and question to query the LLM

Now we are going to construct our context based on the most similar documents in comparison to our question. If you want to retrieve more documents and have a larger context, you can increase the value of MAX_SECTION_LEN.

In [36]:
MAX_SECTION_LEN = 200
SEPARATOR = "\n"


def construct_context(df_result, df_knowledge) -> str:
    chosen_sections = []
    chosen_sections_len = 0

    for index in df_result["id"]:
        # Add contexts until we run out of space.
        document_section = df_knowledge.loc[int(index)]
        chosen_sections_len += len(document_section) + 2
        if chosen_sections_len > MAX_SECTION_LEN:
            break

        chosen_sections.append(SEPARATOR + document_section)
    concatenated_doc = "".join(chosen_sections)
    print(
        f"With maximum sequence length {MAX_SECTION_LEN}, selected top {len(chosen_sections)} document sections: \n{concatenated_doc}"
    )

    return concatenated_doc

In [37]:
enriched_context = construct_context(df_result, df_knowledge["answer"])

With maximum sequence length 200, selected top 1 document sections: 

Managed Spot Training can be used with all instances supported in Amazon SageMaker.


In [38]:
for model in MODEL_CONFIG:
  answer_based_on_context_and_question(
      model,
      enriched_context,
      question,
      MODEL_CONFIG[model]["prompt"],
      MODEL_CONFIG[model]["tokenizer"],
      MODEL_CONFIG[model]["model"]
  )

Input length of input_ids is 44, but `max_length` is set to 20. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.



Model name: 
bigscience/bloomz-3b

Model output:
question: "Which instances can I use with Managed Spot Training in SageMaker?"\n
Context: "
Managed Spot Training can be used with all instances supported in Amazon SageMaker."\n
Answer: all

Model name: 
google/flan-t5-large

Model output:
<pad> all</s>


After retrieving the most similar document(s) and creating our context from it, we can observe that we have sufficient context for our model to function effectively.

In [39]:
pinecone.delete_index(index_name)