# Retrieval-Augmented Generation with Pinecone
## Question Answering based on Custom Dataset with Open-sourced [LangChain](https://python.langchain.com/docs/get_started/introduction.html) Library

This notebook will showcase the utilization of **[BloomZ 3B](https://huggingface.co/bigscience/bloomz-3b)** and **[Flan T5 Large](https://huggingface.co/google/flan-t5-large)** models for question-answering tasks using a library of documents as a reference, by using document embeddings and retrieval, with the embeddings generated from the all-MiniLM-L6-v2 embedding model.
<br><br>
While the BloomZ 3B and Flan T5 Large models have acquired significant general knowledge during training, there is often a requirement to process and utilize a vast library of more specific information.


## Installing dependencies

In [1]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [2]:
!pip install transformers==4.30.2 accelerate==0.20.3 -qU
!pip install sentence-transformers==2.2.2 -qU
!pip install sentencepiece==0.1.99 -qU
!pip install bitsandbytes==0.39.1 -qU
!pip install pinecone-client==2.2.1 -qU
!pip install langchain==0.0.162 -qU
!pip install kaggle==1.5.15 -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m65.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.6/227.6 kB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m34.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m74.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m72.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m30.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━

### Step 1: Defining the LLMs

We will use the MODEL_CONFIG dictionary to define the two models and to store additional information about them later in the notebook.

In [4]:
MODEL_CONFIG = {
    "bigscience/bloomz-3b": {
        "prompt": """question: \"{question}"\\n\nContext: \"{context}"\\n\nAnswer:"""
    },
    "google/flan-t5-large": {
        "prompt": """Answer based on context:\n\n{context}\n\n{question}"""
    }
}

We can set quantization configuration to load large model with less GPU memory.
This requires the `bitsandbytes` library

In [5]:
from torch import cuda, bfloat16, set_default_tensor_type
import transformers

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

Loading **BloomZ 3B** model from HuggingFace `transformers` library.

In [6]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "bigscience/bloomz-3b"

MODEL_CONFIG[model_name]["tokenizer"] = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map='auto'
)
MODEL_CONFIG[model_name]["model"] = AutoModelForCausalLM.from_pretrained(model_name,
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map='auto'
)

Downloading (…)okenizer_config.json:   0%|          | 0.00/199 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/715 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/6.01G [00:00<?, ?B/s]


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


Loading **Flan T5 Large** model from HuggingFace `transformers` library.

In [7]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = "google/flan-t5-large"

MODEL_CONFIG[model_name]["tokenizer"] = T5Tokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map='auto'
)
MODEL_CONFIG[model_name]["model"] = T5ForConditionalGeneration.from_pretrained(
    model_name,
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map='auto'
)

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.


Downloading (…)lve/main/config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

### Step 2. Ask a question to LLM without providing the context
To better illustrate why we need retrieval-augmented generation (RAG) based approach to solve the question and anwering problem. Let's directly ask the model a question and see how they respond.

In [8]:
question = "Which instances can I use with Managed Spot Training in SageMaker?"

In [9]:
def answer_based_on_a_question(model_name, question, prompt, tokenizer, model):
  print(f"\nModel name: \n{model_name}\n")
  prompt = prompt.replace("{question}", question).replace("{context}", "")
  inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
  outputs = model.generate(inputs)
  print(f"Model output:")
  print(tokenizer.decode(outputs[0]))

In [10]:
for model in MODEL_CONFIG:
  answer_based_on_a_question(
      model,
      question,
      MODEL_CONFIG[model]["prompt"],
      MODEL_CONFIG[model]["tokenizer"],
      MODEL_CONFIG[model]["model"]
  )

Input length of input_ids is 27, but `max_length` is set to 20. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.



Model name: 
bigscience/bloomz-3b

Model output:
question: "Which instances can I use with Managed Spot Training in SageMaker?"\n
Context: ""\n
Answer: Man

Model name: 
google/flan-t5-large

Model output:
<pad> SageMaker Online</s>


You can see the generated answer is wrong or doesn't make much sense.

### Step 3. Improve the answer to the same question using prompt engineering with insightful context
To better answer the question, we provide extra contextual information, combine it with a prompt, and send it to model together with the question. Below is an example.



In [11]:
context = """Managed Spot Training can be used with all instances supported in Amazon SageMaker.
Managed Spot Training is supported in all AWS Regions where Amazon SageMaker is currently available."""

In [12]:
def answer_based_on_context_and_question(model_name, context, question, prompt, tokenizer, model):
  print(f"\nModel name: \n{model_name}\n")
  prompt = prompt.replace("{question}", question).replace("{context}", context)
  inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
  outputs = model.generate(inputs)
  print(f"Model output:")
  print(tokenizer.decode(outputs[0]))

In [13]:
for model in MODEL_CONFIG:
  answer_based_on_context_and_question(
      model_name,
      context,
      question,
      MODEL_CONFIG[model]["prompt"],
      MODEL_CONFIG[model]["tokenizer"],
      MODEL_CONFIG[model]["model"]
  )

Input length of input_ids is 62, but `max_length` is set to 20. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.



Model name: 
google/flan-t5-large

Model output:
question: "Which instances can I use with Managed Spot Training in SageMaker?"\n
Context: "Managed Spot Training can be used with all instances supported in Amazon SageMaker.
Managed Spot Training is supported in all AWS Regions where Amazon SageMaker is currently available."\n
Answer: all

Model name: 
google/flan-t5-large

Model output:
<pad> all instances supported in Amazon SageMaker</s>


We can observe that the models generate more accurate answers when provided with some context.
<br>
This can be achieved by retrieving the context from a vector database, as demonstrated in the next step.

### Step 4. Use RAG based approach with LangChain and Pinecone to build a simplified question and answering application

We plan to use document embeddings to fetch the most relevant documents in our document knowledge library and combine them with the prompt that we provide to LLM.

To achieve that, we will do following.

- Generate embedings for each of document in the knowledge library with the [MiniLM-L6](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) embedding model.
- Identify top K most relevant documents based on user query.
    - For a query of your interest, generate the embedding of the query using the same embedding model.
    - Search the Pinecone index to get the most relevant documents in the embedding space (vector database).
- Combine the retrieved documents with prompt and question and send them into LLM.

#### 4.1 Preparing the [MiniLM-L6](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) embedding model

To create our embeddings we will use the `MiniLM-L6` sentence transformer model using the LangChain library. We initialize it like so:

In [14]:
from langchain.embeddings import HuggingFaceEmbeddings

embed = HuggingFaceEmbeddings(
    model_name='sentence-transformers/all-MiniLM-L6-v2'
)

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [15]:
query = "An example sentence to obtain the embedding dimension."

xq = embed.embed_query(query)
len(xq)

384

Encoding this single sentence leaves us with a `384` dimensional sentence embedding.

#### 4.2. Generate embeddings for each document in the knowledge library with the [MiniLM-L6](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) embedding model and add it to Pinecone.

For the purpose of the demo we will use [Amazon SageMaker FAQs](https://www.kaggle.com/datasets/abbbhishekkk/faq-datasets-for-chatbot-training?select=Amazon_sagemaker_Faq.txt) as knowledge library. The data is formatted in a CSV file with three columns `question`, `answer` and `found_duplicate`. We use only the `answer` column as the documents of knowledge library, from which relevant documents are retrieved based on a query.

Let's prepare the dataset for upserting.

In [23]:
try:
    import kaggle
except OSError as e:
    print(e)

Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method.


Find your [Kaggle credentials](https://www.kaggle.com/settings) and replace them in the following cell.

In [24]:
import json

KAGGLE_USERNAME = "YOUR_KAGGLE_USERNAME"
KAGGLE_KEY = "YOUR_KAGGLE_KEY"

with open('/root/.kaggle/kaggle.json', 'w') as fp:
    fp.write(json.dumps({"username": KAGGLE_USERNAME,"key": KAGGLE_KEY}))

In [25]:
!kaggle datasets download -d abbbhishekkk/faq-datasets-for-chatbot-training

Downloading faq-datasets-for-chatbot-training.zip to /content
100% 264k/264k [00:00<00:00, 451kB/s]
100% 264k/264k [00:00<00:00, 451kB/s]


In [26]:
import zipfile

with zipfile.ZipFile("/content/faq-datasets-for-chatbot-training.zip", 'r') as zip_ref:
        zip_ref.extractall('./')

In [27]:
import pandas as pd

df_knowledge = pd.read_json("/content/Amazon_sagemaker_Faq.txt")

In [28]:
df_knowledge.head()

Unnamed: 0,question,answer,found_duplicate
0,What is Amazon SageMaker?,Amazon SageMaker is a fully managed service th...,False
1,In which regions is Amazon SageMaker available?,For a list of the supported Amazon SageMaker A...,False
2,What is the service availability of Amazon Sag...,Amazon SageMaker is designed for high availabi...,False
3,What security measures does Amazon SageMaker h...,Amazon SageMaker ensures that ML model artifac...,False
4,How does Amazon SageMaker secure my code?,Amazon SageMaker stores code in ML storage vol...,False


In [29]:
df_knowledge.drop(["question", "found_duplicate"], axis=1, inplace=True)
df_knowledge.head()

Unnamed: 0,answer
0,Amazon SageMaker is a fully managed service th...
1,For a list of the supported Amazon SageMaker A...
2,Amazon SageMaker is designed for high availabi...
3,Amazon SageMaker ensures that ML model artifac...
4,Amazon SageMaker stores code in ML storage vol...


In [30]:
df_knowledge.shape

(67, 1)

Next we can initialize our connection to **Pinecone**. To do this we need a [free API key](https://app.pinecone.io).

In [17]:
import pinecone
import os

# Load Pinecone API key
api_key = os.getenv('PINECONE_API_KEY') or 'YOUR_PINECONE_API_KEY'
# Set Pinecone environment. Find next to API key in console
env = os.getenv('PINECONE_ENVIRONMENT') or 'YOUR_PINECONE_ENVIRONMENT'

pinecone.init(
    api_key=api_key,
    environment=env
)

List all present indexes associated with your key, should be empty on the first run


In [20]:
pinecone.list_indexes()

[]

Now we create a new index called `retrieval-augmentation-langchain-aws`. It's important that we align the index `dimension` and `metric` parameters with those required by the `MiniLM-L6` model.

In [19]:
index_name = 'retrieval-augmentation-langchain-aws'

if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)

In [21]:
pinecone.create_index(
    name=index_name,
    dimension=384,
    metric='cosine'
)

In [22]:
index = pinecone.Index(index_name)

Now we can upsert the data, we will do this in batches of `128`.

In [31]:
from tqdm.auto import tqdm

batch_size = 128
vector_limit = 100000

answers = df_knowledge[:vector_limit]

for i in tqdm(range(0, len(answers), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(answers))
    # create IDs batch
    ids = [str(x) for x in range(i, i_end)]
    # create metadata batch
    metadatas = [{'text': text} for text in answers["answer"][i:i_end]]
    documents = answers["answer"][i:i_end]
    # create document embeddings
    embeds = embed.embed_documents(documents)
    # create records list for upsert
    records = zip(ids, embeds, metadatas)
    # upsert to Pinecone
    index.upsert(vectors=records)

  0%|          | 0/1 [00:00<?, ?it/s]

In [33]:
# check number of records in the index
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.00067,
 'namespaces': {'': {'vector_count': 67}},
 'total_vector_count': 67}

#### 4.3 Generative Question-Answering with Langchain

In GQA we take the query as a question that is to be answered by a LLM, but the LLM must answer the question based on the information it is seeing being returned from the `vectorstore`.

In [34]:
from langchain.vectorstores import Pinecone

text_field = "text"

vectorstore = Pinecone(
    index, embed.embed_query, text_field
)

In [35]:
question

'Which instances can I use with Managed Spot Training in SageMaker?'

In [36]:
docs = vectorstore.similarity_search(
    question,  # our search query
    k=1  # return the most relevant document
)
docs

[Document(page_content='Managed Spot Training can be used with all instances supported in Amazon SageMaker.', metadata={})]

In [37]:
from langchain import PromptTemplate, HuggingFaceHub, LLMChain

HUGGINGFACE_API_TOKEN=os.getenv('HUGGINGFACE_API_TOKEN') or 'YOUR_HUGGINGFACE_API_TOKEN'

for model in MODEL_CONFIG:
    MODEL_CONFIG[model]["model"] = HuggingFaceHub(
        repo_id=model,
        huggingfacehub_api_token=HUGGINGFACE_API_TOKEN
    )
    MODEL_CONFIG[model]["prompt_template"] = PromptTemplate(
        template=MODEL_CONFIG[model]["prompt"],
        input_variables=["context", "question"]
    )
    MODEL_CONFIG[model]["chain"] = LLMChain(
        prompt=MODEL_CONFIG[model]["prompt_template"],
        llm=MODEL_CONFIG[model]["model"]
    )

In [38]:
from langchain.chains.question_answering import load_qa_chain

print(f"Question: {question}")

for model in MODEL_CONFIG:
  chain = load_qa_chain(MODEL_CONFIG[model]["model"], chain_type="refine")
  result = chain({"input_documents": docs, "question": question})
  print(f"\nModel name:\n{model}")
  print(f"\nModel output:")
  print(result["output_text"])

Question: Which instances can I use with Managed Spot Training in SageMaker?

Model name:
bigscience/bloomz-3b

Model output:
 all

Model name:
google/flan-t5-large

Model output:
all


After retrieving the most similar document(s) and creating our context from it, we can observe that we have sufficient context for our model to function effectively.

In [39]:
pinecone.delete_index(index_name)