# Retrieval Augmented Generation (RAG) Using our Vector DB

In section I, we built a Vector DB to allow for retrieval of similar documents.  This direct followup will show how to use the Vector DB to enhance our prompts with additional context before we put it into a Large Language Model.  

The notebook follows as:

1. RAG Conceptually
   - Question-Answering using Large Language Models
   - Retrieval of Relevant Documents for a Query
   - Question-Answering using RAG for Document Context
2. Using built-in LangChain RAG prompts and Vectors

## 1. RAG Conceptually

Large Language Models have proven to be very good at general question and answering tasks.  However, a main limitation of many LLMs is that they are generally constrained to the data that they are initially trained on.  Without access to an external data source, LLMs cannot bring in new information, whether this is proprietary domain specific knowledge or just an update on an existing knowledge base.  Given that, how can we enable LLMs to be updated with new information while leveraging the powerful language properties?

One solution to this Retrieval Augumented Generation (RAG).  In RAG, we leverage the fact that LLMs can be prompted with additional context data to add additional relevant context to a given query before we pass it into the model.  The old pipeline would be:

```
Query ------> LLM
```

which with RAG will be updated to

```
Query ------> Retrieve Relevant Documents ------> Augmented Query ------> LLM
```

We will retrieve relevant documents using the knowledge base we built with the Vector DB.

### Question-Answering using Large Language Models

We start by looking at a question answering system that simply asks the LLM a question.  In this case, if the model doesn't already know the answer, then there's not much way to inject that knowledge into the model.  Some models may immediately identify that there's not enough context while other models may go off rails and hallucinate.


In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# THE FIRST TIME YOU RUN THIS, IT MIGHT TAKE A WHILE

model_path_or_id = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_path_or_id)
model = AutoModelForCausalLM.from_pretrained(
    model_path_or_id,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    bnb_4bit_compute_dtype=torch.float16,
    attn_implementation="flash_attention_2",
    load_in_4bit=True
)

def generate(prompt):
    """Convenience function for generating model output"""
    # Tokenize the input
    input_ids = tokenizer(
        prompt, 
        return_tensors="pt", 
        truncation=True).input_ids.cuda()
    
    # Generate new tokens based on the prompt, up to max_new_tokens
    # Sample aacording to the parameter
    with torch.inference_mode():
        outputs = model.generate(
            input_ids=input_ids, 
            max_new_tokens=100, 
            do_sample=True, 
            top_p=0.9,
            temperature=0.9,
            use_cache=True
        )
    return tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Downloading shards: 100%|██████████| 2/2 [06:09<00:00, 184.64s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.62s/it]


Let's ask it a very general question because the ChatGPT has been trained on a huge amount of data and providing any specifics in the question will likely result in a correct answer.  In this situation, the model can't possibly ground itself because it doesn't know the context - yet it will still answer with something that it has.

In [4]:
# Prepare the input for for tokenization, attach any prompt that should be needed
PROMPT_TEMPLATE = """
    Question: {query}

    Answer: 
"""

query = "What's the efficacy of NeuroGlyde?"
prompt = PROMPT_TEMPLATE.format(query = query)

res = generate(prompt)

print(f"Prompt:\n{prompt}\n")
print(f"Generated Response:\n{res}")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Prompt:

    Question: What's the efficacy of NeuroGlyde?

    Answer: 


Generated Response:
    NeuroGlyde is a dietary supplement that claims to improve memory and cognitive function by providing the body with the nutrients it needs to produce neurotransmitters, such as dopamine, serotonin, and acetylcholine. These neurotransmitters are responsible for the proper functioning of the brain and the communication between neurons. 

    While NeuroGlyde is not a magic cure-all for memory and cognitive function, there is some


It doesn't know the context, so let's provide it the context.  Which context should we provide?  The context will be retrieved from our vector databse.

We will retrieve the relevant documents to this question, inject it into the prompt, and send that to the model instead.

### Retrieval of Relevant Documents for a Query

We'll briefly revisit our code to retrieve documents from our previous example.  This Vector DB has already been populated with a set of documents.

In [5]:
from typing import List, Dict
from langchain.vectorstores.pgvector import PGVector

from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

In [6]:
# The connection to the database
CONNECTION_STRING = PGVector.connection_string_from_db_params(
    driver= "psycopg2",
    host = "localhost",
    port = "5432",
    database = "postgres",
    user= "username",
    password="password"
)

# The embedding function that will be used to store into the database
embedding_function = SentenceTransformerEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",
    model_kwargs = {'device': 'cuda'},
    encode_kwargs = {'normalize_embeddings': True}
)

# Creates the database connection to our existing DB
db = PGVector(
    connection_string = CONNECTION_STRING,
    collection_name = "embeddings",
    embedding_function = embedding_function
)

  warn_deprecated(
  warn_deprecated(


In [7]:
# query it, note that the score here is a distance metric (lower is more related)
query = "What's the efficacy of NeuroGlyde?"
docs_with_scores = db.similarity_search_with_score(query, k = 1)

# print results
for doc, score in docs_with_scores:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.26277288642252317
Subject:  Medical Science Liaison (MSL) Notes - In-Depth Discussion on NeuroGlyde  
Date:  April 10, 2023  
Provider:  Dr. James Harper  
Title:  Neurologist  
Institution:  City Neurology Clinic  
Summary of Key Discussion Points:  
1. Introduction:  
• Introduced NeuroGlyde, a novel neuroprotective agent, emphasizing its potential in 
slowing disease progression.  
• Discussed ongoing clinical trials and positive early -phase results.  
2. Provider's Current Patient C ases:  
• Explored Dr. Harper's experience with NeuroGlyde in treating neurodegenerative 
disorders.  
• Discussed improvements in cognitive function observed in Alzheimer's patients.  
3. Efficacy and Clinical Data:  
• Presented data demonstrating a 40% reduction in annualized relapse rates in multiple 
sclerosis patients.  
• Highlighted significant improvements in quality of life measures.  
4. Safety Profile

When we query, we get the most relevant document for this query.  Let's create a new prompt that can take this new context. 

### Question-Answering using RAG for Document Context

In [8]:
# Prepare the input for for tokenization, attach any prompt that should be needed
RAG_PROMPT_TEMPLATE = """
Answer the question using only this context:

Context: {context}

Question: {query}

Answer: 
"""

query = "What's the efficacy of NeuroGlyde?"
docs_with_scores = db.similarity_search_with_score(query, k = 1)
context_prompt = RAG_PROMPT_TEMPLATE.format(
    context = docs_with_scores[0][0].page_content,
    query = query
)

res = generate(context_prompt)

print(f"Prompt:\n{context_prompt}\n")
print(f"Generated Response:\n{res}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Prompt:

Answer the question using only this context:

Context: Subject:  Medical Science Liaison (MSL) Notes - In-Depth Discussion on NeuroGlyde  
Date:  April 10, 2023  
Provider:  Dr. James Harper  
Title:  Neurologist  
Institution:  City Neurology Clinic  
Summary of Key Discussion Points:  
1. Introduction:  
• Introduced NeuroGlyde, a novel neuroprotective agent, emphasizing its potential in 
slowing disease progression.  
• Discussed ongoing clinical trials and positive early -phase results.  
2. Provider's Current Patient C ases:  
• Explored Dr. Harper's experience with NeuroGlyde in treating neurodegenerative 
disorders.  
• Discussed improvements in cognitive function observed in Alzheimer's patients.  
3. Efficacy and Clinical Data:  
• Presented data demonstrating a 40% reduction in annualized relapse rates in multiple 
sclerosis patients.  
• Highlighted significant improvements in quality of life measures.  
4. Safety Profile:  
• Discussed the favorable safety profile 

That's it! That's the general concept of Retrieval Augmented Generation.

## Using built in LangChain RAG chains instead

LangChain contains many built-in methods that have connectivity to Vector Databases and LLMs.  In the example above, we built a custom prompt template and manually retrieved the document, then put it into the chain.  While pretty simple, with LangChain, this can all be pipelined together and more can be done, such as retrieving meta-data and sources.

In [9]:
from operator import itemgetter
from langchain.schema import StrOutputParser
from langchain.prompts import PromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.runnable import RunnableParallel
from langchain.llms.huggingface_pipeline import HuggingFacePipeline

# Turn our db into a retriever
retriever = db.as_retriever(search_kwargs = {'k' : 2})

# Turn our model into an LLM
pipe = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
    max_new_tokens=100)
llm = HuggingFacePipeline(pipeline=pipe)

prompt_template = PromptTemplate.from_template("""
Answer the question using only this context:

Context: {context}

Question: {question}

Answer: 
""")                                    

In [11]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Build a chain with multiple documents for RAG
rag_chain_from_docs = (
    {
        "context": lambda input: format_docs(input["documents"]),
        "question": itemgetter("question"),
    }
    | prompt_template
    | llm
    | StrOutputParser()
)

# 2-step chain, first retrieve documents
# Then take those documents and store relevant infomration in `document_sources`
# Pass the prompt into the document chain
rag_chain_with_source = RunnableParallel({
    "documents": retriever, 
     "question": RunnablePassthrough()
}) | {
    "sources": lambda input: [(doc.page_content, doc.metadata) for doc in input["documents"]],
    "answer": rag_chain_from_docs,
}

In [12]:
res = rag_chain_with_source.invoke("What's the efficacy of Pentatryponal?")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [13]:
print(res['answer'])


Answer the question using only this context:

Context: Subject:  Medical Science Liaison (MSL) Notes - Skepticism and Concerns on NeuroSolvix  
Date: April 30, 202 3 
Provider:  Dr. Cynthia Rodriguez  
Title:  Pain Management Specialist  
Institution:  NerveEase Pain Clinic  
Summary of Key Discussion Points:  
1. Introduction:  
• Introduced NeuroSolvix as a potential therapy for neuropathic pain management.  
• Dr. Rodriguez expressed skepticism, questioning the need for another medication in an 
already crowded pain management landscape.  
2. Provider's Current Patient Cases:  
• Dr. Rodriguez shared concerns about introducing new medications without clear 
superiority ove r existing neuropathic pain treatments.  
• Discussed specific cases where current therapies have shown established efficacy in pain 
relief.  
3. Efficacy and Clinical Data:  
• Presented recent clinical data showcasing NeuroSolvix's reduction in neuropathic pain 
score s and improved quality of life.  
• Dr. Ro

In [14]:
res['sources']

[("Subject:  Medical Science Liaison (MSL) Notes - Skepticism and Concerns on NeuroSolvix  \nDate: April 30, 202 3 \nProvider:  Dr. Cynthia Rodriguez  \nTitle:  Pain Management Specialist  \nInstitution:  NerveEase Pain Clinic  \nSummary of Key Discussion Points:  \n1. Introduction:  \n• Introduced NeuroSolvix as a potential therapy for neuropathic pain management.  \n• Dr. Rodriguez expressed skepticism, questioning the need for another medication in an \nalready crowded pain management landscape.  \n2. Provider's Current Patient Cases:  \n• Dr. Rodriguez shared concerns about introducing new medications without clear \nsuperiority ove r existing neuropathic pain treatments.  \n• Discussed specific cases where current therapies have shown established efficacy in pain \nrelief.  \n3. Efficacy and Clinical Data:  \n• Presented recent clinical data showcasing NeuroSolvix's reduction in neuropathic pain \nscore s and improved quality of life.  \n• Dr. Rodriguez questioned the clinical rel