# Retrieval Augmented Generation

## RAG Explained

**Retrieval Augmented Generation (RAG)** is a technique that combines the strengths of both retrieval-based and generative models to enhance text generation. RAG is commonly used to enhance response quality in question-answering scenarios. Before a generative model is prompted to answer the question, the user's input *(1)* is encoded as embedding *(2)* to retrieving relevant information from a database *(3)* of documents. By including them in the prompt *(4)* the retrieved data is used to inform and improve the responses generated by a generative model *(5)*. This method is especially useful because it circumvents the limitations of fine-tuning, which isn't always feasible due to various constraints such as data availability or computational resources. For example, by incorporating rich, academically-informed content directly into the input sequence, it significantly enhances its capability to provide detailed and relevant answers. Here are some resources with more information on the topic:

* [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401)
* [Improving language models by retrieving from trillions of tokens](https://arxiv.org/abs/2112.04426)
* [Retrieval-Augmented Generation for Large Language Models: A Survey](https://arxiv.org/abs/2312.10997)




In [1]:
!pip uninstall -qy dill pyarrow jupyter-lsp jupyterlab jupyterlab-lsp google-cloud-storage tensorflowjs
!pip install -q "dill<0.3.2,>=0.3.1.1" "pyarrow<10.0.0,>=3.0.0" "google-cloud-storage<3,>=2.2.1"
!pip -q install langchain tiktoken chromadb pypdf transformers InstructorEmbedding sentence-transformers
!pip install -q --upgrade "overrides>=7.3.1" "kubernetes>=28.1.0"

[0m

In [2]:
from IPython.display import display, HTML

def displayAnswer(question = "", answer = "", sources = []):
    
    msg = "<div style='position:relative;padding:0.75rem 1.25rem;\
            margin-bottom:1rem;border:1px solid transparent;\
            border-radius:.25rem;background-color:#fdf7e2;\
            border-color:#D6B656;color:#3c4046'>"
    if question: 
        msg += "<b>Question:</b><br>" + question + "<br><br>"
    if answer:
        msg += "<b>Answer:</b><br>" + answer + "<br><br>"
    if sources:
        msg += "<hr><b>Sources:</b><ul>"
        for source in sources:
            msg += "<li>" + source + "</li>"
        msg += "</ul>"
            
    msg += "</div>"
    display(HTML(msg))

In [3]:
#displayAnswer("Who was Albert Einstein?", "A physicist from Germany", ["Source 1", "Source 2"])

## Save Documents in Vector Store

In [4]:
from langchain.document_loaders import DirectoryLoader, PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from InstructorEmbedding import INSTRUCTOR
import os

  from tqdm.autonotebook import trange


In [5]:
# Load and process the pdf files
path = './data'
loader = DirectoryLoader(path, glob="./*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()

In [6]:
# Number of imported pages
print("Number of imported pages:", len(documents))

Number of imported pages: 384


In [7]:
# Example page
print("Example page: \n", documents[0])

Example page: 
 page_content='LECTURE NOTES \n \nFor Nursing Students \n \n \n \n \n \nBasic Clinical Nursing Skills \n \n \n \n \n \n \n \n  \n \nAbraham Alano, B.Sc., M.P.H. \n \nHawassa University \n \n \nIn collaboration with the Ethiopia Public Heal th Training Initiative, The Carter Center, \nthe Ethiopia Ministry of Health, and the Ethiopia Ministry of Education \n \nNovermber 2002 ' metadata={'source': 'data/ln_clin_nursing_final.pdf', 'page': 0}


In [8]:
from nltk.tokenize import word_tokenize

def get_word_counts(document_collection):
    """Calculate the word count for each document in a collection."""
    word_counts = []

    for document in document_collection:
        text_content = document.page_content  # Extract text from the document
        words = word_tokenize(text_content)  # Tokenize the text into words
        word_count = len(words)  # Count the words
        word_counts.append(word_count)  # Append the count to the list

    return word_counts

In [9]:
# Count words per page BEFORE chunking
word_counts_before = get_word_counts(documents)
print(word_counts_before[0:10])

[47, 201, 250, 99, 222, 99, 110, 155, 172, 140]


In [10]:
# Splitting the text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=128)
texts = text_splitter.split_documents(documents)

In [11]:
print("Number of chunks:", len(texts))

Number of chunks: 2766


In [12]:
print(texts[3])

page_content='Center, the Ethiopia Ministry of Health, and the Ethiopia Ministry of Education. \n \n  \n \n   \n \nImportant Guidelines for Printing and Photocopying Limited permission is granted free of charge  to print or photocopy all pages of this' metadata={'source': 'data/ln_clin_nursing_final.pdf', 'page': 1}


In [13]:
# Count words per page AFTER chunking
word_counts_after = get_word_counts(texts)
print(word_counts_after[0:10])

[33, 31, 38, 36, 28, 30, 37, 44, 24, 31]


In [14]:
# Plot the word count distribution BEFORE and AFTER chunking

import plotly.graph_objs as go
from plotly.subplots import make_subplots

fig = make_subplots(rows=1, cols=2, subplot_titles=("Before Chunking", "After Chunking"))

trace1 = go.Histogram(x=word_counts_before, name='Before')
fig.add_trace(trace1, row=1, col=1)
trace2 = go.Histogram(x=word_counts_after, name='After')
fig.add_trace(trace2, row=1, col=2)

fig.update_layout(height=600, width=1200, title_text="Word Count Distribution: Before and After Chunking")
fig.show()

In [15]:
# Load model for generating embeddings
from langchain.embeddings import HuggingFaceInstructEmbeddings

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-base",
                                                      model_kwargs={"device": "mps"})


load INSTRUCTOR_Transformer
max_seq_length  512


In [16]:
# Create vector store and embed documt chunks
persist_directory = 'db'
embedding = instructor_embeddings
vectordb = Chroma.from_documents(documents=texts,
                                 embedding=embedding,
                                 persist_directory=persist_directory)

In [19]:
# OPTIONALLY: Save vector store to disk
# vectordb.persist()
# vectordb = None

In [20]:
# # OPTIONALLY: Load vector store from disk
# vectordb = Chroma(persist_directory=persist_directory,
#                   embedding_function=embedding)

In [21]:
# List of questions; Feel free to add your own
questions = ["What is the purpose of the nursing process?"]

In [22]:
retriever = vectordb.as_retriever()

In [23]:
docs = retriever.get_relevant_documents("What is scaled dot-product attention?")


The method `BaseRetriever.get_relevant_documents` was deprecated in langchain-core 0.1.46 and will be removed in 0.3.0. Use invoke instead.



In [24]:
print("Number of retrieved document chunks:", len(docs))

Number of retrieved document chunks: 4


In [25]:
print("Example chunk:", docs[0])

Example chunk: page_content='22. Document relevant information, means by which correct \nplacement was determined and client responses. \n23. Establish a plan for providi ng daily nasogastric tube care \n- Inspect the nostril for discharge and irritation' metadata={'page': 234, 'source': 'data/ln_clin_nursing_final.pdf'}


In [26]:
# Retrieve documents for each question
for question in questions:
    docs = retriever.get_relevant_documents(question)
    displayAnswer(question = question, sources = [doc.page_content for doc in docs])

In [27]:
# Load Generative Language Model
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import transformers

tokenizer = AutoTokenizer.from_pretrained("lmsys/fastchat-t5-3b-v1.0")
model = AutoModelForSeq2SeqLM.from_pretrained("lmsys/fastchat-t5-3b-v1.0")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Text Generation Parameters

Each parameter influences the text generation in a specific way. Below are the parameters along with a brief explanation:

**`max_length`**:
* Sets the maximum number of tokens in the generated text (default is 50).
* Generation stops if the maximum length is reached before the model produces an EOS token.
* A higher `max_length` allows for longer generated texts, but may increase the time and computational resources required.

**`min_length`**:
* Sets the minimum number of tokens in the generated text (default is 10).
* Generation continues until this minimum length is reached, even if an EOS token is produced.

**`num_beams`**:
* In beam search, sets the number of "beams" or hypotheses to keep at each step (default is 4).
* A higher number of beams increases the chances of finding a good output but also increases the computational cost.

**`num_return_sequences`**:
* Specifies the number of independently computed sequences to return (default is 3).
* When using sampling, multiple different sequences are generated independently from each other.

**`early_stopping`**:
* Stops generation if the model produces the EOS (End Of Sentence) token, even if the predefined maximum length is not reached (default is True).
* Useful when an EOS token signifies the logical end of a text (often represented as `</s>`).

**`do_sample`**:
* Tokens are selected probabilistically based on their likelihood scores (default is True).
* Introduces randomness into the generation process for diverse outputs.
* The level of randomness is controlled by the 'temperature' parameter.

**`temperature`**:
* Adjusts the probability distribution used for sampling the next token (default is 0.7).
* Higher values make the generation more random, while lower values make it more deterministic.

**`top_k`**:
* Limits the number of tokens considered for sampling at each step to the top K most likely tokens (default is 50).
* Can make the generation process faster and more focused.

**`top_p`**:
* Also known as nucleus sampling, sets a cumulative probability threshold (default is 0.95).
* Tokens are sampled only from the smallest set whose cumulative probability exceeds this threshold.

**`repetition_penalty`**:
* Discourages the model from repeating the same token by modifying the token's score (default is 1.5).
* Values greater than 1.0 penalize repetitions, and values less than 1.0 encourage repetitions.

Here are additional resources on the topic:
    
* https://huggingface.co/docs/transformers/main_classes/text_generation
* https://huggingface.co/docs/api-inference/detailed_parameters
* https://huggingface.co/blog/how-to-generate

In [28]:
# Create a 'text2text-generation' pipeline

from transformers import pipeline
from langchain.llms import HuggingFacePipeline
import torch

pipe = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=256,
    min_length=25,
    do_sample = True,
    temperature = 0.6,
)

local_llm = HuggingFacePipeline(pipeline=pipe)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.

The class `HuggingFacePipeline` was deprecated in LangChain 0.0.37 and will be removed in 0.3. An updated version of the class exists in the langchain-huggingface package and should be used instead. To use it run `pip install -U langchain-huggingface` and import as `from langchain_huggingface import HuggingFacePipeline`.



In [31]:
# Test the pipeline with a simple question
question = "When was Albert Einstein born?"
#displayAnswer(question = question, answer = local_llm(question))
print(local_llm(question))

<pad> i n t e r e i n v i d e d o n t e i n a n a l b e r t y o u i n t e i n a l s t e i n b o r n .


In [30]:
# Test the pipeline with the lecture-related questions
for question in questions:
    displayAnswer(question = question, answer = local_llm(question))

In [None]:
# Create the RetrievalQA chain using LangChain
qa_chain = RetrievalQA.from_chain_type(llm=local_llm,
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

In [None]:
print("Prompt template:", qa_chain.combine_documents_chain.llm_chain.prompt.template)

Prompt template: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:


In [None]:
# Test the RetrievalQA chain with the lecture-related questions
for question in questions:
    response = qa_chain(question)
    displayAnswer(question, response['result'], 
                  [source.metadata['source'] for source in response["source_documents"]])
    


The function `__call__` was deprecated in LangChain 0.1.0 and will be removed in 0.2.0. Use invoke instead.

