# Basic RAG

Retrieval-augmented generation (RAG) is an AI framework that synergizes the capabilities of LLMs and information retrieval systems. It’s useful to answer questions or generate content leveraging external knowledge. 

There are two main steps in RAG: 
1. retrieval: retrieve relevant information from a knowledge base with text embeddings stored in a vector store; 
2. generation: insert the relevant information to the prompt for the LLM to generate information. 

In this guide, we will walk through a very basic example of RAG with four implementations:

- RAG from scratch with Llama 3.2 (using free Groq API),  and Faiss
- RAG with Llama 3.2 and LangChain


## RAG from scratch

This section aims to guide you through the process of building a basic RAG from scratch.

### Setup and Installation

Install the required libraries

In [1]:
!pip install numpy==1.26.4 faiss-cpu==1.8.0 openai==1.14.3 sentence-transformers pandas==1.5.3



Download the PubMedQA Labeled for demonstration purposes.

In [2]:
!wget https://raw.githubusercontent.com/pubmedqa/pubmedqa/refs/heads/master/data/ori_pqal.json -O pqa_labelled.json

--2024-10-17 01:28:28--  https://raw.githubusercontent.com/pubmedqa/pubmedqa/refs/heads/master/data/ori_pqal.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2584787 (2.5M) [text/plain]
Saving to: ‘pqa_labelled.json’


2024-10-17 01:28:30 (4.96 MB/s) - ‘pqa_labelled.json’ saved [2584787/2584787]



### Import Libraries

In [3]:
import numpy as np
import faiss
import json
import pandas as pd

from sentence_transformers import SentenceTransformer
from openai import OpenAI

client = OpenAI(
    api_key="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", 
    base_url='https://api.groq.com/openai/v1',
)

  from tqdm.autonotebook import tqdm, trange


### Load and view the dataset

In [4]:
with open('pqa_labelled.json', 'r') as f:
    data = json.load(f)

transformed_data = {
    "questions": [],
    "contexts": [],
    "answers": []
}

for item in data.values():
    transformed_data['questions'].append(item['QUESTION'])
    transformed_data['contexts'].append(item['CONTEXTS'])
    transformed_data['answers'].append(item['LONG_ANSWER'])

df = pd.DataFrame(transformed_data)
df.head()

Unnamed: 0,questions,contexts,answers
0,Do mitochondria play a role in remodelling lac...,[Programmed cell death (PCD) is the regulated ...,Results depicted mitochondrial dynamics in viv...
1,Landolt C and snellen e acuity: differences in...,[Assessment of visual acuity depends on the op...,"Using the charts described, there was only a s..."
2,"Syncope during bathing in infants, a pediatric...",[Apparent life-threatening events in infants a...,"""Aquagenic maladies"" could be a pediatric form..."
3,Are the long-term results of the transanal pul...,[The transanal endorectal pull-through (TERPT)...,Our long-term study showed significantly bette...
4,Can tailored interventions increase mammograph...,[Telephone counseling and tailored print commu...,The effects of the intervention were most pron...


## Split document into chunks

In a RAG system, it is crucial to split the document into smaller chunks so that it’s more effective to identify and retrieve the most relevant information in the retrieval process later. In this example, we simply split our text by character, combine 2048 characters into each chunk.

In [5]:
chunk_size = 2048
chunks = []

for text in df['contexts']:
    text = "\n".join(text)
    chunks += [text[i:i + chunk_size] for i in range(0, len(text), chunk_size) if len(text[i:i + chunk_size]) > 0]

print(f"Total chunks: {len(chunks)}")

chunks = chunks[:200]
print(f"For testing, we will use only {len(chunks)} chunks")

Total chunks: 1033
For testing, we will use only 200 chunks


In [6]:
def get_text_embedding(sentences):
    model = SentenceTransformer('BAAI/bge-large-en-v1.5')
    embeddings = model.encode(sentences)
    return embeddings

In [7]:
text_embeddings = get_text_embedding(chunks)

In [8]:
text_embeddings.shape

(200, 1024)

In [9]:
text_embeddings

array([[-0.02067855, -0.0206868 , -0.00386641, ..., -0.0246049 ,
        -0.01016385, -0.01389029],
       [ 0.02316776,  0.00813704,  0.01444905, ...,  0.00461098,
        -0.00860879, -0.0261598 ],
       [-0.01896981,  0.03469644,  0.01957831, ..., -0.03605654,
         0.01847292, -0.03471107],
       ...,
       [ 0.01794667,  0.00674211, -0.01692702, ..., -0.00258185,
        -0.05088462, -0.01757879],
       [ 0.01564134,  0.02095531,  0.02021172, ..., -0.02712115,
         0.00242129, -0.0316244 ],
       [ 0.00582405, -0.00561285,  0.00256976, ..., -0.00079749,
        -0.00475963,  0.02209369]], dtype=float32)

### Load into a vector database
Once we get the text embeddings, a common practice is to store them in a vector database for efficient processing and retrieval. There are several vector database to choose from. In our simple example, we are using an open-source vector database Faiss, which allows for efficient similarity search.  

With Faiss, we instantiate an instance of the Index class, which defines the indexing structure of the vector database. We then add the text embeddings to this indexing structure.


In [10]:
d = text_embeddings.shape[1]
index = faiss.IndexFlatL2(d)
index.add(text_embeddings)

### Create embeddings for a question
Whenever users ask a question, we also need to create embeddings for this question using the same embedding models as before.


In [11]:
question = "Explain what is Amblyopia?"
question_embeddings = get_text_embedding([question])
question_embeddings.shape

(1, 1024)

In [12]:
question_embeddings

array([[ 0.02755282, -0.02487555, -0.04647454, ...,  0.00898436,
        -0.00786413, -0.02185952]], dtype=float32)

### Retrieve similar chunks from the vector database
We can perform a search on the vector database with `index.search`, which takes two arguments: the first is the vector of the question embeddings, and the second is the number of similar vectors to retrieve. This function returns the distances and the indices of the most similar vectors to the question vector in the vector database. Then based on the returned indices, we can retrieve the actual relevant text chunks that correspond to those indices.


In [13]:
D, I = index.search(question_embeddings, k=3)
print(I)

[[ 26   1 106]]


In [14]:
retrieved_chunk = [chunks[i] for i in I.tolist()[0]]
print(retrieved_chunk)

['The records of 465 patients with an established diagnosis of age related macular degeneration who had attended a specialist macular clinic between 1990 and 1998 were scrutinised. A full clinical examination and standardised refraction had been carried out in 189 of these cases on a minimum of two occasions. Cases were looked for where an improvement of one or more lines of either distance or near acuity was recorded in the eye unaffected by macular disease. In each one of these cases the improvement in visual acuity could not be attributed to treatment of other existing pathology.\n12 such cases were detected. In nine of these the eye showing improvement of acuity had a history of amblyopia. The mean improvement in distance and near acuity in amblyopic eyes by 12 months was 3.3 and 1.9 lines logMAR respectively. The improvement in acuity generally occurred between 1 and 12 months from baseline and remained stable over the period of follow up.', 'Assessment of visual acuity depends on

### Combine context and question in a prompt and generate response

Finally, we can offer the retrieved text chunks as the context information within the prompt. Here is a prompt template where we can include both the retrieved text and user question in the prompt.



In [15]:
def prompt_template(question, retrieved_chunk):
    prompt = f"""
    Answer the following question based only on the provided context.

    <context>
    {retrieved_chunk}
    </context>

    Question: {question}
    """
    return prompt

In [16]:
def run_llm(user_message, model="llama-3.2-3b-preview"):
    system_message = "You are a helpful assistant. You are given a question and a context. You need to answer the question based on the context."
    messages = [{"role": "system", "content": system_message}]
    messages += [{"role": "user", "content": user_message}]
    completion = client.chat.completions.create(
        model=model,
        messages=messages,
    )
    return completion.choices[0].message.content

In [17]:
prompt = prompt_template(question, retrieved_chunk)

run_llm(prompt)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


"Based on the provided context, Amblyopia is a condition where there is a reduced visual acuity due to an early disturbance in the brain's use of visual information, usually occurring before the age of 7. In the context of the study, amblyopia was detected in 39 patients with strabismus and 13 healthy volunteers."

### Test the dataset

In [18]:
data_id = 1

question, answers = df['questions'][data_id], df['answers'][data_id]

question_embeddings = get_text_embedding([question])
D, I = index.search(question_embeddings, k=3)
retrieved_chunk = [chunks[i] for i in I.tolist()[0]]
prompt = prompt_template(question, retrieved_chunk)

response = run_llm(prompt)
print("Question:", question)
print("RAG Response:", response)
print("Ground Truth:", answers)

Question: Landolt C and snellen e acuity: differences in strabismus amblyopia?
RAG Response: According to the provided context, in the study regarding strabismus amblyopia (39 patients with various eye disorders), the mean decimal values for Landolt C (LR) and Snellen E (SE) acuity were 0.14 and 0.16 respectively. The mean difference between LR and SE was 0.55 lines for the eyes with strabismus amblyopia.
Ground Truth: Using the charts described, there was only a slight overestimation of visual acuity by the Snellen E compared to the Landolt C, even in strabismus amblyopia. Small differences in the lower visual acuity range have to be considered.


## LangChain

In [19]:
!pip install langchain==0.1.13 langchain-community==0.0.29 langchain-openai==0.1.1 langchain-huggingface==0.0.3

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [22]:
from langchain_core.documents import Document
from langchain_community.chat_models.openai import ChatOpenAI
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain


In [23]:
# Load data
docs = [Document(page_content="\n".join(doc)) for doc in df['contexts']]
docs = docs[:200]


In [24]:
# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter()
documents = text_splitter.split_documents(docs)


In [25]:
# Define the embedding model
embeddings = SentenceTransformerEmbeddings(model_name="BAAI/bge-large-en-v1.5")
# Create the vector store
vector = FAISS.from_documents(documents, embeddings)
# Define a retriever interface
retriever = vector.as_retriever()


In [26]:
# Define LLM
model = ChatOpenAI(
    model_name='llama-3.2-3b-preview',
    openai_api_key="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", 
    openai_api_base="https://api.groq.com/openai/v1"
)



  warn_deprecated(


In [27]:
# Define prompt template
prompt = ChatPromptTemplate.from_template("""
Answer the following question based only on the provided context:

<context>
{context}
</context>

Question: {input}
""")

In [28]:
# Create a retrieval chain to answer questions
document_chain = create_stuff_documents_chain(model, prompt)
retrieval_chain = create_retrieval_chain(retriever, document_chain)

In [29]:
response = retrieval_chain.invoke({"input": "Explain what is Amblyopia?"})
print(response["answer"])

Amblyopia is a condition where there is a reduced visual acuity in one or both eyes, often caused by a lack of clear vision during childhood, usually before the age of 7. This can occur due to strabismus (crossed eyes), nearsightedness, farsightedness, or other eye problems that prevent the brain from developing proper vision in one or both eyes. As a result, the brain favors the eye(s) with normal vision, leaving the affected eye(s) with weaker vision.


In [31]:
data_id = 1

question, answers = df['questions'][data_id], df['answers'][data_id]


response = retrieval_chain.invoke({"input": question})
print("Question:", question)
print("RAG Response:", response["answer"])
print("Ground Truth:", answers)

Question: Landolt C and snellen e acuity: differences in strabismus amblyopia?
RAG Response: The mean difference between Landolt C acuity (LR) and Snellen E acuity (SE) for eyes with strabismus amblyopia was 0.55 lines.
Ground Truth: Using the charts described, there was only a slight overestimation of visual acuity by the Snellen E compared to the Landolt C, even in strabismus amblyopia. Small differences in the lower visual acuity range have to be considered.
