## RAG from scratch

This section aims to guide you through the process of building a basic RAG from scratch.

# Basic RAG

Retrieval-augmented generation (RAG) is an AI framework that synergizes the capabilities of LLMs and information retrieval systems. It’s useful to answer questions or generate content leveraging external knowledge.

There are two main steps in RAG:
1. retrieval: retrieve relevant information from a knowledge base with text embeddings stored in a vector store;
2. generation: insert the relevant information to the prompt for the LLM to generate information.

In this guide, we will walk through a very basic example of RAG with four implementations:

- RAG from scratch with Llama 3.2 (using free Groq API),  and Faiss
- RAG with Llama 3.2 and LangChain


### Setup and Installation

Install the required libraries

In [1]:
!pip install numpy==1.26.4 faiss-cpu==1.8.0 openai==1.14.3 sentence-transformers pandas==1.5.3

Collecting faiss-cpu==1.8.0
  Downloading faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting openai==1.14.3
  Downloading openai-1.14.3-py3-none-any.whl.metadata (20 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-3.2.0-py3-none-any.whl.metadata (10 kB)
Collecting pandas==1.5.3
  Downloading pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting httpx<1,>=0.23.0 (from openai==1.14.3)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai==1.14.3)
  Downloading httpcore-1.0.6-py3-none-any.whl.metadata (21 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai==1.14.3)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32

Download the PubMedQA Labeled for demonstration purposes.

In [2]:
!wget https://raw.githubusercontent.com/pubmedqa/pubmedqa/refs/heads/master/data/ori_pqal.json -O pqa_labelled.json

--2024-10-17 04:49:50--  https://raw.githubusercontent.com/pubmedqa/pubmedqa/refs/heads/master/data/ori_pqal.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2584787 (2.5M) [text/plain]
Saving to: ‘pqa_labelled.json’


2024-10-17 04:49:50 (29.9 MB/s) - ‘pqa_labelled.json’ saved [2584787/2584787]



### Import Libraries

In [3]:
import numpy as np
import faiss
import json
import pandas as pd

from sentence_transformers import SentenceTransformer
from openai import OpenAI

client = OpenAI(
    api_key="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",
    base_url='https://api.groq.com/openai/v1',
)

  from tqdm.autonotebook import tqdm, trange


### Load and view the dataset

In [4]:
with open('pqa_labelled.json', 'r') as f:
    data = json.load(f)

transformed_data = {
    "questions": [],
    "contexts": [],
    "answers": []
}

for item in data.values():
    transformed_data['questions'].append(item['QUESTION'])
    transformed_data['contexts'].append(item['CONTEXTS'])
    transformed_data['answers'].append(item['LONG_ANSWER'])

df = pd.DataFrame(transformed_data)
df.head()

Unnamed: 0,questions,contexts,answers
0,Do mitochondria play a role in remodelling lac...,[Programmed cell death (PCD) is the regulated ...,Results depicted mitochondrial dynamics in viv...
1,Landolt C and snellen e acuity: differences in...,[Assessment of visual acuity depends on the op...,"Using the charts described, there was only a s..."
2,"Syncope during bathing in infants, a pediatric...",[Apparent life-threatening events in infants a...,"""Aquagenic maladies"" could be a pediatric form..."
3,Are the long-term results of the transanal pul...,[The transanal endorectal pull-through (TERPT)...,Our long-term study showed significantly bette...
4,Can tailored interventions increase mammograph...,[Telephone counseling and tailored print commu...,The effects of the intervention were most pron...


## Split document into chunks

In a RAG system, it is crucial to split the document into smaller chunks so that it’s more effective to identify and retrieve the most relevant information in the retrieval process later. In this example, we simply split our text by character, combine 2048 characters into each chunk.

In [5]:
chunk_size = 2048
chunks = []

for text in df['contexts']:
    text = "\n".join(text)
    chunks += [text[i:i + chunk_size] for i in range(0, len(text), chunk_size) if len(text[i:i + chunk_size]) > 0]

print(f"Total chunks: {len(chunks)}")

chunks = chunks[:200]
print(f"For testing, we will use only {len(chunks)} chunks")

Total chunks: 1033
For testing, we will use only 200 chunks


In [6]:
def get_text_embedding(sentences):
    model = SentenceTransformer('BAAI/bge-large-en-v1.5')
    embeddings = model.encode(sentences)
    return embeddings

In [7]:
text_embeddings = get_text_embedding(chunks)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/779 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

In [8]:
text_embeddings.shape

(200, 1024)

In [9]:
text_embeddings

array([[-0.0206785 , -0.02068686, -0.00386645, ..., -0.02460489,
        -0.01016388, -0.01389027],
       [ 0.02316779,  0.00813706,  0.01444898, ...,  0.00461098,
        -0.0086088 , -0.02615978],
       [-0.01896976,  0.0346964 ,  0.01957824, ..., -0.03605655,
         0.01847291, -0.03471109],
       ...,
       [ 0.0179466 ,  0.00674204, -0.01692699, ..., -0.00258184,
        -0.05088448, -0.01757879],
       [ 0.01564134,  0.02095525,  0.02021168, ..., -0.02712116,
         0.00242136, -0.03162438],
       [ 0.00582403, -0.00561286,  0.00256976, ..., -0.00079752,
        -0.00475964,  0.02209357]], dtype=float32)

### Load into a vector database
Once we get the text embeddings, a common practice is to store them in a vector database for efficient processing and retrieval. There are several vector database to choose from. In our simple example, we are using an open-source vector database Faiss, which allows for efficient similarity search.  

With Faiss, we instantiate an instance of the Index class, which defines the indexing structure of the vector database. We then add the text embeddings to this indexing structure.


In [10]:
d = text_embeddings.shape[1]
index = faiss.IndexFlatL2(d)
index.add(text_embeddings)

### Create embeddings for a question
Whenever users ask a question, we also need to create embeddings for this question using the same embedding models as before.


In [11]:
question = "Explain what is Amblyopia?"
question_embeddings = get_text_embedding([question])
question_embeddings.shape

(1, 1024)

In [12]:
question_embeddings

array([[ 0.02755284, -0.02487543, -0.04647457, ...,  0.00898422,
        -0.00786411, -0.02185954]], dtype=float32)

### Retrieve similar chunks from the vector database
We can perform a search on the vector database with `index.search`, which takes two arguments: the first is the vector of the question embeddings, and the second is the number of similar vectors to retrieve. This function returns the distances and the indices of the most similar vectors to the question vector in the vector database. Then based on the returned indices, we can retrieve the actual relevant text chunks that correspond to those indices.


In [13]:
D, I = index.search(question_embeddings, k=3)
print(I)

[[ 26   1 106]]


In [14]:
retrieved_chunk = [chunks[i] for i in I.tolist()[0]]
print(retrieved_chunk)

['The records of 465 patients with an established diagnosis of age related macular degeneration who had attended a specialist macular clinic between 1990 and 1998 were scrutinised. A full clinical examination and standardised refraction had been carried out in 189 of these cases on a minimum of two occasions. Cases were looked for where an improvement of one or more lines of either distance or near acuity was recorded in the eye unaffected by macular disease. In each one of these cases the improvement in visual acuity could not be attributed to treatment of other existing pathology.\n12 such cases were detected. In nine of these the eye showing improvement of acuity had a history of amblyopia. The mean improvement in distance and near acuity in amblyopic eyes by 12 months was 3.3 and 1.9 lines logMAR respectively. The improvement in acuity generally occurred between 1 and 12 months from baseline and remained stable over the period of follow up.', 'Assessment of visual acuity depends on

### Combine context and question in a prompt and generate response

Finally, we can offer the retrieved text chunks as the context information within the prompt. Here is a prompt template where we can include both the retrieved text and user question in the prompt.



In [None]:
def prompt_template(question, retrieved_chunk):
    prompt = f"""
    Answer the following question based only on the provided context.

    <context>
    {retrieved_chunk}
    </context>

    Question: {question}
    """
    return prompt

In [None]:
def run_llm(user_message, model="llama-3.2-3b-preview"):
    system_message = "You are a helpful assistant. You are given a question and a context. You need to answer the question based on the context."
    messages = [{"role": "system", "content": system_message}]
    messages += [{"role": "user", "content": user_message}]
    completion = client.chat.completions.create(
        model=model,
        messages=messages,
    )
    return completion.choices[0].message.content

In [None]:
prompt = prompt_template(question, retrieved_chunk)

run_llm(prompt)

'Based on the provided context, Amblyopia is a condition that can cause visual impairments in the eyes of individuals, particularly those with a history of strabismus (crossed eyes). In the context of the study, Amblyopic eyes showed improvement in visual acuity, with a mean improvement of 3.3 lines in distance acuity and 1.9 lines in near acuity over 12 months. It is also mentioned that in cases where the eye showing improvement was also amblyopic, the improvement occurred between 1 and 12 months from baseline and remained stable over the study period.'

### Test the dataset

In [None]:
data_id = 1

question, answers = df['questions'][data_id], df['answers'][data_id]

question_embeddings = get_text_embedding([question])
D, I = index.search(question_embeddings, k=3)
retrieved_chunk = [chunks[i] for i in I.tolist()[0]]
prompt = prompt_template(question, retrieved_chunk)

response = run_llm(prompt)
print("Question:", question)
print("RAG Response:", response)
print("Ground Truth:", answers)

Question: Landolt C and snellen e acuity: differences in strabismus amblyopia?
RAG Response: Based on the given context, the text mentions the following points regarding the differences in Landolt C and Snellen E acuity in the group of patients with strabismus amblyopia:

- The mean decimal values for Landolt C acuity (LR) and Snellen E acuity (SE) in this group were 0.14 and 0.16, respectively.
- The mean difference between LR and SE was 0.55 lines in this group.

So, in strabismus amblyopia, the Landolt C acuity was slightly worse compared to the Snellen E acuity, resulting in a 0.55 line difference between the two.
Ground Truth: Using the charts described, there was only a slight overestimation of visual acuity by the Snellen E compared to the Landolt C, even in strabismus amblyopia. Small differences in the lower visual acuity range have to be considered.


## LangChain

In [None]:
!pip install langchain==0.1.13 langchain-community==0.0.29 langchain-openai==0.1.1 langchain-huggingface==0.0.3

Collecting langchain==0.1.13
  Downloading langchain-0.1.13-py3-none-any.whl.metadata (13 kB)
Collecting langchain-community==0.0.29
  Downloading langchain_community-0.0.29-py3-none-any.whl.metadata (8.3 kB)
Collecting langchain-openai==0.1.1
  Downloading langchain_openai-0.1.1-py3-none-any.whl.metadata (2.5 kB)
Collecting langchain-huggingface==0.0.3
  Downloading langchain_huggingface-0.0.3-py3-none-any.whl.metadata (1.2 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain==0.1.13)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain==0.1.13)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting langchain-core<0.2.0,>=0.1.33 (from langchain==0.1.13)
  Downloading langchain_core-0.1.52-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain-text-splitters<0.1,>=0.0.1 (from langchain==0.1.13)
  Downloading langchain_text_splitters-0.0.2-py3-none-any.whl.metadata (2.2 kB)
Collecting langs

In [None]:
from langchain_core.documents import Document
from langchain_community.chat_models.openai import ChatOpenAI
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain


In [None]:
# Load data
docs = [Document(page_content="\n".join(doc)) for doc in df['contexts']]
docs = docs[:200]


In [None]:
# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter()
documents = text_splitter.split_documents(docs)


In [None]:
# Define the embedding model
embeddings = SentenceTransformerEmbeddings(model_name="BAAI/bge-large-en-v1.5")
# Create the vector store
vector = FAISS.from_documents(documents, embeddings)
# Define a retriever interface
retriever = vector.as_retriever()


In [None]:
# Define LLM
model = ChatOpenAI(
    model_name='llama-3.2-3b-preview',
    openai_api_key="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",
    openai_api_base="https://api.groq.com/openai/v1"
)



In [None]:
# Define prompt template
prompt = ChatPromptTemplate.from_template("""
Answer the following question based only on the provided context:

<context>
{context}
</context>

Question: {input}
""")

In [None]:
# Create a retrieval chain to answer questions
document_chain = create_stuff_documents_chain(model, prompt)
retrieval_chain = create_retrieval_chain(retriever, document_chain)

In [None]:
response = retrieval_chain.invoke({"input": "Explain what is Amblyopia?"})
print(response["answer"])

In [None]:
data_id = 1

question, answers = df['questions'][data_id], df['answers'][data_id]


response = retrieval_chain.invoke({"input": question})
print("Question:", question)
print("RAG Response:", response["answer"])
print("Ground Truth:", answers)