# Retrieval-augmented generation (RAG)

This is a very popular method for Question Answering with LLM, today we will go through a simple implementation.

![RAG Diagram](https://docs.aws.amazon.com/images/sagemaker/latest/dg/images/jumpstart/jumpstart-fm-rag.jpg)

1. Build your source documents
   * Collect your documents
   * "Chunk" your documents
   * "Encode" your chunks in to vectors with an embedding model
2. Query
    * Encode a question
3. Retreival
    * Calculate similarity between questions and sources
    * Return the best matching text
4. Build Prompt
    * Combine the Query and Context togeter into one prompt
8. Generation
    * Send to a LLM for an answer


# Install and update some packages

In [1]:
!pip install nltk
!pip install sentence-transformers
!pip install -U torch
!pip install -U torchvision
!pip install -U transformers
import nltk
nltk.download('punkt_tab')

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/msuresh/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [2]:
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModel
import torch
from tqdm import tqdm
import os
from sentence_transformers import SentenceTransformer

os.environ['TOKENIZERS_PARALLELISM']='false'
device='cuda'

2025-03-12 15:02:20.828251: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-12 15:02:23.268421: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


# Dataset

* Loading and chunking 
Will use the simple wikipedia dataset, which is a verison of wikipeida with requirments on plain language.

1. We have articles
2. We need to turn these into resources we can give to an LLM
3. We do this by breaking our raw article into smaller chunks (in this case sentences)

This is a little **confusing**, we have
1. An nltk 'tokenizer' that will take raw article text, and breaks it into sentence
2. Not the same as the 'tokenizer' that takes words and turns them into integer tokens

In [3]:
dataset_name="wikipedia"
dataset_config_name="20220301.simple"
split='all'

dataset = load_dataset(dataset_name, dataset_config_name, split=split)
documents = []
for article in dataset:        
        if 'spain' not in article['text'].lower(): continue
        content = article['text']

        chunks= nltk.sent_tokenize(content)
        chunks = [c for c in chunks if len(c.split()) >5] # Get rid of short text
        documents.extend(chunks)


In [4]:
print(len(documents))

118167


In [5]:
for d in documents[0:10]: print(d)

April is the fourth month of the year in the Julian and Gregorian calendars, and comes between March and May.
It is one of four months to have 30 days.
April always begins on the same day of week as July, and additionally, January in leap years.
April always ends on the same day of the week as December.
April's flowers are the Sweet Pea and Daisy.
The meaning of the diamond is innocence.
The Month 

April comes between March and May, making it the fourth month of the year.
It also comes first in the year out of the four months that have 30 days, as June, September and November are later in the year.
April begins on the same day of the week as July every year and on the same day of the week as January in leap years.
April ends on the same day of the week as December every year, as each other's last days are exactly 35 weeks (245 days) apart.


# Embedding and Similarity

* sentence_transformers is a packgage that converts sentences in to vectors
* We can use cosine similarity (A dot B) / (|A|*|B|) to see how similar two sentences are
* Try finding a sentence that has high similarity to l1 below

In [6]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2').to(device)

In [7]:
import numpy as np
l1="RAG models are tools used for question and answering"
l2="Retrieval augmented generation models are used for Q&A pipelines"
l3="What is a RAG model?"
l4="What is a  retrieval augmented generation model?"
l5 ="CHANGE ME to find text with a high similarity"

test=np.concatenate([embedding_model.encode(l)[None] for l in [l1,l2,l3,l4,l5]],axis=0)

mat=embedding_model.similarity(test,test)

print("Similarity Matrix\n\n",mat)

print('\nSimilarity with your Sentence',mat[0,4].item())

Similarity Matrix

 tensor([[ 1.0000,  0.4785,  0.6538,  0.4681,  0.1585],
        [ 0.4785,  1.0000,  0.0838,  0.6197,  0.2542],
        [ 0.6538,  0.0838,  1.0000,  0.3537, -0.0532],
        [ 0.4681,  0.6197,  0.3537,  1.0000,  0.1770],
        [ 0.1585,  0.2542, -0.0532,  0.1770,  1.0000]])

Similarity with your Sentence 0.15851107239723206


# Vector Store
This is the lingo for all our chunks being encoded in one place. This can get big and is often stored in a custom made database, but we'll just embed our documents once.


In [8]:
if 'vector_store' not in locals():
    vector_store=embedding_model.encode(documents)

# Query and Context


We now want to get some chunks of text that match a query
1. Encode Query
2. Calculate similarity with the vector store
3. Pull the sentences with the highest similarity

Combine those as context for our LLM. 


In [22]:
#query = "What is Paella?"
#query = "What is the captial of Spain?"
query = "What are the main differences between machine learning and deep learning?"

query_encode=embedding_model.encode(query)

sim_mat=embedding_model.similarity(query_encode,vector_store).squeeze()

vals,indices=torch.topk(sim_mat,3)
context=""

window=0 # We may want to include some more context for some questions
for i,idx in enumerate(indices):
    print("---")
    print(" ".join(documents[idx-window:idx+window+1]),vals[i])
    context+= " ".join(documents[idx-window:idx+window+1])
    print("---")



---
Each has its own unique characteristics. tensor(0.3310)
---
---
The exact reasons for each is not yet clearly understood. tensor(0.3299)
---
---
They say that even complex computers cannot model connections between molecules, cells, tissues, organs, organisms, and the environment. tensor(0.3167)
---


# Build our Prompt

Now we just want to combine our query and context together. This is the agumentation part of RAG. 

In [23]:

prompt = f"Based on the following information: {context}\nAnswer the question: {query}"

print(prompt)



Based on the following information: Each has its own unique characteristics.The exact reasons for each is not yet clearly understood.They say that even complex computers cannot model connections between molecules, cells, tissues, organs, organisms, and the environment.
Answer the question: What are the main differences between machine learning and deep learning?


# Pass to a Generative LLM

Were going to use granite, this is a 'small' large language model from IBM that will run with this classe's smaller GPUs

In [11]:
model_path = "ibm-granite/granite-3.1-2b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map='auto')
model.eval()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the cpu.


GraniteForCausalLM(
  (model): GraniteModel(
    (embed_tokens): Embedding(49155, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-39): 40 x GraniteDecoderLayer(
        (self_attn): GraniteSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): GraniteMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): GraniteRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): GraniteRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): GraniteRMSNorm((20

# LLM Templates

Many models have there own format for processing 'chats' broken into categories like roles, context, tasks, etc. We can use the model's template saved with it's tokenizer

In [12]:
chat = [
    { "role": "user", "content": prompt },
]
chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
# tokenize the text
input_tokens = tokenizer(chat, return_tensors="pt").to(model.device)

print(chat)

<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.
Today's Date: March 12, 2025.
You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>Based on the following information: Spain is a country in Southern Europe.Spain is a country in Europe.It used to be part of the Spanish Empire.
Answer the question: What is the captial of Spain?<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>


# Run LLM

In [13]:
# generate output tokens
output = model.generate(**input_tokens, 
                        max_new_tokens=100)
# decode output tokens into text
output = tokenizer.batch_decode(output)
# print output
print(output)

["<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.\nToday's Date: March 12, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>\n<|start_of_role|>user<|end_of_role|>Based on the following information: Spain is a country in Southern Europe.Spain is a country in Europe.It used to be part of the Spanish Empire.\nAnswer the question: What is the captial of Spain?<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>The information provided does not include details about the capital of Spain. The capital of Spain is Madrid.<|end_of_text|>"]


# Exercises
1. Ask a question about spain?
    * Try running the above with a new question?
    * Did the model find useful context?
    * Did the LLM answer correctly?


This is the response I got: "Based on the following information: Spain is a country in Southern Europe.Spain is a country in Europe.It used to be part of the Spanish Empire.Answer the question: What is the captial of Spain?". The model found somewhat relevant context, and retrieved general information about Spain, but it did not include details about its capital. No, it did not answer my question correctly, it needs to include Madrid in the question to draw relevent information.

2. Ask a question that isn't about spain
    * Try running the above with a new question
    * Did the model find useful context?
    * Did the LLM answer correctly?

"Based on the following information: Each has its own unique characteristics.The exact reasons for each is not yet clearly understood.They say that even complex computers cannot model connections between molecules, cells, tissues, organs, organisms, and the environment. Answer the question: What are the main differences between machine learning and deep learning?". No the model did not find useful context or answer the question correctly. 

3. Based on your answers above describe some strenghts and weakness to this process.  

Some strengths are that the model attempts to retrieve relevant information before answering, improving accuracy when it works correctly. If retrieval is well-implemented, the system can dynamically pull from vast datasets, making it useful for a wide range of queries. Some weaknesses are that the model sometimes pulls unrelated context--for example the first question, and that if the retrieval system pulls incorrect or incomplete context, the model may prioritize it over general knowledge, leading to flawed answers. 

# Controlling Generation (optional)
The output of the LLMs can be controled, if you want to dig deeper take a here for the things you can tune: https://huggingface.co/docs/transformers/en/main_classes/text_generation how does this effect your output>