## Lab 4: RAG - Getting Started with Retroeval Augmented Generation


Jay Urbain, PhD
2/11/2025


RAG: Retrieval-Augmented Generation is a method to improve the performance of language models by incorporating external knowledge sources, such as databases, knowledge graphs, or search engines. The basic idea is to retrieve relevant information from an external source based on the input query.

Constructed entirely with open source tools. Can be run locally, Colab, or other cloud computing environment.

At the bottom of the lab there are questions and further experiments to complete.

Resources:   

embedding model : https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1  
dataset : https://huggingface.co/datasets/not-lain/wikipedia  
faiss docs : https://huggingface.co/docs/datasets/v2.18.0/en/package_reference/main_classes#datasets.Dataset.add_faiss_index  
chatbot : https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct  
Full documentation : https://huggingface.co/blog/not-lain/rag-chatbot-using-llama3  

Installations

In [2]:
# !pip install --upgrade pip
# !pip install -q datasets sentence-transformers faiss-cpu accelerate bitsandbytes
# !pip install -q datasets sentence-transformers faiss-gpu-cu12 accelerate bitsandbytes




In [5]:
!pip install faiss-gpu-cu12

Collecting faiss-gpu-cu12
  Downloading faiss_gpu_cu12-1.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading faiss_gpu_cu12-1.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.9/47.9 MB[0m [31m46.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu-cu12
Successfully installed faiss-gpu-cu12-1.10.0


Load a cleaned subset of wikipedia   

https://huggingface.co/datasets/not-lain/wikipedia

https://huggingface.co/docs/datasets/en/tutorial

In [6]:
from datasets import load_dataset

dataset = load_dataset("not-lain/wikipedia")


In [7]:
# Review dataset

dataset


DatasetDict({
    train: Dataset({
        features: ['id', 'url', 'title', 'text'],
        num_rows: 3000
    })
})

In [8]:
#dataset['train'][0]

## Load a sentence transformer for embedding

In [9]:
from sentence_transformers import SentenceTransformer
ST = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")


## Embed the dataset

Note: this is a little slow. GPUs help.

In [10]:
def embed(batch):
    """
    adds a column to the dataset called 'embeddings'
    """
    # or you can combine multiple columns here
    # For example the title and the text
    information = batch["text"]
    return {"embeddings" : ST.encode(information)}

dataset = dataset.map(embed, batched=True,batch_size=16)

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

## Examine the contents of the dataset

In [11]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'url', 'title', 'text', 'embeddings'],
        num_rows: 3000
    })
})

## Save your dataset with embeddings

Save your dataset with embeddings

In [12]:
#dataset.push_to_hub("not-lain/wikipedia", revision="embedded")
dataset.save_to_disk('not-lain_wikipedia')


Saving the dataset (0/1 shards):   0%|          | 0/3000 [00:00<?, ? examples/s]

Verify

In [13]:
dataset = dataset.load_from_disk('not-lain_wikipedia')


## Index the embeddings with the faiss vector database

faiss docs : https://huggingface.co/docs/datasets/v2.18.0/en/package_reference/main_classes#datasets.Dataset.add_faiss_index

https://huggingface.co/docs/datasets/v1.17.0/faiss_es.html

In [14]:
data = dataset["train"]
data = data.add_faiss_index("embeddings")


  0%|          | 0/3 [00:00<?, ?it/s]

## On to search!

Search query function

In [15]:
def search(query: str, k: int = 3 ):
    """a function that embeds a new query and returns the most probable results"""
    embedded_query = ST.encode(query) # embed new query
    scores, retrieved_examples = data.get_nearest_examples( # retrieve results
        "embeddings", embedded_query, # compare our new embedded query with the dataset embeddings
        k=k # get only top k results
    )
    return scores, retrieved_examples


Experiment with search

In [16]:
# search for word anarchy and get the best 4 matching values from the dataset
scores , result = search("anarchy", 4 )
result['title']


['Anarchism', 'Anarcho-capitalism', 'Community', 'Capitalism']

## Build a complete search agent



In [None]:
from sentence_transformers import SentenceTransformer
from datasets import load_dataset

ST = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

dataset = load_dataset("not-lain/wikipedia",revision = "embedded")

data = dataset["train"]
data = data.add_faiss_index("embeddings") # column name that has the embeddings of the dataset

def search(query: str, k: int = 3 ):
    """a function that embeds a new query and returns the most probable results"""
    embedded_query = ST.encode(query) # embed new query
    scores, retrieved_examples = data.get_nearest_examples( # retrieve results
        "embeddings", embedded_query, # compare our new embedded query with the dataset embeddings
        k=k # get only top k results
    )
    return scores, retrieved_examples


Downloading readme:   0%|          | 0.00/417 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/49.8M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3000 [00:00<?, ? examples/s]

  0%|          | 0/3 [00:00<?, ?it/s]

## Load the llama-3 LLM to serve as our search agent

In [17]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# use quantization to lower GPU usage
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    # torch_dtype=torch.float16,
    device_map="auto",
    quantization_config=bnb_config
)
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

## Prompt engineering

In [18]:
SYS_PROMPT = """You are an assistant for answering questions.
You are given the extracted parts of a long document and a question. Provide a conversational answer.
If you don't know the answer, just say "I do not know." Don't make up an answer."""


In [19]:
def format_prompt(prompt,retrieved_documents,k):
  """using the retrieved documents we will prompt the model to generate our responses"""
  PROMPT = f"Question:{prompt}\nContext:"
  for idx in range(k) :
    PROMPT+= f"{retrieved_documents['text'][idx]}\n"
  return PROMPT

def generate(formatted_prompt):
  formatted_prompt = formatted_prompt[:2000] # to avoid GPU OOM
  messages = [{"role":"system","content":SYS_PROMPT},{"role":"user","content":formatted_prompt}]
  # tell the model to generate
  input_ids = tokenizer.apply_chat_template(
      messages,
      add_generation_prompt=True,
      return_tensors="pt"
  ).to(model.device)
  outputs = model.generate(
      input_ids,
      max_new_tokens=1024,
      eos_token_id=terminators,
      do_sample=True,
      temperature=0.6,
      top_p=0.9,
  )
  response = outputs[0][input_ids.shape[-1]:]
  return tokenizer.decode(response, skip_special_tokens=True)

def rag_search_agent(prompt:str,k:int=2):
  scores , retrieved_documents = search(prompt, k)
  formatted_prompt = format_prompt(prompt,retrieved_documents,k)
  return generate(formatted_prompt)


In [21]:
rag_search_agent("what's anarchy ?", k = 2)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


"Anarchy, in this context, refers to a political philosophy and movement that seeks to abolish institutions that maintain unnecessary coercion and hierarchy. Anarchists believe that the state and capitalism are the main sources of oppression and seek to replace them with stateless societies and voluntary free associations. This philosophy has been around for centuries, with roots in the Enlightenment, and has played a significant role in workers' struggles for emancipation throughout history."


TODO:

What algorithm is used in `get_nearest_examples`?

What is quantization?

Create embeddings using the title and the text. Experiment with different queries and see if you can see a difference.

Perform prompt engineering to improve search results. List your experiments and and results.

Experiment with at least one other embedding method.

How can this search agent be improved?

Extra credit: Try another dataset.

Submit your notebook and a PDF report with your answers to the questions above and the results from your experiments.

Also please provide feedback on the lab.