<a href="https://colab.research.google.com/github/hussamalafandi/Generative_AI/blob/main/notebooks/10/10_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retrieval-Augmented Generation (RAG): Enhancing LLMs with External Knowledge

## Introduction to Retrieval-Augmented Generation (RAG)

<div style="text-align: center;">
    <img src="rag_flow.png" alt="RAG Flow Diagram" style="width: 60%; height: auto;">
</div>

## What is Retrieval-Augmented Generation (RAG)?

Retrieval-Augmented Generation (RAG) is a technique used to enhance large language models (LLMs) by integrating external knowledge retrieved from document databases or knowledge stores. Unlike conventional generative models, which rely solely on learned parameters from training data, RAG dynamically accesses up-to-date and contextually relevant information, significantly improving the accuracy, reliability, and usefulness of the generated responses.

The core idea behind RAG is simple yet powerful:

- **Retrieve**: When a user provides a query or prompt, RAG first retrieves relevant documents or passages from an external knowledge base.

- **Generate**: The model then uses the retrieved documents as context to generate accurate, informed, and detailed responses.

### Why Retrieval Matters in Generative AI?

Retrieval methods address fundamental limitations of purely parametric generative models:

* **Factual Accuracy**: Retrieval enables models to access the latest and accurate data rather than relying solely on outdated training datasets.
* **Reducing Hallucinations**: By grounding generation in retrieved information, RAG significantly reduces the chances of generating incorrect, nonsensical, or fabricated information.
* **Scalability**: Retrieval allows LLMs to leverage large-scale, dynamic knowledge bases efficiently without retraining the entire model when information updates occur.

### Limitations of Traditional LLMs:

Traditional language models have some well-known drawbacks:

* **Hallucination**: Generating plausible but incorrect or unsupported information.
* **Stale Knowledge**: Limited to static training data, lacking awareness of recent updates or newly available information.
* **Context Limitations**: Without retrieval, LLMs have fixed-size context windows, severely limiting their ability to reference extensive external knowledge.

### Real-world Examples and Use Cases

**Knowledge-base Q&A Systems**
* Quickly answering user questions by retrieving precise, authoritative information from structured or unstructured sources.
* Example: Customer support systems retrieving relevant FAQ or product manuals to answer customer queries.

**Chatbots with External Knowledge Bases**
* Dynamic chatbots integrated with knowledge bases or external databases to offer up-to-date, personalized interactions.
* Example: Travel assistant chatbot retrieving flight schedules, weather data, and travel restrictions.

**Enterprise-level AI Assistants**
* Assisting professionals in fields such as law, medicine, or technical documentation by providing quick access to domain-specific knowledge.
* Example: Medical assistants that generate treatment suggestions based on the latest clinical guidelines and patient histories.

## Core Concepts and Components of RAG

To effectively build and deploy Retrieval-Augmented Generation systems, it’s crucial to understand their core components: the **Retriever**, the **Generator (Reader)**, and the overall **End-to-End Flow**.

### Retriever

The retriever component is responsible for identifying and fetching the most relevant documents or information chunks from an external knowledge base given a query. Retrieval methods typically fall into two categories: **Sparse** and **Dense**.

##### Sparse Methods (Keyword-Based):

Sparse retrieval methods rely on exact term matches and statistical weighting (like TF-IDF or BM25).

* **TF-IDF**: Scores words based on frequency across documents.
* **BM25**: An improvement that adjusts for document length and term saturation.

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Sample documents
docs = [
    "The cat sat on the mat.",
    "Dogs and cats are pets.",
    "The mat was red and soft.",
    "Pets are lovely companions."
]

# Step 1: Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Step 2: Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(docs)

# Step 3: Get the list of terms (features)
terms = vectorizer.get_feature_names_out()

# Step 4: Convert the TF-IDF matrix into a DataFrame for better readability
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=terms)

# Step 5: Display the terms
print("Vocabulary terms:\n")
print(terms)

# Step 6: Display the TF-IDF matrix nicely
print("\nTF-IDF Weighted Document-Term Matrix:\n")
tfidf_df.round(3)


Vocabulary terms:

['and' 'are' 'cat' 'cats' 'companions' 'dogs' 'lovely' 'mat' 'on' 'pets'
 'red' 'sat' 'soft' 'the' 'was']

TF-IDF Weighted Document-Term Matrix:



Unnamed: 0,and,are,cat,cats,companions,dogs,lovely,mat,on,pets,red,sat,soft,the,was
0,0.0,0.0,0.405,0.0,0.0,0.0,0.0,0.319,0.405,0.0,0.0,0.405,0.0,0.638,0.0
1,0.401,0.401,0.0,0.509,0.0,0.509,0.0,0.0,0.0,0.401,0.0,0.0,0.0,0.0,0.0
2,0.357,0.0,0.0,0.0,0.0,0.0,0.0,0.357,0.0,0.0,0.453,0.0,0.453,0.357,0.453
3,0.0,0.438,0.0,0.0,0.555,0.0,0.555,0.0,0.0,0.438,0.0,0.0,0.0,0.0,0.0


##### **Exercise**:
Modify the above code to search which document is most relevant to the query: "cats love mats".
(Hint: Vectorize the query and compute cosine similarity!)

##### Dense Retrieval (Embeddings)

Dense retrieval methods use vector embeddings to capture semantic similarity rather than exact keyword matches.

In [10]:
from sentence_transformers import SentenceTransformer, util

# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Create document embeddings
doc_embeddings = model.encode(docs, convert_to_tensor=True)

# Query
query = "A soft mat for pets"
query_embedding = model.encode(query, convert_to_tensor=True)

# Compute similarity
cos_scores = util.cos_sim(query_embedding, doc_embeddings)

# Find the most similar document
most_similar_idx = cos_scores.argmax()
print(f"Most similar document to '{query}': {docs[most_similar_idx]}")


Most similar document to 'A soft mat for pets': The cat sat on the mat.


##### **Exercise**

Try a different query like "Companions for humans" and check which document ranks highest!

### Vector Database

Vector databases efficiently store, manage, and retrieve dense embeddings at scale. They are critical in modern RAG implementations.

Popular options:

* **FAISS** (open-source, very fast for local)
* **ChromaDB** (easy for prototyping)
* **Pinecone** (scalable, cloud-based)

> To run the next code cells you need to [install faiss](https://github.com/facebookresearch/faiss/blob/main/INSTALL.md). (`pip install faiss-cpu`) and ChromaDB (`pip install chromadb`).

We will use FAISS to build a fast in-memory index of document embeddings and perform a similarity search.


In [None]:
import faiss
import numpy as np

# Convert embeddings to numpy
doc_embeddings_np = doc_embeddings.cpu().detach().numpy()

# Build FAISS index
dimension = doc_embeddings_np.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(doc_embeddings_np)

# Search with query
query_np = query_embedding.cpu().detach().numpy().reshape(1, -1)
distances, indices = index.search(query_np, k=1)

print(f"Most similar to document using FAISS: {docs[indices[0][0]]}")

Most similar to document using FAISS: The cat sat on the mat.


Now, we'll use ChromaDB to store both embeddings and documents, and perform a semantic search with document retrieval.

In [None]:
import chromadb
from chromadb.utils import embedding_functions

# Initialize ChromaDB client
chroma_client = chromadb.Client()

# Create a collection (like an index in FAISS)
collection = chroma_client.create_collection(name="my_collection")

# Add documents and embeddings
collection.add(
    embeddings=doc_embeddings_np.tolist(),  # convert numpy to list
    documents=[doc for doc in docs],         # documents list
    ids=[str(i) for i in range(len(docs))]    # unique string IDs
)

# Query ChromaDB
results = collection.query(
    query_embeddings=query_np.tolist(),  # query as a list
    n_results=1
)

print(f"Most similar document using ChromaDB: {results['documents'][0][0]}")

Most similar document using ChromaDB: The cat sat on the mat.


##### Key Differences: **FAISS** vs **ChromaDB**

* **FAISS** stores only **embeddings**, while **ChromaDB** stores **both embeddings and documents**.
* With **FAISS**, you work with implicit numeric indices; **ChromaDB** requires you to provide **document IDs**.
* **FAISS** returns distances and indices; **ChromaDB** directly returns the **matching documents** along with scores.
* **ChromaDB** also supports **persistence** out of the box, making it easier for saving and reloading collections.

> **Note:**  
> This notebook shows minimal FAISS and ChromaDB usage for RAG systems. For production, consider proper persistence and indexing configurations.


##### **Exercise**

Index more documents and retrieve the top 3 most similar documents instead of just 1.

### Reader/Generator

After retrieval, the generator (also called the reader or the generative component) synthesizes the retrieved information into a coherent, relevant answer or output.

It utilizes the retrieved documents as context within the model's input prompts, enabling responses grounded firmly in factual information rather than relying solely on internal knowledge from training.

In [31]:
# Retrieved document
context = results['documents'][0][0]

# User question
question = "Where do cats usually sit?"

# Simple prompt
prompt = f"Context: {context}\n\nQuestion: {question}\n\nAnswer:"

print(prompt)


Context: The cat sat on the mat.

Question: Where do cats usually sit?

Answer:


Now we pass the prompt to an LLM to generate an answer grounded in the retrieved context.

In [33]:
from transformers import pipeline

# Assuming you have a text generation pipeline ready with Gemma 3
generator = pipeline("text-generation", model="google/gemma-3-1b-it", device='cpu')

# Generate an answer
response = generator(prompt, max_new_tokens=100, do_sample=True)

# Print the generated answer
print(response[0]['generated_text'])


Device set to use cpu


Context: The cat sat on the mat.

Question: Where do cats usually sit?

Answer: Mats.



##### **Exercise**

Rewrite the prompt to:

- Instruct the model to only answer using the provided context.
- Tell the model to say "I don't know" if the answer is missing.

# Additional Resources

* LangChain Conceptual Guides [RAG](https://python.langchain.com/docs/concepts/rag/) and [RAG From Scratch](https://github.com/langchain-ai/rag-from-scratch)
* LangChain Tutorial on RAG [Part 1](https://python.langchain.com/docs/tutorials/rag/) and [Part 2](https://python.langchain.com/docs/tutorials/qa_chat_history/)
* WandB Course [RAG++ : From POC to Production](https://wandb.ai/site/courses/rag)
* LangChain [Multi-Vector Retriever](https://blog.langchain.dev/semi-structured-multi-modal-rag/)
* [How to pass multimodal data to models](https://python.langchain.com/docs/how_to/multimodal_inputs/)
* [Chroma multi-modal RAG](https://github.com/langchain-ai/langchain/blob/master/cookbook/multi_modal_RAG_chroma.ipynb)
* [pdf-retrieval-with-ColQwen2-vlm_Vespa-cloud](https://colab.research.google.com/github/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/pdf-retrieval-with-ColQwen2-vlm_Vespa-cloud.ipynb#scrollTo=PUqnrKWLak3O)