**RAG (Retrieval-Augmented Generation)** is a technique that enhances large language models (LLMs) by retrieving relevant information from a knowledge base before generating responses. This improves accuracy, reduces hallucinations, and allows models to access up-to-date or domain-specific knowledge without retraining.

The notebook demonstrates a complete RAG pipeline, focusing on the retrieval component. It uses LangChain for document loading and splitting, SentenceTransformer for embeddings, and ChromaDB for vector storage and querying. Below, I'll explain RAG in detail, using the notebook's code as examples.

Key Components of RAG
1. Data Ingestion: Load and preprocess documents.
2. Chunking: Split documents into manageable pieces.
3. Embedding Generation: Convert text into vector representations.
4. Vector Storage: Store embeddings in a database for efficient retrieval.
5. Retrieval: Query the database to find relevant documents based on similarity.
6. Generation: Use retrieved documents as context for an LLM to generate answers (not fully implemented in the notebook, but implied).

The notebook covers steps 1-5. For generation, you'd typically pass the retrieved documents to an LLM like GPT-4 via an API or local model.

Step-by-Step Explanation with Code Examples
1. **Data Ingestion** -
RAG starts by ingesting documents from various sources (e.g., PDFs, text files). The notebook loads sample text files and PDFs.

In [125]:
from langchain_core.documents import Document

document = Document(
    page_content = "This is the content of the document.",
    metadata = {
        "source": "example.txt",
        "pages": 1,
        "author": "Kunal Sankhe", 
        "date": "2026-01-15"}
)

This creates a Document object with content and metadata. For real data:

In [97]:
import os
os.makedirs('../data/text_files', exist_ok=True)

In [98]:
sample_texts={
    "../data/text_files/python_intro.txt":"""Python Programming Introduction

Python is a high-level, interpreted programming language known for its simplicity and readability.
Created by Guido van Rossum and first released in 1991, Python has become one of the most popular
programming languages in the world.

Key Features:
- Easy to learn and use
- Extensive standard library
- Cross-platform compatibility
- Strong community support

Python is widely used in web development, data science, artificial intelligence, and automation.""",
    
    "../data/text_files/machine_learning.txt": """Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing computer programs
that can access data and use it to learn for themselves.

Types of Machine Learning:
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning: Finding patterns in unlabeled data
3. Reinforcement Learning: Learning through rewards and penalties

Applications include image recognition, speech processing, and recommendation systems
    
    
    """

}

for filepath,content in sample_texts.items():
    with open(filepath,'w',encoding="utf-8") as f:
        f.write(content)

print("✅ Sample text files created!")

✅ Sample text files created!


In [126]:
from langchain_core.documents import Document
from langchain_community.document_loaders import TextLoader

loader = TextLoader('../data/text_files/python_intro.txt', encoding = "utf-8")
document = loader.load()
print(document)  # Display the loaded document

[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content='Python Programming Introduction\n\nPython is a high-level, interpreted programming language known for its simplicity and readability.\nCreated by Guido van Rossum and first released in 1991, Python has become one of the most popular\nprogramming languages in the world.\n\nKey Features:\n- Easy to learn and use\n- Extensive standard library\n- Cross-platform compatibility\n- Strong community support\n\nPython is widely used in web development, data science, artificial intelligence, and automation.')]


In [101]:
from langchain_community.document_loaders import DirectoryLoader, PyMuPDFLoader

# Create the PDF directory if it doesn't exist
import os
os.makedirs('../data/pdf', exist_ok=True)

dir_loader = DirectoryLoader(
    '../data/pdf',
    glob="*.pdf",
    loader_cls=PyMuPDFLoader
)
pdf_documents = dir_loader.load()
print(f"Loaded {len(documents)} documents from directory.")
print(pdf_documents)  # Display the loaded documents

Loaded 2 documents from directory.


In [102]:
# Creating Data Chunks 

from langchain_text_splitters import RecursiveCharacterTextSplitter

def split_documents(documents,chunk_size=1000,chunk_overlap=200):
    """
    Split documents into smaller chunks for better RAG performance.
    
    Parameters:
    - chunk_size: Maximum characters per chunk (adjust based on your LLM)
    - chunk_overlap: Characters to overlap between chunks (preserves context)
    """
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, # Each chunk: ~1000 characters
        chunk_overlap=chunk_overlap, # 200 chars overlap for context
        length_function=len, # How to measure length
        separators=["\n\n", "\n", " ", ""] # Split hierarchy
    )
    # Actually split the documents
    split_docs = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(split_docs)} chunks")
    
    # Show what a chunk looks like
    if split_docs:
        print(f"\nExample chunk:")
        print(f"Content: {split_docs[0].page_content[:200]}...")
        print(f"Metadata: {split_docs[0].metadata}")
    
    return split_docs

chunks = split_documents(pdf_documents)

Split 194 documents into 672 chunks

Example chunk:
Content: What are embeddings
Vicki Boykis...
Metadata: {'producer': 'pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'creator': 'LaTeX with hyperref', 'creationdate': '2026-01-17T00:33:18+00:00', 'source': '../data/pdf/embeddings.pdf', 'file_path': '../data/pdf/embeddings.pdf', 'total_pages': 83, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2026-01-17T00:33:18+00:00', 'trapped': '', 'modDate': 'D:20260117003318Z', 'creationDate': 'D:20260117003318Z', 'page': 0}


In [103]:
import numpy as np
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import uuid
from typing import List, Dict,  Any, Tuple
from sklearn.metrics.pairwise import cosine_similarity



In [104]:
class EmbeddingManager:
    """Manages embeddings using SentenceTransformer and ChromaDB."""

    def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
        self.model_name = model_name
        self.model = None
        self._load_model()

    def _load_model(self):
        try:
            """Loads the SentenceTransformer model."""
            self.model = SentenceTransformer(self.model_name)
            print(f"Model loaded successfully. Embedding dimension  is  {self.get_embedding_dimension()}")
        except Exception as e:
            print(f"Error loading model: {e}")
            raise e
    
    def generate_embedding(self, texts: List[str]) -> np.ndarray:
        """Generates embeddings for a list of texts."""
        if not self.model:
            raise ValueError("Model not loaded.")
        print(f"Generating embeddings for {len(texts)} texts.")
        embeddings = self.model.encode(texts, convert_to_numpy=True)
        print(f"Generated embeddings with shape: {embeddings.shape}")
        return embeddings

    def get_embedding_dimension(self) -> int:
        """Returns the dimension of the embeddings."""
        if not self.model:
            raise ValueError("Model not loaded.")
        return self.model.get_sentence_embedding_dimension()

embedding_manager = EmbeddingManager()
embedding_manager

Model loaded successfully. Embedding dimension  is  384


<__main__.EmbeddingManager at 0x34f09be00>

In [105]:
class VectorStore:

    def __init__(self, collection_name: str = "pdf_documents", persist_directory: str="../data/vector_store"):
        
        self.collection_name = collection_name
        self.persist_directory = persist_directory
        self.client = None 
        self.collection = None
        self._initialize_store()

    def _initialize_store(self):
        try:
            os.makedirs(self.persist_directory, exist_ok=True)
            self.client = chromadb.PersistentClient(path = self.persist_directory)
            self.collection = self.client.get_or_create_collection(
                name=self.collection_name,
                metadata={"description": "PDF Document Embeddings for RAG System"})
            print(f"Vector store initialized at {self.persist_directory} with collection '{self.collection_name}'.")
            print("Existing collections:", self.collection.count())

        except Exception as e:
            print(f"Error creating directory: {e}")
            raise e

    def add_embeddings(self, documents: List[Document], embeddings: np.ndarray):

        if len(documents) != embeddings.shape[0]:
            raise ValueError("Number of documents and embeddings must match.")

        # prepare data for insertion
        ids = []
        metadatas = []
        documents_texts = []
        embeddings_list = []

        for i, (doc, emb) in enumerate(zip(documents, embeddings)):
            doc_id = f"doc_{uuid.uuid4().hex[:8]}_{i}"
            ids.append(doc_id)

            metadata = dict(doc.metadata)
            metadata['index'] = i 
            metadata['content_length'] = len(doc.page_content)
            metadatas.append(metadata)

            documents_texts.append(doc.page_content)
             
            embeddings_list.append(emb.tolist())

        try:
            self.collection.add(
                ids=ids,
                metadatas=metadatas,
                documents=documents_texts,
                embeddings=embeddings_list
            )
            print(f"Added {len(documents)} embeddings to the vector store.")
        except Exception as e:
            print(f"Error adding embeddings: {e}")
            raise e
    
vector_store = VectorStore()
vector_store

Vector store initialized at ../data/vector_store with collection 'pdf_documents'.
Existing collections: 0


<__main__.VectorStore at 0x3521e5a90>

In [106]:
texts = [doc.page_content for doc in chunks]
print(f"Preparing to generate embeddings for {len(texts)} text chunks.")
embeddings = embedding_manager.generate_embedding(texts)
print(f"Embeddings generated with shape: {embeddings.shape}")
vector_store.add_embeddings(chunks, embeddings)

Preparing to generate embeddings for 672 text chunks.
Generating embeddings for 672 texts.
Generated embeddings with shape: (672, 384)
Embeddings generated with shape: (672, 384)
Added 672 embeddings to the vector store.


In [119]:
class RAGRetriever:
    """Retrieves relevant documents from the vector store based on a query."""

    def __init__(self, vector_store: VectorStore, embedding_manager: EmbeddingManager):
        self.vector_store = vector_store
        self.embedding_manager = embedding_manager

    def retrieve(self, query: str, top_k: int = 5, score_threshold: float = 0.0) -> List[Dict[str, Any]]:
        """Retrieves top_k relevant documents for the given query."""
        query_embedding = self.embedding_manager.generate_embedding([query])[0].tolist()
        
        results = self.vector_store.collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k
        )
        
        retrieved_docs = []
        if results['documents'] and len(results['documents'][0]) > 0:
            documents = results['documents'][0]
            metadatas = results['metadatas'][0]
            distances = results['distances'][0]
            ids = results['ids'][0]

            for i, (doc_id, document, metadata, distance) in enumerate(zip(ids, documents, metadatas, distances)):
                similarity_score = 1 - distance
                if similarity_score >= score_threshold:
                    retrieved_docs.append({
                        "id": doc_id,
                        "content": document,
                        "metadata": metadata,
                        "similarity_score": similarity_score, 
                        "distance": distance, 
                        "rank": i + 1
                    })
            print(f"Retrieved {len(retrieved_docs)} documents for the query: '{query}'")
        else:
            print("No documents retrieved.")
        return retrieved_docs

In [117]:
rag_retriever = RAGRetriever(vector_store, embedding_manager)

In [121]:
rag_retriever.retrieve("Explain Self-Attention")

Generating embeddings for 1 texts.
Generated embeddings with shape: (1, 384)
Retrieved 1 documents for the query: 'Explain Self-Attention'


[{'id': 'doc_cb90eb27_163',
  'content': 'or the word encoding.\nNext, these positional vectors are passed in parallel to the model. Within\nthe Transformer paper, the model consists of six layers that perform encod-\ning and six that perform decoding. We start with the encoder layer, which\nconsists of two sub-layers: the self-attention layer, and a feed-forward neural\nnetwork. The self-attention layer is the key piece, which performs the process\nof learning the relationship of each term in relation to the other through scaled\ndot-product attention. We can think of self-attention in several ways: as a\ndifferentiable lookup table, or as a large lookup dictionary that contains both\nthe terms and their positions, with the weights of each term in relationship to\nthe other obtained from previous layers.\nThe scaled dot-product attention is the product of three matrices: key,\nquery, and value. These are initially all the same values that are outputs of\nprevious layers - in the first

## Integration of Context with Prompt to LLM 

In [None]:
from langchain_groq import ChatGroq
import os
from dotenv import load_dotenv
load_dotenv()

# Initialize Groq LLM with API Key
#groq_api_key = "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"
os.getenv("GROQ_API_KEY")
#chat_groq = ChatGroq(api_key=groq_api_key)

llm = ChatGroq(api_key=groq_api_key, temperature=0.1, model="llama-3.1-8b-instant", max_tokens=1024)

# Step 2: Create a RetrievalQA chain

def rag_simple(query, retriever, llm, top_k=3):
    results = retriever.retrieve(query, top_k=top_k)
    context = "\n\n".join([ doc["content"] for doc in results]) if results else "" 
    if not context:
        return "No relevant documents found."
    
    # Generate answer using LLM
    prompt = f"""You are an AI assistant. Use the following context to answer the question
    Context:
    {context}
    Question: {query}
    Answer:""" 
    
    response = llm.invoke(prompt.format(context = context, query=query))            

    return response.content

In [154]:
 answer = rag_simple("Explain Self-Attention", rag_retriever, llm, top_k=3)
 print("Answer:", answer)

Generating embeddings for 1 texts.
Generated embeddings with shape: (1, 384)
Retrieved 1 documents for the query: 'Explain Self-Attention'
Answer: Self-Attention is a key component of the Transformer model, introduced in the Transformer paper. It's a mechanism that allows the model to learn the relationships between different input elements, such as words or tokens, and their positions within the input sequence.

In essence, Self-Attention is a differentiable lookup table or a large dictionary that contains both the input terms and their positions. The weights of each term in relation to the other are obtained from previous layers, which enables the model to weigh the importance of each input element in relation to the others.

The Self-Attention mechanism is based on the scaled dot-product attention, which is the product of three matrices: Key, Query, and Value. These matrices are initially all the same values that are outputs of previous layers. In the first pass through the model, t

## RAG Advanced 

In [None]:
def rag_advanced(query, retriever, llm, top_k=3, min_score=0.2, return_context=False):
    results = retriever.retrieve(query, top_k=top_k, score_threshold=min_score)
    if not results:
        return {'answer': "No relevant documents found.", 'sources': [], 'confidence': 0.0, 'context':""}
    
    # Generate answer using LLM with citations
    prompt = f"""You are an AI assistant. Use the following context to answer the question
    Context:
    {context}
    Question: {query}
    Provide citations from the context in your answer.
    Answer:""" 
    
    response = llm.invoke(prompt.format(context = context, query=query))            

    return response.content