# Retrieval Augmented Generation (RAG)

This notebook covers Retrieval Augmented Generation (RAG) - a powerful technique for enhancing LLM responses with external knowledge:
- **What is RAG?**: Combining retrieval of relevant information with LLM generation
- **Why RAG?**: Overcome LLM limitations like knowledge cutoff, hallucinations, and lack of domain-specific knowledge
- **RAG Components**: Document loading, chunking, embedding, vector storage, retrieval, and generation
- **Building RAG Systems**: End-to-end implementation using LangChain and vector databases
- **Advanced Techniques**: Query rewriting, reranking, and hybrid search

## Learning Objectives

- Understand what RAG is and why it's important
- Learn the components and architecture of RAG systems
- Build a complete RAG pipeline from scratch
- Implement document processing, chunking, and embedding
- Use vector databases for efficient retrieval
- Combine retrieval with LLM generation
- Apply RAG to real-world use cases


## Installation

Run this cell to install required packages (uncomment if needed):


In [None]:
# Install packages (uncomment if needed)
# !pip install langchain langchain-openai langchain-community chromadb sentence-transformers pypdf python-dotenv


## 1. What is RAG?

**Retrieval Augmented Generation (RAG)** is a technique that enhances Large Language Models (LLMs) by:

1. **Retrieving** relevant information from external knowledge sources (documents, databases, etc.)
2. **Augmenting** the LLM's context with this retrieved information
3. **Generating** responses based on both the LLM's training and the retrieved context

### Why RAG?

LLMs have several limitations:
- **Knowledge Cutoff**: Training data has a cutoff date, missing recent information
- **Hallucinations**: May generate plausible but incorrect information
- **Domain-Specific Knowledge**: Limited knowledge in specialized domains
- **Static Knowledge**: Cannot access real-time or private information

### RAG Benefits:

✅ **Up-to-date Information**: Access current information beyond training cutoff  
✅ **Reduced Hallucinations**: Ground responses in retrieved documents  
✅ **Domain Expertise**: Incorporate specialized knowledge bases  
✅ **Transparency**: Can cite sources for generated answers  
✅ **Cost-Effective**: No need to retrain models for new knowledge

### RAG Architecture:

```
User Query
    ↓
Query Embedding
    ↓
Vector Search (Vector DB)
    ↓
Retrieve Relevant Documents
    ↓
Augment Prompt with Context
    ↓
LLM Generation
    ↓
Response
```


In [1]:
# Import libraries
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Load environment variables
load_dotenv()

print("Libraries imported successfully!")


Libraries imported successfully!


## 2. RAG Components

A RAG system consists of several key components:

### 2.1 Document Loading
- Load documents from various sources (PDFs, text files, web pages, databases)
- Extract text content from different formats

### 2.2 Text Chunking
- Split documents into smaller chunks
- Balance chunk size: too small (loses context) vs too large (inefficient retrieval)
- Common strategies: fixed-size, sentence-aware, semantic chunking

### 2.3 Embedding Generation
- Convert text chunks into vector embeddings
- Use embedding models (e.g., OpenAI, Sentence Transformers)
- Embeddings capture semantic meaning

### 2.4 Vector Storage
- Store embeddings in vector databases (ChromaDB, FAISS, Pinecone)
- Enable fast similarity search

### 2.5 Retrieval
- Given a query, find most similar document chunks
- Use similarity metrics (cosine similarity, dot product, etc.)
- Can retrieve top-k most relevant chunks

### 2.6 Generation
- Augment LLM prompt with retrieved context
- Generate response based on query + context
- Optionally cite sources


## 3. Building a Simple RAG System

Let's build a complete RAG system step by step:


### 3.1 Step 1: Load Documents


In [2]:
# Example: Create a sample document for demonstration
sample_text = """
Machine Learning is a subset of artificial intelligence that focuses on the development of algorithms 
and statistical models that enable computer systems to improve their performance on a specific task 
through experience. Unlike traditional programming where explicit instructions are provided, machine 
learning systems learn patterns from data.

Deep Learning is a specialized subset of machine learning that uses neural networks with multiple 
layers (hence "deep") to model and understand complex patterns in data. Deep learning has been 
particularly successful in areas like image recognition, natural language processing, and speech recognition.

Natural Language Processing (NLP) is a branch of AI that helps computers understand, interpret, and 
manipulate human language. NLP combines computational linguistics with statistical, machine learning, 
and deep learning models to process human language in text or voice form.

Vector databases are specialized databases designed to store and efficiently search high-dimensional 
vectors (embeddings). They enable similarity search, which is crucial for applications like RAG, 
semantic search, and recommendation systems.
"""

# Save to a temporary file for demonstration
with open("sample_doc.txt", "w") as f:
    f.write(sample_text)

# Load the document
loader = TextLoader("sample_doc.txt")
documents = loader.load()

print(f"Loaded {len(documents)} document(s)")
print(f"Document length: {len(documents[0].page_content)} characters")
print(f"\nFirst 200 characters:\n{documents[0].page_content[:200]}...")


Loaded 1 document(s)
Document length: 1181 characters

First 200 characters:

Machine Learning is a subset of artificial intelligence that focuses on the development of algorithms 
and statistical models that enable computer systems to improve their performance on a specific t...


### 3.2 Step 2: Chunk Documents


In [3]:
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,  # Characters per chunk
    chunk_overlap=50,  # Overlap between chunks to preserve context
    length_function=len,
)

chunks = text_splitter.split_documents(documents)

print(f"Split into {len(chunks)} chunks")
print(f"\nChunk sizes: {[len(chunk.page_content) for chunk in chunks]}")
print(f"\nFirst chunk:\n{chunks[0].page_content}")


Split into 9 chunks

Chunk sizes: [101, 99, 144, 193, 109, 99, 176, 100, 142]

First chunk:
Machine Learning is a subset of artificial intelligence that focuses on the development of algorithms


### 3.3 Step 3: Create Embeddings and Vector Store


In [4]:
# Initialize embeddings (using OpenAI - requires API key)
# For demonstration, we'll check if API key is available
api_key = os.getenv("OPENAI_API_KEY")

if api_key:
    embeddings = OpenAIEmbeddings(openai_api_key=api_key)
    
    # Create vector store from chunks
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory="./rag_chroma_db"  # Persist to disk
    )
    
    print("Vector store created successfully!")
    print(f"Number of vectors stored: {len(chunks)}")
else:
    print("⚠️  OPENAI_API_KEY not found in environment variables.")
    print("Please set your OpenAI API key to use embeddings.")
    print("You can use: export OPENAI_API_KEY='your-key-here'")
    print("\nFor now, we'll continue with a placeholder structure.")


⚠️  OPENAI_API_KEY not found in environment variables.
Please set your OpenAI API key to use embeddings.
You can use: export OPENAI_API_KEY='your-key-here'

For now, we'll continue with a placeholder structure.


### 3.4 Step 4: Create Retrieval Chain


In [5]:
# Create a retrieval chain that combines retrieval and generation
if api_key:
    # Initialize LLM
    llm = ChatOpenAI(
        model_name="gpt-3.5-turbo",
        temperature=0,
        openai_api_key=api_key
    )
    
    # Create a custom prompt template
    prompt_template = """Use the following pieces of context to answer the question at the end.
    If you don't know the answer, just say that you don't know, don't try to make up an answer.
    
    Context: {context}
    
    Question: {question}
    
    Answer:"""
    
    PROMPT = PromptTemplate(
        template=prompt_template,
        input_variables=["context", "question"]
    )
    
    # Create retrieval QA chain
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",  # "stuff", "map_reduce", "refine", "map_rerank"
        retriever=vectorstore.as_retriever(search_kwargs={"k": 2}),  # Retrieve top 2 chunks
        chain_type_kwargs={"prompt": PROMPT},
        return_source_documents=True
    )
    
    print("RAG chain created successfully!")
else:
    print("⚠️  Skipping chain creation - API key required")


⚠️  Skipping chain creation - API key required


### 3.5 Step 5: Query the RAG System


In [6]:
# Example queries
if api_key:
    queries = [
        "What is machine learning?",
        "How does deep learning differ from machine learning?",
        "What are vector databases used for?"
    ]
    
    for query in queries:
        print(f"\n{'='*60}")
        print(f"Query: {query}")
        print(f"{'='*60}")
        
        result = qa_chain.invoke({"query": query})
        
        print(f"\nAnswer: {result['result']}")
        print(f"\nSource documents retrieved: {len(result['source_documents'])}")
        if result['source_documents']:
            print(f"\nFirst source chunk preview:")
            print(result['source_documents'][0].page_content[:200] + "...")
else:
    print("⚠️  API key required to run queries")
    print("\nExample of what the RAG system would do:")
    print("1. Convert query to embedding")
    print("2. Search vector store for similar chunks")
    print("3. Retrieve top-k relevant chunks")
    print("4. Augment LLM prompt with retrieved context")
    print("5. Generate answer based on query + context")


⚠️  API key required to run queries

Example of what the RAG system would do:
1. Convert query to embedding
2. Search vector store for similar chunks
3. Retrieve top-k relevant chunks
4. Augment LLM prompt with retrieved context
5. Generate answer based on query + context


## 4. Advanced RAG Techniques

### 4.1 Query Rewriting
- Rephrase queries to improve retrieval
- Generate multiple query variations
- Use query expansion techniques

### 4.2 Reranking
- Initial retrieval gets many candidates
- Rerank using more sophisticated models (cross-encoders)
- Improve precision of retrieved documents

### 4.3 Hybrid Search
- Combine semantic search (embeddings) with keyword search (BM25)
- Get benefits of both approaches
- Weighted combination of results

### 4.4 Chunking Strategies
- **Fixed-size**: Simple but may break context
- **Sentence-aware**: Split at sentence boundaries
- **Semantic chunking**: Group semantically related content
- **Sliding window**: Overlap chunks to preserve context

### 4.5 Metadata Filtering
- Filter documents by metadata (date, source, category)
- Improve retrieval relevance
- Support multi-tenant scenarios


## 5. RAG Best Practices

### Document Processing
- ✅ Clean and normalize text
- ✅ Remove irrelevant content
- ✅ Preserve important formatting
- ✅ Handle multiple languages

### Chunking
- ✅ Choose appropriate chunk size (typically 200-1000 tokens)
- ✅ Use overlap to preserve context
- ✅ Consider document structure (paragraphs, sections)
- ✅ Test different chunking strategies

### Retrieval
- ✅ Tune number of retrieved chunks (k)
- ✅ Use appropriate similarity metric
- ✅ Consider reranking for better precision
- ✅ Implement metadata filtering when needed

### Generation
- ✅ Design clear prompt templates
- ✅ Include instructions for using context
- ✅ Handle cases where no relevant context is found
- ✅ Enable source citation

### Evaluation
- ✅ Measure retrieval quality (precision, recall)
- ✅ Evaluate answer quality (accuracy, relevance)
- ✅ Test with diverse queries
- ✅ Monitor for hallucinations


## 6. Common RAG Challenges and Solutions

### Challenge 1: Irrelevant Retrieval
**Problem**: Retrieved chunks don't match the query  
**Solutions**: 
- Improve query understanding (query rewriting)
- Use better embedding models
- Implement reranking
- Tune retrieval parameters

### Challenge 2: Context Window Limits
**Problem**: Too many retrieved chunks exceed LLM context window  
**Solutions**:
- Limit number of retrieved chunks
- Use compression techniques
- Implement hierarchical retrieval
- Use chain types like "map_reduce"

### Challenge 3: Outdated Information
**Problem**: Vector store contains outdated information  
**Solutions**:
- Implement incremental updates
- Use versioning for documents
- Add timestamp metadata
- Periodic re-indexing

### Challenge 4: Hallucinations
**Problem**: LLM generates information not in retrieved context  
**Solutions**:
- Improve prompt instructions
- Use temperature=0 for more deterministic outputs
- Implement answer validation
- Add source citation requirements


## 7. Next Steps

Now that you understand RAG fundamentals:

1. **Experiment with different chunking strategies** - Try various chunk sizes and overlap
2. **Test different embedding models** - Compare OpenAI, Sentence Transformers, etc.
3. **Explore advanced techniques** - Implement reranking, query rewriting, hybrid search
4. **Build domain-specific RAG** - Apply to your own documents and use cases
5. **Integrate with LangChain** - Explore LangChain's RAG capabilities and tools
6. **Evaluate your system** - Measure retrieval and generation quality

### Resources
- [LangChain RAG Documentation](https://python.langchain.com/docs/use_cases/question_answering/)
- [ChromaDB Documentation](https://docs.trychroma.com/)
- [FAISS Documentation](https://github.com/facebookresearch/faiss)
- Research Papers: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Lewis et al., 2020)


In [7]:
# Cleanup: Remove temporary file
import os
if os.path.exists("sample_doc.txt"):
    os.remove("sample_doc.txt")
    print("Temporary file cleaned up!")


Temporary file cleaned up!
