# Building a RAG System from Scratch - Step by Step

**Retrieval-Augmented Generation (RAG)** is a technique that enhances Large Language Models by providing them with relevant context from a knowledge base before generating answers.

**Why RAG?**
- ‚úÖ Reduces hallucinations by grounding answers in real data
- ‚úÖ Enables LLMs to access up-to-date information
- ‚úÖ Allows working with private/proprietary documents
- ‚úÖ Can cite sources for answers

**What we'll build:**
A complete RAG system that can answer questions about a document (2024 State of the Union address)

## Step 1: Install Required Dependencies

In [None]:
import subprocess
import sys

packages = [
    "langchain",              # Core LangChain framework
    "langchain-chroma",       # Chroma vector store integration
    "langchain-openai",       # OpenAI models integration
    "langchain-core",         # Core LangChain utilities
    "python-dotenv",          # Environment variable management
    "chromadb"                # Vector database
]

print("Installing RAG dependencies...\n")
for package in packages:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package])
    print(f"‚úì {package}")

print("\n‚úÖ All packages installed successfully!")

**üìù Explanation:**
We install 6 essential packages:
- **langchain**: Main framework for building LLM applications
- **langchain-chroma**: Allows us to use ChromaDB as our vector database
- **langchain-openai**: Provides OpenAI's GPT models and embeddings
- **langchain-core**: Core utilities for chains and prompts
- **python-dotenv**: Loads API keys from .env file securely
- **chromadb**: Lightweight vector database for storing document embeddings

## Step 2: Import Libraries

In [None]:
from dotenv import load_dotenv
from langchain_chroma import Chroma
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_text_splitters import CharacterTextSplitter

# Load API keys from .env file
load_dotenv()

print("‚úÖ All imports successful and environment loaded!")

**üìù Explanation:**
Each import serves a specific purpose in our RAG pipeline:
- **Chroma**: Vector database for storing embeddings
- **PromptTemplate**: Structures prompts with variables
- **RunnablePassthrough**: Passes data through pipeline unchanged
- **StrOutputParser**: Extracts text from LLM response
- **OpenAIEmbeddings**: Converts text to vector embeddings
- **ChatOpenAI**: OpenAI's chat model (GPT)
- **CharacterTextSplitter**: Splits large documents into chunks
- **load_dotenv()**: Loads your OPENAI_API_KEY from .env file

## Step 3: Initialize Embeddings Model

In [None]:
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

print("‚úÖ Embeddings model initialized")
print(f"   Model: text-embedding-3-large")
print(f"   Dimensions: 3072 (vector size)")

**üìù Explanation:**
Embeddings convert text into numerical vectors that capture semantic meaning. Similar concepts have similar vectors.

**Why text-embedding-3-large?**
- High quality: 3072-dimensional vectors
- Captures nuanced meaning
- Good for semantic similarity search

**Example:** 
- "dog" and "puppy" ‚Üí similar vectors
- "dog" and "car" ‚Üí different vectors

These embeddings allow us to find relevant documents even when they don't contain exact keyword matches.

## Step 4: Create Vector Store (ChromaDB)

In [None]:
vector_store = Chroma(
    collection_name="state_of_union_rag",
    embedding_function=embeddings
)

print("‚úÖ Vector store created")
print(f"   Database: ChromaDB")
print(f"   Collection: state_of_union_rag")
print(f"   Ready to store document embeddings")

**üìù Explanation:**
ChromaDB is a vector database that stores and retrieves embeddings efficiently.

**What it does:**
- Stores document embeddings (vectors)
- Performs fast similarity searches
- Returns the most relevant documents for a query

**How it works:**
1. Documents ‚Üí Embeddings ‚Üí Stored in ChromaDB
2. Query ‚Üí Embedding ‚Üí Search similar vectors
3. ChromaDB returns most similar documents

**Why ChromaDB?**
- Lightweight and easy to use
- No separate server needed
- Perfect for development and small-to-medium projects

## Step 5: Load the Document

In [None]:
with open("2024_state_of_the_union.txt", "r") as f:
    document = f.read()

print("‚úÖ Document loaded successfully")
print(f"   File: 2024_state_of_the_union.txt")
print(f"   Total characters: {len(document):,}")
print(f"   Total words: ~{len(document.split()):,}")
print(f"\n   Preview (first 200 chars):")
print(f"   {document[:200]}...")

**üìù Explanation:**
We load the document that will serve as our knowledge base.

**Why this step?**
- RAG needs a source of information to retrieve from
- This document contains facts the LLM can reference
- In production, this could be PDFs, databases, APIs, etc.

**Note:** The document is likely too large to fit in a single LLM prompt (context window), which is why we need RAG and chunking in the next step.

## Step 6: Split Document into Chunks

In [None]:
text_splitter = CharacterTextSplitter(
    chunk_size=1000,        # Each chunk: ~1000 characters
    chunk_overlap=200,      # Overlap: 200 characters between chunks
    length_function=len,    # Use character count
    separator="\n"          # Split on newlines when possible
)

chunks = text_splitter.create_documents([document])

print("‚úÖ Document split into chunks")
print(f"   Total chunks: {len(chunks)}")
print(f"   Chunk size: ~1000 characters")
print(f"   Overlap: 200 characters")
print(f"\n   Example chunk:")
print(f"   {chunks[0].page_content[:300]}...")

**üìù Explanation:**
Chunking breaks large documents into smaller, manageable pieces.

**Why chunk?**
- LLMs have token limits (can't process entire documents at once)
- Smaller chunks = more precise retrieval
- Each chunk can be embedded and searched independently

**Key parameters:**
- **chunk_size=1000**: Each chunk is roughly 1000 characters
- **chunk_overlap=200**: Chunks share 200 characters to preserve context across boundaries
- This prevents important information from being split awkwardly

**Example:** If a sentence spans a chunk boundary, the overlap ensures it appears in both chunks.

## Step 7: Store Chunks in Vector Database

In [None]:
print("‚è≥ Adding chunks to vector store (this may take a moment)...")

document_ids = vector_store.add_documents(chunks)

print(f"‚úÖ All chunks stored in vector database")
print(f"   Total documents indexed: {len(document_ids)}")
print(f"   Each chunk has been:")
print(f"   1. Converted to embedding (vector)")
print(f"   2. Stored in ChromaDB")
print(f"   3. Ready for similarity search")

**üìù Explanation:**
This is where the magic happens! Each chunk is:

1. **Converted to embedding**: OpenAI's model converts text ‚Üí 3072-dimensional vector
2. **Stored in ChromaDB**: Vector + original text saved together
3. **Indexed**: Database organizes vectors for fast retrieval

**What happens behind the scenes:**
```
Chunk 1: "Putin invaded Ukraine..." ‚Üí [0.234, -0.567, 0.891, ...] (3072 numbers)
Chunk 2: "The economy is strong..." ‚Üí [0.123, -0.234, 0.456, ...] (3072 numbers)
...
```

Now when you search, ChromaDB can quickly find chunks with similar vectors to your query!

## Step 8: Create a Retriever

In [None]:
retriever = vector_store.as_retriever(
    search_kwargs={"k": 3}  # Retrieve top 3 most relevant chunks
)

print("‚úÖ Retriever created")
print(f"   Will retrieve: Top 3 most similar chunks")
print(f"   Search method: Similarity search using vector distance")

**üìù Explanation:**
The retriever is responsible for finding relevant documents based on a query.

**How it works:**
1. Takes a query (e.g., "Who invaded Ukraine?")
2. Converts query to embedding
3. Compares query embedding to all chunk embeddings
4. Returns the k=3 most similar chunks

**Why k=3?**
- Balance between context and token limits
- More chunks = more context but longer prompts
- Fewer chunks = faster but might miss relevant info
- 3 is a good starting point; adjust based on your needs

## Step 9: Initialize the Language Model

In [None]:
llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0  # Deterministic outputs for factual answers
)

print("‚úÖ Language Model initialized")
print(f"   Model: GPT-4o-mini")
print(f"   Temperature: 0 (factual, consistent answers)")
print(f"   Purpose: Generate answers based on retrieved context")

**üìù Explanation:**
The LLM generates the final answer using the retrieved context.

**Model choice:**
- **GPT-4o-mini**: Faster and cheaper than GPT-4, still high quality
- Good balance of performance and cost for RAG applications

**Temperature=0:**
- Controls randomness in responses
- 0 = deterministic, factual (best for RAG)
- Higher values = more creative but less reliable

The LLM will receive both the retrieved chunks and the user's question, then generate an answer based on that context.

## Step 10: Create the Prompt Template

In [None]:
template = """You are a helpful AI assistant. Answer the question based ONLY on the provided context.
If the answer is not in the context, say "I don't have enough information in the provided context to answer that question."

Context:
{context}

Question: {question}

Answer:"""

prompt = PromptTemplate.from_template(template)

print("‚úÖ Prompt template created")
print("   Template has 2 variables:")
print("   - {context}: Retrieved chunks will go here")
print("   - {question}: User's question will go here")

**üìù Explanation:**
The prompt template structures how we communicate with the LLM.

**Key instructions:**
1. **"Answer based ONLY on the provided context"** - Prevents hallucination
2. **"Say 'I don't have enough information...' if not in context"** - Honest responses

**Variables:**
- `{context}`: Filled with retrieved chunks
- `{question}`: Filled with user's query

**Why this matters:**
Without proper prompting, LLMs might make up answers. This template ensures the LLM only uses the retrieved information, making responses more reliable and trustworthy.

## Step 11: Create Document Formatter Function