# üîç Local RAG Tutorial

A hands-on tutorial for building a **Retrieval-Augmented Generation** system that runs entirely on your machine.

Sources: https://culinary.sonoma.edu/sites/culinary/files/2025-03/Pasta%20Carbonara%20Recipe.pdf

## What we'll build

```
Your Question ‚Üí Embed ‚Üí Search Vector DB ‚Üí Retrieve Chunks ‚Üí LLM + Context ‚Üí Answer
```

## Prerequisites

1. **Ollama** installed and running ([download here](https://ollama.com/download))
2. A model pulled: `ollama pull llama3.1:8b`
3. Some PDFs in the `./documents` folder


# Make sure Ollama is running (in one terminal)
Terminal `ollama serve`

Check if it's running by checking: http://localhost:11434/

---
## Step 0: Install Dependencies

Run this cell once to install required packages.

In [None]:
# Uncomment and run this cell if you haven't installed the dependencies yet

# !pip install llama-index
# !pip install llama-index-llms-ollama
# !pip install llama-index-embeddings-huggingface
# !pip install llama-index-vector-stores-chroma
# !pip install chromadb
# !pip install gradio
# !pip install pypdf
# !pip install sentence-transformers

---
## Step 1: Configuration

Set up paths and model choices. **Edit these to match your setup!**

In [None]:
# =============================================================================
# CONFIGURATION - Adjust these for your setup
# =============================================================================
import os

# Folder containing your PDF documents
DOCUMENTS_FOLDER = "./documents"

# Folder for the vector database (persists embeddings between runs)
CHROMA_DB_FOLDER = "./chroma_db"

# Create folders if they don't exist
os.makedirs(DOCUMENTS_FOLDER, exist_ok=True)
os.makedirs(CHROMA_DB_FOLDER, exist_ok=True)

# Ollama LLM model (must be pulled first: ollama pull llama3.1:8b)
LLM_MODEL = "llama3.1:8b"  # Alternatives: "mistral", "phi3", "gemma2"

# HuggingFace embedding model (downloaded automatically)
# This replaces the need for nomic-embed-text in Ollama!
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"  # Fast, 384 dimensions
# Alternatives:
# - "sentence-transformers/all-mpnet-base-v2"  (better quality, slower)
# - "BAAI/bge-small-en-v1.5"  (good balance)
# - "BAAI/bge-large-en-v1.5"  (best quality, needs more RAM)

# Chunking settings
CHUNK_SIZE = 512      # Characters per chunk
CHUNK_OVERLAP = 50    # Overlap between chunks

# Retrieval settings
TOP_K = 3  # Number of chunks to retrieve per query

print(f"‚úÖ Folders ready: {DOCUMENTS_FOLDER}, {CHROMA_DB_FOLDER}")

---
## Step 2: Import Libraries

We're using:
- **LlamaIndex**: Orchestrates the RAG pipeline
- **Ollama**: Runs the LLM locally
- **HuggingFace**: Provides the embedding model (no extra downloads needed!)
- **ChromaDB**: Stores vectors locally

In [None]:
import os
from pathlib import Path

# LlamaIndex core
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    Settings,
    StorageContext,
)
from llama_index.core.node_parser import SentenceSplitter

# Local LLM via Ollama
from llama_index.llms.ollama import Ollama

# HuggingFace embeddings (no Ollama embedding model needed!)
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Vector store
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

print("‚úÖ All imports successful!")

---
## Step 3: Initialize the LLM

We use **Ollama** to run a local LLM. This is the model that generates answers.

Make sure Ollama is running! In a terminal: `ollama serve`

In [None]:
print(f"üîß Connecting to Ollama with model: {LLM_MODEL}")

# Initialize the LLM
llm = Ollama(
    model=LLM_MODEL,
    request_timeout=120.0,  # Local models can be slow without GPU
)

# Quick test to make sure it's working
try:
    test = llm.complete("Say 'hello' in one word.")
    print(f"‚úÖ Ollama is working! Test response: {test}")
except Exception as e:
    print(f"‚ùå Error connecting to Ollama: {e}")
    print("\nTroubleshooting:")
    print("  1. Is Ollama running? Start it with: ollama serve")
    print(f"  2. Is the model pulled? Run: ollama pull {LLM_MODEL}")

---
## Step 4: Initialize the Embedding Model

We use **HuggingFace sentence-transformers**.

Benefits:
- Downloads automatically (no manual `ollama pull`)
- Wide variety of models available
- Well-documented and widely used

The embedding model converts text into vectors (lists of numbers) that capture semantic meaning.

In [None]:
print(f"üîß Loading embedding model: {EMBEDDING_MODEL}")
print("   (This may download the model on first run - ~90MB for MiniLM)")

# Initialize the embedding model from HuggingFace
embed_model = HuggingFaceEmbedding(
    model_name=EMBEDDING_MODEL,
    # Uncomment below if you have a GPU and want faster embeddings
    # device="cuda",
)

# Quick test
test_embedding = embed_model.get_text_embedding("Hello world")
print(f"‚úÖ Embedding model loaded!")
print(f"   Vector dimensions: {len(test_embedding)}")

---
## Step 5: Configure LlamaIndex Settings

We tell LlamaIndex which models to use and how to chunk documents.

In [None]:
# Set the LLM and embedding model as defaults
Settings.llm = llm
Settings.embed_model = embed_model

# Configure document chunking
Settings.node_parser = SentenceSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
)

print(f"‚úÖ Settings configured:")
print(f"   LLM: {LLM_MODEL}")
print(f"   Embeddings: {EMBEDDING_MODEL}")
print(f"   Chunk size: {CHUNK_SIZE} chars")
print(f"   Chunk overlap: {CHUNK_OVERLAP} chars")

---
## Step 6: Set Up ChromaDB Vector Store

**ChromaDB** stores our document embeddings locally. It persists to disk, so you don't need to re-embed documents every time.

Think of it as a database optimized for finding similar vectors.

In [None]:
print("üíæ Initializing ChromaDB vector store...")

# Create the database folder if it doesn't exist
os.makedirs(CHROMA_DB_FOLDER, exist_ok=True)

# Create a persistent ChromaDB client
chroma_client = chromadb.PersistentClient(path=CHROMA_DB_FOLDER)

# Get or create a collection for our documents
chroma_collection = chroma_client.get_or_create_collection(
    name="tutorial_documents",
    metadata={"description": "RAG tutorial document collection"}
)

# Wrap for LlamaIndex
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

print(f"‚úÖ Vector store ready at: {CHROMA_DB_FOLDER}")
print(f"   Existing documents in collection: {chroma_collection.count()}")

---
## Step 7: Load and Index Documents

This is where the magic happens:

1. **Load** PDFs from the documents folder
2. **Chunk** them into smaller pieces
3. **Embed** each chunk into a vector
4. **Store** vectors in ChromaDB

‚ö†Ô∏è **Make sure you have PDFs in the `./documents` folder!**

In [None]:
# Create documents folder if it doesn't exist
os.makedirs(DOCUMENTS_FOLDER, exist_ok=True)

# Check for PDFs
pdf_files = list(Path(DOCUMENTS_FOLDER).glob("*.pdf"))

if not pdf_files:
    print(f"‚ö†Ô∏è  No PDFs found in {DOCUMENTS_FOLDER}")
    print("\n   Please add some PDF files and re-run this cell!")
    print(f"\n   Example: Copy a PDF to {os.path.abspath(DOCUMENTS_FOLDER)}")
else:
    print(f"üìÑ Found {len(pdf_files)} PDF(s):")
    for pdf in pdf_files:
        print(f"   - {pdf.name}")

In [None]:
# Load and index the documents
# This cell may take a few minutes depending on document size

print("üìñ Loading documents...")
documents = SimpleDirectoryReader(
    input_dir=DOCUMENTS_FOLDER,
    required_exts=[".pdf"],
).load_data()

print(f"   Loaded {len(documents)} document sections")

print("\nüî¢ Creating embeddings and building index...")
print("   (This may take a few minutes the first time)")

index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    show_progress=True,
)

print("\n‚úÖ Index built successfully!")

---
## Step 8: Create the Query Engine

The query engine ties everything together:
1. Takes your question
2. Embeds it
3. Finds similar chunks in the vector store
4. Sends chunks + question to the LLM
5. Returns the answer

In [None]:
# Create the query engine
query_engine = index.as_query_engine(
    similarity_top_k=TOP_K,  # How many chunks to retrieve
    response_mode="compact",  # Combine chunks into coherent response
)

print(f"‚úÖ Query engine ready!")
print(f"   Will retrieve top {TOP_K} chunks for each query")

---
## Step 9: Query Function with Source Display

This function shows both the answer AND the retrieved chunks, so you can see what context the LLM used.

In [None]:
from IPython.display import display, Markdown

def ask(question: str, show_sources: bool = True):
    """
    Ask a question and display the answer with sources.
    
    Args:
        question: Your question about the documents
        show_sources: Whether to display retrieved chunks
    """
    print(f"üîç Question: {question}\n")
    print("‚è≥ Thinking...")
    
    # Query the RAG system
    response = query_engine.query(question)
    
    # Display the answer
    print("\n" + "="*60)
    display(Markdown(f"## üí¨ Answer\n\n{response}"))
    print("="*60)
    
    # Display retrieved sources
    if show_sources:
        print("\nüìö Retrieved Chunks:\n")
        
        for i, node in enumerate(response.source_nodes, 1):
            filename = node.metadata.get("file_name", "Unknown")
            page = node.metadata.get("page_label", "?")
            score = node.score if node.score else 0
            
            print(f"--- Chunk {i} (Score: {score:.3f}) ---")
            print(f"Source: {filename}, Page {page}")
            print(f"\n{node.text[:400]}{'...' if len(node.text) > 400 else ''}")
            print()
    
    return response

---
## üéØ Try It Out!

Now you can ask questions about your documents. Edit the question below and run the cell.

In [None]:
# Ask a question about your document(s)!

response = ask("What is the main topic of this document?", show_sources=False)

In [None]:
# Try another question

response = ask("What methodology or approach is described?")

In [None]:
# Try asking something NOT in the documents to see how it handles it

response = ask("What is the capital of France?")

---
## üß™ Experiments to Try

Use the cells below to explore how RAG behaves in different situations.

### Experiment 1: Specificity

Compare vague vs. specific questions. How does retrieval quality change?

In [None]:
# Vague question
response = ask("What is this about?")

In [None]:
# Specific question
response = ask("What specific methods or techniques are used in section 3?")

### Experiment 2: Attribution Prompting

Can you get the model to cite its sources more explicitly?

In [None]:
# Without attribution request
response = ask("What are the main findings?")

In [None]:
# With attribution request
response = ask("What are the main findings? Quote the relevant passages and cite page numbers.")

### Experiment 3: Changing Retrieval Settings

What happens if we retrieve more or fewer chunks?

In [None]:
# Create a query engine that retrieves MORE chunks
query_engine_more = index.as_query_engine(similarity_top_k=5)

print("Retrieving 5 chunks instead of 3:\n")
response = query_engine_more.query("Summarize the key points.")
print(response)
print(f"\nUsed {len(response.source_nodes)} chunks")

In [None]:
# Create a query engine that retrieves FEWER chunks
query_engine_less = index.as_query_engine(similarity_top_k=1)

print("Retrieving only 1 chunk:\n")
response = query_engine_less.query("Summarize the key points.")
print(response)
print(f"\nUsed {len(response.source_nodes)} chunk")

---
## üìù Reflection Questions

After experimenting, consider:

1. **What made the difference between good and bad RAG responses?**

2. **When did the system fail to use the context properly?**

3. **How might you evaluate RAG quality systematically?**

4. **What would you change about the chunking strategy?**

5. **When might RAG NOT be the right approach?**

6. **What happens when you have contradicting documents in your folder?**

## üìö References & Further Reading

### Foundational Papers
- **RAG**: Lewis et al. (2020). [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401). *NeurIPS 2020*.
- **Dense Retrieval**: Karpukhin et al. (2020). [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906). *EMNLP 2020*.
- **Sentence Embeddings**: Reimers & Gurevych (2019). [Sentence-BERT](https://arxiv.org/abs/1908.10084). *EMNLP 2019*.

### Why RAG Works This Way
- Liu et al. (2023). [Lost in the Middle: How Language Models Use Long Contexts](https://arxiv.org/abs/2307.03172). *Explains why we chunk and retrieve rather than stuffing everything in context.*

### Tools We Used
- [LlamaIndex Documentation](https://docs.llamaindex.ai/)
- [ChromaDB Documentation](https://docs.trychroma.com/)
- [Ollama](https://ollama.com/) | [Model Library](https://ollama.com/library)
- [Sentence Transformers](https://www.sbert.net/) | [Model Hub](https://huggingface.co/sentence-transformers)

### Going Deeper
- Gao et al. (2024). [RAG Survey](https://arxiv.org/abs/2312.10997). *Comprehensive overview of RAG techniques.*
- [Anthropic RAG Guide](https://docs.anthropic.com/en/docs/build-with-claude/retrieval-augmented-generation)
- [RAGAS Evaluation Framework](https://github.com/explodinggradients/ragas)

---
## üîß Appendix: Understanding the Components

### Embedding Models Comparison

| Model | Dimensions | Speed | Quality | Size |
|-------|------------|-------|---------|------|
| all-MiniLM-L6-v2 | 384 | ‚ö° Fast | Good | ~90MB |
| all-mpnet-base-v2 | 768 | Medium | Better | ~420MB |
| bge-small-en-v1.5 | 384 | ‚ö° Fast | Good | ~130MB |
| bge-large-en-v1.5 | 1024 | Slow | Best | ~1.3GB |

### Chunking Strategies

- **Smaller chunks** (256-512): More precise retrieval, but may lose context
- **Larger chunks** (1024+): More context, but may retrieve irrelevant content
- **Overlap**: Helps prevent cutting sentences in the middle