[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kgweber-cwru/coding-with-ai-wn26/blob/main/week-5-building-rag-systems/concepts.ipynb)

# Week 5: Building Complete RAG Systems

This week, we'll use a vector document store to build out a more permanent database in ChromaDB and work
toward creating a RAG application

## Learning Objectives
- Build production-quality RAG pipelines
- Implement document chunking strategies
- Handle multiple document sources
- Improve retrieval quality
- Add metadata and filtering

In [1]:
import os
import sys
from pathlib import Path

IN_COLAB = "google.colab" in sys.modules

if IN_COLAB:
    !pip install -q google-genai google-auth python-dotenv numpy chromadb
    from google.colab import auth
    auth.authenticate_user()
    try:
        PROJECT_ID = input("Enter your Google Cloud Project ID (press Enter to use default ADC): ").strip()
    except Exception:
        PROJECT_ID = ""
    if PROJECT_ID:
        os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID
else:
    def find_service_account_json(max_up=6):
        p = Path.cwd()
        for _ in range(max_up):
            candidate = p / "series-2-coding-llms" / "creds"
            if candidate.exists():
                for f in candidate.glob("*.json"):
                    return str(f.resolve())
            candidate2 = p / "creds"
            if candidate2.exists():
                for f in candidate2.glob("*.json"):
                    return str(f.resolve())
            p = p.parent
        return None

    sa_path = find_service_account_json()
    if sa_path:
        os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = sa_path
    else:
        try:
            from dotenv import load_dotenv
            load_dotenv()
        except Exception:
            pass


In [2]:
import chromadb
from datetime import datetime
from google import genai
from google.genai import types
import google.auth

creds, project = google.auth.default()
project = os.environ.get("GOOGLE_CLOUD_PROJECT", project)
client = genai.Client(vertexai=True, project=project, location="us-central1")
print(f"Using project: {project}")

print("✅ Environment loaded successfully!")

Using project: coding-with-ai-wn-26
✅ Environment loaded successfully!


## Part 1: Document Processing Pipeline

Here are some useful classes to work with

In [3]:
class Document:
    """Represents a document with metadata"""
    def __init__(self, content, metadata=None):
        self.content = content
        self.metadata = metadata or {}
        self.metadata['created'] = datetime.now().isoformat()
    
    def __repr__(self):
        return f"Document(content={self.content[:50]}..., metadata={self.metadata})"

class DocumentProcessor:
    """Process and chunk documents"""
    
    @staticmethod
    def chunk_by_sentences(text, chunk_size=3, overlap=1):
        """Chunk by sentence count"""
        sentences = [s.strip() + '.' for s in text.split('.') if s.strip()]
        chunks = []
        
        for i in range(0, len(sentences), chunk_size - overlap):
            chunk = ' '.join(sentences[i:i + chunk_size])
            if chunk:
                chunks.append(chunk)
        
        return chunks
    
    @staticmethod
    def chunk_by_words(text, chunk_size=200, overlap=50):
        """Chunk by word count"""
        words = text.split()
        chunks = []
        
        for i in range(0, len(words), chunk_size - overlap):
            chunk = ' '.join(words[i:i + chunk_size])
            if chunk:
                chunks.append(chunk)
        
        return chunks

print("✅ Document classes ready")

✅ Document classes ready


## Part 2: From In-Memory Store to a Real Vector Database

In Week 4, we built `SimpleDocumentStore`, which kept our vectors in memory. This is great for learning, but has major limitations:
- **Volatility**: All data is lost when the script ends.
- **Scalability**: It doesn't scale to millions of documents.
- **Performance**: Searching requires comparing the query with *every single document*, which is slow.

To solve this, we use a dedicated **vector database**. For this series, we'll use **ChromaDB**.

### Why ChromaDB?
- **Open-source and easy to use**: Perfect for getting started.
- **Persistent**: It can save your database to disk. In Colab, we can save it to our Google Drive.
- **Scalable**: It's designed to handle large-scale vector search efficiently.
- **Feature-rich**: It supports metadata filtering, which we'll use.

### ChromaDB Documentation

https://github.com/chroma-core/chroma

### How it Works
Instead of simple lists, ChromaDB organizes data into **collections**. When you add documents, it stores the text, its vector embedding, and any metadata you provide. For searching, it uses efficient algorithms to find the nearest neighbors without checking every single document, making it incredibly fast.

Let's build a store that uses ChromaDB. We'll save our database to Google Drive so it persists across Colab sessions.

For this exercise, we're going to continue to use the Google `gemini-embedding-001` model, but if you're working locally,
chromaDB has a native one on board you could use - or you could specify any other embedding model you like

In [4]:
class ChromaDocumentStore:
    """Document store using ChromaDB for persistence and efficient search"""
    
    def __init__(self, collection_name="rag_collection"):
        if IN_COLAB:
            # Save to Google Drive in Colab
            db_path = "/content/drive/MyDrive/chroma_db"
            if not os.path.exists(db_path):
                os.makedirs(db_path)
            self.client = chromadb.PersistentClient(path=db_path)
        else:
            # Save to local disk otherwise
            self.client = chromadb.Client()
            
        self.collection = self.client.get_or_create_collection(name=collection_name)
    
    def add_document(self, document, chunk_strategy='sentences', chunk_size=3):
        """Add document with chunking"""
        processor = DocumentProcessor()
        
        if chunk_strategy == 'sentences':
            chunks = processor.chunk_by_sentences(document.content, chunk_size)
        else:
            chunks = processor.chunk_by_words(document.content, chunk_size)
        
        if not chunks:
            return 0

        # Get embeddings for all chunks in one API call
        response = client.models.embed_content(
            model="gemini-embedding-001",
            contents=[c.replace("\n", " ") for c in chunks]
        )
        embeddings = [e.values for e in response.embeddings]
        
        # Create a unique document ID for tracking
        document_id = f"{document.metadata.get('source', 'doc')}_{datetime.now().timestamp()}"
        
        # Prepare metadata and IDs for ChromaDB
        metadatas = []
        ids = []
        for i, chunk in enumerate(chunks):
            chunk_metadata = document.metadata.copy()
            chunk_metadata['document_id'] = document_id
            chunk_metadata['chunk_index'] = i
            chunk_metadata['total_chunks'] = len(chunks)
            metadatas.append(chunk_metadata)
            # Create a unique ID for each chunk
            ids.append(f"{document_id}_{i}")

        # Add to ChromaDB collection
        self.collection.add(
            embeddings=embeddings,
            documents=chunks,
            metadatas=metadatas,
            ids=ids
        )
        
        return len(chunks)
    
    def search(self, query, top_k=5, filters=None):
        """Search with optional metadata filtering"""
        response = client.models.embed_content(
            model="gemini-embedding-001",
            contents=query
        )
        query_embedding = response.embeddings[0].values
        
        # ChromaDB handles filtering with the 'where' clause
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k,
            where=filters
        )
        
        # Format results to match our RAG system's expectations
        formatted_results = []
        if results and results['documents']:
            for i, doc in enumerate(results['documents'][0]):
                formatted_results.append({
                    "chunk": doc,
                    "similarity": 1 - results['distances'][0][i], # Chroma uses distance, convert to similarity
                    "metadata": results['metadatas'][0][i]
                })
        
        return formatted_results
    
    def __len__(self):
        return self.collection.count()

## Part 3: Complete RAG System

In [5]:
class RAGSystem:
    """Complete RAG system with a pluggable document store"""
    
    def __init__(self, store_impl=ChromaDocumentStore):
        self.store = store_impl()
    
    def add_document(self, content, metadata=None, **kwargs):
        """Add document to system"""
        doc = Document(content, metadata)
        chunks_added = self.store.add_document(doc, **kwargs)
        return chunks_added
    
    def query(self, question, top_k=3, filters=None, temperature=0.3):
        """Query system with RAG"""
        # Retrieve
        results = self.store.search(question, top_k=top_k, filters=filters)
        
        if not results:
            return {
                "answer": "No relevant documents found.",
                "sources": []
            }
        
        # Build context
        context = "\n\n".join([
            f"[Source {i+1}] {r['chunk']}"
            for i, r in enumerate(results)
        ])
        
        # Generate
        prompt = f"""Answer the question based on the provided context. 
If the answer is not in the context, say so.

Context:
{context}

Question: {question}

Answer:"""
        
        response = client.models.generate_content(
            model="gemini-2.5-flash-lite",
            contents=prompt,
            config=types.GenerateContentConfig(temperature=temperature)
        )
        
        return {
            "answer": response.text,
            "sources": results
        }
    
    def stats(self):
        """Get system statistics"""
        total_chunks = len(self.store)
        # To get unique docs, count unique document_id values in metadata
        if isinstance(self.store, ChromaDocumentStore):
            all_meta = self.store.collection.get(include=['metadatas'])
            unique_docs = len(set(m.get('document_id', '') for m in all_meta['metadatas']))
        else:
            unique_docs = "N/A"

        return {
            "total_chunks": total_chunks,
            "unique_documents": unique_docs,
            "store_implementation": self.store.__class__.__name__
        }

print("✅ RAG system ready")

✅ RAG system ready


## Part 4: Example - Medical Knowledge Base

In [6]:
# Create RAG system
rag = RAGSystem()

# Add documents with metadata
documents = [
    {
        "content": """Hypertension, or high blood pressure, is a common condition where blood 
        pressure is consistently elevated. Treatment includes lifestyle changes like diet and 
        exercise, and medications such as ACE inhibitors or diuretics. Regular monitoring is essential.""",
        "metadata": {"topic": "cardiovascular", "source": "clinical_guide", "date": "2024"}
    },
    {
        "content": """Type 2 diabetes is characterized by insulin resistance and high blood sugar. 
        Management includes blood glucose monitoring, dietary modifications, exercise, and medications 
        like metformin. Complications can affect kidneys, eyes, and nerves if uncontrolled.""",
        "metadata": {"topic": "endocrine", "source": "clinical_guide", "date": "2024"}
    },
    {
        "content": """Asthma is a chronic respiratory condition causing airway inflammation and bronchospasm. 
        Symptoms include wheezing, shortness of breath, and coughing. Treatment involves inhaled 
        corticosteroids for prevention and bronchodilators for acute symptoms.""",
        "metadata": {"topic": "respiratory", "source": "clinical_guide", "date": "2024"}
    }
]

for doc in documents:
    chunks = rag.add_document(doc["content"], doc["metadata"], chunk_strategy='sentences', chunk_size=2)
    print(f"Added document: {chunks} chunks")

print(f"\nSystem stats: {rag.stats()}")

Added document: 3 chunks
Added document: 3 chunks
Added document: 3 chunks

System stats: {'total_chunks': 9, 'unique_documents': 3, 'store_implementation': 'ChromaDocumentStore'}


In [11]:
# Query the system
questions = [
    "What medications are used to treat high blood pressure?",
    "What is the air-speed velocity of a swallow?",
    "How is diabetes managed?",
    "Tell me about respiratory conditions"
]

for q in questions:
    print(f"\nQ: {q}")
    print("="*70)
    result = rag.query(q, top_k=2)
    print(f"A: {result['answer']}")
    print("\nSources used:")
    for i, src in enumerate(result['sources'], 1):
        print(f"  {i}. [Score: {src['similarity']:.3f}] {src['chunk'][:80]}...")
        print(f"     Metadata: {src['metadata']}")


Q: What medications are used to treat high blood pressure?
A: ACE inhibitors or diuretics are used to treat high blood pressure.

Sources used:
  1. [Score: 0.528] Hypertension, or high blood pressure, is a common condition where blood 
       ...
     Metadata: {'source': 'clinical_guide', 'total_chunks': 3, 'document_id': 'clinical_guide_1769447226.974153', 'created': '2026-01-26T12:07:06.549193', 'topic': 'cardiovascular', 'date': '2024', 'chunk_index': 0}
  2. [Score: 0.302] Treatment includes lifestyle changes like diet and 
        exercise, and medica...
     Metadata: {'created': '2026-01-26T12:07:06.549193', 'date': '2024', 'total_chunks': 3, 'source': 'clinical_guide', 'topic': 'cardiovascular', 'chunk_index': 1, 'document_id': 'clinical_guide_1769447226.974153'}

Q: What is the air-speed velocity of a swallow?
A: The provided context does not contain information about the air-speed velocity of a swallow.

Sources used:
  1. [Score: -0.419] Hypertension, or high blood pressu

## Part 5: Filtered Search with ChromaDB

ChromaDB makes filtering easy and efficient. Instead of checking metadata in Python, we pass a `where` clause to the `query` method. This is much faster as the filtering happens inside the database.

The `where` clause format is a dictionary, just like we used before. For example: `{"topic": "cardiovascular"}`.

In [10]:
# Search only cardiovascular topics
result = rag.query(
    "What treatments are available?",
    filters={"topic": "cardiovascular"},
    top_k=2
)

print("Filtered to cardiovascular only:")
print(result['answer'])


Filtered to cardiovascular only:
Treatments include lifestyle changes like diet and exercise, and medications such as ACE inhibitors or diuretics. Regular monitoring is also essential.


## Key Takeaways

1. **Vector Databases are Essential**: For any real-world RAG system, an in-memory store is not enough. A vector database like ChromaDB provides persistence, scalability, and performance.
2. **Persistence in Colab**: By mounting Google Drive, we can save our ChromaDB database and reuse it across sessions, which is critical for iterative development.
3. **Efficient Filtering**: Vector databases handle metadata filtering internally, which is much more efficient than post-filtering in Python.
4. **Batching is Better**: When adding multiple documents or chunks, it's more efficient to get all embeddings in a single API call rather than one by one.
5. **Abstraction is Key**: Our `RAGSystem` can now work with different storage backends (`SimpleDocumentStore`, `ChromaDocumentStore`, etc.) without changing its core logic.

## Next Week

Best practices and production patterns:
- Error handling
- Cost optimization
- Testing strategies
- Deployment considerations