# Week 1: RAG Foundations - Lecture Demo

**Goal:** Build a complete RAG system for Toyota specifications

**What we'll build:**
- Load and analyze Toyota PDF documents
- Implement simple chunking strategy
- Create embeddings with Vertex AI
- Store in ChromaDB vector database
- Query and retrieve relevant information
- Generate answers with Gemini Pro

**Duration:** 2 hours live session


## Part 1: Environment Setup and Data Exploration

First, let's verify our environment and explore the Toyota dataset.


In [1]:
# Load environment variables
from dotenv import load_dotenv
import os

load_dotenv()

# Verify GCP configuration
print("GCP Configuration:")
print("="*70)
print(f"Project ID: {os.getenv('GCP_PROJECT_ID')}")
print(f"Region: {os.getenv('GCP_REGION', 'us-central1')}")
print("\n✓ Environment variables loaded")


GCP Configuration:
Project ID: agentapps-473813
Region: us-central1

✓ Environment variables loaded


In [2]:
# Check Python version and key imports
import sys
print(f"Python version: {sys.version}")
print(f"Python executable: {sys.executable}")

# Verify key packages are installed
try:
    import pypdf
    import chromadb
    from langchain_google_vertexai import VertexAIEmbeddings, VertexAI
    print("\n✓ All required packages are installed!")
except ImportError as e:
    print(f"\n❌ Missing package: {e}")
    print("Run: pip install -r requirements.txt")


Python version: 3.11.2 (v3.11.2:878ead1ac1, Feb  7 2023, 10:02:41) [Clang 13.0.0 (clang-1300.0.29.30)]
Python executable: /Users/itversity/Projects/Internal/aicohort-content/.venv/bin/python


  from google.cloud.aiplatform.utils import gcs_utils



✓ All required packages are installed!


### 1.1 Explore Toyota Dataset - File Sizes

Let's see what PDF files we have and their sizes.


In [3]:
from pathlib import Path
import os

# Path to Toyota specs directory
data_dir = Path("../data/car-specs/toyota-specs")

# List all PDF files with sizes
pdfs = sorted(data_dir.glob("*.pdf"))

print("Toyota Specifications Dataset")
print("="*70)
print(f"{'Filename':<45} {'Size (KB)':<10} {'Pages'}")
print("-"*70)

total_size = 0
for pdf in pdfs:
    size_kb = os.path.getsize(pdf) / 1024
    total_size += size_kb
    
    # Get page count
    with open(pdf, 'rb') as f:
        reader = pypdf.PdfReader(f)
        pages = len(reader.pages)
    
    print(f"{pdf.name:<45} {size_kb:>7.1f} KB {pages:>5}")

print("-"*70)
print(f"{'Total':<45} {total_size:>7.1f} KB")
print(f"\n✓ Found {len(pdfs)} Toyota specification documents")


Toyota Specifications Dataset
Filename                                      Size (KB)  Pages
----------------------------------------------------------------------
Introduction_to_Toyota_Car_Sales.pdf             48.6 KB     2
Toyota_Camry_Specifications.pdf                  74.3 KB     1
Toyota_Corolla_Specifications.pdf                73.4 KB     1
Toyota_Highlander_Specifications.pdf             74.9 KB     1
Toyota_Prius_Specifications.pdf                  74.9 KB     1
Toyota_RAV4_Specifications.pdf                   75.9 KB     1
Toyota_Tacoma_Specifications.pdf                 73.6 KB     1
Toyota_bZ4X_Specifications.pdf                   83.9 KB     4
----------------------------------------------------------------------
Total                                           579.5 KB

✓ Found 8 Toyota specification documents


### 1.2 Load PDF and Analyze Content

Extract text from Toyota Camry PDF and analyze it.


In [4]:
def load_pdf(pdf_path):
    """Load and extract text from a PDF file."""
    with open(pdf_path, 'rb') as f:
        reader = pypdf.PdfReader(f)
        text = ""
        for page in reader.pages:
            text += page.extract_text()
    return text

# Load Toyota Camry specifications
camry_path = data_dir / "Toyota_Camry_Specifications.pdf"
camry_text = load_pdf(camry_path)

# Analyze the text
print("Toyota Camry Text Analysis")
print("="*70)
print(f"Total characters: {len(camry_text):,}")
print(f"Total words: {len(camry_text.split()):,}")

# Fix: Calculate line count before the f-string
line_count = len([l for l in camry_text.split('\n') if l.strip()])
print(f"Total lines: {line_count:,}")

print("\nFirst 500 characters:")
print("-"*70)
print(camry_text[:500])
print("...")

Toyota Camry Text Analysis
Total characters: 3,073
Total words: 434
Total lines: 68

First 500 characters:
----------------------------------------------------------------------
rag/Toyota_Camry_Specifications.md
Toyota Camry: The Sophisticated Midsize Sedan
Overview
The Toyota Camry is a premium midsize sedan renowned for its reliability, spacious interiors, and smooth performance. It caters to
professionals, small families, and those seeking a balance of luxury and efﬁciency, with hybrid options available for eco-conscious buyers.
Engine Options
2.5L 4-Cylinder Gasoline Engine
Power Output: 203 HP
Transmission: 8-speed automatic
Key Feature: Efﬁcient and reﬁned perfor
...


## Part 2: Load All Documents

Load all 8 Toyota PDFs into memory.


In [5]:
# Load all Toyota PDFs
documents = []

for pdf in sorted(pdfs):
    text = load_pdf(pdf)
    model = pdf.stem.replace("_", " ").replace(" Specifications", "")
    
    documents.append({
        "content": text,
        "source": pdf.name,
        "model": model
    })
    print(f"✓ Loaded: {model:<30} ({len(text):,} characters)")

print(f"\n✓ Successfully loaded {len(documents)} documents")


✓ Loaded: Introduction to Toyota Car Sales (1,871 characters)
✓ Loaded: Toyota Camry                   (3,073 characters)
✓ Loaded: Toyota Corolla                 (2,685 characters)
✓ Loaded: Toyota Highlander              (2,877 characters)
✓ Loaded: Toyota Prius                   (3,310 characters)
✓ Loaded: Toyota RAV4                    (3,174 characters)
✓ Loaded: Toyota Tacoma                  (2,668 characters)
✓ Loaded: Toyota bZ4X                    (3,090 characters)

✓ Successfully loaded 8 documents


## Part 3: Chunking Strategy

Break documents into smaller chunks for better retrieval.


In [6]:
def simple_chunk(text, chunk_size=500, overlap=50):
    """Split text into overlapping chunks."""
    chunks = []
    start = 0
    
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        
        if chunk.strip():
            chunks.append(chunk)
        
        start = end - overlap
        
        if start >= len(text) - overlap:
            break
    
    return chunks

# Chunk all documents
all_chunks = []

for doc in documents:
    chunks = simple_chunk(doc["content"], chunk_size=500, overlap=50)
    
    for i, chunk in enumerate(chunks):
        all_chunks.append({
            "content": chunk,
            "model": doc["model"],
            "source": doc["source"],
            "chunk_id": f"{doc['source']}_{i}"
        })
    
    print(f"✓ {doc['model']:<30} -> {len(chunks):>3} chunks")

print(f"\n✓ Created {len(all_chunks)} chunks from {len(documents)} documents")


✓ Introduction to Toyota Car Sales ->   5 chunks
✓ Toyota Camry                   ->   7 chunks
✓ Toyota Corolla                 ->   6 chunks
✓ Toyota Highlander              ->   7 chunks
✓ Toyota Prius                   ->   8 chunks
✓ Toyota RAV4                    ->   7 chunks
✓ Toyota Tacoma                  ->   6 chunks
✓ Toyota bZ4X                    ->   7 chunks

✓ Created 53 chunks from 8 documents


## Part 4: ChromaDB Storage

Store chunks with embeddings for semantic search.


In [10]:
!gcloud auth application-default login

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login&state=PVQJDNzoxVHGSf36RgUMHaB8N05ykx&access_type=offline&code_challenge=ge4H0PwMBSpQbRurdX1hIG8fZeFDZFGdS-4Q3Lyg4-I&code_challenge_method=S256


Credentials saved to file: [/Users/itversity/.config/gcloud/application_default_credentials.json]

These credentials will be used by any library that requests Application Default Credentials (ADC).

Quota project "agentapps-473813" was added to ADC which can be used by Google client libraries for billing and quota. Note that some services may still bill the project owning the resource.


Updates are available for some Google Cloud 

In [11]:
os.getenv("GCP_PROJECT_ID")

'agentapps-473813'

In [15]:
import chromadb
from langchain_google_vertexai import VertexAIEmbeddings
from chromadb.api.types import EmbeddingFunction, Documents
import os

In [16]:
# Custom wrapper to adapt LangChain embeddings to ChromaDB interface
class VertexAIEmbeddingFunction(EmbeddingFunction):
    def __init__(self, model_name: str, project: str, location: str):
        self.embeddings = VertexAIEmbeddings(
            model_name=model_name,
            project=project,
            location=location
        )
    
    def __call__(self, input: Documents) -> list:
        """Embed documents using Vertex AI"""
        return self.embeddings.embed_documents(input)

In [17]:
# Initialize Vertex AI embeddings wrapper
print("Initializing Vertex AI embeddings...")
embedding_function = VertexAIEmbeddingFunction(
    model_name="text-embedding-004",
    project=os.getenv("GCP_PROJECT_ID"),
    location=os.getenv("GCP_REGION", "us-central1")
)

Initializing Vertex AI embeddings...


In [18]:
# Initialize ChromaDB with Vertex AI embeddings
client = chromadb.Client()

In [19]:
# Delete collection if it exists (for clean state)
try:
    client.delete_collection(name="toyota_specs_week1")
    print("✓ Deleted existing collection")
except:
    pass

✓ Deleted existing collection


In [20]:
# Create new collection
collection = client.create_collection(
    name="toyota_specs_week1",
    embedding_function=embedding_function,
    metadata={"description": "Toyota specifications - Week 1 demo"}
)

In [21]:
# Prepare data
documents_list = [chunk["content"] for chunk in all_chunks]
metadatas = [{"model": chunk["model"], "source": chunk["source"]} for chunk in all_chunks]
ids = [chunk["chunk_id"] for chunk in all_chunks]

In [22]:
# Add to collection (will use Vertex AI for embeddings)
print("Adding chunks to ChromaDB (creating embeddings with Vertex AI)...")
collection.add(documents=documents_list, metadatas=metadatas, ids=ids)

print(f"✓ Stored {collection.count()} chunks in ChromaDB with Vertex AI embeddings")

Adding chunks to ChromaDB (creating embeddings with Vertex AI)...
✓ Stored 53 chunks in ChromaDB with Vertex AI embeddings


## Part 5: Query and Retrieve

Test semantic search - find relevant chunks for a query.


In [23]:
query = "What is the Toyota Camry's horsepower?"

print(f"Query: '{query}'\\n")
print("Retrieving top 3 relevant chunks...\\n")
print("="*70)

Query: 'What is the Toyota Camry's horsepower?'\n
Retrieving top 3 relevant chunks...\n


In [24]:
# Query ChromaDB
results = collection.query(query_texts=[query], n_results=3)

In [25]:
# Display results
for i, (doc, metadata, distance) in enumerate(zip(
    results['documents'][0],
    results['metadatas'][0],
    results['distances'][0]
), 1):
    print(f"\\nResult {i}:")
    print(f"  Model: {metadata['model']}")
    print(f"  Similarity: {1 - distance:.4f}")
    print(f"  Content: {doc[:150]}...")
    print("-"*70)


\nResult 1:
  Model: Toyota Camry
  Similarity: 0.7040
  Content: rag/Toyota_Camry_Specifications.md
Toyota Camry: The Sophisticated Midsize Sedan
Overview
The Toyota Camry is a premium midsize sedan renowned for its...
----------------------------------------------------------------------
\nResult 2:
  Model: Toyota Camry
  Similarity: 0.6471
  Content: ront-Wheel Drive) only
What is the warranty for the Toyota Camry?
3 years/36,000 miles basic warranty and 5 years/60,000 miles powertrain warranty
Qui...
----------------------------------------------------------------------
\nResult 3:
  Model: Toyota Camry
  Similarity: 0.5994
  Content: fessionals: Modern interiors and advanced tech for a sophisticated experience
Eco-Conscious Buyers: Hybrid models with outstanding fuel efﬁciency
Smal...
----------------------------------------------------------------------


## Part 6: Generate Answer with Gemini Pro

Use the LLM to generate a natural language answer based on retrieved context.


In [26]:
from langchain_google_vertexai import VertexAI

In [29]:
# Initialize Gemini Pro
llm = VertexAI(model_name="gemini-2.5-pro", temperature=0)

In [30]:
# Build prompt with context
context = "\\n\\n".join(results['documents'][0])

prompt = f"""You are a helpful Toyota sales assistant. Answer the customer's question based on the provided information.

Context from Toyota specifications:
{context}

Customer question: {query}

Provide a clear, accurate answer. If the information isn't in the context, say so.

Answer:"""

# Generate answer
answer = llm.invoke(prompt)

print("="*70)
print(f"Question: {query}")
print("="*70)
print(f"Answer: {answer}")
print("="*70)

Question: What is the Toyota Camry's horsepower?
Answer: The horsepower of the Toyota Camry depends on the engine you choose:

*   The 2.5L 4-cylinder gasoline engine has **203 horsepower**.
*   The V6 gasoline engine has **301 horsepower**.
*   The hybrid models have a combined horsepower of **208**.


## Part 7: Complete RAG Function

Wrap everything into a reusable function.


In [31]:
def ask_toyota_question(question, collection, llm):
    """
    Ask a question about Toyota vehicles using RAG.
    
    Args:
        question: User's question
        collection: ChromaDB collection
        llm: Language model
        
    Returns:
        tuple: (answer, sources)
    """
    # Retrieve
    results = collection.query(query_texts=[question], n_results=3)
    
    # Build context
    context = "\\n\\n".join(results['documents'][0])
    sources = results['metadatas'][0]
    
    # Create prompt
    prompt = f"""You are a helpful Toyota sales assistant. Answer based on the provided information.

Context:
{context}

Question: {question}

Answer:"""
    
    # Generate
    answer = llm.invoke(prompt)
    
    return answer, sources

In [32]:
# Test with multiple queries
test_queries = [
    "What's the Camry's horsepower?",
    "What safety features does the RAV4 have?",
    "Tell me about Toyota reliability"
]

In [33]:
print("Testing Complete RAG System")
print("="*70)

for i, q in enumerate(test_queries, 1):
    print(f"\\n{i}. Q: {q}")
    answer, sources = ask_toyota_question(q, collection, llm)
    print(f"   A: {answer}")
    print(f"   Sources: {[s['model'] for s in sources]}")


Testing Complete RAG System
\n1. Q: What's the Camry's horsepower?
   A: Based on the provided information, the Toyota Camry with the 2.5L 4-Cylinder gasoline engine has a power output of 203 HP.
   Sources: ['Toyota Camry', 'Toyota Camry', 'Toyota Camry']
\n2. Q: What safety features does the RAV4 have?
   A: Based on the information provided, the Toyota RAV4 has "top-tier safety features," which make it an ideal vehicle for families.
   Sources: ['Toyota RAV4', 'Toyota RAV4', 'Toyota RAV4']
\n3. Q: Tell me about Toyota reliability
   A: Based on the information provided, Toyota has long been known for its reliability.

Specifically, the Toyota Corolla is renowned for its long-term reliability and resale value, especially when compared to competitors like the Hyundai Elantra.
   Sources: ['Introduction to Toyota Car Sales', 'Introduction to Toyota Car Sales', 'Toyota Corolla']
