# Week 1 Assignment: Build Your Toyota RAG System - SOLUTION

**Instructor Reference Solution**

This notebook contains complete implementations for all assignment tasks.


## Part 1: Environment Setup


In [None]:
import sys
from pathlib import Path
import pypdf
import chromadb
from langchain_google_vertexai import VertexAI

print(f"Python version: {sys.version}")
print("✓ All imports successful!")


## Part 2: Load Toyota PDFs


In [None]:
def load_pdf(pdf_path):
    """Load and extract text from a PDF file."""
    with open(pdf_path, 'rb') as f:
        reader = pypdf.PdfReader(f)
        text = ""
        for page in reader.pages:
            text += page.extract_text()
    return text

# Test
data_dir = Path("../data/car-specs/toyota-specs")
test_pdf = data_dir / "Toyota_Camry_Specifications.pdf"
text = load_pdf(test_pdf)

assert len(text) > 2000, "Should extract substantial text"
print(f"✓ Loaded {len(text)} characters from {test_pdf.name}")


## Part 3: Chunking Implementation


In [None]:
def simple_chunk(text, chunk_size=500, overlap=50):
    """Split text into overlapping chunks."""
    chunks = []
    start = 0
    
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        
        if chunk.strip():
            chunks.append(chunk)
        
        start = end - overlap
        
        if start >= len(text) - overlap:
            break
    
    return chunks

# Load all PDFs and chunk
pdfs = sorted(data_dir.glob("*.pdf"))
all_chunks = []

for pdf in pdfs:
    text = load_pdf(pdf)
    model = pdf.stem.replace("_", " ").replace(" Specifications", "")
    chunks = simple_chunk(text, chunk_size=500, overlap=50)
    
    for i, chunk in enumerate(chunks):
        all_chunks.append({
            "content": chunk,
            "model": model,
            "source": pdf.name,
            "chunk_id": f"{pdf.name}_{i}"
        })

print(f"✓ Created {len(all_chunks)} chunks from {len(pdfs)} documents")


## Part 4: Store in ChromaDB


In [None]:
client = chromadb.Client()
collection = client.create_collection(
    name="toyota_specs_assignment",
    metadata={"description": "Toyota specs - Assignment solution"}
)

documents_list = [chunk["content"] for chunk in all_chunks]
metadatas = [{"model": chunk["model"], "source": chunk["source"]} for chunk in all_chunks]
ids = [chunk["chunk_id"] for chunk in all_chunks]

collection.add(documents=documents_list, metadatas=metadatas, ids=ids)
print(f"✓ Stored {collection.count()} chunks in ChromaDB")


## Part 5: Complete RAG Function


In [None]:
llm = VertexAI(model_name="gemini-pro", temperature=0)

def ask_toyota_question(question, collection, llm):
    """Ask a question about Toyota using RAG."""
    results = collection.query(query_texts=[question], n_results=3)
    context = "\\n\\n".join(results['documents'][0])
    sources = results['metadatas'][0]
    
    prompt = f"""You are a helpful Toyota sales assistant. Answer based on the provided information.

Context:
{context}

Question: {question}

Answer:"""
    
    answer = llm.invoke(prompt)
    return answer, sources

# Test
answer, sources = ask_toyota_question("What's the Camry's horsepower?", collection, llm)
print(f"Answer: {answer}")
print(f"Sources: {[s['model'] for s in sources]}")


## Part 6: Reflection Questions

### Question 1: Chunking Strategy

**Answer:** The simple 500-character chunking is easy to implement and creates consistent chunk sizes, but it has significant drawbacks. It can split mid-sentence or mid-paragraph, breaking semantic coherence. It might separate related information like an engine specification from its description. This approach would fail when dealing with structured sections like tables or when important context spans across the arbitrary boundary.

### Question 2: Retrieval Quality

**Answer:** Specification queries (e.g., "What's the horsepower?") worked best because they target specific factual information that's likely contained in a single chunk. Feature queries also performed well when the feature description was self-contained. General queries struggled more because they require synthesizing information across multiple sections or documents, and simple retrieval may miss relevant context.

### Question 3: Improvements

**Answer:** The most impactful improvement would be section-based chunking that respects document structure. This would ensure that complete thoughts and related information stay together, improving both retrieval accuracy and answer quality. Additionally, adding metadata filtering (by model, section type) would allow more precise retrieval for specific queries.
