# Exercise 8: Chunk Size Experiment

**Hypothesis:** Chunk size is one of the most impactful RAG hyperparameters.
Too small ‚Üí retrieved chunks lack context. Too large ‚Üí irrelevant content dilutes the signal.

| Chunk size | Expected behaviour |
|---|---|
| **128** chars | Very precise retrieval, but answers may be incomplete ‚Äî chunks cut mid-thought |
| **512** chars | Balanced ‚Äî the Exercise 1 default |
| **2048** chars | Broad context, but retrieval precision drops ‚Äî irrelevant content creeps in |

**For each chunk size this notebook will:**
1. Re-chunk the corpus
2. Re-embed and rebuild the FAISS index
3. Run the same 5 queries
4. Record retrieved chunk scores, sources, and the final answer
5. Save all results to CSV for comparison

>  **Runtime note:** Re-embedding at every chunk size is slow on CPU.
> Run this on **Colab with a T4 GPU** (Runtime ‚Üí Change runtime type ‚Üí T4 GPU).

---
## Setup

In [13]:
try:
    ip = get_ipython()
    ip.run_line_magic('pip', 'install -q torch transformers sentence-transformers faiss-cpu pymupdf accelerate ipyfilechooser pandas')
except NameError:
    import subprocess, sys
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q',
        'torch', 'transformers', 'sentence-transformers',
        'faiss-cpu', 'pymupdf', 'accelerate', 'ipyfilechooser', 'pandas'])

In [None]:
import os, time
import torch
import faiss
import numpy as np
import pandas as pd
from pathlib import Path
from dataclasses import dataclass
from typing import List, Tuple
import fitz  # PyMuPDF

os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'

# ‚îÄ‚îÄ Device detection ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
def get_device():
    if torch.cuda.is_available():
        name = torch.cuda.get_device_name(0)
        mem  = torch.cuda.get_device_properties(0).total_memory / 1e9
        print(f'‚úî CUDA GPU: {name} ({mem:.1f} GB)')
        return 'cuda', torch.float16
    elif torch.backends.mps.is_available():
        print('‚úî Apple Silicon MPS')
        return 'mps', torch.float32
    else:
        print('‚ö† CPU only ‚Äî this will be slow. Consider using Colab + T4 GPU.')
        return 'cpu', torch.float32

try:
    import google.colab
    ENVIRONMENT = 'colab'
except ImportError:
    ENVIRONMENT = 'local'

DEVICE, DTYPE = get_device()
print(f'Environment: {ENVIRONMENT.upper()} | Device: {DEVICE} | Dtype: {DTYPE}')

---
## Load Documents
Same Google Drive / upload pattern as Exercise 1. Select your corpus folder, then run Cell 2.

In [None]:
# =============================================================================
# CELL 1 ‚Äî SELECT DOCUMENT SOURCE  (DO NOT CHANGE)
# =============================================================================
USE_GOOGLE_DRIVE = True

DOC_FOLDER = 'documents'
folder_chooser = None

if ENVIRONMENT == 'colab':
    if USE_GOOGLE_DRIVE:
        from google.colab import drive
        print('Mounting Google Drive...')
        drive.mount('/content/drive')
        print('‚úî Google Drive mounted\n')
        try:
            from ipyfilechooser import FileChooser
            folder_chooser = FileChooser(
                path='/content/drive/MyDrive',
                title='Select your documents folder in Google Drive',
                show_only_dirs=True, select_default=True)
            print('üìÅ Select your documents folder below, then run Cell 2:')
            display(folder_chooser)
        except ImportError:
            DOC_FOLDER = '/content/drive/MyDrive/your_documents_folder'
            print(f"Edit DOC_FOLDER: '{DOC_FOLDER}'")
    else:
        from google.colab import files as colab_files
        os.makedirs(DOC_FOLDER, exist_ok=True)
        print('Upload your documents:')
        uploaded = colab_files.upload()
        for fn in uploaded:
            os.rename(fn, f'{DOC_FOLDER}/{fn}')
else:
    try:
        from ipyfilechooser import FileChooser
        folder_chooser = FileChooser(path=str(Path.home()),
            title='Select documents folder', show_only_dirs=True, select_default=True)
        display(folder_chooser)
    except ImportError:
        print(f'Using default folder: {DOC_FOLDER}')

In [None]:
# =============================================================================
# CELL 2 ‚Äî CONFIRM SELECTION  (DO NOT CHANGE)
# =============================================================================
if folder_chooser is not None and folder_chooser.selected_path:
    DOC_FOLDER = folder_chooser.selected_path
    print(f'‚úî Using: {DOC_FOLDER}')
elif folder_chooser is not None:
    print('‚ö† No folder selected ‚Äî go back, select one, then re-run this cell.')
else:
    print(f'‚úî Using: {DOC_FOLDER}')

In [26]:
import fitz  # PyMuPDF
from typing import List, Tuple

def load_text_file(filepath: str) -> str:
    """Load a plain text file."""
    with open(filepath, 'r', encoding='utf-8', errors='ignore') as f:
        return f.read()


def load_pdf_file(filepath: str) -> str:
    """
    Extract text from a PDF with embedded text.

    PyMuPDF reads the text layer directly.
    For scanned PDFs without embedded text, you'd need OCR.
    """
    doc = fitz.open(filepath)
    text_parts = []

    for page_num, page in enumerate(doc):
        text = page.get_text()
        if text.strip():
            # Add page marker for debugging/citation
            text_parts.append(f"\n[Page {page_num + 1}]\n{text}")

    doc.close()
    return "\n".join(text_parts)


def load_documents(doc_folder: str) -> List[Tuple[str, str]]:
    """Load all documents from a folder. Returns list of (filename, content)."""
    documents = []
    folder = Path(doc_folder)

    for filepath in folder.rglob("*"):
        try:
            if not filepath.is_file():
                continue
        except OSError:
            continue
        if filepath.suffix.lower() not in ('.pdf', '.txt', '.md', '.text'):
            continue
        try:
            if filepath.suffix.lower() == '.pdf':
                content = load_pdf_file(str(filepath))
            elif filepath.suffix.lower() in ['.txt', '.md', '.text']:
                content = load_text_file(str(filepath))
            else:
                continue

            if content.strip():
                documents.append((filepath.name, content))
                print(f"√¢≈ì‚Äú Loaded: {filepath.name} ({len(content):,} chars)")
        except Exception as e:
            print(f"√¢≈ì‚Äî Error loading {filepath}: {e}")

    return documents

In [None]:
# Load your documents
documents = load_documents(DOC_FOLDER)
print(f"\nLoaded {len(documents)} documents")

if len(documents) == 0:
    print("\n√¢≈°  No documents loaded! Please add PDF or TXT files to the documents folder.")

In [None]:
# Inspect a document to verify loading worked
if documents:
    filename, content = documents[0]
    print(f"First document: {filename}")
    print(f"Total length: {len(content):,} characters")
    print(f"\nFirst 1000 characters:\n{'-'*40}")
    print(content[:1000])

In [None]:
# def load_documents(folder: str) -> List[Tuple[str, str]]:
#     docs = []
#     folder_path = Path(folder)
#     if not folder_path.exists():
#         print(f'‚ö† Folder not found: {folder}')
#         return docs
#     for path in sorted(folder_path.iterdir()):
#         if path.suffix.lower() == '.pdf':
#             doc  = fitz.open(str(path))
#             text = ''.join(f'[Page {i+1}]\n{page.get_text()}\n' for i, page in enumerate(doc))
#             docs.append((path.name, text))
#         elif path.suffix.lower() in ('.txt', '.md'):
#             docs.append((path.name, path.read_text(encoding='utf-8', errors='replace')))
#     print(f'‚úî Loaded {len(docs)} documents from {folder}')
#     for name, content in docs:
#         print(f'   {name}: {len(content):,} chars')
#     return docs

# documents = load_documents(DOC_FOLDER)

---
## Define Your 5 Queries
Choose questions that test different retrieval needs:
- A **narrow factual** question (single sentence answer)
- A **procedural** question (multi-step answer)
- A **broad conceptual** question
- A question whose answer **spans multiple sections**
- A question that is **hard to answer** from the corpus

In [None]:
# =============================================================================
# CONFIGURE YOUR QUERIES HERE
# =============================================================================
QUERIES = [
    {
        'id'  : 'Q1',
        'type': 'narrow_factual',
        'text': 'What is the correct spark plug gap for a Model T Ford?'
    },
    {
        'id'  : 'Q2',
        'type': 'procedural',
        'text': 'How do I fix a slipping transmission band?'
    },
    {
        'id'  : 'Q3',
        'type': 'procedural',
        'text': 'How do I adjust the carburetor on a Model T?'
    },
    {
        'id'  : 'Q4',
        'type': 'broad_conceptual',
        'text': 'What oil should I use in a Model T engine?'
    },
    {
        'id'  : 'Q5',
        'type': 'multi_section',
        'text': 'What are all the steps to prepare a Model T for winter driving?'
    },
]

TOP_K      = 5    # chunks retrieved per query
CHUNK_OVERLAP = 0   # fixed overlap = 0 so chunk SIZE is the only variable

print(f'‚úî {len(QUERIES)} queries defined')
print(f'   TOP_K={TOP_K} | CHUNK_OVERLAP={CHUNK_OVERLAP} (fixed)')
for q in QUERIES:
    print(f"   [{q['id']}] ({q['type']}) {q['text']}")

---
## Load Embedding Model and LLM
Loaded once ‚Äî reused across all chunk-size experiments.

In [None]:
from sentence_transformers import SentenceTransformer

EMBED_MODEL_NAME = 'all-MiniLM-L6-v2'
EMBEDDING_DIM    = 384

print(f'Loading embedding model: {EMBED_MODEL_NAME} ...')
embed_model = SentenceTransformer(EMBED_MODEL_NAME, device=DEVICE)
print('‚úî Embedding model ready')

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

LLM_MODEL = 'Qwen/Qwen2.5-1.5B-Instruct'
print(f'Loading LLM: {LLM_MODEL} ...')
tokenizer = AutoTokenizer.from_pretrained(LLM_MODEL)

if DEVICE == 'cuda':
    model = AutoModelForCausalLM.from_pretrained(
        LLM_MODEL, device_map='auto', dtype=DTYPE, trust_remote_code=True)
elif DEVICE == 'mps':
    model = AutoModelForCausalLM.from_pretrained(
        LLM_MODEL, dtype=DTYPE, trust_remote_code=True).to(DEVICE)
else:
    model = AutoModelForCausalLM.from_pretrained(
        LLM_MODEL, dtype=DTYPE, trust_remote_code=True)

print(f'‚úî LLM loaded on {DEVICE}')

---
## Pipeline Functions

In [None]:
@dataclass
class Chunk:
    text: str
    source_file: str
    chunk_index: int
    start_char: int
    end_char: int


def chunk_text(text: str, source_file: str,
               chunk_size: int, chunk_overlap: int) -> List[Chunk]:
    """Split text into overlapping chunks, breaking at paragraph/sentence boundaries."""
    chunks, start, idx = [], 0, 0
    while start < len(text):
        end = start + chunk_size
        if end < len(text):
            pb = text.rfind('\n\n', start + chunk_size // 2, end)
            if pb != -1:
                end = pb + 2
            else:
                sb = text.rfind('. ', start + chunk_size // 2, end)
                if sb != -1:
                    end = sb + 2
        s = text[start:end].strip()
        if s:
            chunks.append(Chunk(s, source_file, idx, start, end))
            idx += 1
        prev = start
        start = end - chunk_overlap
        if start <= prev:
            start = end   # safety: always make progress
    return chunks


def build_pipeline(chunk_size: int, chunk_overlap: int) -> Tuple[List[Chunk], faiss.IndexFlatIP]:
    """Re-chunk all documents, re-embed, and rebuild the FAISS index."""
    # 1. Chunk
    all_chunks = []
    for filename, content in documents:
        all_chunks.extend(chunk_text(content, filename, chunk_size, chunk_overlap))

    char_counts = [len(c.text) for c in all_chunks]
    print(f'  Chunks: {len(all_chunks):,}  |  '
          f'avg {sum(char_counts)/len(char_counts):.0f} chars  |  '
          f'min {min(char_counts)}  max {max(char_counts)}')

    # 2. Embed
    t0 = time.time()
    embeddings = embed_model.encode(
        [c.text for c in all_chunks],
        show_progress_bar=True,
        batch_size=64
    ).astype('float32')
    print(f'  Embedded in {time.time()-t0:.1f}s')

    # 3. FAISS index
    idx = faiss.IndexFlatIP(EMBEDDING_DIM)
    faiss.normalize_L2(embeddings)
    idx.add(embeddings)
    print(f'  FAISS index: {idx.ntotal:,} vectors')

    return all_chunks, idx

# def rebuild_pipeline(chunk_size: int = 512, chunk_overlap: int = 128):
#     """Re-chunk documents, re-embed, and rebuild FAISS index. Updates global all_chunks and index."""
#     global all_chunks, index
#     all_chunks = []
#     for filename, content in documents:
#         all_chunks.extend(chunk_text(content, filename, chunk_size=chunk_size, chunk_overlap=chunk_overlap))
#     chunk_embeddings = embed_model.encode([c.text for c in all_chunks], show_progress_bar=True).astype("float32")
#     faiss.normalize_L2(chunk_embeddings)
#     index = faiss.IndexFlatIP(EMBEDDING_DIM)
#     index.add(chunk_embeddings)
#     print(f"Rebuilt: {len(all_chunks)} chunks, chunk_size={chunk_size}, chunk_overlap={chunk_overlap}")

PROMPT_TEMPLATE = (
    'You are a helpful assistant. Answer the question using ONLY the context below.\n'
    'If the context does not contain enough information, say so.\n\n'
    'CONTEXT:\n{context}\n\n'
    'QUESTION: {question}\n\n'
    'ANSWER:'
)


def generate_response(prompt: str, max_new_tokens: int = 300) -> str:
    inputs = tokenizer(prompt, return_tensors='pt')
    inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.1,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id
        )
    return tokenizer.decode(
        out[0][inputs['input_ids'].shape[1]:],
        skip_special_tokens=True
    ).strip()


def run_rag_query(question: str, all_chunks: List[Chunk],
                  faiss_index: faiss.IndexFlatIP,
                  top_k: int = 5) -> dict:
    """
    Run a full RAG query and return a dict with retrieval stats and the final answer.
    """
    # Retrieve
    q_emb = embed_model.encode([question]).astype('float32')
    faiss.normalize_L2(q_emb)
    scores, indices = faiss_index.search(q_emb, top_k)

    retrieved = [
        (all_chunks[i], float(s))
        for s, i in zip(scores[0], indices[0]) if i != -1
    ]

    # Retrieval stats
    score_list    = [s for _, s in retrieved]
    top_score     = max(score_list) if score_list else 0.0
    avg_score     = sum(score_list) / len(score_list) if score_list else 0.0
    score_spread  = max(score_list) - min(score_list) if len(score_list) > 1 else 0.0
    unique_sources = len({c.source_file for c, _ in retrieved})

    # Build context
    context = '\n\n---\n\n'.join(
        f'[Source: {c.source_file} | Score: {s:.3f} | Chars: {len(c.text)}]\n{c.text}'
        for c, s in retrieved
    )

    # Generate answer
    prompt = PROMPT_TEMPLATE.format(context=context, question=question)
    answer = generate_response(prompt)

    return {
        'answer'            : answer,
        'top_score'         : round(top_score, 4),
        'avg_score'         : round(avg_score, 4),
        'score_spread'      : round(score_spread, 4),
        'unique_sources'    : unique_sources,
        'retrieved_chunks'  : [
            {
                'source'   : c.source_file,
                'score'    : round(s, 4),
                'char_len' : len(c.text),
                'preview'  : c.text[:120].replace('\n', ' ')
            }
            for c, s in retrieved
        ]
    }

print('‚úî Pipeline functions defined')

---
## Run the Chunk Size Experiment

For each of the three chunk sizes, this cell:
- Rebuilds chunks + FAISS index
- Runs all 5 queries
- Prints a live summary
- Appends results to a master list

**Expected runtime on T4 GPU:** ~5‚Äì10 min total

In [33]:
# =============================================================================
# CHUNK SIZES TO TEST  (modify if you want more configurations)
# =============================================================================
CHUNK_SIZES = [128, 512, 2048]

all_results = []   # master list ‚Äî one row per (chunk_size √ó query)

for chunk_size in CHUNK_SIZES:
    print(f'\n{"="*65}')
    print(f'CHUNK SIZE = {chunk_size} chars  (overlap={CHUNK_OVERLAP})')
    print(f'{"="*65}')

    # 1. Rebuild index
    chunks, idx = build_pipeline(chunk_size, CHUNK_OVERLAP)
    n_chunks     = len(chunks)

    # 2. Run all queries
    for q in QUERIES:
        print(f"\n  [{q['id']}] {q['text']}")

        result = run_rag_query(q['text'], chunks, idx, top_k=TOP_K)

        print(f"    top_score={result['top_score']:.4f}  "
              f"avg_score={result['avg_score']:.4f}  "
              f"sources={result['unique_sources']}")
        print(f"    Answer preview: {result['answer'][:200]}...")

        all_results.append({
            'chunk_size'      : chunk_size,
            'chunk_overlap'   : CHUNK_OVERLAP,
            'total_chunks'    : n_chunks,
            'query_id'        : q['id'],
            'query_type'      : q['type'],
            'question'        : q['text'],
            'top_score'       : result['top_score'],
            'avg_score'       : result['avg_score'],
            'score_spread'    : result['score_spread'],
            'unique_sources'  : result['unique_sources'],
            'answer'          : result['answer'],
            # Flatten top-3 retrieved chunks for the CSV
            'chunk1_source'   : result['retrieved_chunks'][0]['source']  if len(result['retrieved_chunks']) > 0 else '',
            'chunk1_score'    : result['retrieved_chunks'][0]['score']   if len(result['retrieved_chunks']) > 0 else '',
            'chunk1_len'      : result['retrieved_chunks'][0]['char_len']if len(result['retrieved_chunks']) > 0 else '',
            'chunk1_preview'  : result['retrieved_chunks'][0]['preview'] if len(result['retrieved_chunks']) > 0 else '',
            'chunk2_source'   : result['retrieved_chunks'][1]['source']  if len(result['retrieved_chunks']) > 1 else '',
            'chunk2_score'    : result['retrieved_chunks'][1]['score']   if len(result['retrieved_chunks']) > 1 else '',
            'chunk2_len'      : result['retrieved_chunks'][1]['char_len']if len(result['retrieved_chunks']) > 1 else '',
            'chunk2_preview'  : result['retrieved_chunks'][1]['preview'] if len(result['retrieved_chunks']) > 1 else '',
            'chunk3_source'   : result['retrieved_chunks'][2]['source']  if len(result['retrieved_chunks']) > 2 else '',
            'chunk3_score'    : result['retrieved_chunks'][2]['score']   if len(result['retrieved_chunks']) > 2 else '',
            'chunk3_len'      : result['retrieved_chunks'][2]['char_len']if len(result['retrieved_chunks']) > 2 else '',
            'chunk3_preview'  : result['retrieved_chunks'][2]['preview'] if len(result['retrieved_chunks']) > 2 else '',
        })

    # Save after each chunk size ‚Äî don't lose results if the next one crashes
    df_partial = pd.DataFrame(all_results)
    df_partial.to_csv('exercise8_chunk_size_results.csv', index=False)
    print(f'\n  ‚úî Progress saved ‚Üí exercise8_chunk_size_results.csv  '
          f'({len(all_results)} rows so far)')

print(f'\n{"="*65}')
print('ALL CHUNK SIZES COMPLETE')
print(f'{"="*65}')


CHUNK SIZE = 128 chars  (overlap=0)
  Chunks: 3,353  |  avg 113 chars  |  min 11  max 128


Batches:   0%|          | 0/53 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


  Embedded in 2.4s
  FAISS index: 3,353 vectors

  [Q1] What is the correct spark plug gap for a Model T Ford?
    top_score=0.6039  avg_score=0.5844  sources=5
    Answer preview: The correct spark plug gap for a Model T Ford is approximately 0.03 inches (0.76 mm). This value was determined by analyzing multiple sources including the original text and OCR scans of historical do...

  [Q2] How do I fix a slipping transmission band?
    top_score=0.5198  avg_score=0.5088  sources=3
    Answer preview: The slow speed band may be tightened by loosening the lock nut at the tight side of the transmission cover. If that doesn't work, you can try tightening the adjusting nuts on the shafts to the right. ...

  [Q3] How do I adjust the carburetor on a Model T?
    top_score=0.6446  avg_score=0.6297  sources=4
    Answer preview: To adjust the carburetor on a Model T, you should turn it to the right as far as possible without reducing speed. This will ensure that the gasoline is mixed with air 

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

  Embedded in 1.3s
  FAISS index: 888 vectors

  [Q1] What is the correct spark plug gap for a Model T Ford?
    top_score=0.5627  avg_score=0.5355  sources=4
    Answer preview: The correct spark plug gap for a Model T Ford is 7/8 inch or approximately 0.875 inches. This value was mentioned in multiple sources including the ModelT-11-20-ocr.pdf, Ford-Model-T-Man-1919-ocr.pdf,...

  [Q2] How do I fix a slipping transmission band?
    top_score=0.5174  avg_score=0.5110  sources=4
    Answer preview: Loosen the lock nut at the tight side of the transmission cover, then adjust the screw until it's tight. Remove the transmission cover, and turn the adjusting nuts on the shafts to the right. Ensure t...

  [Q3] How do I adjust the carburetor on a Model T?
    top_score=0.6096  avg_score=0.5799  sources=3
    Answer preview: To adjust the carburetor on a Model T, you need to advance the throttle lever to the sixth notch while retarding the spark about the fourth notch. Cut off the gasoline f

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

  Embedded in 0.8s
  FAISS index: 220 vectors

  [Q1] What is the correct spark plug gap for a Model T Ford?
    top_score=0.5351  avg_score=0.5129  sources=3
    Answer preview: The correct spark plug gap for a Model T Ford is 7/16 inch, about the thickness of a smooth dime.

This answer is derived directly from the text provided, specifically from the passage discussing how ...

  [Q2] How do I fix a slipping transmission band?
    top_score=0.4549  avg_score=0.4307  sources=4
    Answer preview: If the transmission bands are slipping, loosen the lock nut on the transmission cover and adjust the tightening screw. Ensure that the bands do not drag the drums when disengaging, as they can cause o...

  [Q3] How do I adjust the carburetor on a Model T?
    top_score=0.5348  avg_score=0.5064  sources=5
    Answer preview: For the convenience of the driver in adjusting the carburetor. After the new car has become thoroughly worked in, the driver should observe the angle ofthe carburetor ad

---
## Results: Side-by-Side Comparison

In [None]:
df = pd.DataFrame(all_results)
pd.set_option('display.max_colwidth', 80)

# ‚îÄ‚îÄ 1. Retrieval quality per chunk size ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print('=== RETRIEVAL QUALITY BY CHUNK SIZE ===')
retrieval_summary = (
    df.groupby('chunk_size')[['top_score', 'avg_score', 'score_spread', 'unique_sources']]
      .mean()
      .round(4)
)
retrieval_summary['total_chunks'] = df.groupby('chunk_size')['total_chunks'].first()
display(retrieval_summary)

# ‚îÄ‚îÄ 2. Per-query, per-chunk-size answer comparison ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print('\n=== ANSWERS BY QUERY √ó CHUNK SIZE ===')
for q in QUERIES:
    print(f"\n{'‚îÄ'*65}")
    print(f"[{q['id']}] ({q['type']})")
    print(q['text'])
    print(f"{'‚îÄ'*65}")
    sub = df[df['query_id'] == q['id']].sort_values('chunk_size')
    for _, row in sub.iterrows():
        print(f"\n  chunk_size={row['chunk_size']:>4}  "
              f"top_score={row['top_score']:.4f}  "
              f"avg_score={row['avg_score']:.4f}  "
              f"total_chunks={row['total_chunks']:,}")
        print(f"  Answer: {row['answer'][:300]}")

In [None]:
# ‚îÄ‚îÄ 3. Score heatmap: chunk_size √ó query_id ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
try:
    import matplotlib.pyplot as plt
    import matplotlib

    pivot_top  = df.pivot_table(index='query_id', columns='chunk_size', values='top_score')
    pivot_avg  = df.pivot_table(index='query_id', columns='chunk_size', values='avg_score')

    fig, axes = plt.subplots(1, 2, figsize=(12, 4))

    for ax, pivot, title in zip(
            axes,
            [pivot_top, pivot_avg],
            ['Top-1 Retrieval Score', 'Avg Retrieval Score (top-K)']):
        im = ax.imshow(pivot.values, cmap='YlOrRd', aspect='auto',
                       vmin=0, vmax=pivot.values.max())
        ax.set_xticks(range(len(pivot.columns)))
        ax.set_xticklabels([f'{c}' for c in pivot.columns])
        ax.set_yticks(range(len(pivot.index)))
        ax.set_yticklabels(pivot.index)
        ax.set_xlabel('Chunk Size (chars)')
        ax.set_ylabel('Query')
        ax.set_title(title)
        plt.colorbar(im, ax=ax)
        # Annotate cells
        for i in range(len(pivot.index)):
            for j in range(len(pivot.columns)):
                ax.text(j, i, f"{pivot.values[i, j]:.3f}",
                        ha='center', va='center', fontsize=9,
                        color='black' if pivot.values[i, j] < pivot.values.max()*0.7 else 'white')

    plt.suptitle('Retrieval Score Heatmap: Query √ó Chunk Size', fontsize=13, y=1.02)
    plt.tight_layout()
    plt.savefig('exercise8_score_heatmap.png', dpi=150, bbox_inches='tight')
    plt.show()
    print('‚úî Saved: exercise8_score_heatmap.png')
except ImportError:
    print('matplotlib not available ‚Äî skipping heatmap')

In [None]:
# ‚îÄ‚îÄ 4. Bar chart: total chunks produced at each size ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
try:
    chunk_counts = df.drop_duplicates('chunk_size').set_index('chunk_size')['total_chunks']

    fig, ax = plt.subplots(figsize=(7, 4))
    bars = ax.bar([str(s) for s in chunk_counts.index], chunk_counts.values,
                  color=['#4e79a7', '#f28e2b', '#e15759'])
    for bar, val in zip(bars, chunk_counts.values):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5,
                f'{val:,}', ha='center', va='bottom', fontsize=10)
    ax.set_xlabel('Chunk Size (chars)')
    ax.set_ylabel('Number of Chunks')
    ax.set_title('Total Chunks Produced per Chunk Size')
    plt.tight_layout()
    plt.savefig('exercise8_chunk_counts.png', dpi=150)
    plt.show()
    print('‚úî Saved: exercise8_chunk_counts.png')
except Exception as e:
    print(f'Chart skipped: {e}')

---
## Documentation Questions

Fill in your observations after reviewing the outputs above.

### 1. How does chunk size affect retrieval precision?
*Which chunk size produced the highest top-1 scores? Did small chunks retrieve very precise but context-poor text? Did large chunks retrieve broad passages with mixed relevance?*

---

### 2. How does chunk size affect answer completeness?
*Which queries suffered most at chunk_size=128? Were procedural questions (multi-step answers) harder to answer with small chunks? Did large chunks help or hurt?*

---

### 3. Is there a sweet spot for your corpus?
*Look at the retrieval summary table. Which chunk size gave the best balance of top_score, avg_score, and answer quality?*

---

### 4. Does optimal chunk size depend on question type?
*Compare narrow_factual vs. procedural vs. multi_section queries across chunk sizes. Does a different size win for different question types?*

In [None]:
# Download all output files in Colab
try:
    from google.colab import files as colab_files
    for fname in [
        'exercise8_chunk_size_results.csv',
        'exercise8_score_heatmap.png',
        'exercise8_chunk_counts.png',
    ]:
        if os.path.exists(fname):
            colab_files.download(fname)
            print(f'‚¨á Downloading: {fname}')
        else:
            print(f'‚ö† Not found (skipping): {fname}')
except ImportError:
    print('Files saved locally:')
    for fname in [
        'exercise8_chunk_size_results.csv',
        'exercise8_score_heatmap.png',
        'exercise8_chunk_counts.png',
    ]:
        print(f'  {fname}')