# PDF Loading & Chunking Experiment

**Goal:** Learn how to load PDFs and split them into chunks for RAG system.

**What we'll do:**
1. Load a single PDF
2. Inspect its structure
3. Split into chunks
4. Examine chunk metadata

## Setup: Import Libraries

First, import the necessary libraries.

In [1]:
# TODO: Import these libraries
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from pathlib import Path
# Your code here:


## Step 1: Load a Single PDF

Let's start with the smallest PDF: `On-board Diagnostics.pdf` (1.5MB)

**What PyPDFLoader does:**
- Reads PDF file
- Extracts text from each page
- Returns a list of `Document` objects (1 per page)

**Each Document has:**
- `page_content`: The text
- `metadata`: Info like page number, source file

In [2]:
# TODO: Load the PDF
# 1. Define the path to the PDF
pdf_path = "../data/automotive/On-board Diagnostics.pdf"

# 2. Create a PyPDFLoader instance
loader = PyPDFLoader(pdf_path)

# 3. Load the pages
pages = loader.load()

# 4. Print how many pages loaded
print(f"Loaded {len(pages)} pages")

# Your code here:


Loaded 20 pages


## Step 2: Inspect the First Page

Let's see what a single page looks like.

In [3]:
# TODO: Examine first page
# 1. Get the first page
first_page = pages[0]

# 2. Print the metadata
print("Metadata:")
print(first_page.metadata)

# 3. Print first 500 characters of content
print("\nContent preview:")
print(first_page.page_content[:500])

# 4. Print total character count
print(f"\nTotal characters: {len(first_page.page_content)}")

# Your code here:


Metadata:
{'source': '../data/automotive/On-board Diagnostics.pdf', 'page': 0, 'page_label': '1'}

Content preview:
Various views of a "MaxScan OE509" – a
fairly typical onboard diagnostics (OBD)
scanner, 2015.
On-board diagnostics
On-board diagnostics (OBD) is a term referring to a
vehicle's self-diagnostic and reporting capability. In the
United States, this capability is a requirement to comply
with federal emissions standards to detect failures that
may increase the vehicle tailpipe emissions to more than
150% of the standard to which it was originally
certified.[1][2]
OBD systems give the vehicle owner o

Total characters: 3118


## Step 3: Split into Chunks

**Why chunk?** 
- LLMs have token limits (GPT-3.5: ~4096 tokens)
- A full PDF page might be too long
- Smaller chunks = more precise retrieval

**RecursiveCharacterTextSplitter:**
- `chunk_size`: Max characters per chunk (1000 ≈ 150-200 words)
- `chunk_overlap`: How many characters overlap between chunks (prevents splitting important context)

**How it works:**
1. Tries to split at paragraph breaks (`\n\n`)
2. If still too big, tries newlines (`\n`)
3. If still too big, tries spaces
4. Last resort: splits at exact character count

In [4]:
# TODO: Create text splitter and split documents
# 1. Create the splitter
text_splitter = RecursiveCharacterTextSplitter(
     chunk_size=1000,
     chunk_overlap=200,
 )

# 2. Split all pages into chunks
chunks = text_splitter.split_documents(pages)

# 3. Print how many chunks created
print(f"Split {len(pages)} pages into {len(chunks)} chunks")

# Your code here:


Split 20 pages into 72 chunks


## Step 4: Examine a Few Chunks

Let's see what the chunks look like.

In [5]:
# TODO: Print first 3 chunks
for i, chunk in enumerate(chunks[:3]):
    print(f"\n{'='*60}")
    print(f"Chunk {i+1}")
    print(f"{'='*60}")
    print(f"Metadata: {chunk.metadata}")
    print(f"Length: {len(chunk.page_content)} characters")
    print(f"\nContent:\n{chunk.page_content}")

# Your code here:



Chunk 1
Metadata: {'source': '../data/automotive/On-board Diagnostics.pdf', 'page': 0, 'page_label': '1'}
Length: 996 characters

Content:
Various views of a "MaxScan OE509" – a
fairly typical onboard diagnostics (OBD)
scanner, 2015.
On-board diagnostics
On-board diagnostics (OBD) is a term referring to a
vehicle's self-diagnostic and reporting capability. In the
United States, this capability is a requirement to comply
with federal emissions standards to detect failures that
may increase the vehicle tailpipe emissions to more than
150% of the standard to which it was originally
certified.[1][2]
OBD systems give the vehicle owner or repair technician
access to the status of the various vehicle sub-systems. The
amount of diagnostic information available via OBD has
varied widely since its introduction in the early 1980s
versions of onboard vehicle computers. Early versions of
OBD would simply illuminate a tell-tale light if a problem
was detected, but would not provide any information 

## Step 5: Test Different Chunk Sizes

Let's experiment with different chunk sizes to see the impact.

**Trade-offs:**
- **Larger chunks (2000+):** More context, fewer chunks, but might exceed token limit
- **Smaller chunks (500):** More precise, but might lose context

In [6]:
# TODO: Test different chunk sizes
chunk_sizes = [500, 1000, 2000]

for size in chunk_sizes:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=size,
        chunk_overlap=int(size * 0.2)  # 20% overlap
    )
    test_chunks = splitter.split_documents(pages)
    print(f"Chunk size {size}: {len(test_chunks)} chunks created")

# Your code here:


Chunk size 500: 135 chunks created
Chunk size 1000: 72 chunks created
Chunk size 2000: 38 chunks created


## Step 6: Load All PDFs from Directory

Now let's load ALL PDFs from the automotive folder.

In [7]:
# TODO: Load all PDFs
# 1. Get all PDF files
pdf_dir = Path("../data/automotive")
pdf_files = list(pdf_dir.glob("*.pdf"))
print(f"Found {len(pdf_files)} PDF files")

# 2. Load and chunk each PDF
all_chunks = []
for pdf_path in pdf_files:
    print(f"\nLoading: {pdf_path.name}")
    loader = PyPDFLoader(str(pdf_path))
    pages = loader.load()
    
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
    )
    chunks = text_splitter.split_documents(pages)
    all_chunks.extend(chunks)
    print(f"  - {len(pages)} pages → {len(chunks)} chunks")

# 3. Print total
print(f"\nTotal chunks: {len(all_chunks)}")

# Your code here:


Found 3 PDF files

Loading: automotive_infotainment.pdf
  - 62 pages → 136 chunks

Loading: On-board Diagnostics.pdf
  - 20 pages → 72 chunks

Loading: CAN.pdf
  - 167 pages → 427 chunks

Total chunks: 635


## Summary

**What you learned:**
1. How to load PDFs with `PyPDFLoader`
2. How to split documents into chunks with `RecursiveCharacterTextSplitter`
3. How chunk size affects the number of chunks
4. How to process multiple PDFs from a directory

**Next step:** Move this logic to `src/pdf_loader.py` as a reusable function!