# Text Splitting Test
This notebook tests the DocumentSplitter class for chunking documents into optimal sizes for embeddings.

In [1]:
# Setup: Add parent directory to path
import sys
from pathlib import Path

project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

print(f"Project root: {project_root}")

Project root: c:\Users\kissa\OneDrive\Desktop\research-assistant


In [2]:
# Import classes
from src.processing.document_loader import DocumentLoader
from src.processing.text_splitter import DocumentSplitter

In [3]:
# Load the sample PDF (from previous notebook)
pdf_path = project_root / "data" / "samples" / "sample.pdf"

loader = DocumentLoader()
docs = loader.load_pdf(str(pdf_path))
print(f"Loaded {len(docs)} pages from PDF")

Loaded 9 pages from PDF


In [4]:
# Split documents into chunks
splitter = DocumentSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
print(f"Split {len(docs)} pages into {len(chunks)} chunks")

Split 9 pages into 18 chunks


In [5]:
# Inspect the first chunk
print("First chunk content:")
print(chunks[0].page_content)
print("\nFirst chunk metadata:")
print(chunks[0].metadata)

First chunk content:
A Brief Introduction to Artificial Intelligence
What is AI and how is it going to shape the future 
By Dibbyo Saha, Undergraduate Student, Computer Science,
Ryerson University
What is Artificial Intelligence?
Imag e by Gerd Altmann from Pixabay
Generally
speaking,
Artificial
Intelligence
is
a
computing
concept
that
helps
a 
machine
think
and
solve
complex
problems
as
we
humans
do
with
our
intelligence. 
For
example,
we
perform
a
task,
make
mistakes
and
learn
from
our
mistakes
(At 
least
the
wise
ones
of
us
do!).
Likewise,
an
AI
or
Artificial
Intelligence
is
supposed 
to
work
on
a
problem,
make
some
mistakes
in
solving
the
problem
and
learn
from 
the
problems
in
a
self-correcting
manner
as
a
part
of
its
self-improvement.
Or
in 
other
words,
think
of
this
like
playing
a
game
of
chess.
Every
bad
move
you
make 
reduces
your
chances
of
winning
the
game.
So,
every
time
you
lose
against
your 
friend,
you
try
remembering
the
moves
you
made
which
you
shouldn’t
have
and 
app

In [6]:
# Analyze chunk sizes
chunk_lengths = [len(chunk.page_content) for chunk in chunks]

print("Chunk Statistics:")
print(f"  Total chunks: {len(chunks)}")
print(f"  Average size: {sum(chunk_lengths) / len(chunk_lengths):.0f} characters")
print(f"  Min size: {min(chunk_lengths)} characters")
print(f"  Max size: {max(chunk_lengths)} characters")
print(f"  Chunk size limit: 1000 characters")
print(f"  Chunk overlap: 200 characters")

Chunk Statistics:
  Total chunks: 18
  Average size: 786 characters
  Min size: 288 characters
  Max size: 999 characters
  Chunk size limit: 1000 characters
  Chunk overlap: 200 characters


In [8]:
# Experiment: Try different chunk sizes
chunk_sizes = [500, 1000, 1500, 2000]

print("Testing different chunk sizes:\n")
for size in chunk_sizes:
    splitter = DocumentSplitter(chunk_size=size, chunk_overlap=200)
    test_chunks = splitter.split_documents(docs)
    avg_size = sum(len(c.page_content) for c in test_chunks) / len(test_chunks)
    print(f"Chunk size {size:4d}: {len(test_chunks):3d} chunks (avg: {avg_size:.0f} chars)")

Testing different chunk sizes:

Chunk size  500:  41 chunks (avg: 455 chars)
Chunk size 1000:  18 chunks (avg: 786 chars)
Chunk size 1500:  12 chunks (avg: 1086 chars)
Chunk size 2000:  10 chunks (avg: 1264 chars)


In [9]:
# Verify overlap between consecutive chunks
if len(chunks) >= 2:
    print("Checking overlap between first two chunks:\n")
    chunk1_end = chunks[0].page_content[-100:]
    chunk2_start = chunks[1].page_content[:100]
    
    print("End of chunk 1:")
    print(chunk1_end)
    print("\nStart of chunk 2:")
    print(chunk2_start)
    print("\n(Notice the overlap preserves context)")

Checking overlap between first two chunks:

End of chunk 1:
friend,
you
try
remembering
the
moves
you
made
which
you
shouldn’t
have
and 
apply
that
knowledge
in

Start of chunk 2:
bad
move
you
make 
reduces
your
chances
of
winning
the
game.
So,
every
time
you
lose
against
your 
f

(Notice the overlap preserves context)
