# Full Pipeline Test

**Goal**: Test the complete document processing pipeline with real PDF

This notebook tests the full integration:
1. Load PDF documents
2. Split into chunks
3. Generate embeddings
4. Store in ChromaDB
5. Perform similarity search

## Setup

In [1]:
import sys
from pathlib import Path

# Add project root to path
project_root = Path().absolute().parent
sys.path.insert(0, str(project_root))

from src.processing.document_loader import DocumentLoader
from src.processing.text_splitter import DocumentSplitter
from src.processing.embeddings import EmbeddingsGenerator
from src.vectorstore.chroma_store import ChromaVectorStore

## Test 1: Initialize All Components

Create instances of all pipeline components

In [2]:
print("Initializing pipeline components...\n")

# Document loader
loader = DocumentLoader()
print("‚úì DocumentLoader initialized")

# Text splitter
splitter = DocumentSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
print("‚úì DocumentSplitter initialized")
print(f"  Chunk size: {splitter.chunk_size}")
print(f"  Chunk overlap: {splitter.chunk_overlap}")

# Embeddings generator
embedder = EmbeddingsGenerator()
print("‚úì EmbeddingsGenerator initialized")

# Vector store manager
vectorstore_manager = ChromaVectorStore(
    embeddings=embedder,
    persist_directory="./data/vectorstore_pipeline_test"
)
print("‚úì ChromaVectorStore initialized")
print(f"  Persist directory: {vectorstore_manager.persist_directory}")

Initializing pipeline components...

‚úì DocumentLoader initialized
‚úì DocumentSplitter initialized
  Chunk size: 1000
  Chunk overlap: 200
‚úì EmbeddingsGenerator initialized
‚úì ChromaVectorStore initialized
  Persist directory: ./data/vectorstore_pipeline_test


## Test 2: Load PDF Document

Load the sample PDF from data/samples/sample.pdf

In [3]:
pdf_path = "../data/samples/sample.pdf"
print(f"Loading PDF: {pdf_path}\n")

docs = loader.load_pdf(pdf_path)

print(f"‚úì Loaded {len(docs)} pages from PDF\n")

# Show first document info
if docs:
    print("First page preview:")
    print(f"  Content length: {len(docs[0].page_content)} characters")
    print(f"  Metadata: {docs[0].metadata}")
    print(f"  First 200 chars: {docs[0].page_content[:200]}...")

Loading PDF: ../data/samples/sample.pdf

‚úì Loaded 9 pages from PDF

First page preview:
  Content length: 1134 characters
  Metadata: {'source': '../data/samples/sample.pdf', 'page': 0, 'filename': 'sample.pdf', 'upload_date': '2026-02-07T18:27:02.529060'}
  First 200 chars: A Brief Introduction to Artificial Intelligence
What is AI and how is it going to shape the future 
By Dibbyo Saha, Undergraduate Student, Computer Science,
Ryerson University
What is Artificial Intel...


## Test 3: Split Documents into Chunks

Split the loaded documents into smaller chunks for embedding

In [4]:
print("Splitting documents into chunks...\n")

chunks = splitter.split_documents(docs)

print(f"‚úì Created {len(chunks)} chunks\n")

# Show chunk statistics
chunk_lengths = [len(chunk.page_content) for chunk in chunks]
print("Chunk statistics:")
print(f"  Average length: {sum(chunk_lengths) / len(chunk_lengths):.0f} characters")
print(f"  Min length: {min(chunk_lengths)} characters")
print(f"  Max length: {max(chunk_lengths)} characters")

# Show first chunk
print(f"\nFirst chunk preview:")
print(f"  Length: {len(chunks[0].page_content)} characters")
print(f"  Content: {chunks[0].page_content[:200]}...")
print(f"  Metadata: {chunks[0].metadata}")

Splitting documents into chunks...

‚úì Created 18 chunks

Chunk statistics:
  Average length: 786 characters
  Min length: 288 characters
  Max length: 999 characters

First chunk preview:
  Length: 999 characters
  Content: A Brief Introduction to Artificial Intelligence
What is AI and how is it going to shape the future 
By Dibbyo Saha, Undergraduate Student, Computer Science,
Ryerson University
What is Artificial Intel...
  Metadata: {'source': '../data/samples/sample.pdf', 'page': 0, 'filename': 'sample.pdf', 'upload_date': '2026-02-07T18:27:02.529060', 'chunk_id': 0}


## Test 4: Create Vector Store

Generate embeddings and store chunks in ChromaDB

In [5]:
print("Creating vector store with embeddings...\n")
print("This may take a moment...\n")

vectorstore = vectorstore_manager.create_from_documents(
    documents=chunks,
    collection_name="pipeline_test"
)

print("‚úì Vector store created successfully!")
print(f"‚úì Stored {len(chunks)} chunks with embeddings")
print(f"‚úì Collection name: pipeline_test")

Creating vector store with embeddings...

This may take a moment...



Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


‚úì Vector store created successfully!
‚úì Stored 18 chunks with embeddings
‚úì Collection name: pipeline_test


## Test 5: Similarity Search

Test semantic search on the stored documents

In [6]:
# Test query - adjust based on your PDF content
query = "What is Artificial Intelligence?"

print(f"Query: '{query}'\n")

results = vectorstore_manager.similarity_search(query, k=3)

print(f"‚úì Found {len(results)} most relevant chunks:\n")

for i, result in enumerate(results, 1):
    print(f"--- Result {i} ---")
    print(f"Content: {result.page_content[:200]}...")
    print(f"Metadata: {result.metadata}")
    print()

Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given


Query: 'What is Artificial Intelligence?'

‚úì Found 3 most relevant chunks:

--- Result 1 ---
Content: A Brief Introduction to Artificial Intelligence
What is AI and how is it going to shape the future 
By Dibbyo Saha, Undergraduate Student, Computer Science,
Ryerson University
What is Artificial Intel...
Metadata: {'chunk_id': 0, 'filename': 'sample.pdf', 'page': 0, 'source': '../data/samples/sample.pdf', 'upload_date': '2026-02-07T18:27:02.529060'}

--- Result 2 ---
Content: Intelligence
as
a 
process
that
is
going
to
help
machines
achieve
a
humanlike
mental
behaviour.
AI
is 
a
vast
and
growing
field
which
also
includes
a
lot
more
subfields
like
machine 
learning
and
deep...
Metadata: {'chunk_id': 6, 'filename': 'sample.pdf', 'page': 2, 'source': '../data/samples/sample.pdf', 'upload_date': '2026-02-07T18:27:02.529060'}

--- Result 3 ---
Content: complicated
and
intuitive
sense
of
thinking
and
problem-solving
abilities
of
the 
human mind.
A Brief History of AI
The
concept
of
Artific

## Test 6: Multiple Queries

Test different types of queries

In [7]:
# Adjust these queries based on your PDF content
test_queries = [
    "What is the difference between AI and traditional robotics?",
    "What are the subfields of AI mentioned in the document?",
    "How will AI impact jobs in the future?",
    "What are some current applications of AI?"
]

print("Testing multiple queries:\n")
print("=" * 70)

for query in test_queries:
    print(f"\nQuery: '{query}'")
    print("-" * 70)
    
    results = vectorstore_manager.similarity_search(query, k=2)
    
    for i, result in enumerate(results, 1):
        print(f"  {i}. {result.page_content[:100]}...")
        print(f"     Source: page {result.metadata.get('page', 'unknown')}")

Testing multiple queries:


Query: 'What is the difference between AI and traditional robotics?'
----------------------------------------------------------------------
  1. improves
by
a
noteworthy
extent.
AI
is
programmed
to
do
something
similar
to 
that!
Artificial Intel...
     Source: page 1
  2. complicated
and
intuitive
sense
of
thinking
and
problem-solving
abilities
of
the 
human mind.
A Brie...
     Source: page 2

Query: 'What are the subfields of AI mentioned in the document?'
----------------------------------------------------------------------
  1. A Brief Introduction to Artificial Intelligence
What is AI and how is it going to shape the future 
...
     Source: page 0
  2. Sour ce:h ttp://da tasciencecen tral.com
Deep
Learning,
on
the
other
hand
is
the
concept
of
computer...
     Source: page 4

Query: 'How will AI impact jobs in the future?'
----------------------------------------------------------------------
  1. great
tool
in
the
future
of
education.
AI
can
be
used


## Test 7: Search with Scores

Get relevance scores to understand search quality

In [8]:
query = "Explain Machine Learning and Deep Learning"

print(f"Query: '{query}'\n")

results_with_scores = vectorstore_manager.similarity_search_with_score(query, k=5)

print(f"‚úì Top {len(results_with_scores)} results with relevance scores:\n")

for i, (doc, score) in enumerate(results_with_scores, 1):
    print(f"--- Result {i} (Score: {score:.4f}) ---")
    print(f"Page: {doc.metadata.get('page', 'unknown')}")
    print(f"Content: {doc.page_content[:150]}...")
    print()

print("Note: Lower scores = higher similarity in ChromaDB")

Query: 'Explain Machine Learning and Deep Learning'

‚úì Top 5 results with relevance scores:

--- Result 1 (Score: 0.7966) ---
Page: 2
Content: Intelligence
as
a 
process
that
is
going
to
help
machines
achieve
a
humanlike
mental
behaviour.
AI
is 
a
vast
and
growing
field
which
also
includes
a
...

--- Result 2 (Score: 0.8995) ---
Page: 4
Content: Sour ce:h ttp://da tasciencecen tral.com
Deep
Learning,
on
the
other
hand
is
the
concept
of
computers
simulating
the 
process
a
human
brain
takes
to
a...

--- Result 3 (Score: 0.9833) ---
Page: 3
Content: which
is
not
apparently
comprehendible
by
the
human
eyes.
The 
machine
looks
for
patterns
and
draws
conclusions
on
its
own
from
the
patterns
of 
the
d...

--- Result 4 (Score: 1.0602) ---
Page: 3
Content: is
being
trained
by
giving
it
access
to
a
huge
amount
of
data
and
training
the 
machine
to
analyze
it.
For
instance,
the
machine
is
given
a
number
of
...

--- Result 5 (Score: 1.0617) ---
Page: 0
Content: A Brief Introduction to Artificial 

## Test 8: Using the Pipeline Class

Test the DocumentProcessingPipeline class for streamlined processing

In [9]:
from src.processing.document_processing_pipeline import DocumentProcessingPipeline, PipelineConfig

# Create configuration
config = PipelineConfig(
    chunk_size=1000,
    chunk_overlap=200,
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    vectorstore_path="./data/vectorstore_pipeline_class_test"
)

# Initialize pipeline
pipeline = DocumentProcessingPipeline(config)

print("‚úì Pipeline initialized with config:")
print(f"  Chunk size: {config.chunk_size}")
print(f"  Chunk overlap: {config.chunk_overlap}")
print(f"  Embedding model: {config.embedding_model}")
print(f"  Vectorstore path: {config.vectorstore_path}")

‚úì Pipeline initialized with config:
  Chunk size: 1000
  Chunk overlap: 200
  Embedding model: sentence-transformers/all-MiniLM-L6-v2
  Vectorstore path: ./data/vectorstore_pipeline_class_test


In [10]:
# Process PDF using pipeline
pdf_paths = ["../data/samples/sample.pdf"]

print("\nProcessing PDF with pipeline...\n")
vectorstore = pipeline.process_pdfs(pdf_paths)

print("\n‚úì Pipeline processing complete!")


Processing PDF with pipeline...

üì• Loading 1 PDFs...
‚úÖ Loaded 9 pages
‚úÇÔ∏è  Splitting documents into chunks...
‚úÖ Created 18 chunks
üî¢ Generating embeddings and storing in vector database...


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


‚úÖ Processing complete! Vector store ready for search.

‚úì Pipeline processing complete!


In [11]:
# Search using pipeline
query = "What concerns exist about AI and automation?"

print(f"\nSearching with query: '{query}'\n")

results = pipeline.search(query, k=3)

print(f"‚úì Found {len(results)} results:\n")

for i, result in enumerate(results, 1):
    print(f"{i}. {result.page_content[:150]}...")
    print(f"   Page: {result.metadata.get('page', 'unknown')}\n")


Searching with query: 'What concerns exist about AI and automation?'



Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given


‚úì Found 3 results:

1. great
tool
in
the
future
of
education.
AI
can
be
used
to
analyze
data
from
an 
individual‚Äôs
personal
and
intellectual
needs,
capabilities,
choices
and...
   Page: 6

2. Intelligence
is
also
viewed
as
a
great
tool
for
better
cybersecurity.
Many
banks
are 
using
AI
as
a
means
to
identify
unauthorized
credit
cards
uses.
...
   Page: 5

3. fears
regarding
AI 
includes
the
scenario
whereas
machines
become
smarter
and
smarter
they
going
to 
end
up
being
as
opinionated
and
biased
like
some
...
   Page: 5



## Summary

### What We Tested:
1. ‚úÖ Initialized all pipeline components
2. ‚úÖ Loaded PDF document
3. ‚úÖ Split documents into chunks
4. ‚úÖ Created vector store with embeddings
5. ‚úÖ Performed similarity search
6. ‚úÖ Tested multiple queries
7. ‚úÖ Got relevance scores
8. ‚úÖ Used DocumentProcessingPipeline class

### Full Pipeline Verified:
```
PDF ‚Üí DocumentLoader ‚Üí Documents
  ‚Üì
DocumentSplitter ‚Üí Chunks
  ‚Üì
EmbeddingsGenerator ‚Üí Vectors
  ‚Üì
ChromaVectorStore ‚Üí Searchable Database
  ‚Üì
Similarity Search ‚Üí Relevant Results
```

### Key Findings:
- **Complete Integration**: All components work together seamlessly
- **Real PDF Processing**: Successfully processed actual PDF file
- **Semantic Search**: Finds relevant content based on meaning
- **Pipeline Class**: Simplifies the entire workflow
- **Persistence**: Data saved to disk for reuse

### Next Steps:
1. Integrate with RAG (Retrieval-Augmented Generation) for Q&A
2. Add support for multiple document types
3. Implement advanced filtering and metadata search
4. Build a user interface for document upload and search