# Step 1.6: Vector Store (Chroma) Test

**Goal**: Store embeddings and enable similarity search

**File**: `src/vectorstore/chroma_store.py`

This notebook tests the `ChromaVectorStore` class which:
- Stores document embeddings in ChromaDB
- Enables fast similarity search
- Persists data to disk for reuse
- Supports adding documents incrementally

## Setup

In [1]:
import sys
from pathlib import Path

# Add project root to path
project_root = Path().absolute().parent
sys.path.insert(0, str(project_root))

from src.processing.embeddings import EmbeddingsGenerator
from src.vectorstore.chroma_store import ChromaVectorStore
from langchain.schema import Document

## Test 1: Initialize Components

Create embeddings generator and vectorstore manager

In [2]:
print("Initializing components...\n")

# Create embeddings generator
embedder = EmbeddingsGenerator()
print("✓ EmbeddingsGenerator initialized")

# Create vectorstore manager with test directory
vectorstore_manager = ChromaVectorStore(
    embeddings=embedder,
    persist_directory="./data/vectorstore_test"
)
print("✓ ChromaVectorStore manager initialized")
print(f"  Persist directory: {vectorstore_manager.persist_directory}")

Initializing components...

✓ EmbeddingsGenerator initialized
✓ ChromaVectorStore manager initialized
  Persist directory: ./data/vectorstore_test


## Test 2: Create Sample Documents

Create test documents with metadata

In [3]:
# Create sample documents about different topics
sample_docs = [
    Document(
        page_content="Climate change refers to long-term shifts in global temperatures and weather patterns. These shifts may be natural, but since the 1800s, human activities have been the main driver of climate change.",
        metadata={"source": "climate_basics.pdf", "page": 1, "topic": "climate"}
    ),
    Document(
        page_content="The greenhouse effect is the warming of Earth's surface and lower atmosphere caused by greenhouse gases. Carbon dioxide and methane are the primary greenhouse gases.",
        metadata={"source": "climate_basics.pdf", "page": 2, "topic": "climate"}
    ),
    Document(
        page_content="Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It focuses on developing computer programs that can access data and use it to learn.",
        metadata={"source": "ai_intro.pdf", "page": 1, "topic": "ai"}
    ),
    Document(
        page_content="Deep learning is a type of machine learning based on artificial neural networks. The learning process is deep because the structure of artificial neural networks consists of multiple input, output, and hidden layers.",
        metadata={"source": "ai_intro.pdf", "page": 3, "topic": "ai"}
    ),
    Document(
        page_content="Python is a high-level, interpreted programming language known for its simplicity and readability. It's widely used in data science, web development, automation, and artificial intelligence.",
        metadata={"source": "python_guide.pdf", "page": 1, "topic": "programming"}
    ),
    Document(
        page_content="Renewable energy comes from natural sources that are constantly replenished. Solar, wind, and hydroelectric power are examples of renewable energy sources that help reduce carbon emissions.",
        metadata={"source": "energy_report.pdf", "page": 5, "topic": "energy"}
    )
]

print(f"✓ Created {len(sample_docs)} sample documents\n")
for i, doc in enumerate(sample_docs, 1):
    print(f"{i}. [{doc.metadata['topic']}] {doc.page_content[:60]}...")

✓ Created 6 sample documents

1. [climate] Climate change refers to long-term shifts in global temperat...
2. [climate] The greenhouse effect is the warming of Earth's surface and ...
3. [ai] Machine learning is a subset of artificial intelligence that...
4. [ai] Deep learning is a type of machine learning based on artific...
5. [programming] Python is a high-level, interpreted programming language kno...
6. [energy] Renewable energy comes from natural sources that are constan...


## Test 3: Create Vectorstore from Documents

Store documents in ChromaDB with embeddings

In [4]:
print("Creating vectorstore from documents...\n")

vectorstore = vectorstore_manager.create_from_documents(
    documents=sample_docs,
    collection_name="test_collection"
)

print("✓ Vectorstore created successfully!")
print(f"✓ Collection name: test_collection")
print(f"✓ Number of documents stored: {len(sample_docs)}")

Creating vectorstore from documents...



Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


✓ Vectorstore created successfully!
✓ Collection name: test_collection
✓ Number of documents stored: 6


## Test 4: Basic Similarity Search

Search for documents similar to a query

In [5]:
query = "What is climate change?"
print(f"Query: '{query}'\n")

results = vectorstore_manager.similarity_search(query, k=3)

print(f"✓ Found {len(results)} most similar documents:\n")

for i, result in enumerate(results, 1):
    print(f"--- Result {i} ---")
    print(f"Content: {result.page_content[:150]}...")
    print(f"Metadata: {result.metadata}")
    print()

Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given


Query: 'What is climate change?'

✓ Found 3 most similar documents:

--- Result 1 ---
Content: Climate change refers to long-term shifts in global temperatures and weather patterns. These shifts may be natural, but since the 1800s, human activit...
Metadata: {'page': 1, 'source': 'climate_basics.pdf', 'topic': 'climate'}

--- Result 2 ---
Content: The greenhouse effect is the warming of Earth's surface and lower atmosphere caused by greenhouse gases. Carbon dioxide and methane are the primary gr...
Metadata: {'page': 2, 'source': 'climate_basics.pdf', 'topic': 'climate'}

--- Result 3 ---
Content: Renewable energy comes from natural sources that are constantly replenished. Solar, wind, and hydroelectric power are examples of renewable energy sou...
Metadata: {'page': 5, 'source': 'energy_report.pdf', 'topic': 'energy'}



## Test 5: Similarity Search with Scores

Get relevance scores along with results

In [6]:
query = "Tell me about artificial intelligence and neural networks"
print(f"Query: '{query}'\n")

results_with_scores = vectorstore_manager.similarity_search_with_score(query, k=4)

print(f"✓ Found {len(results_with_scores)} results with scores:\n")

for i, (doc, score) in enumerate(results_with_scores, 1):
    print(f"--- Result {i} (Score: {score:.4f}) ---")
    print(f"Topic: {doc.metadata['topic']}")
    print(f"Source: {doc.metadata['source']}")
    print(f"Content: {doc.page_content[:100]}...")
    print()

print("Note: Lower scores indicate higher similarity in Chroma")

Query: 'Tell me about artificial intelligence and neural networks'

✓ Found 4 results with scores:

--- Result 1 (Score: 0.8297) ---
Topic: ai
Source: ai_intro.pdf
Content: Deep learning is a type of machine learning based on artificial neural networks. The learning proces...

--- Result 2 (Score: 0.9514) ---
Topic: ai
Source: ai_intro.pdf
Content: Machine learning is a subset of artificial intelligence that enables systems to learn and improve fr...

--- Result 3 (Score: 1.4755) ---
Topic: programming
Source: python_guide.pdf
Content: Python is a high-level, interpreted programming language known for its simplicity and readability. I...

--- Result 4 (Score: 1.8221) ---
Topic: climate
Source: climate_basics.pdf
Content: The greenhouse effect is the warming of Earth's surface and lower atmosphere caused by greenhouse ga...

Note: Lower scores indicate higher similarity in Chroma


## Test 6: Multiple Different Queries

Test various queries to verify semantic search works correctly

In [7]:
test_queries = [
    "How does global warming work?",
    "What programming language should I learn?",
    "Explain machine learning algorithms",
    "What are clean energy sources?"
]

print("Testing multiple queries:\n")
print("=" * 70)

for query in test_queries:
    print(f"\nQuery: '{query}'")
    print("-" * 70)
    
    results = vectorstore_manager.similarity_search(query, k=2)
    
    for i, result in enumerate(results, 1):
        print(f"  {i}. [{result.metadata['topic']}] {result.page_content[:80]}...")

Testing multiple queries:


Query: 'How does global warming work?'
----------------------------------------------------------------------
  1. [climate] The greenhouse effect is the warming of Earth's surface and lower atmosphere cau...
  2. [climate] Climate change refers to long-term shifts in global temperatures and weather pat...

Query: 'What programming language should I learn?'
----------------------------------------------------------------------
  1. [programming] Python is a high-level, interpreted programming language known for its simplicit...
  2. [ai] Machine learning is a subset of artificial intelligence that enables systems to ...

Query: 'Explain machine learning algorithms'
----------------------------------------------------------------------
  1. [ai] Machine learning is a subset of artificial intelligence that enables systems to ...
  2. [ai] Deep learning is a type of machine learning based on artificial neural networks....

Query: 'What are clean energy sources?

## Test 7: Add More Documents

Test adding documents to existing vectorstore

In [8]:
# Create new documents to add
new_docs = [
    Document(
        page_content="Natural language processing (NLP) is a branch of AI that helps computers understand, interpret, and manipulate human language. It combines computational linguistics with machine learning.",
        metadata={"source": "nlp_guide.pdf", "page": 1, "topic": "ai"}
    ),
    Document(
        page_content="Solar panels convert sunlight directly into electricity using photovoltaic cells. This clean energy technology has become increasingly efficient and affordable in recent years.",
        metadata={"source": "solar_tech.pdf", "page": 2, "topic": "energy"}
    )
]

print(f"Adding {len(new_docs)} new documents to existing vectorstore...\n")

vectorstore_manager.add_documents(new_docs)

print("✓ Documents added successfully!")
print(f"✓ Total documents now: {len(sample_docs) + len(new_docs)}")

# Test search with new content
print("\nTesting search with newly added content:")
results = vectorstore_manager.similarity_search("What is NLP?", k=2)
print(f"\nTop result: {results[0].page_content[:100]}...")
print(f"Source: {results[0].metadata['source']}")

Adding 2 new documents to existing vectorstore...

✓ Documents added successfully!
✓ Total documents now: 8

Testing search with newly added content:

Top result: Natural language processing (NLP) is a branch of AI that helps computers understand, interpret, and ...
Source: nlp_guide.pdf


In [9]:
# Note: This test requires actual PDF files and the DocumentLoader/Splitter classes
# Uncomment and run if you have the required components:

'''
from src.processing.document_loader import DocumentLoader
from src.processing.document_splitter import DocumentSplitter

# Full pipeline test
print("Running full pipeline test...\n")

# Step 1: Load PDF
loader = DocumentLoader()
docs = loader.load_pdf("data/samples/sample.pdf")
print(f"✓ Loaded {len(docs)} documents from PDF")

# Step 2: Split into chunks
splitter = DocumentSplitter()
chunks = splitter.split_documents(docs)
print(f"✓ Split into {len(chunks)} chunks")

# Step 3: Create embeddings and vectorstore
embedder = EmbeddingsGenerator()
vectorstore_manager = ChromaVectorStore(embedder)
vectorstore = vectorstore_manager.create_from_documents(chunks)
print(f"✓ Created vectorstore with {len(chunks)} chunks")

# Step 4: Search test
results = vectorstore_manager.similarity_search("climate change", k=3)
print(f"\n✓ Search completed! Found {len(results)} results:\n")

for i, result in enumerate(results, 1):
    print(f"--- Result {i} ---")
    print(result.page_content[:200])
    print(result.metadata)
    print()
'''

print("Full pipeline test code is ready (commented out).")
print("Uncomment the code above when you have:")
print("  1. DocumentLoader class implemented")
print("  2. DocumentSplitter class implemented")
print("  3. Sample PDF file at data/samples/sample.pdf")

Full pipeline test code is ready (commented out).
Uncomment the code above when you have:
  1. DocumentLoader class implemented
  2. DocumentSplitter class implemented
  3. Sample PDF file at data/samples/sample.pdf


## Test 9: Persistence Test

Verify that vectorstore persists to disk and can be reloaded

In [10]:
print("Testing persistence...\n")

# Create a new vectorstore manager instance
new_embedder = EmbeddingsGenerator()
new_manager = ChromaVectorStore(
    embeddings=new_embedder,
    persist_directory="./data/vectorstore_test"
)

# Load existing vectorstore
print("Loading existing vectorstore from disk...")
loaded_vectorstore = new_manager.load_existing(collection_name="test_collection")
print("✓ Vectorstore loaded successfully!\n")

# Test search on loaded vectorstore
print("Testing search on loaded vectorstore:")
results = new_manager.similarity_search("climate change", k=2)

print(f"✓ Found {len(results)} results:\n")
for i, result in enumerate(results, 1):
    print(f"{i}. {result.page_content[:80]}...")

print("\n✓ Persistence works! Data survives across sessions.")

Testing persistence...



Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


Loading existing vectorstore from disk...
✓ Vectorstore loaded successfully!

Testing search on loaded vectorstore:
✓ Found 2 results:

1. Climate change refers to long-term shifts in global temperatures and weather pat...
2. The greenhouse effect is the warming of Earth's surface and lower atmosphere cau...

✓ Persistence works! Data survives across sessions.


## Test 10: Metadata Filtering (Advanced)

Demonstrate filtering by metadata

In [11]:
print("Testing metadata filtering...\n")

# Search only within climate-related documents
query = "What causes environmental changes?"

# Chroma supports filtering via where clause
results = vectorstore_manager.vectorstore.similarity_search(
    query,
    k=3,
    filter={"topic": "climate"}
)

print(f"Query: '{query}'")
print(f"Filter: topic = 'climate'\n")
print(f"✓ Found {len(results)} results (all should be climate-related):\n")

for i, result in enumerate(results, 1):
    print(f"{i}. Topic: {result.metadata['topic']}")
    print(f"   {result.page_content[:100]}...\n")

Testing metadata filtering...

Query: 'What causes environmental changes?'
Filter: topic = 'climate'

✓ Found 2 results (all should be climate-related):

1. Topic: climate
   Climate change refers to long-term shifts in global temperatures and weather patterns. These shifts ...

2. Topic: climate
   The greenhouse effect is the warming of Earth's surface and lower atmosphere caused by greenhouse ga...



## Cleanup (Optional)

Remove test vectorstore directory if needed

In [12]:
import shutil
import os

# Uncomment to clean up test data
'''
test_dir = "./data/vectorstore_test"
if os.path.exists(test_dir):
    shutil.rmtree(test_dir)
    print(f"✓ Cleaned up test directory: {test_dir}")
'''

print("Cleanup code ready (commented out)")
print("Uncomment to remove test vectorstore directory")

Cleanup code ready (commented out)
Uncomment to remove test vectorstore directory


## Summary

### What We Tested:
1. ✅ Initialize ChromaVectorStore with embeddings
2. ✅ Create vectorstore from documents
3. ✅ Perform similarity search
4. ✅ Get similarity scores
5. ✅ Test multiple different queries
6. ✅ Add documents to existing vectorstore
7. ✅ Full pipeline integration (template provided)
8. ✅ Persistence and reloading
9. ✅ Metadata filtering

### Key Findings:
- **Semantic Search Works**: Climate queries return climate docs, AI queries return AI docs
- **Persistence**: Data is saved to disk and can be reloaded
- **Incremental Updates**: Can add documents without recreating entire store
- **Metadata Support**: Can filter searches by metadata fields
- **Scoring**: Lower scores = higher similarity in Chroma
- **Integration Ready**: Works seamlessly with EmbeddingsGenerator

### ChromaVectorStore Methods Verified:
- `create_from_documents()` - Create new vectorstore
- `load_existing()` - Load persisted vectorstore
- `add_documents()` - Add more documents
- `similarity_search()` - Find similar documents
- `similarity_search_with_score()` - Get relevance scores

### Next Steps:
1. Test with real PDF documents using DocumentLoader
2. Integrate with RAG pipeline for question answering
3. Experiment with different collection names for organizing documents
4. Optimize chunk size and overlap for your use case