# Step 1.5: Embeddings Generator Test

**Goal**: Convert text chunks to vector embeddings using HuggingFace models

**File**: `src/processing/embeddings.py`

This notebook tests the `EmbeddingsGenerator` class which:
- Uses free HuggingFace models (runs locally, no API calls)
- Converts text to numerical vectors
- Enables semantic similarity search

## Setup

In [1]:
import sys
from pathlib import Path

# Add project root to path
project_root = Path().absolute().parent
sys.path.insert(0, str(project_root))

from src.processing.embeddings import EmbeddingsGenerator

## Test 1: Initialize Embeddings Generator

Create an instance using the default model: `sentence-transformers/all-MiniLM-L6-v2`
- Fast and lightweight
- Good for general-purpose semantic search
- Produces 384-dimensional vectors

In [2]:
print("Initializing EmbeddingsGenerator...")
embedder = EmbeddingsGenerator()
print("✓ Embeddings model loaded successfully!")

Initializing EmbeddingsGenerator...
✓ Embeddings model loaded successfully!


## Test 2: Embed a Single Query

Test the `embed_query()` method with a sample question

In [3]:
query = "What is climate change?"
print(f"Query: '{query}'\n")

query_vector = embedder.embed_query(query)

print(f"✓ Vector dimension: {len(query_vector)}")
print(f"✓ First 5 values: {query_vector[:5]}")
print(f"✓ Last 5 values: {query_vector[-5:]}")
print(f"\nVector type: {type(query_vector)}")

Query: 'What is climate change?'

✓ Vector dimension: 384
✓ First 5 values: [-0.037313830107450485, 0.09820153564214706, 0.0566871240735054, 0.06354967504739761, 0.03308780491352081]
✓ Last 5 values: [0.046277157962322235, -0.033581286668777466, -0.04362424835562706, -0.00490030599758029, 0.0240323543548584]

Vector type: <class 'list'>


## Test 3: Embed Multiple Documents

Test the `embed_documents()` method with multiple text chunks

In [4]:
documents = [
    "Climate change refers to long-term shifts in global temperatures and weather patterns.",
    "Machine learning is a subset of artificial intelligence.",
    "The greenhouse effect is the warming of Earth's surface and lower atmosphere.",
    "Python is a popular programming language for data science."
]

print("Embedding multiple documents...\n")
doc_vectors = embedder.embed_documents(documents)

print(f"✓ Number of documents embedded: {len(doc_vectors)}")
print(f"✓ Each vector dimension: {len(doc_vectors[0])}")
print(f"\nFirst document vector (first 5 values): {doc_vectors[0][:5]}")

Embedding multiple documents...

✓ Number of documents embedded: 4
✓ Each vector dimension: 384

First document vector (first 5 values): [-0.02915601246058941, 0.019391369074583054, 0.13551342487335205, 0.05121537297964096, 0.02877393178641796]


## Test 4: Semantic Similarity Demonstration

Calculate cosine similarity between query and documents to show semantic understanding

In [5]:
import numpy as np

def cosine_similarity(vec1, vec2):
    """Calculate cosine similarity between two vectors"""
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

# Query about climate
climate_query = "Tell me about global warming"
climate_vector = embedder.embed_query(climate_query)

print(f"Query: '{climate_query}'\n")
print("Similarity scores with documents:\n")

for i, doc in enumerate(documents):
    similarity = cosine_similarity(climate_vector, doc_vectors[i])
    print(f"{i+1}. [{similarity:.4f}] {doc[:60]}...")

print("\n✓ Notice: Climate-related documents have higher similarity scores!")

Query: 'Tell me about global warming'

Similarity scores with documents:

1. [0.4786] Climate change refers to long-term shifts in global temperat...
2. [0.1482] Machine learning is a subset of artificial intelligence....
3. [0.5693] The greenhouse effect is the warming of Earth's surface and ...
4. [0.1785] Python is a popular programming language for data science....

✓ Notice: Climate-related documents have higher similarity scores!


## Test 5: Different Queries Comparison

Compare how different queries relate to the same set of documents

In [6]:
queries = [
    "What causes global warming?",
    "How does AI work?",
    "Best programming language for data analysis"
]

print("Comparing different queries:\n")

for query in queries:
    query_vec = embedder.embed_query(query)
    print(f"\nQuery: '{query}'")
    print("-" * 60)
    
    # Find most similar document
    similarities = [cosine_similarity(query_vec, doc_vec) for doc_vec in doc_vectors]
    best_match_idx = np.argmax(similarities)
    
    print(f"Best match: {documents[best_match_idx][:60]}...")
    print(f"Similarity: {similarities[best_match_idx]:.4f}")

Comparing different queries:


Query: 'What causes global warming?'
------------------------------------------------------------
Best match: The greenhouse effect is the warming of Earth's surface and ...
Similarity: 0.6267

Query: 'How does AI work?'
------------------------------------------------------------
Best match: Machine learning is a subset of artificial intelligence....
Similarity: 0.5104

Query: 'Best programming language for data analysis'
------------------------------------------------------------
Best match: Python is a popular programming language for data science....
Similarity: 0.6565


## Test 6: Get Embeddings Instance

Test the `get_embeddings()` method (used by vectorstore)

In [7]:
embeddings_instance = embedder.get_embeddings()
print(f"✓ Embeddings instance type: {type(embeddings_instance)}")
print(f"✓ Model name: {embeddings_instance.model_name}")
print(f"✓ This instance will be passed to ChromaVectorStore")

✓ Embeddings instance type: <class 'langchain_community.embeddings.huggingface.HuggingFaceEmbeddings'>
✓ Model name: sentence-transformers/all-MiniLM-L6-v2
✓ This instance will be passed to ChromaVectorStore


## Test 7: Performance Check

Measure embedding speed for different batch sizes

In [8]:
import time

# Test with different batch sizes
test_texts = [f"This is test document number {i}" for i in range(100)]

print("Performance test:\n")

# Single query
start = time.time()
_ = embedder.embed_query(test_texts[0])
single_time = time.time() - start
print(f"Single query: {single_time:.4f} seconds")

# Batch of 10
start = time.time()
_ = embedder.embed_documents(test_texts[:10])
batch_10_time = time.time() - start
print(f"Batch of 10: {batch_10_time:.4f} seconds ({batch_10_time/10:.4f} per doc)")

# Batch of 100
start = time.time()
_ = embedder.embed_documents(test_texts)
batch_100_time = time.time() - start
print(f"Batch of 100: {batch_100_time:.4f} seconds ({batch_100_time/100:.4f} per doc)")

print("\n✓ Batch processing is more efficient!")

Performance test:

Single query: 0.0649 seconds
Batch of 10: 0.0715 seconds (0.0071 per doc)
Batch of 100: 0.5873 seconds (0.0059 per doc)

✓ Batch processing is more efficient!


## Summary

### What We Tested:
1. ✅ Initialize EmbeddingsGenerator with default model
2. ✅ Embed single queries with `embed_query()`
3. ✅ Embed multiple documents with `embed_documents()`
4. ✅ Verify vector dimensions (384 for all-MiniLM-L6-v2)
5. ✅ Demonstrate semantic similarity
6. ✅ Get embeddings instance for vectorstore
7. ✅ Check performance characteristics

### Key Findings:
- Model runs locally (no API calls needed)
- Produces 384-dimensional normalized vectors
- Understands semantic meaning (climate queries match climate docs)
- Batch processing is more efficient than individual queries
- Ready to integrate with ChromaVectorStore

### Next Step:
Test the ChromaVectorStore in notebook `04_vectorstore_test.ipynb`