# Lab 3: Vector Embeddings and Hybrid Search

This notebook processes HTML documents from the Databricks volume, generates vector embeddings,
stores them in Neo4j as a document graph, and enables hybrid search.

## Prerequisites

1. **Neo4j database** running (Aura or self-hosted, version 5.11+)
2. **Databricks Secrets** configured with `neo4j-creds` scope
3. **Unity Catalog Volume** with HTML files (from Lab 1)
4. **Lab 2 completed** - financial entity graph populated

## Embedding Model

Uses **Databricks Foundation Model** `databricks-gte-large-en` (1024 dimensions, 8K context).

---

## Step 1: Configuration

Load secrets from Databricks and configure connections.

In [None]:
# =============================================================================
# CONFIGURATION - Load from Databricks Secrets
# =============================================================================
import time

print("=" * 70)
print("CONFIGURATION")
print("=" * 70)

# Retrieve credentials from Databricks Secrets
NEO4J_URL = dbutils.secrets.get(scope="neo4j-creds", key="url")
NEO4J_USER = dbutils.secrets.get(scope="neo4j-creds", key="username")
NEO4J_PASS = dbutils.secrets.get(scope="neo4j-creds", key="password")
VOLUME_PATH = dbutils.secrets.get(scope="neo4j-creds", key="volume_path")

NEO4J_DATABASE = "neo4j"
HTML_PATH = f"{VOLUME_PATH}/html"

print(f"Neo4j URL:  {NEO4J_URL}")
print(f"HTML Path:  {HTML_PATH}")
print("[OK] Configuration loaded")

## Step 2: Import Lab 3 Package

Import models, processing functions, and search components from the lab package.

In [None]:
# =============================================================================
# IMPORTS FROM LAB 3 PACKAGE
# =============================================================================
from neo4j import GraphDatabase

# Models
from lab_3_vector_embeddings.models import (
    ChunkConfig,
    EmbeddingConfig,
    EmbeddingProvider,
    IndexConfig,
    Neo4jConfig,
)

# Processing
from lab_3_vector_embeddings.processing import (
    process_html_content,
    chunk_document_sync,
)

# Embeddings
from lab_3_vector_embeddings.embeddings import (
    create_embedder,
    embed_chunks,
    get_default_config_for_provider,
)

# Graph writing
from lab_3_vector_embeddings.graph import write_document_graph

# Search
from lab_3_vector_embeddings.search import create_searcher

print("[OK] Lab 3 package imported")
print("  - Models: ChunkConfig, IndexConfig, Neo4jConfig, etc.")
print("  - Processing: process_html_content, chunk_document_sync")
print("  - Embeddings: create_embedder, embed_chunks")
print("  - Graph: write_document_graph")
print("  - Search: create_searcher")

In [None]:
# =============================================================================
# INITIALIZE CONFIGURATION OBJECTS
# =============================================================================

# Neo4j connection config
neo4j_config = Neo4jConfig(
    uri=NEO4J_URL,
    username=NEO4J_USER,
    password=NEO4J_PASS,
    database=NEO4J_DATABASE,
)

# Chunking config (neo4j-graphrag FixedSizeSplitter settings)
chunk_config = ChunkConfig(chunk_size=4000, chunk_overlap=200)

# Index config
index_config = IndexConfig()

# Embedding config - Databricks GTE Large (1024 dims, 8K context)
embedding_config = get_default_config_for_provider(EmbeddingProvider.DATABRICKS)

print("[OK] Configuration objects created")
print(f"  Chunk size: {chunk_config.chunk_size}, overlap: {chunk_config.chunk_overlap}")
print(f"  Embedding: {embedding_config.model_name} ({embedding_config.dimensions} dims)")
print(f"  Vector index: {index_config.vector_index_name}")
print(f"  Fulltext index: {index_config.fulltext_index_name}")

## Step 3: Verify Prerequisites

Check Neo4j connectivity and HTML file availability.

In [None]:
# =============================================================================
# VERIFY PREREQUISITES
# =============================================================================
print("=" * 70)
print("PREREQUISITE VERIFICATION")
print("=" * 70)

# Check HTML files
print(f"\n[1] HTML Files in {HTML_PATH}:")
files = dbutils.fs.ls(HTML_PATH)
html_files = [f for f in files if f.name.endswith('.html')]
for f in html_files:
    print(f"    - {f.name}")
print(f"    Total: {len(html_files)} files")

# Check Neo4j connection
print(f"\n[2] Neo4j Connection:")
driver = GraphDatabase.driver(NEO4J_URL, auth=(NEO4J_USER, NEO4J_PASS))
with driver.session(database=NEO4J_DATABASE) as session:
    result = session.run("RETURN 1 AS test")
    result.single()
print(f"    [OK] Connected to {NEO4J_URL}")

# Check existing entity graph from Lab 2
print(f"\n[3] Existing Entity Graph (from Lab 2):")
with driver.session(database=NEO4J_DATABASE) as session:
    result = session.run("""
        MATCH (n) WHERE n:Customer OR n:Company OR n:Stock
        RETURN labels(n)[0] AS label, count(n) AS count
        ORDER BY label
    """)
    for record in result:
        print(f"    {record['label']}: {record['count']} nodes")

driver.close()
print("\n[OK] All prerequisites verified")

## Step 4: Process Documents

Load HTML files, extract text, and create chunks using the `step1_process_documents` pattern.

In [None]:
# =============================================================================
# STEP 1: DOCUMENT PROCESSING
# =============================================================================
print("=" * 70)
print("STEP 1: Document Processing")
print("=" * 70)

start_time = time.time()
documents = []
all_chunks = []

for i, file_info in enumerate(html_files):
    filename = file_info.name
    filepath = f"{HTML_PATH}/{filename}"
    
    print(f"  [{i+1}/{len(html_files)}] Processing: {filename}")
    
    # Read HTML content from Databricks volume
    content = dbutils.fs.head(filepath, 100000)
    
    # Process document using lab package
    doc = process_html_content(content, filename, filepath)
    documents.append(doc)
    print(f"           Type: {doc.document_type.value}, Chars: {len(doc.raw_text)}")
    
    # Chunk document using lab package
    chunks = chunk_document_sync(doc, chunk_config)
    all_chunks.extend(chunks)
    print(f"           Chunks: {len(chunks)}")

doc_processing_time = time.time() - start_time

print(f"\n  [OK] Processed {len(documents)} documents into {len(all_chunks)} chunks")
print(f"       Time: {doc_processing_time:.2f}s")

## Step 5: Generate Embeddings

Create vector embeddings using the `step2_generate_embeddings` pattern.

In [None]:
# =============================================================================
# STEP 2: EMBEDDING GENERATION
# =============================================================================
from lab_3_vector_embeddings.embeddings import validate_embeddings

print("=" * 70)
print("STEP 2: Embedding Generation")
print("=" * 70)
print(f"  Provider: {embedding_config.provider.value}")
print(f"  Model: {embedding_config.model_name}")
print(f"  Dimensions: {embedding_config.dimensions}")

start_time = time.time()

# Create embedder using lab package
embedder = create_embedder(embedding_config)
print(f"\n  [OK] Embedder initialized")

# Generate embeddings using lab package
print(f"  Embedding {len(all_chunks)} chunks...")
embedded_chunks = embed_chunks(all_chunks, embedder)

# Validate embeddings
is_valid, errors = validate_embeddings(embedded_chunks, embedding_config.dimensions)
if not is_valid:
    print(f"  [WARNING] Embedding validation errors: {errors[:5]}")
else:
    print(f"  [OK] All embeddings validated ({embedding_config.dimensions} dimensions)")

embedding_time = time.time() - start_time
print(f"       Time: {embedding_time:.2f}s")

## Step 6: Write Document Graph to Neo4j

Create indexes, constraints, nodes, and relationships using the `step3_write_to_neo4j` pattern.

In [None]:
# =============================================================================
# STEP 3: NEO4J GRAPH WRITING
# =============================================================================
print("=" * 70)
print("STEP 3: Neo4j Graph Writing")
print("=" * 70)
print(f"  URI: {neo4j_config.uri}")
print(f"  Database: {neo4j_config.database}")

start_time = time.time()

# Use the lab package's write_document_graph function
# This handles: constraints, indexes, documents, chunks, and relationships
graph_results = write_document_graph(
    neo4j_config=neo4j_config,
    documents=documents,
    chunks=embedded_chunks,
    embedding_dimensions=embedding_config.dimensions,
    index_config=index_config,
    clear_existing=False,  # Set True to clear existing document graph first
)

graph_writing_time = time.time() - start_time

print(f"\n  [OK] Graph writing complete")
print(f"       Documents written: {graph_results['documents'].get('documents_written', 0)}")
print(f"       Chunks written: {graph_results['chunks'].get('chunks_written', 0)}")
print(f"       FROM_DOCUMENT relationships: {graph_results['from_document'].get('from_document_relationships', 0)}")
print(f"       NEXT_CHUNK relationships: {graph_results['next_chunk'].get('next_chunk_relationships', 0)}")
print(f"       DESCRIBES relationships: {graph_results['describes'].get('describes_relationships', 0)}")
print(f"       Time: {graph_writing_time:.2f}s")

print("\n  Index Status:")
for name, status in graph_results.get('index_status', {}).items():
    print(f"    {name}: {status.get('state', 'unknown')}")

## Step 7: Search Demonstrations

Demonstrate vector, full-text, hybrid, and graph-aware search.

In [None]:
# =============================================================================
# INITIALIZE SEARCHER
# =============================================================================

# Create searcher using lab package
searcher = create_searcher(neo4j_config, embedder, index_config)
print("[OK] DocumentSearcher initialized")

In [None]:
# =============================================================================
# DEMO 1: VECTOR SEARCH
# =============================================================================
print("=" * 70)
print("DEMO 1: Vector Search")
print("=" * 70)
print("Query: 'investment strategies for moderate risk'\n")

results = searcher.vector_search("investment strategies for moderate risk")

for i, r in enumerate(results[:3]):
    print(f"{i+1}. Score: {r.score:.4f}")
    print(f"   Document: {r.document_title}")
    print(f"   Text: {r.text[:150]}...\n")

In [None]:
# =============================================================================
# DEMO 2: FULL-TEXT SEARCH
# =============================================================================
print("=" * 70)
print("DEMO 2: Full-Text Search")
print("=" * 70)
print("Query: 'renewable energy'\n")

results = searcher.fulltext_search("renewable energy")

for i, r in enumerate(results[:3]):
    print(f"{i+1}. Score: {r.score:.4f}")
    print(f"   Document: {r.document_title}")
    print(f"   Text: {r.text[:150]}...\n")

In [None]:
# =============================================================================
# DEMO 3: HYBRID SEARCH
# =============================================================================
print("=" * 70)
print("DEMO 3: Hybrid Search")
print("=" * 70)
print("Query: 'customer technology portfolio'\n")

results = searcher.hybrid_search("customer technology portfolio")

for i, r in enumerate(results[:3]):
    print(f"{i+1}. Score: {r.score:.4f}")
    print(f"   Document: {r.document_title}")
    print(f"   Text: {r.text[:150]}...\n")

In [None]:
# =============================================================================
# DEMO 4: GRAPH-AWARE SEARCH
# =============================================================================
print("=" * 70)
print("DEMO 4: Graph-Aware Search")
print("=" * 70)
print("Query: 'retirement planning'")
print("Traverses: Chunk -> Document -> Customer\n")

result = searcher.vector_search_with_graph_traversal("retirement planning")

print(f"Found {len(result.search_results)} chunks")
print(f"Related customers: {len(result.related_customers)}")
print(f"Related companies: {len(result.related_companies)}")
print(f"Related stocks: {len(result.related_stocks)}")

if result.related_customers:
    print("\nConnected Customers:")
    for c in result.related_customers[:5]:
        name = f"{c.get('first_name', '')} {c.get('last_name', '')}".strip()
        print(f"  - {name}")

In [None]:
# =============================================================================
# CLEANUP AND SUMMARY
# =============================================================================
searcher.close()
print("[OK] Searcher closed")

# Pipeline Summary
total_time = doc_processing_time + embedding_time + graph_writing_time

print("\n" + "=" * 70)
print("PIPELINE COMPLETE")
print("=" * 70)
print(f"  Documents processed: {len(documents)}")
print(f"  Chunks created: {len(all_chunks)}")
print(f"  Total time: {total_time:.2f}s")
print("\n  Timing breakdown:")
print(f"    Document processing: {doc_processing_time:.2f}s")
print(f"    Embedding generation: {embedding_time:.2f}s")
print(f"    Graph writing: {graph_writing_time:.2f}s")
print("=" * 70)

## Summary

Lab 3 demonstrated:

1. **Document Processing** - HTML parsing and chunking via `process_html_content()` and `chunk_document_sync()`
2. **Embedding Generation** - Databricks `databricks-gte-large-en` model (1024 dims) via `create_embedder()`
3. **Graph Storage** - Document/Chunk nodes with relationships via `write_document_graph()`
4. **Search Capabilities** via `DocumentSearcher`:
   - `vector_search()` - Semantic similarity
   - `fulltext_search()` - Keyword matching
   - `hybrid_search()` - Combined vector + keyword
   - `vector_search_with_graph_traversal()` - Graph-aware retrieval

### Package Structure

```
lab_3_vector_embeddings/
├── models/      # Pydantic models (Neo4jConfig, ChunkConfig, etc.)
├── processing/  # Document processing (HTML parsing, chunking)
├── embeddings/  # Embedding providers (Databricks, SentenceTransformers)
├── graph/       # Neo4j graph writing
└── search/      # Search implementations
```