# Embeddings and Vector Search

This notebook demonstrates how to generate embeddings for text chunks and perform vector similarity search using Neo4j.

**Prerequisites:** Complete [01_01 Data Loading](01_01_data_loading.ipynb) first.

**Learning Objectives:**
- Understand what embeddings are and why they matter for RAG
- Use `FixedSizeSplitter` to automatically chunk text
- Generate embeddings using Azure OpenAI
- Create a vector index in Neo4j
- Perform similarity search to find relevant chunks

---

## What are Embeddings?

Embeddings are numerical representations (vectors) of text that capture semantic meaning. Similar texts have similar embeddings, enabling **semantic search** - finding content by meaning rather than exact keywords.

```
"Apple makes iPhones" → [0.12, -0.45, 0.78, ...] (1536 dimensions)
"The company produces smartphones" → [0.11, -0.44, 0.77, ...] (similar vector!)
```

## Setup

Import required modules.

In [13]:
import sys
sys.path.insert(0, '../solutions')

from neo4j import GraphDatabase
from neo4j_graphrag.indexes import create_vector_index
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter

from config import Neo4jConfig, get_embedder

## Sample Data

We use the same sample SEC 10-K text as the previous notebook.

In [14]:
SAMPLE_TEXT = """
Apple Inc. ("Apple" or the "Company") designs, manufactures and markets smartphones, 
personal computers, tablets, wearables and accessories, and sells a variety of related 
services. The Company's fiscal year is the 52- or 53-week period that ends on the last 
Saturday of September.

Products

iPhone is the Company's line of smartphones based on its iOS operating system. The iPhone 
product line includes iPhone 14 Pro, iPhone 14, iPhone 13 and iPhone SE. Mac is the Company's 
line of personal computers based on its macOS operating system. iPad is the Company's line 
of multi-purpose tablets based on its iPadOS operating system.

Services

Advertising includes third-party licensing arrangements and the Company's own advertising 
platforms. AppleCare offers a portfolio of fee-based service and support products. Cloud 
Services store and keep customers' content up-to-date across all devices. Digital Content 
operates various platforms for discovering, purchasing, streaming and downloading digital 
content and apps. Payment Services include Apple Card and Apple Pay.
""".strip()

DOCUMENT_PATH = "form10k-sample/apple-2023-10k.pdf"
print(f"Sample text length: {len(SAMPLE_TEXT)} characters")

Sample text length: 1079 characters


## Connect to Neo4j

In [15]:
neo4j_config = Neo4jConfig()
driver = GraphDatabase.driver(
    neo4j_config.uri,
    auth=(neo4j_config.username, neo4j_config.password)
)
driver.verify_connectivity()
print("Connected to Neo4j successfully!")

Connected to Neo4j successfully!


## Clear Existing Data

In [16]:
def clear_graph(driver):
    """Remove all Document and Chunk nodes."""
    with driver.session() as session:
        result = session.run("""
            MATCH (n) WHERE n:Document OR n:Chunk
            DETACH DELETE n
            RETURN count(n) as deleted
        """)
        count = result.single()["deleted"]
        print(f"Deleted {count} nodes")

clear_graph(driver)

Deleted 2 nodes


## Split Text with FixedSizeSplitter

The `FixedSizeSplitter` from neo4j-graphrag automatically splits text into chunks of a specified size with overlap.

- **chunk_size**: Maximum characters per chunk
- **chunk_overlap**: Characters to overlap between chunks (preserves context)

In [17]:
# Create splitter with small chunk size for demonstration
splitter = FixedSizeSplitter(chunk_size=400, chunk_overlap=50)

# Split the text (async method - Jupyter supports top-level await)
chunks = await splitter.run(text=SAMPLE_TEXT)

print(f"Split into {len(chunks.chunks)} chunks:\n")
for i, chunk in enumerate(chunks.chunks):
    print(f"Chunk {i}: {len(chunk.text)} chars")
    print(f"  {chunk.text}\n")

Split into 3 chunks:

Chunk 0: 400 chars
  Apple Inc. ("Apple" or the "Company") designs, manufactures and markets smartphones, 
personal computers, tablets, wearables and accessories, and sells a variety of related 
services. The Company's fiscal year is the 52- or 53-week period that ends on the last 
Saturday of September.

Products

iPhone is the Company's line of smartphones based on its iOS operating system. The iPhone 
product line 

Chunk 1: 400 chars
  ts iOS operating system. The iPhone 
product line includes iPhone 14 Pro, iPhone 14, iPhone 13 and iPhone SE. Mac is the Company's 
line of personal computers based on its macOS operating system. iPad is the Company's line 
of multi-purpose tablets based on its iPadOS operating system.

Services

Advertising includes third-party licensing arrangements and the Company's own advertising 
platforms. 

Chunk 2: 379 chars
  nts and the Company's own advertising 
platforms. AppleCare offers a portfolio of fee-based service and support

## Initialize Embedder

Create an embedder using Microsoft Foundry. This uses the `text-embedding-ada-002` model which produces 1536-dimensional vectors.

In [18]:
embedder = get_embedder()
print(f"Embedder initialized: {embedder.model}")

Embedder initialized: text-embedding-ada-002


## Generate Embeddings

Generate an embedding vector for each chunk. This calls the Azure OpenAI embedding API.

In [19]:
# Generate embeddings for each chunk
chunk_embeddings = []
for i, chunk in enumerate(chunks.chunks):
    embedding = embedder.embed_query(chunk.text)
    chunk_embeddings.append({
        "text": chunk.text,
        "index": i,
        "embedding": embedding
    })
    print(f"Chunk {i}: Generated {len(embedding)}-dimensional embedding")

print(f"\nFirst 5 values of chunk 0's embedding: {chunk_embeddings[0]['embedding'][:5]}")

Chunk 0: Generated 1536-dimensional embedding
Chunk 1: Generated 1536-dimensional embedding
Chunk 2: Generated 1536-dimensional embedding

First 5 values of chunk 0's embedding: [0.0033809475135058165, -0.003419476794078946, -0.008592037484049797, -0.023002002388238907, 0.000592789554502815]


## Store in Neo4j with Embeddings

Create Document and Chunk nodes, storing the embedding vector on each Chunk.

In [20]:
def store_chunks_with_embeddings(driver, doc_path: str, chunk_data: list[dict]):
    """Store Document and Chunk nodes with embeddings."""
    with driver.session() as session:
        # Create Document
        session.run("""
            CREATE (d:Document {path: $path})
        """, path=doc_path)
        print(f"Created Document: {doc_path}")
        
        # Create Chunks with embeddings
        for chunk in chunk_data:
            session.run("""
                MATCH (d:Document {path: $path})
                CREATE (c:Chunk {
                    text: $text,
                    index: $index,
                    embedding: $embedding
                })
                CREATE (c)-[:FROM_DOCUMENT]->(d)
            """, path=doc_path, text=chunk["text"], 
               index=chunk["index"], embedding=chunk["embedding"])
        print(f"Created {len(chunk_data)} Chunk nodes with embeddings")
        
        # Create NEXT_CHUNK relationships
        session.run("""
            MATCH (d:Document {path: $path})<-[:FROM_DOCUMENT]-(c:Chunk)
            WITH c ORDER BY c.index
            WITH collect(c) as chunks
            UNWIND range(0, size(chunks)-2) as i
            WITH chunks[i] as c1, chunks[i+1] as c2
            CREATE (c1)-[:NEXT_CHUNK]->(c2)
        """, path=doc_path)
        print("Created NEXT_CHUNK relationships")

store_chunks_with_embeddings(driver, DOCUMENT_PATH, chunk_embeddings)

Created Document: form10k-sample/apple-2023-10k.pdf
Created 3 Chunk nodes with embeddings
Created NEXT_CHUNK relationships


## Create Vector Index

Create a vector index in Neo4j for efficient similarity search. The index uses cosine similarity to compare embeddings.

In [21]:
INDEX_NAME = "chunkEmbeddings"

# Drop existing index if it exists
try:
    with driver.session() as session:
        session.run(f"DROP INDEX {INDEX_NAME} IF EXISTS")
        print(f"Dropped existing index: {INDEX_NAME}")
except Exception:
    pass

# Create new vector index
create_vector_index(
    driver=driver,
    name=INDEX_NAME,
    label="Chunk",
    embedding_property="embedding",
    dimensions=1536,
    similarity_fn="cosine"
)
print(f"Created vector index: {INDEX_NAME}")

Dropped existing index: chunkEmbeddings
Created vector index: chunkEmbeddings


## Vector Similarity Search

Now we can search for chunks that are semantically similar to a query. The search:
1. Converts the query to an embedding
2. Finds chunks with similar embedding vectors
3. Returns results ranked by similarity score

In [22]:
def vector_search(driver, embedder, query: str, top_k: int = 3):
    """Search for chunks similar to the query."""
    # Generate query embedding
    query_embedding = embedder.embed_query(query)
    
    with driver.session() as session:
        result = session.run("""
            CALL db.index.vector.queryNodes($index_name, $top_k, $embedding)
            YIELD node, score
            RETURN node.text as text, node.index as idx, score
            ORDER BY score DESC
        """, index_name=INDEX_NAME, top_k=top_k, embedding=query_embedding)
        
        return list(result)

# Test search
query = "What products does Apple make?"
print(f"Query: \"{query}\"\n")
print("=" * 60)

results = vector_search(driver, embedder, query)
for i, record in enumerate(results):
    print(f"\n[{i+1}] Score: {record['score']:.4f} (Chunk {record['idx']})")
    print(f"    {record['text']}")

Query: "What products does Apple make?"


[1] Score: 0.9340 (Chunk 0)
    Apple Inc. ("Apple" or the "Company") designs, manufactures and markets smartphones, 
personal computers, tablets, wearables and accessories, and sells a variety of related 
services. The Company's fiscal year is the 52- or 53-week period that ends on the last 
Saturday of September.

Products

iPhone is the Company's line of smartphones based on its iOS operating system. The iPhone 
product line 

[2] Score: 0.9182 (Chunk 1)
    ts iOS operating system. The iPhone 
product line includes iPhone 14 Pro, iPhone 14, iPhone 13 and iPhone SE. Mac is the Company's 
line of personal computers based on its macOS operating system. iPad is the Company's line 
of multi-purpose tablets based on its iPadOS operating system.

Services

Advertising includes third-party licensing arrangements and the Company's own advertising 
platforms. 

[3] Score: 0.9140 (Chunk 2)
    nts and the Company's own advertising 
platforms. AppleCar

## Compare Different Queries

Try different queries to see how semantic search finds relevant content even with different wording.

In [23]:
queries = [
    "Tell me about iPhone and Mac computers",
    "What services does the company offer?",
    "When does the fiscal year end?"
]

for query in queries:
    print(f"\nQuery: \"{query}\"")
    print("-" * 50)
    results = vector_search(driver, embedder, query, top_k=1)
    if results:
        record = results[0]
        print(f"Best match (score: {record['score']:.4f}):")
        print(f"  {record['text']}")


Query: "Tell me about iPhone and Mac computers"
--------------------------------------------------
Best match (score: 0.9302):
  ts iOS operating system. The iPhone 
product line includes iPhone 14 Pro, iPhone 14, iPhone 13 and iPhone SE. Mac is the Company's 
line of personal computers based on its macOS operating system. iPad is the Company's line 
of multi-purpose tablets based on its iPadOS operating system.

Services

Advertising includes third-party licensing arrangements and the Company's own advertising 
platforms. 

Query: "What services does the company offer?"
--------------------------------------------------
Best match (score: 0.9181):
  nts and the Company's own advertising 
platforms. AppleCare offers a portfolio of fee-based service and support products. Cloud 
Services store and keep customers' content up-to-date across all devices. Digital Content 
operates various platforms for discovering, purchasing, streaming and downloading digital 
content and apps. Payment Ser

## Summary

In this notebook, you learned:

1. **Embeddings** - Numerical vectors that capture semantic meaning
2. **FixedSizeSplitter** - Automatic text chunking with overlap
3. **Vector storage** - Storing embeddings as node properties
4. **Vector index** - Enabling efficient similarity search
5. **Semantic search** - Finding content by meaning, not just keywords

The chunks now have embeddings that enable semantic retrieval. In the next notebook, you'll learn to extract **entities** from these chunks to build a richer knowledge graph.

---

**Next:** [Entity Extraction Basics](01_03_entity_extraction.ipynb)

In [24]:
# Cleanup
driver.close()
print("Connection closed.")

Connection closed.
