# Embeddings and Vector Search

This notebook demonstrates how to generate embeddings for text chunks and perform vector similarity search using Neo4j.

**Prerequisites:** Complete [01_01 Data Loading](01_01_data_loading.ipynb) first.

**Learning Objectives:**
- Understand what embeddings are and why they matter for GraphRAG
- Use `FixedSizeSplitter` to automatically chunk text
- Generate embeddings using Microsoft Foundry
- Create a vector index in Neo4j
- Perform similarity search to find relevant chunks

---

## What are Embeddings?

Embeddings are numerical representations (vectors) of text that capture semantic meaning. Similar texts have similar embeddings, enabling **semantic search** - finding content by meaning rather than exact keywords.

```
"Apple makes iPhones" → [0.12, -0.45, 0.78, ...] (1536 dimensions)
"The company produces smartphones" → [0.11, -0.44, 0.77, ...] (similar vector!)
```

## Setup

Import required modules.

In [None]:
from neo4j_graphrag.indexes import create_vector_index
from data_utils import Neo4jConnection, DataLoader, split_text, get_embedder

## Sample Data

Load the sample SEC 10-K text from file using `DataLoader`.

In [None]:
# Load text from file
loader = DataLoader("company_data.txt")
SAMPLE_TEXT = loader.text
DOCUMENT_PATH = "form10k-sample/apple-2023-10k.pdf"

metadata = loader.get_metadata()
print(f"Loaded from: {metadata['name']}")
print(f"Sample text length: {metadata['size']} characters")

## Connect to Neo4j

Create a connection using the `Neo4jConnection` utility class.

In [None]:
neo4j = Neo4jConnection().verify()
driver = neo4j.driver

## Clear Existing Data

Remove any existing Document and Chunk nodes from previous runs.

In [None]:
neo4j.clear_graph()

## Split Text with FixedSizeSplitter

The `FixedSizeSplitter` from neo4j-graphrag automatically splits text into chunks of a specified size with overlap.

- **chunk_size**: Maximum characters per chunk
- **chunk_overlap**: Characters to overlap between chunks (preserves context)

In [None]:
# Split text using the utility function (smaller chunks for demo)
chunks_text = split_text(SAMPLE_TEXT, chunk_size=400, chunk_overlap=50)

print(f"Split into {len(chunks_text)} chunks:\n")
for i, chunk in enumerate(chunks_text):
    print(f"Chunk {i}: {len(chunk)} chars")
    print(f"  {chunk}\n")

## Initialize Embedder

Create an embedder using Microsoft Foundry. This uses the `text-embedding-ada-002` model which produces 1536-dimensional vectors.

In [None]:
embedder = get_embedder()
print(f"Embedder initialized: {embedder.model}")

## Generate Embeddings

Generate an embedding vector for each chunk. This calls the Microsoft Foundry embedding API.

In [None]:
# Generate embeddings for each chunk
chunk_embeddings = []
for i, text in enumerate(chunks_text):
    embedding = embedder.embed_query(text)
    chunk_embeddings.append({
        "text": text,
        "index": i,
        "embedding": embedding
    })
    print(f"Chunk {i}: Generated {len(embedding)}-dimensional embedding")

print(f"\nFirst 5 values of chunk 0's embedding: {chunk_embeddings[0]['embedding'][:5]}")

## Store in Neo4j with Embeddings

Create Document and Chunk nodes, storing the embedding vector on each Chunk.

In [None]:
def store_chunks_with_embeddings(driver, doc_path: str, chunk_data: list[dict]):
    """Store Document and Chunk nodes with embeddings."""
    with driver.session() as session:
        # Create Document
        session.run("""
            CREATE (d:Document {path: $path})
        """, path=doc_path)
        print(f"Created Document: {doc_path}")
        
        # Create Chunks with embeddings
        for chunk in chunk_data:
            session.run("""
                MATCH (d:Document {path: $path})
                CREATE (c:Chunk {
                    text: $text,
                    index: $index,
                    embedding: $embedding
                })
                CREATE (c)-[:FROM_DOCUMENT]->(d)
            """, path=doc_path, text=chunk["text"], 
               index=chunk["index"], embedding=chunk["embedding"])
        print(f"Created {len(chunk_data)} Chunk nodes with embeddings")
        
        # Create NEXT_CHUNK relationships
        session.run("""
            MATCH (d:Document {path: $path})<-[:FROM_DOCUMENT]-(c:Chunk)
            WITH c ORDER BY c.index
            WITH collect(c) as chunks
            UNWIND range(0, size(chunks)-2) as i
            WITH chunks[i] as c1, chunks[i+1] as c2
            CREATE (c1)-[:NEXT_CHUNK]->(c2)
        """, path=doc_path)
        print("Created NEXT_CHUNK relationships")

store_chunks_with_embeddings(driver, DOCUMENT_PATH, chunk_embeddings)

## Create Vector Index

Create a vector index in Neo4j for efficient similarity search. The index uses cosine similarity to compare embeddings.

In [None]:
INDEX_NAME = "chunkEmbeddings"

# Drop existing index if it exists
try:
    with driver.session() as session:
        session.run(f"DROP INDEX {INDEX_NAME} IF EXISTS")
        print(f"Dropped existing index: {INDEX_NAME}")
except Exception:
    pass

# Create new vector index
create_vector_index(
    driver=driver,
    name=INDEX_NAME,
    label="Chunk",
    embedding_property="embedding",
    dimensions=1536,
    similarity_fn="cosine"
)
print(f"Created vector index: {INDEX_NAME}")

## Vector Similarity Search

Now we can search for chunks that are semantically similar to a query. The search:
1. Converts the query to an embedding
2. Finds chunks with similar embedding vectors
3. Returns results ranked by similarity score

In [None]:
def vector_search(driver, embedder, query: str, top_k: int = 3):
    """Search for chunks similar to the query."""
    # Generate query embedding
    query_embedding = embedder.embed_query(query)
    
    with driver.session() as session:
        result = session.run("""
            CALL db.index.vector.queryNodes($index_name, $top_k, $embedding)
            YIELD node, score
            RETURN node.text as text, node.index as idx, score
            ORDER BY score DESC
        """, index_name=INDEX_NAME, top_k=top_k, embedding=query_embedding)
        
        return list(result)

# Test search
query = "What products does Apple make?"
print(f"Query: \"{query}\"\n")
print("=" * 60)

results = vector_search(driver, embedder, query)
for i, record in enumerate(results):
    print(f"\n[{i+1}] Score: {record['score']:.4f} (Chunk {record['idx']})")
    print(f"    {record['text']}")

## Compare Different Queries

Try different queries to see how semantic search finds relevant content even with different wording.

In [None]:
queries = [
    "Tell me about iPhone and Mac computers",
    "What services does the company offer?",
    "When does the fiscal year end?"
]

for query in queries:
    print(f"\nQuery: \"{query}\"")
    print("-" * 50)
    results = vector_search(driver, embedder, query, top_k=1)
    if results:
        record = results[0]
        print(f"Best match (score: {record['score']:.4f}):")
        print(f"  {record['text']}")

## Summary

In this notebook, you learned:

1. **Embeddings** - Numerical vectors that capture semantic meaning
2. **FixedSizeSplitter** - Automatic text chunking with overlap
3. **Vector storage** - Storing embeddings as node properties
4. **Vector index** - Enabling efficient similarity search
5. **Semantic search** - Finding content by meaning, not just keywords

The chunks now have embeddings that enable semantic retrieval. In the next notebook, you'll learn to extract **entities** from these chunks to build a richer knowledge graph.

---

**Next:** [Entity Extraction Basics](01_03_entity_extraction.ipynb)

In [None]:
# Cleanup
neo4j.close()