# Embeddings and Vector Search

This notebook demonstrates how to generate embeddings for text chunks and perform vector similarity search using Neo4j. Embeddings are the foundation of semantic search - the ability to find content by meaning rather than exact keyword matches.

**Prerequisites:** Review [01 Data Loading](01_data_loading.ipynb) to understand the manufacturing graph structure. This notebook is self-contained and will create its own data.

**Learning Objectives:**
- Understand what embeddings are and why they matter for GraphRAG
- Use `FixedSizeSplitter` to automatically chunk requirement text
- Generate embeddings using AWS Bedrock (Amazon Titan)
- Create a vector index in Neo4j
- Perform similarity search to find relevant requirement chunks

---

## What are Embeddings?

Embeddings are numerical representations (vectors) of text that capture semantic meaning. The key insight is that **similar texts produce similar vectors**, enabling semantic search.

```
"thermal management system cooling" → [0.12, -0.45, 0.78, ...] (1024 dimensions)
"the cooling system must provide heat transfer" → [0.11, -0.44, 0.77, ...] (similar vector!)
```

This is powerful because:
- A search for "battery cooling requirements" will find content about "thermal management system" even though those exact words don't appear
- The embedding model understands that "cooling" and "thermal management" are semantically related
- You don't need to anticipate every possible way an engineer might phrase their query

**How similarity is measured:** We use **cosine similarity** to compare vectors. A score of 1.0 means identical direction (very similar), while 0.0 means perpendicular (unrelated). In practice, scores above 0.8 typically indicate strong semantic similarity.

## Install Dependencies

First, install the required packages. This only needs to be run once per session.

In [None]:
# Install neo4j-graphrag with Bedrock support
%pip install "neo4j-graphrag[bedrock] @ git+https://github.com/neo4j-partners/neo4j-graphrag-python.git@bedrock-embeddings" python-dotenv pydantic-settings nest-asyncio -q

## Setup

Import required modules.

In [None]:
from neo4j_graphrag.indexes import create_vector_index
from data_utils import Neo4jConnection, DataLoader, split_text, get_embedder

## Sample Data

Load requirement description text from `manufacturing_data.txt`. This contains consolidated engineering requirement descriptions for the HVB_3900 high-voltage battery component.

In [None]:
# Load requirement text from file
loader = DataLoader("manufacturing_data.txt")
SAMPLE_TEXT = loader.text

metadata = loader.get_metadata()
print(f"Loaded from: {metadata['name']}")
print(f"Sample text length: {metadata['size']} characters")

## Connect to Neo4j

Create a connection using the `Neo4jConnection` utility class.

In [None]:
neo4j = Neo4jConnection().verify()
driver = neo4j.driver

## Clear Existing Data

Remove any existing nodes from previous runs.

In [None]:
neo4j.clear_graph()

## Split Requirement Text with FixedSizeSplitter

The `FixedSizeSplitter` from neo4j-graphrag automatically splits text into chunks of a specified size with overlap.

- **chunk_size**: Maximum characters per chunk
- **chunk_overlap**: Characters to overlap between chunks (preserves context at boundaries)

In [None]:
# Split requirement text using the utility function (smaller chunks for demo)
chunks_text = split_text(SAMPLE_TEXT, chunk_size=400, chunk_overlap=50)

print(f"Split into {len(chunks_text)} chunks:\n")
for i, chunk in enumerate(chunks_text):
    print(f"Chunk {i}: {len(chunk)} chars")
    print(f"  {chunk}\n")

## Initialize Embedder

Create an embedder using AWS Bedrock. The embedding model is configured in `CONFIG.txt` via the `EMBEDDING_MODEL_ID` setting.

We're using **Amazon Titan Text Embeddings V2**, which produces 1024-dimensional vectors. This is important because:
- The vector index dimensions must match the embedder output
- All chunks and queries must use the same embedding model for meaningful comparisons

In [None]:
embedder = get_embedder()
print(f"Embedder initialized: {embedder.model_id}")

## Generate Embeddings

Generate an embedding vector for each chunk. This calls the AWS Bedrock embedding API.

Each call to `embed_query()`:
1. Sends the text to Amazon Titan via AWS Bedrock
2. Returns a list of 1024 floating-point numbers
3. These numbers encode the semantic meaning of the text

> **Note:** Embedding generation has a cost (typically fractions of a cent per call), but it's a one-time operation. Once stored, embeddings can be searched repeatedly without additional API calls.

In [None]:
# Generate embeddings for each chunk
chunk_embeddings = []
for i, text in enumerate(chunks_text):
    embedding = embedder.embed_query(text)
    chunk_embeddings.append({
        "text": text,
        "index": i,
        "embedding": embedding
    })
    print(f"Chunk {i}: Generated {len(embedding)}-dimensional embedding")

print(f"\nFirst 5 values of chunk 0's embedding: {chunk_embeddings[0]['embedding'][:5]}")

## Store in Neo4j with Embeddings

Create a Requirement node and Chunk nodes, storing the embedding vector on each Chunk as a property. The Requirement is linked to its Chunks via `HAS_CHUNK` relationships, and Chunks are linked to each other via `NEXT_CHUNK`.

Neo4j can store vectors (lists of floats) as node properties. This is efficient because:
- The embedding is stored directly with the chunk text it represents
- No need for a separate vector database
- Graph traversals can access both text and embeddings seamlessly

In [None]:
def store_requirement_with_chunks(driver, req_name, chunk_data):
    """Store a Requirement node and Chunk nodes with embeddings."""
    with driver.session() as session:
        # Create Requirement
        result = session.run("""
            CREATE (r:Requirement {requirement_id: '1_1', name: $name,
                description: 'Battery Cell and Module Design'})
            RETURN elementId(r) as req_id
        """, name=req_name)
        req_id = result.single()["req_id"]
        print(f"Created Requirement: {req_name}")

        # Create Chunks with embeddings and HAS_CHUNK relationships
        for chunk in chunk_data:
            session.run("""
                MATCH (r:Requirement) WHERE elementId(r) = $req_id
                CREATE (c:Chunk {
                    text: $text,
                    index: $index,
                    embedding: $embedding
                })
                CREATE (r)-[:HAS_CHUNK]->(c)
            """, req_id=req_id, text=chunk["text"],
               index=chunk["index"], embedding=chunk["embedding"])
        print(f"Created {len(chunk_data)} Chunk nodes with embeddings")

        # Create NEXT_CHUNK relationships
        session.run("""
            MATCH (r:Requirement) WHERE elementId(r) = $req_id
            MATCH (r)-[:HAS_CHUNK]->(c:Chunk)
            WITH c ORDER BY c.index
            WITH collect(c) as chunks
            UNWIND range(0, size(chunks)-2) as i
            WITH chunks[i] as c1, chunks[i+1] as c2
            CREATE (c1)-[:NEXT_CHUNK]->(c2)
        """, req_id=req_id)
        print("Created NEXT_CHUNK relationships")

store_requirement_with_chunks(driver, "Battery Cell and Module Design", chunk_embeddings)

## Create Vector Index

Create a vector index in Neo4j for efficient similarity search. The index uses cosine similarity to compare embeddings.

> **Note:** The vector dimensions must match your embedding model. Amazon Titan Text Embeddings V2 produces 1024-dimensional vectors. If you change `EMBEDDING_MODEL_ID` in `CONFIG.txt`, update the dimensions accordingly.

In [None]:
INDEX_NAME = "requirement_embeddings"

# Drop existing index if it exists
try:
    with driver.session() as session:
        session.run(f"DROP INDEX {INDEX_NAME} IF EXISTS")
        print(f"Dropped existing index: {INDEX_NAME}")
except Exception:
    pass

# Create new vector index (1024 dimensions for Titan V2)
create_vector_index(
    driver=driver,
    name=INDEX_NAME,
    label="Chunk",
    embedding_property="embedding",
    dimensions=1024,
    similarity_fn="cosine"
)
print(f"Created vector index: {INDEX_NAME}")

## Vector Similarity Search

Now we can search for chunks that are semantically similar to a query. The search process:

1. **Embed the query** - Convert the search query to a vector using the same embedding model
2. **Find similar vectors** - Use the vector index to find chunks with similar embeddings
3. **Rank by score** - Return results ordered by cosine similarity (higher = more similar)

The Neo4j procedure `db.index.vector.queryNodes()` performs an efficient approximate nearest neighbor (ANN) search, which scales well even with millions of chunks.

In [None]:
def vector_search(driver, embedder, query: str, top_k: int = 3):
    """Search for chunks similar to the query."""
    # Generate query embedding
    query_embedding = embedder.embed_query(query)

    with driver.session() as session:
        result = session.run("""
            CALL db.index.vector.queryNodes($index_name, $top_k, $embedding)
            YIELD node, score
            RETURN node.text as text, node.index as idx, score
            ORDER BY score DESC
        """, index_name=INDEX_NAME, top_k=top_k, embedding=query_embedding)

        return list(result)

# Test search
query = "What are the thermal management requirements?"
print(f"Query: \"{query}\"\n")
print("=" * 60)

results = vector_search(driver, embedder, query)
for i, record in enumerate(results):
    print(f"\n[{i+1}] Score: {record['score']:.4f} (Chunk {record['idx']})")
    print(f"    {record['text']}")

## Compare Different Queries

Try different queries to see how semantic search finds relevant content even with different wording.

Notice how:
- "battery cooling system" finds content about thermal management and coolant specifications
- "electrical safety requirements" finds content about high-voltage wiring and safety monitoring
- The embedding model understands synonyms, related concepts, and engineering context

In [None]:
queries = [
    "What are the energy density specifications for battery cells?",
    "How is the battery pack protected against water?",
    "What safety monitoring is required for the BMS?"
]

for query in queries:
    print(f"\nQuery: \"{query}\"")
    print("-" * 50)
    results = vector_search(driver, embedder, query, top_k=1)
    if results:
        record = results[0]
        print(f"Best match (score: {record['score']:.4f}):")
        print(f"  {record['text']}")

## Summary

In this notebook, you learned how to enable semantic search in Neo4j:

1. **Embeddings** - Numerical vectors (1024 floats for Titan V2) that capture semantic meaning. Similar texts have similar vectors.

2. **FixedSizeSplitter** - Automatic text chunking with overlap to ensure context isn't lost at chunk boundaries.

3. **Vector storage** - Storing embeddings as node properties in Neo4j, keeping text and vectors together with the requirement they belong to.

4. **Vector index** - A specialized index (`requirement_embeddings`) that enables fast approximate nearest neighbor search.

5. **Semantic search** - Finding content by meaning rather than keywords, using `db.index.vector.queryNodes()`.

At this point, you have a working semantic search system over manufacturing requirement descriptions. In the next notebook, you'll learn to use the **VectorRetriever** class which abstracts all of this into a simple API.

---

**Next:** [Vector Retriever](03_vector_retriever.ipynb)

In [None]:
# Cleanup
neo4j.close()