# Data Loading and Embeddings

This notebook covers the complete data preparation pipeline for GraphRAG applications: loading text data into Neo4j as a Document-Chunk graph structure, then enriching it with embeddings for semantic search.

**Learning Objectives:**
- Understand the Document → Chunk graph structure
- Connect to Neo4j and create Document and Chunk nodes
- Link chunks with `FROM_DOCUMENT` and `NEXT_CHUNK` relationships
- Understand what embeddings are and why they matter
- Generate embeddings using Microsoft Foundry
- Create a vector index and perform similarity search

---

## Why Documents and Chunks?

When building GraphRAG applications, we split documents into smaller pieces called **chunks** because:

1. **Context windows are limited** - LLMs can only process a certain amount of text at once
2. **Retrieval precision** - Smaller chunks allow more precise matching to user queries
3. **Cost efficiency** - Processing smaller chunks is faster and cheaper

The graph structure we'll build:
```
(:Document) <-[:FROM_DOCUMENT]- (:Chunk) -[:NEXT_CHUNK]-> (:Chunk)
```

We'll use `FixedSizeSplitter` from the [neo4j-graphrag-python](https://neo4j.com/docs/neo4j-graphrag-python/current/) library to split text into chunks:

- `chunk_size`: Maximum characters per chunk
- `chunk_overlap`: Characters shared between consecutive chunks for context continuity

## Setup

Import required modules and configure the environment.

In [None]:
from neo4j_graphrag.indexes import create_vector_index
from data_utils import Neo4jConnection, DataLoader, split_text, get_embedder

## Sample Data

We'll load text from `company_data.txt` representing content from an SEC 10-K filing.

> **Note:** In production, you would use `pypdf` or similar libraries to extract text from PDF files. We use a pre-defined text file here for fast, reproducible results.

In [None]:
# Load text from file using DataLoader
loader = DataLoader("company_data.txt")
SAMPLE_TEXT = loader.text

# Document metadata
DOCUMENT_PATH = "form10k-sample/apple-2023-10k.pdf"
DOCUMENT_PAGE = 1

metadata = loader.get_metadata()
print(f"Loaded from: {metadata['name']}")
print(f"Sample text length: {metadata['size']} characters")
print(f"\n{SAMPLE_TEXT}")

## Connect to Neo4j

Create a connection to your Neo4j database using the `Neo4jConnection` utility class.

In [None]:
neo4j = Neo4jConnection().verify()
driver = neo4j.driver

## Clear Existing Data (Optional)

For a clean start, remove any existing Document and Chunk nodes from previous runs.

In [None]:
neo4j.clear_graph()

---

# Part 1: Building the Document-Chunk Graph

First, we'll create the basic graph structure with Document and Chunk nodes.

## Create Document Node

Create a Document node to represent the source file. This node stores metadata about where the content came from.

In [None]:
def create_document(driver, path: str, page: int) -> str:
    """Create a Document node and return its element ID."""
    with driver.session() as session:
        result = session.run("""
            CREATE (d:Document {path: $path, page: $page})
            RETURN elementId(d) as doc_id
        """, path=path, page=page)
        return result.single()["doc_id"]

doc_id = create_document(driver, DOCUMENT_PATH, DOCUMENT_PAGE)
print(f"Created Document node with ID: {doc_id}")

## Split Text into Chunks

Use `FixedSizeSplitter` from neo4j-graphrag-python to split the text into chunks with configurable size and overlap.

In [None]:
# Split text using the utility function
chunks = split_text(SAMPLE_TEXT, chunk_size=400, chunk_overlap=50)

print(f"Split into {len(chunks)} chunks:\n")
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {len(chunk)} chars")
    print(f"  {chunk[:100]}...\n")

## Create Chunk Nodes

Create Chunk nodes for each piece of text and link them to the Document with `FROM_DOCUMENT` relationships.

In [None]:
def create_chunks(driver, doc_id: str, chunks: list[str]) -> list[str]:
    """Create Chunk nodes linked to a Document. Returns chunk element IDs."""
    chunk_ids = []
    with driver.session() as session:
        for index, text in enumerate(chunks):
            result = session.run("""
                MATCH (d:Document) WHERE elementId(d) = $doc_id
                CREATE (c:Chunk {text: $text, index: $index})
                CREATE (c)-[:FROM_DOCUMENT]->(d)
                RETURN elementId(c) as chunk_id
            """, doc_id=doc_id, text=text, index=index)
            chunk_id = result.single()["chunk_id"]
            chunk_ids.append(chunk_id)
            print(f"Created Chunk {index}")
    return chunk_ids

chunk_ids = create_chunks(driver, doc_id, chunks)
print(f"\nCreated {len(chunk_ids)} chunks")

## Link Chunks with NEXT_CHUNK

Create `NEXT_CHUNK` relationships between sequential chunks. This preserves the original document order.

In [None]:
def link_chunks(driver, chunk_ids: list[str]):
    """Create NEXT_CHUNK relationships between sequential chunks."""
    with driver.session() as session:
        for i in range(len(chunk_ids) - 1):
            session.run("""
                MATCH (c1:Chunk) WHERE elementId(c1) = $id1
                MATCH (c2:Chunk) WHERE elementId(c2) = $id2
                CREATE (c1)-[:NEXT_CHUNK]->(c2)
            """, id1=chunk_ids[i], id2=chunk_ids[i+1])
        print(f"Created {len(chunk_ids) - 1} NEXT_CHUNK relationships")

link_chunks(driver, chunk_ids)

## Verify the Graph Structure

Query the graph to see what we created.

In [None]:
def show_graph_structure(driver):
    """Display the Document-Chunk graph structure."""
    with driver.session() as session:
        # Count nodes
        result = session.run("""
            MATCH (d:Document)
            OPTIONAL MATCH (d)<-[:FROM_DOCUMENT]-(c:Chunk)
            RETURN d.path as document, d.page as page, count(c) as chunks
        """)
        print("=== Graph Structure ===")
        for record in result:
            print(f"Document: {record['document']} (page {record['page']})")
            print(f"  Chunks: {record['chunks']}")
        
        # Show chunk chain
        result = session.run("""
            MATCH (c:Chunk)
            OPTIONAL MATCH (c)-[:NEXT_CHUNK]->(next:Chunk)
            RETURN c.index as idx, 
                   substring(c.text, 0, 50) as text,
                   next.index as next_idx
            ORDER BY c.index
        """)
        print("\n=== Chunk Chain ===")
        for record in result:
            next_str = f" -> Chunk {record['next_idx']}" if record['next_idx'] is not None else " (end)"
            print(f"Chunk {record['idx']}: \"{record['text']}...\"{next_str}")

show_graph_structure(driver)

---

# Part 2: Adding Embeddings for Semantic Search

Now that we have our Document-Chunk graph, we'll add embeddings to enable semantic search. Embeddings are numerical representations (vectors) of text that capture semantic meaning.

```
"Apple makes iPhones" → [0.12, -0.45, 0.78, ...] (1536 dimensions)
"The company produces smartphones" → [0.11, -0.44, 0.77, ...] (similar vector!)
```

Similar texts have similar embeddings, enabling **semantic search** - finding content by meaning rather than exact keywords.

## Initialize Embedder

Create an embedder using Microsoft Foundry. This uses the `text-embedding-ada-002` model which produces 1536-dimensional vectors.

In [None]:
embedder = get_embedder()
print(f"Embedder initialized: {embedder.model}")

## Generate and Store Embeddings

Generate an embedding vector for each chunk and update the nodes in Neo4j.

In [None]:
def add_embeddings_to_chunks(driver, embedder, chunk_ids: list[str]):
    """Generate embeddings for chunks and store them in Neo4j."""
    with driver.session() as session:
        for i, chunk_id in enumerate(chunk_ids):
            # Get chunk text
            result = session.run("""
                MATCH (c:Chunk) WHERE elementId(c) = $chunk_id
                RETURN c.text as text
            """, chunk_id=chunk_id)
            text = result.single()["text"]
            
            # Generate embedding
            embedding = embedder.embed_query(text)
            
            # Store embedding
            session.run("""
                MATCH (c:Chunk) WHERE elementId(c) = $chunk_id
                SET c.embedding = $embedding
            """, chunk_id=chunk_id, embedding=embedding)
            
            print(f"Chunk {i}: Generated {len(embedding)}-dimensional embedding")
    
    print(f"\nAdded embeddings to {len(chunk_ids)} chunks")

add_embeddings_to_chunks(driver, embedder, chunk_ids)

## Create Vector Index

Create a vector index in Neo4j for efficient similarity search. The index uses cosine similarity to compare embeddings.

In [None]:
INDEX_NAME = "chunkEmbeddings"

# Drop existing index if it exists
try:
    with driver.session() as session:
        session.run(f"DROP INDEX {INDEX_NAME} IF EXISTS")
        print(f"Dropped existing index: {INDEX_NAME}")
except Exception:
    pass

# Create new vector index
create_vector_index(
    driver=driver,
    name=INDEX_NAME,
    label="Chunk",
    embedding_property="embedding",
    dimensions=1536,
    similarity_fn="cosine"
)
print(f"Created vector index: {INDEX_NAME}")

## Vector Similarity Search

Now we can search for chunks that are semantically similar to a query. The search:
1. Converts the query to an embedding
2. Finds chunks with similar embedding vectors
3. Returns results ranked by similarity score

In [None]:
def vector_search(driver, embedder, query: str, top_k: int = 3):
    """Search for chunks similar to the query."""
    # Generate query embedding
    query_embedding = embedder.embed_query(query)
    
    with driver.session() as session:
        result = session.run("""
            CALL db.index.vector.queryNodes($index_name, $top_k, $embedding)
            YIELD node, score
            RETURN node.text as text, node.index as idx, score
            ORDER BY score DESC
        """, index_name=INDEX_NAME, top_k=top_k, embedding=query_embedding)
        
        return list(result)

# Test search
query = "What products does Apple make?"
print(f"Query: \"{query}\"\n")
print("=" * 60)

results = vector_search(driver, embedder, query)
for i, record in enumerate(results):
    print(f"\n[{i+1}] Score: {record['score']:.4f} (Chunk {record['idx']})")
    print(f"    {record['text'][:200]}...")

## Compare Different Queries

Try different queries to see how semantic search finds relevant content even with different wording.

In [None]:
queries = [
    "Tell me about iPhone and Mac computers",
    "What services does the company offer?",
    "When does the fiscal year end?"
]

for query in queries:
    print(f"\nQuery: \"{query}\"")
    print("-" * 50)
    results = vector_search(driver, embedder, query, top_k=1)
    if results:
        record = results[0]
        print(f"Best match (score: {record['score']:.4f}):")
        print(f"  {record['text'][:150]}...")

## Summary

In this notebook, you learned the complete data preparation pipeline for GraphRAG:

**Part 1 - Graph Structure:**
1. **Document-Chunk structure** - Documents are split into chunks for efficient retrieval
2. **FROM_DOCUMENT relationship** - Links chunks back to their source document
3. **NEXT_CHUNK relationship** - Preserves the sequential order of chunks

**Part 2 - Embeddings:**
4. **Embeddings** - Numerical vectors that capture semantic meaning
5. **Vector storage** - Storing embeddings as node properties
6. **Vector index** - Enabling efficient similarity search
7. **Semantic search** - Finding content by meaning, not just keywords

Your knowledge graph is now ready for retrieval! In the next notebook, you'll learn to use **retrievers** to build GraphRAG pipelines that combine vector search with LLM-generated answers.

---

**Next:** [GraphRAG Retrievers](02_graphrag_retrievers.ipynb)

In [None]:
# Cleanup
neo4j.close()