# Data Loading Fundamentals

This notebook introduces the core concepts of loading text data into Neo4j for GraphRAG applications.

**Learning Objectives:**
- Understand the Document â†’ Chunk graph structure
- Connect to Neo4j from a Jupyter notebook
- Create Document and Chunk nodes
- Link chunks to documents and to each other

---

## Why Documents and Chunks?

When building RAG (Retrieval-Augmented Generation) applications, we split documents into smaller pieces called **chunks**. This is necessary because:

1. **Context windows are limited** - LLMs can only process a certain amount of text at once
2. **Retrieval precision** - Smaller chunks allow more precise matching to user queries
3. **Cost efficiency** - Processing smaller chunks is faster and cheaper

The graph structure we'll build:
```
(:Document) <-[:FROM_DOCUMENT]- (:Chunk) -[:NEXT_CHUNK]-> (:Chunk)
```

## Setup

Import required modules and configure the environment.

In [None]:
import sys
sys.path.insert(0, '../solutions')

from neo4j import GraphDatabase
from config import Neo4jConfig

## Sample Data

We'll use sample text representing content from an SEC 10-K filing. This text is similar to what you would extract from a real PDF document.

> **Note:** In production, you would use `pypdf` or similar libraries to extract text from PDF files. We use pre-defined text here for fast, reproducible results.

In [None]:
# Sample text representing a page from an SEC 10-K filing
SAMPLE_TEXT = """
Apple Inc. ("Apple" or the "Company") designs, manufactures and markets smartphones, 
personal computers, tablets, wearables and accessories, and sells a variety of related 
services. The Company's fiscal year is the 52- or 53-week period that ends on the last 
Saturday of September.

Products

iPhone is the Company's line of smartphones based on its iOS operating system. The iPhone 
product line includes iPhone 14 Pro, iPhone 14, iPhone 13 and iPhone SE. Mac is the Company's 
line of personal computers based on its macOS operating system. iPad is the Company's line 
of multi-purpose tablets based on its iPadOS operating system.

Services

Advertising includes third-party licensing arrangements and the Company's own advertising 
platforms. AppleCare offers a portfolio of fee-based service and support products. Cloud 
Services store and keep customers' content up-to-date across all devices. Digital Content 
operates various platforms for discovering, purchasing, streaming and downloading digital 
content and apps. Payment Services include Apple Card and Apple Pay.
""".strip()

# Document metadata
DOCUMENT_PATH = "form10k-sample/apple-2023-10k.pdf"
DOCUMENT_PAGE = 1

print(f"Sample text length: {len(SAMPLE_TEXT)} characters")
print(f"\n{SAMPLE_TEXT}")

## Connect to Neo4j

Create a connection to your Neo4j database using credentials from environment variables.

In [None]:
neo4j_config = Neo4jConfig()
driver = GraphDatabase.driver(
    neo4j_config.uri,
    auth=(neo4j_config.username, neo4j_config.password)
)
driver.verify_connectivity()
print("Connected to Neo4j successfully!")

## Clear Existing Data (Optional)

For a clean start, remove any existing Document and Chunk nodes from previous runs.

In [None]:
def clear_graph(driver):
    """Remove all Document and Chunk nodes."""
    with driver.session() as session:
        result = session.run("""
            MATCH (n) WHERE n:Document OR n:Chunk
            DETACH DELETE n
            RETURN count(n) as deleted
        """)
        count = result.single()["deleted"]
        print(f"Deleted {count} nodes")

clear_graph(driver)

## Create Document Node

First, create a Document node to represent the source file. This node stores metadata about where the content came from.

In [None]:
def create_document(driver, path: str, page: int) -> str:
    """Create a Document node and return its element ID."""
    with driver.session() as session:
        result = session.run("""
            CREATE (d:Document {path: $path, page: $page})
            RETURN elementId(d) as doc_id
        """, path=path, page=page)
        return result.single()["doc_id"]

doc_id = create_document(driver, DOCUMENT_PATH, DOCUMENT_PAGE)
print(f"Created Document node with ID: {doc_id}")

## Split Text into Chunks

Split the sample text into smaller chunks. For this demonstration, we'll manually split by paragraph.

> **Note:** In the next notebook, you'll learn to use `FixedSizeSplitter` for automatic chunking.

In [None]:
def split_into_chunks(text: str) -> list[str]:
    """Split text into chunks by double newlines (paragraphs)."""
    chunks = [chunk.strip() for chunk in text.split("\n\n") if chunk.strip()]
    return chunks

chunks = split_into_chunks(SAMPLE_TEXT)
print(f"Split into {len(chunks)} chunks:\n")
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {len(chunk)} chars")
    print(f"  {chunk}\n")

## Create Chunk Nodes

Create Chunk nodes for each piece of text and link them to the Document with `FROM_DOCUMENT` relationships.

In [None]:
def create_chunks(driver, doc_id: str, chunks: list[str]) -> list[str]:
    """Create Chunk nodes linked to a Document. Returns chunk element IDs."""
    chunk_ids = []
    with driver.session() as session:
        for index, text in enumerate(chunks):
            result = session.run("""
                MATCH (d:Document) WHERE elementId(d) = $doc_id
                CREATE (c:Chunk {text: $text, index: $index})
                CREATE (c)-[:FROM_DOCUMENT]->(d)
                RETURN elementId(c) as chunk_id
            """, doc_id=doc_id, text=text, index=index)
            chunk_id = result.single()["chunk_id"]
            chunk_ids.append(chunk_id)
            print(f"Created Chunk {index}")
    return chunk_ids

chunk_ids = create_chunks(driver, doc_id, chunks)
print(f"\nCreated {len(chunk_ids)} chunks")

## Link Chunks with NEXT_CHUNK

Create `NEXT_CHUNK` relationships between sequential chunks. This preserves the original document order.

In [None]:
def link_chunks(driver, chunk_ids: list[str]):
    """Create NEXT_CHUNK relationships between sequential chunks."""
    with driver.session() as session:
        for i in range(len(chunk_ids) - 1):
            session.run("""
                MATCH (c1:Chunk) WHERE elementId(c1) = $id1
                MATCH (c2:Chunk) WHERE elementId(c2) = $id2
                CREATE (c1)-[:NEXT_CHUNK]->(c2)
            """, id1=chunk_ids[i], id2=chunk_ids[i+1])
        print(f"Created {len(chunk_ids) - 1} NEXT_CHUNK relationships")

link_chunks(driver, chunk_ids)

## Verify the Graph Structure

Query the graph to see what we created.

In [None]:
def show_graph_structure(driver):
    """Display the Document-Chunk graph structure."""
    with driver.session() as session:
        # Count nodes
        result = session.run("""
            MATCH (d:Document)
            OPTIONAL MATCH (d)<-[:FROM_DOCUMENT]-(c:Chunk)
            RETURN d.path as document, d.page as page, count(c) as chunks
        """)
        print("=== Graph Structure ===")
        for record in result:
            print(f"Document: {record['document']} (page {record['page']})")
            print(f"  Chunks: {record['chunks']}")
        
        # Show chunk chain
        result = session.run("""
            MATCH (c:Chunk)
            OPTIONAL MATCH (c)-[:NEXT_CHUNK]->(next:Chunk)
            RETURN c.index as idx, 
                   c.text as text,
                   next.index as next_idx
            ORDER BY c.index
        """)
        print("\n=== Chunk Chain ===")
        for record in result:
            next_str = f" -> Chunk {record['next_idx']}" if record['next_idx'] is not None else " (end)"
            print(f"Chunk {record['idx']}: \"{record['text']}\"{next_str}")

show_graph_structure(driver)

## Summary

In this notebook, you learned:

1. **Document-Chunk structure** - Documents are split into chunks for efficient retrieval
2. **FROM_DOCUMENT relationship** - Links chunks back to their source document
3. **NEXT_CHUNK relationship** - Preserves the sequential order of chunks

This basic structure is the foundation for RAG applications. In the next notebooks, you'll learn to:
- Add **embeddings** to chunks for semantic search (01_02)
- Extract **entities** from chunks to build a knowledge graph (01_03)

---

**Next:** [Embeddings and Vector Search](01_02_embeddings.ipynb)

In [None]:
# Cleanup
driver.close()
print("Connection closed.")