# Data Loading Fundamentals

This notebook introduces the core concepts of loading text data into Neo4j for GraphRAG applications.

**Learning Objectives:**
- Understand the Document â†’ Chunk graph structure
- Connect to Neo4j from a Jupyter notebook
- Create Document and Chunk nodes
- Link chunks to documents and to each other

---

## Why Documents and Chunks?

When building GraphRAG applications, we split documents into smaller pieces called **chunks** because:

1. **Context windows are limited** - LLMs can only process a certain amount of text at once
2. **Retrieval precision** - Smaller chunks allow more precise matching to user queries
3. **Cost efficiency** - Processing smaller chunks is faster and cheaper

The graph structure we'll build:
```
(:Document) <-[:FROM_DOCUMENT]- (:Chunk) -[:NEXT_CHUNK]-> (:Chunk)
```

---

## Text Splitting with neo4j-graphrag-python

We'll use `FixedSizeSplitter` from the [neo4j-graphrag-python](https://neo4j.com/docs/neo4j-graphrag-python/current/) library to split text into chunks:

- `chunk_size`: Maximum characters per chunk
- `chunk_overlap`: Characters shared between consecutive chunks for context continuity
- `approximate=True` (default): Avoids splitting words mid-token

## Setup

Import required modules and configure the environment.

In [None]:
from data_utils import Neo4jConnection, DataLoader, split_text

## Sample Data

We'll load text from `company_data.txt` representing content from an SEC 10-K filing.

> **Note:** In production, you would use `pypdf` or similar libraries to extract text from PDF files. We use a pre-defined text file here for fast, reproducible results.

In [None]:
# Load text from file using DataLoader
loader = DataLoader("company_data.txt")
SAMPLE_TEXT = loader.text

# Document metadata
DOCUMENT_PATH = "form10k-sample/apple-2023-10k.pdf"
DOCUMENT_PAGE = 1

metadata = loader.get_metadata()
print(f"Loaded from: {metadata['name']}")
print(f"Sample text length: {metadata['size']} characters")
print(f"\n{SAMPLE_TEXT}")

## Connect to Neo4j

Create a connection to your Neo4j database using the `Neo4jConnection` utility class.

In [None]:
neo4j = Neo4jConnection().verify()
driver = neo4j.driver

## Clear Existing Data (Optional)

For a clean start, remove any existing Document and Chunk nodes from previous runs using the utility method.

In [None]:
neo4j.clear_graph()

## Create Document Node

First, create a Document node to represent the source file. This node stores metadata about where the content came from.

In [None]:
def create_document(driver, path: str, page: int) -> str:
    """Create a Document node and return its element ID."""
    with driver.session() as session:
        result = session.run("""
            CREATE (d:Document {path: $path, page: $page})
            RETURN elementId(d) as doc_id
        """, path=path, page=page)
        return result.single()["doc_id"]

doc_id = create_document(driver, DOCUMENT_PATH, DOCUMENT_PAGE)
print(f"Created Document node with ID: {doc_id}")

## Split Text into Chunks

Use `FixedSizeSplitter` from neo4j-graphrag-python to split the text into chunks with configurable size and overlap.

In [None]:
# Split text using the utility function
chunks = split_text(SAMPLE_TEXT, chunk_size=500, chunk_overlap=50)

print(f"Split into {len(chunks)} chunks:\n")
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {len(chunk)} chars")
    print(f"  {chunk[:100]}...\n")

## Create Chunk Nodes

Create Chunk nodes for each piece of text and link them to the Document with `FROM_DOCUMENT` relationships.

In [None]:
def create_chunks(driver, doc_id: str, chunks: list[str]) -> list[str]:
    """Create Chunk nodes linked to a Document. Returns chunk element IDs."""
    chunk_ids = []
    with driver.session() as session:
        for index, text in enumerate(chunks):
            result = session.run("""
                MATCH (d:Document) WHERE elementId(d) = $doc_id
                CREATE (c:Chunk {text: $text, index: $index})
                CREATE (c)-[:FROM_DOCUMENT]->(d)
                RETURN elementId(c) as chunk_id
            """, doc_id=doc_id, text=text, index=index)
            chunk_id = result.single()["chunk_id"]
            chunk_ids.append(chunk_id)
            print(f"Created Chunk {index}")
    return chunk_ids

chunk_ids = create_chunks(driver, doc_id, chunks)
print(f"\nCreated {len(chunk_ids)} chunks")

## Link Chunks with NEXT_CHUNK

Create `NEXT_CHUNK` relationships between sequential chunks. This preserves the original document order.

In [None]:
def link_chunks(driver, chunk_ids: list[str]):
    """Create NEXT_CHUNK relationships between sequential chunks."""
    with driver.session() as session:
        for i in range(len(chunk_ids) - 1):
            session.run("""
                MATCH (c1:Chunk) WHERE elementId(c1) = $id1
                MATCH (c2:Chunk) WHERE elementId(c2) = $id2
                CREATE (c1)-[:NEXT_CHUNK]->(c2)
            """, id1=chunk_ids[i], id2=chunk_ids[i+1])
        print(f"Created {len(chunk_ids) - 1} NEXT_CHUNK relationships")

link_chunks(driver, chunk_ids)

## Verify the Graph Structure

Query the graph to see what we created.

In [None]:
def show_graph_structure(driver):
    """Display the Document-Chunk graph structure."""
    with driver.session() as session:
        # Count nodes
        result = session.run("""
            MATCH (d:Document)
            OPTIONAL MATCH (d)<-[:FROM_DOCUMENT]-(c:Chunk)
            RETURN d.path as document, d.page as page, count(c) as chunks
        """)
        print("=== Graph Structure ===")
        for record in result:
            print(f"Document: {record['document']} (page {record['page']})")
            print(f"  Chunks: {record['chunks']}")
        
        # Show chunk chain
        result = session.run("""
            MATCH (c:Chunk)
            OPTIONAL MATCH (c)-[:NEXT_CHUNK]->(next:Chunk)
            RETURN c.index as idx, 
                   c.text as text,
                   next.index as next_idx
            ORDER BY c.index
        """)
        print("\n=== Chunk Chain ===")
        for record in result:
            next_str = f" -> Chunk {record['next_idx']}" if record['next_idx'] is not None else " (end)"
            print(f"Chunk {record['idx']}: \"{record['text']}\"{next_str}")

show_graph_structure(driver)

## Summary

In this notebook, you learned:

1. **Document-Chunk structure** - Documents are split into chunks for efficient retrieval
2. **FROM_DOCUMENT relationship** - Links chunks back to their source document
3. **NEXT_CHUNK relationship** - Preserves the sequential order of chunks

This basic structure is the foundation for GraphRAG applications. In the next notebooks, you'll learn to:
- Add **embeddings** to chunks for semantic search (01_02)
- Extract **entities** from chunks to build a knowledge graph (01_03)

---

**Next:** [Embeddings and Vector Search](01_02_embeddings.ipynb)

In [None]:
# Cleanup
neo4j.close()