# Data Loading Fundamentals

This notebook introduces the core concepts of loading structured manufacturing data into Neo4j for GraphRAG applications. You'll learn how to build a manufacturing traceability graph and create text chunks from requirement descriptions for semantic search.

**Learning Objectives:**
- Understand the manufacturing traceability graph structure
- Connect to Neo4j from a Jupyter notebook
- Create Product, TechnologyDomain, and Component nodes using Cypher
- Create relationships between nodes to form the traceability chain
- Load Requirement descriptions and split them into chunks for embedding

---

## Why a Manufacturing Traceability Graph?

In automotive manufacturing, traceability is critical. Engineers need to trace from a product all the way down to individual test results and defects. A knowledge graph naturally represents these relationships:

```
(Product) -[:PRODUCT_HAS_DOMAIN]-> (TechnologyDomain) -[:DOMAIN_HAS_COMPONENT]-> (Component)
    (Component) -[:COMPONENT_HAS_REQ]-> (Requirement) -[:HAS_CHUNK]-> (Chunk) -[:NEXT_CHUNK]-> (Chunk)
```

This structure enables:
- **Impact analysis**: When a change is proposed, trace which requirements, components, and tests are affected
- **Traceability**: Follow the chain from product definition to test results and defects
- **Semantic search**: Find requirements by meaning using vector embeddings on chunk text

---

## Text Splitting with neo4j-graphrag-python

Requirement descriptions contain detailed engineering specifications. We split them into smaller **chunks** for embedding because:

1. **Retrieval precision** - A 500-character chunk about "thermal management" is more relevant to a query about cooling than an entire multi-paragraph requirement
2. **Embedding quality** - Embedding models produce better representations for focused text segments
3. **Context windows** - LLMs can only process a certain amount of text at once

We'll use `FixedSizeSplitter` from the [neo4j-graphrag-python](https://neo4j.com/docs/neo4j-graphrag-python/current/) library:

- `chunk_size`: Maximum characters per chunk (e.g., 500 or 1000)
- `chunk_overlap`: Characters shared between consecutive chunks for context continuity
- `approximate=True` (default): Avoids splitting words mid-token

## Install Dependencies

First, install the required packages. This only needs to be run once per session.

In [None]:
# Install neo4j-graphrag with Bedrock support
%pip install "neo4j-graphrag[bedrock] @ git+https://github.com/neo4j-partners/neo4j-graphrag-python.git@bedrock-embeddings" python-dotenv pydantic-settings nest-asyncio -q

## Setup

Import required modules and configure the environment.

In [None]:
from data_utils import Neo4jConnection, CSVLoader, DataLoader, split_text

## Load Manufacturing Data from CSVs

We'll load structured manufacturing data from CSV files in the `TransformedData/` directory. The CSVs contain products, technology domains, components, and their relationships.

We'll also load requirement description text from `manufacturing_data.txt` for chunking and embedding demonstrations.

In [None]:
# Load CSV data from TransformedData directory
csv_loader = CSVLoader("../TransformedData")

products = csv_loader.load_csv("products.csv")
tech_domains = csv_loader.load_csv("technology_domains.csv")
components = csv_loader.load_csv("components.csv")
product_domains = csv_loader.load_csv("product_technology_domains.csv")
domain_components = csv_loader.load_csv("technology_domains_components.csv")

print(f"Products: {len(products)}")
for p in products:
    print(f"  {p['product_id']}: {p['Product Name']} - {p['Description']}")

print(f"\nTechnology Domains: {len(tech_domains)}")
for td in tech_domains:
    print(f"  {td['technology_domain_id']}: {td['Technology Domain']}")

print(f"\nComponents: {len(components)}")
for c in components:
    print(f"  {c['component_id']}: {c['Component']} - {c['Component Description']}")

# Also load requirement text for chunking demo
loader = DataLoader("manufacturing_data.txt")
SAMPLE_TEXT = loader.text
metadata = loader.get_metadata()
print(f"\nRequirement text: {metadata['size']} characters")

## Connect to Neo4j

Create a connection to your Neo4j database using the `Neo4jConnection` utility class.

In [None]:
neo4j = Neo4jConnection().verify()
driver = neo4j.driver

## Clear Existing Data (Optional)

For a clean start, remove any existing nodes from previous runs.

In [None]:
neo4j.clear_graph()

## Create Product, TechnologyDomain, and Component Nodes

First, create the top-level entities from the CSV data. These form the structural backbone of the manufacturing traceability graph.

- **Product**: The vehicle under development (R2D2)
- **TechnologyDomain**: Major technology areas (Electric Powertrain, Chassis, Body, Infotainment)
- **Component**: Hardware components (HVB_3900 battery, PDU_1500 power distribution unit, etc.)

In [None]:
def create_products(driver, products):
    """Create Product nodes from CSV data."""
    with driver.session() as session:
        for p in products:
            session.run("""
                CREATE (p:Product {product_id: $id, name: $name, description: $desc})
            """, id=p["product_id"], name=p["Product Name"], desc=p["Description"])
    print(f"Created {len(products)} Product nodes")

def create_technology_domains(driver, domains):
    """Create TechnologyDomain nodes from CSV data."""
    with driver.session() as session:
        for td in domains:
            session.run("""
                CREATE (t:TechnologyDomain {technology_domain_id: $id, name: $name})
            """, id=td["technology_domain_id"], name=td["Technology Domain"])
    print(f"Created {len(domains)} TechnologyDomain nodes")

def create_components(driver, components):
    """Create Component nodes from CSV data."""
    with driver.session() as session:
        for c in components:
            session.run("""
                CREATE (c:Component {component_id: $id, name: $name, description: $desc})
            """, id=c["component_id"], name=c["Component"], desc=c["Component Description"])
    print(f"Created {len(components)} Component nodes")

create_products(driver, products)
create_technology_domains(driver, tech_domains)
create_components(driver, components)

## Create Relationships

Now create the relationships that form the traceability chain. We use the junction table CSVs to connect:
- **PRODUCT_HAS_DOMAIN**: Product → TechnologyDomain
- **DOMAIN_HAS_COMPONENT**: TechnologyDomain → Component

These relationships are created by matching nodes on their IDs from the junction CSVs.

In [None]:
def create_product_domain_rels(driver, product_domains):
    """Create PRODUCT_HAS_DOMAIN relationships."""
    with driver.session() as session:
        for row in product_domains:
            session.run("""
                MATCH (p:Product {product_id: $pid})
                MATCH (t:TechnologyDomain {technology_domain_id: $tid})
                CREATE (p)-[:PRODUCT_HAS_DOMAIN]->(t)
            """, pid=row["product_id"], tid=row["technology_domain_id"])
    print(f"Created {len(product_domains)} PRODUCT_HAS_DOMAIN relationships")

def create_domain_component_rels(driver, domain_components):
    """Create DOMAIN_HAS_COMPONENT relationships."""
    with driver.session() as session:
        for row in domain_components:
            session.run("""
                MATCH (t:TechnologyDomain {technology_domain_id: $tid})
                MATCH (c:Component {component_id: $cid})
                CREATE (t)-[:DOMAIN_HAS_COMPONENT]->(c)
            """, tid=row["technology_domain_id"], cid=row["component_id"])
    print(f"Created {len(domain_components)} DOMAIN_HAS_COMPONENT relationships")

create_product_domain_rels(driver, product_domains)
create_domain_component_rels(driver, domain_components)

## Split Requirement Text into Chunks

Now we demonstrate text chunking on the requirement descriptions. This is how we prepare text-rich content for vector embedding and semantic search.

We load a consolidated requirement description text file and split it into chunks using `FixedSizeSplitter`. Each chunk becomes a node linked to a parent Requirement node with `HAS_CHUNK` and `NEXT_CHUNK` relationships.

In [None]:
# Split requirement text into chunks
chunks = split_text(SAMPLE_TEXT, chunk_size=500, chunk_overlap=50)

print(f"Split into {len(chunks)} chunks:\n")
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {len(chunk)} chars")
    print(f"  {chunk[:100]}...\n")

## Create Requirement and Chunk Nodes

Create a Requirement node to represent the source requirement, then create Chunk nodes for each piece of text. Link chunks to the requirement with `HAS_CHUNK` and to each other with `NEXT_CHUNK`.

The `NEXT_CHUNK` relationships preserve the original text order, enabling **context expansion** — when you find a relevant chunk via vector search, you can retrieve adjacent chunks for additional context.

In [None]:
def create_requirement_with_chunks(driver, requirement_name, chunks):
    """Create a Requirement node and linked Chunk nodes."""
    with driver.session() as session:
        # Create Requirement node
        result = session.run("""
            CREATE (r:Requirement {requirement_id: '1_1', name: $name,
                description: 'Battery Cell and Module Design'})
            RETURN elementId(r) as req_id
        """, name=requirement_name)
        req_id = result.single()["req_id"]
        print(f"Created Requirement: {requirement_name}")

        # Link Requirement to Component
        session.run("""
            MATCH (r:Requirement) WHERE elementId(r) = $req_id
            MATCH (c:Component {name: 'HVB_3900'})
            CREATE (c)-[:COMPONENT_HAS_REQ]->(r)
        """, req_id=req_id)
        print("Created COMPONENT_HAS_REQ relationship")

        # Create Chunk nodes with HAS_CHUNK relationships
        chunk_ids = []
        for index, text in enumerate(chunks):
            result = session.run("""
                MATCH (r:Requirement) WHERE elementId(r) = $req_id
                CREATE (c:Chunk {text: $text, index: $index})
                CREATE (r)-[:HAS_CHUNK]->(c)
                RETURN elementId(c) as chunk_id
            """, req_id=req_id, text=text, index=index)
            chunk_ids.append(result.single()["chunk_id"])
            print(f"Created Chunk {index}")

        # Create NEXT_CHUNK relationships
        for i in range(len(chunk_ids) - 1):
            session.run("""
                MATCH (c1:Chunk) WHERE elementId(c1) = $id1
                MATCH (c2:Chunk) WHERE elementId(c2) = $id2
                CREATE (c1)-[:NEXT_CHUNK]->(c2)
            """, id1=chunk_ids[i], id2=chunk_ids[i+1])
        print(f"Created {len(chunk_ids) - 1} NEXT_CHUNK relationships")

create_requirement_with_chunks(driver, "Battery Cell and Module Design", chunks)

## Verify the Graph Structure

Query the graph to see what we created.

In [None]:
def show_graph_structure(driver):
    """Display the manufacturing graph structure."""
    with driver.session() as session:
        # Count nodes by label
        result = session.run("""
            CALL db.labels() YIELD label
            CALL (label) {
                MATCH (n) WHERE label IN labels(n)
                RETURN count(n) as count
            }
            RETURN label, count
            ORDER BY count DESC
        """)
        print("=== Node Counts ===")
        for record in result:
            print(f"  {record['label']}: {record['count']}")

        # Show traceability chain
        print("\n=== Traceability Chain ===")
        result = session.run("""
            MATCH (p:Product)-[:PRODUCT_HAS_DOMAIN]->(td:TechnologyDomain)
                    -[:DOMAIN_HAS_COMPONENT]->(c:Component)
            RETURN p.name as product, td.name as domain, c.name as component
            ORDER BY td.name, c.name
        """)
        for record in result:
            print(f"  {record['product']} -> {record['domain']} -> {record['component']}")

        # Show chunk chain
        print("\n=== Requirement Chunks ===")
        result = session.run("""
            MATCH (r:Requirement)-[:HAS_CHUNK]->(c:Chunk)
            RETURN r.name as requirement, count(c) as chunks
        """)
        for record in result:
            print(f"  {record['requirement']}: {record['chunks']} chunks")

show_graph_structure(driver)

## Summary

In this notebook, you learned the foundational graph structure for manufacturing GraphRAG applications:

1. **Structured data loading** - Products, TechnologyDomains, and Components loaded from CSV files into Neo4j as nodes.

2. **Traceability relationships** - PRODUCT_HAS_DOMAIN and DOMAIN_HAS_COMPONENT relationships form the top of the traceability chain, enabling queries like "what components belong to the Electric Powertrain domain?"

3. **Requirement chunking** - Requirement description text split into Chunk nodes linked by HAS_CHUNK and NEXT_CHUNK relationships, preparing the text for vector embedding.

4. **Graph advantages** - The graph structure lets us combine structured manufacturing data (products, components) with unstructured text (requirement descriptions) in a single queryable model.

In the next notebook, you'll learn to add **embeddings** to these chunks, enabling semantic similarity search over requirement descriptions.

---

**Next:** [Embeddings and Vector Search](02_embeddings.ipynb)

In [None]:
# Cleanup
neo4j.close()