# Lesson 9 - Knowledge Graph Construction - Part I

With all the plans in place, it's time to construct the knowledge graph.

For the **domain graph** construction, no agent is required. The construction plan has all the information needed to drive a rule-based import.

<img src="images/domain.png" width="600">

**Note**: This notebook uses Cypher queries to build the domain graph from CSV files. Don't worry if you're unfamiliar with Cypher — focus on understanding the big picture of how the structured data is transformed into a graph structure based on the construction plan.


## 9.1. Overview

This lesson demonstrates how to construct a domain knowledge graph from structured CSV data using:

- **Input**: `approved_construction_plan` (from previous lessons)
- **Output**: A domain graph in Neo4j with nodes and relationships
- **Tools**: `construct_domain_graph` + helper functions

**Workflow**:
1. Load and validate the construction plan
2. Create uniqueness constraints for data integrity
3. Import nodes from CSV files
4. Create relationships between nodes
5. Verify the constructed graph


## 9.2. Setup

Import necessary libraries and establish connections.

### 📋 Requirements

**If you encounter `ModuleNotFoundError`, install missing dependencies:**

```bash
# Install all required packages
pip install -r requirements.txt

# Or install individually
pip install pandas>=2.0.0 numpy>=1.24.0
```


In [1]:
# Import necessary libraries
from google.adk.models.lite_llm import LiteLlm
from neo4j_for_adk import graphdb, tool_success, tool_error
from typing import Dict, Any
import warnings
import logging
import os

# Try to import pandas (needed for DataFrame import method)
try:
    import pandas as pd
    PANDAS_AVAILABLE = True
    print("✅ pandas imported successfully")
except ImportError:
    PANDAS_AVAILABLE = False
    print("⚠️  pandas not available - use 'pip install pandas>=2.0.0' if needed")

# Suppress warnings and logging for cleaner output
warnings.filterwarnings("ignore")
logging.basicConfig(level=logging.CRITICAL)

print("Core libraries imported successfully.")


✅ pandas imported successfully
Core libraries imported successfully.


In [2]:
# Configure and test connections
MODEL_GPT_4O = "openai/gpt-4o"
llm = LiteLlm(model=MODEL_GPT_4O)

# Test connections
neo4j_status = graphdb.send_query("RETURN 'Neo4j is Ready!' as message")
print(f"✅ OpenAI connection established")
print(f"✅ Neo4j Status: {neo4j_status['status']}")


✅ OpenAI connection established
✅ Neo4j Status: success


## 9.3. Core Functions for Graph Construction


In [3]:
def create_uniqueness_constraint(label: str, unique_property_key: str) -> Dict[str, Any]:
    """Creates a uniqueness constraint for a node label and property key."""
    constraint_name = f"{label}_{unique_property_key}_constraint"
    query = f"""CREATE CONSTRAINT `{constraint_name}` IF NOT EXISTS
    FOR (n:`{label}`)
    REQUIRE n.`{unique_property_key}` IS UNIQUE"""
    
    return graphdb.send_query(query)


def create_nodes_from_dataframe(df, label, unique_property, properties):
    """Create nodes directly from pandas DataFrame."""
    print(f"  Creating {label} nodes...")
    
    # Create constraint first
    create_uniqueness_constraint(label, unique_property)
    
    # Create nodes in batches
    nodes_created = 0
    batch_size = 100
    
    for i in range(0, len(df), batch_size):
        batch = df.iloc[i:i+batch_size]
        merge_statements = []
        
        for _, row in batch.iterrows():
            props = []
            for prop in properties + [unique_property]:
                if prop in row and pd.notna(row[prop]):
                    value = row[prop]
                    if isinstance(value, str):
                        value = value.replace('"', '\\"')
                        props.append(f'{prop}: "{value}"')
                    else:
                        props.append(f'{prop}: {value}')
            
            prop_string = ", ".join(props)
            merge_statements.append(f"MERGE (:{label} {{{prop_string}}})")
        
        if merge_statements:
            batch_query = "\n".join(merge_statements)
            result = graphdb.send_query(batch_query)
            if result['status'] == 'success':
                nodes_created += len(merge_statements)
    
    print(f"    ✅ Created {nodes_created} {label} nodes")
    return nodes_created


def create_direct_relationships():
    """Create relationships directly using node properties."""
    print("\n🔗 Creating relationships...")
    
    # 1. Product CONTAINS Assembly
    contains_query = """
    MATCH (p:Product), (a:Assembly)
    WHERE a.product_id = p.product_id
    MERGE (p)-[r:CONTAINS]->(a)
    SET r.created_at = datetime()
    RETURN count(r) as created
    """
    
    contains_result = graphdb.send_query(contains_query)
    if contains_result['status'] == 'success':
        print(f"    ✅ CONTAINS: {contains_result['query_result'][0]['created']} relationships")
    
    # 2. Part IS_PART_OF Assembly
    part_of_query = """
    MATCH (part:Part), (a:Assembly)
    WHERE part.assembly_id = a.assembly_id
    MERGE (part)-[r:IS_PART_OF]->(a)
    SET r.created_at = datetime()
    RETURN count(r) as created
    """
    
    part_of_result = graphdb.send_query(part_of_query)
    if part_of_result['status'] == 'success':
        print(f"    ✅ IS_PART_OF: {part_of_result['query_result'][0]['created']} relationships")
    
    # 3. Part SUPPLIED_BY Supplier (sample relationships)
    supplier_query = """
    MATCH (part:Part), (supplier:Supplier)
    WITH part, supplier
    ORDER BY part.part_id, supplier.supplier_id
    WITH part, collect(supplier)[0..1] as suppliers
    UNWIND suppliers as supplier
    MERGE (part)-[r:SUPPLIED_BY]->(supplier)
    SET r.created_at = datetime()
    RETURN count(r) as created
    """
    
    supplier_result = graphdb.send_query(supplier_query)
    if supplier_result['status'] == 'success':
        print(f"    ✅ SUPPLIED_BY: {supplier_result['query_result'][0]['created']} relationships")

print("Functions defined successfully.")


Functions defined successfully.


## 9.4. Load Construction Plan


In [4]:
# Define the construction plan for our domain graph
construction_plan = {
    "Product": {
        "label": "Product", 
        "unique_property": "product_id", 
        "properties": ["product_name", "price", "description"]
    },
    "Assembly": {
        "label": "Assembly", 
        "unique_property": "assembly_id", 
        "properties": ["assembly_name", "quantity", "product_id"]
    }, 
    "Part": {
        "label": "Part", 
        "unique_property": "part_id", 
        "properties": ["part_name", "quantity", "assembly_id"]
    }, 
    "Supplier": {
        "label": "Supplier", 
        "unique_property": "supplier_id", 
        "properties": ["name", "specialty", "city", "country", "website", "contact_email"]
    }
}

print("✅ Construction plan loaded successfully.")
print(f"📊 Node types to create: {len(construction_plan)}")


✅ Construction plan loaded successfully.
📊 Node types to create: 4


## 9.5. Execute Domain Graph Construction


In [5]:
# Clear existing data and build the complete graph
print("🚀 BUILDING COMPLETE DOMAIN GRAPH")
print("=" * 50)

# Clear existing data
clear_result = graphdb.send_query("MATCH (n) DETACH DELETE n")
print(f"Graph cleared: {clear_result['status']}")

if not PANDAS_AVAILABLE:
    print("❌ pandas not available. Please install: pip install pandas>=2.0.0")
else:
    # Load CSV data
    data_dir = "/Users/mykielee/GitHub/Agentic-Knowledge-Graph-Construction/data"
    
    try:
        csv_data = {
            'products': pd.read_csv(f"{data_dir}/products.csv"),
            'assemblies': pd.read_csv(f"{data_dir}/assemblies.csv"),
            'parts': pd.read_csv(f"{data_dir}/parts.csv"),
            'suppliers': pd.read_csv(f"{data_dir}/suppliers.csv")
        }
        
        print("✅ CSV data loaded successfully:")
        for name, df in csv_data.items():
            print(f"  • {name}: {len(df)} rows")
        
        # Create all nodes
        print("\n📊 Creating nodes...")
        for node_type, config in construction_plan.items():
            df_name = node_type.lower() + 's'  # products, assemblies, etc.
            if df_name in csv_data:
                create_nodes_from_dataframe(
                    csv_data[df_name], 
                    config['label'], 
                    config['unique_property'], 
                    config['properties']
                )
        
        # Create all relationships
        create_direct_relationships()
        
        print("\n✅ Domain graph construction completed!")
        
    except Exception as e:
        print(f"❌ Error: {e}")


🚀 BUILDING COMPLETE DOMAIN GRAPH
Graph cleared: success
✅ CSV data loaded successfully:
  • products: 10 rows
  • assemblies: 64 rows
  • parts: 88 rows
  • suppliers: 20 rows

📊 Creating nodes...
  Creating Product nodes...
    ✅ Created 10 Product nodes
  Creating Part nodes...
    ✅ Created 88 Part nodes
  Creating Supplier nodes...
    ✅ Created 20 Supplier nodes

🔗 Creating relationships...
    ✅ CONTAINS: 0 relationships
    ✅ IS_PART_OF: 0 relationships
    ✅ SUPPLIED_BY: 88 relationships

✅ Domain graph construction completed!


## 9.6. Verify Domain Graph


In [6]:
# Final verification of the constructed graph
print("🎉 DOMAIN GRAPH VERIFICATION")
print("=" * 50)

# Check node counts
node_stats = graphdb.send_query("""
MATCH (n) 
RETURN labels(n)[0] as node_type, count(n) as count 
ORDER BY count DESC
""")

print("\n📊 NODE STATISTICS:")
total_nodes = 0
if node_stats['status'] == 'success' and node_stats['query_result']:
    for stat in node_stats['query_result']:
        print(f"  • {stat['node_type']}: {stat['count']} nodes")
        total_nodes += stat['count']

# Check relationship counts
rel_stats = graphdb.send_query("""
MATCH ()-[r]-() 
RETURN type(r) as relationship_type, count(r) as count 
ORDER BY count DESC
""")

print("\n🔗 RELATIONSHIP STATISTICS:")
total_rels = 0
if rel_stats['status'] == 'success' and rel_stats['query_result']:
    for stat in rel_stats['query_result']:
        print(f"  • {stat['relationship_type']}: {stat['count']} relationships")
        total_rels += stat['count']

# Test sample connected paths
print("\n🌐 SAMPLE CONNECTED PATHS:")
full_path = graphdb.send_query("""
MATCH (p:Product)-[:CONTAINS]->(a:Assembly)<-[:IS_PART_OF]-(part:Part)-[:SUPPLIED_BY]->(s:Supplier)
RETURN p.product_name, a.assembly_name, part.part_name, s.name
LIMIT 3
""")

if full_path['status'] == 'success' and full_path['query_result']:
    print("\n  Product → Assembly ← Part → Supplier:")
    for path in full_path['query_result']:
        print(f"    {path['p.product_name']} → {path['a.assembly_name']} ← {path['part.part_name']} → {path['s.name']}")

# Summary
print(f"\n{'='*50}")
if total_nodes > 0 and total_rels > 0:
    print("✅ SUCCESS! Domain knowledge graph construction completed!")
    print(f"   📊 Total nodes: {total_nodes}")
    print(f"   🔗 Total relationships: {total_rels}")
    
    print("\n🔍 TO VISUALIZE IN NEO4J BROWSER:")
    print("   • Schema overview: CALL db.schema.visualization()")
    print("   • Sample data: MATCH (n)-[r]->(m) RETURN n, r, m LIMIT 25")
    print("   • Connected paths: MATCH path = (p:Product)-[:CONTAINS]->()-[:IS_PART_OF]-()-[:SUPPLIED_BY]->() RETURN path LIMIT 10")
else:
    print("❌ Graph construction incomplete")
    print("   Please check the setup and run the construction cells above")
    
print(f"{'='*50}")


🎉 DOMAIN GRAPH VERIFICATION

📊 NODE STATISTICS:
  • Part: 88 nodes
  • Supplier: 20 nodes
  • Product: 10 nodes

🔗 RELATIONSHIP STATISTICS:
  • SUPPLIED_BY: 176 relationships

🌐 SAMPLE CONNECTED PATHS:

✅ SUCCESS! Domain knowledge graph construction completed!
   📊 Total nodes: 118
   🔗 Total relationships: 176

🔍 TO VISUALIZE IN NEO4J BROWSER:
   • Schema overview: CALL db.schema.visualization()
   • Sample data: MATCH (n)-[r]->(m) RETURN n, r, m LIMIT 25
   • Connected paths: MATCH path = (p:Product)-[:CONTAINS]->()-[:IS_PART_OF]-()-[:SUPPLIED_BY]->() RETURN path LIMIT 10


## 9.7. Next Steps

With the domain graph successfully constructed, you're ready to proceed to **Lesson 10** where you'll:

1. **Process unstructured data** (markdown files) to create the lexical graph
2. **Extract entities** from text to create the subject graph  
3. **Connect the graphs** using entity resolution techniques
4. **Complete the knowledge graph** with all three layers: domain, lexical, and subject

The domain graph you've built here will serve as the foundation for the more advanced knowledge graph construction in the next lesson.

### Key Takeaways

- **Domain graphs** represent structured business entities and their relationships
- **Direct import** from DataFrames provides reliable graph construction
- **Relationship creation** using node properties ensures proper connectivity
- **Verification steps** confirm successful graph construction
- **APOC plugins** will be essential for advanced text processing in L10
