# Phase 1 User Journey - Agentic Graph RAG

This notebook demonstrates the core functionality of Phase 1:

1. **Project Management** - Create and manage projects
2. **Schema Definition** - Define custom node and edge schemas
3. **Node Creation** - Create nodes with structured and unstructured data
4. **Edge Creation** - Create relationships between nodes
5. **Querying** - Query the knowledge graph
6. **Schema Versioning** - Manage schema versions

---

## Setup

In [None]:
# Add parent directory to path
import sys
from pathlib import Path

# Get the project root
project_root = Path().absolute().parent.parent.parent
sys.path.insert(0, str(project_root / "code"))

print(f"Project root: {project_root}")

In [None]:
# Imports
from datetime import datetime
from uuid import uuid4

from graph_rag.models.project import Project, ProjectStatus, ProjectConfig
from graph_rag.models.schema import Schema, SchemaType
from graph_rag.models.node import Node, UnstructuredBlob, ChunkMetadata, NodeMetadata
from graph_rag.models.edge import Edge, EdgeDirection, EdgeMetadata
from graph_rag.validation import (
    StructuredDataValidator,
    UnstructuredDataValidator,
    VectorValidator,
    SchemaVersionValidator
)
from graph_rag.db import (
    init_database,
    test_connection,
    get_db,
    close_database
)

print("✓ Imports successful")

---

## 1. Database Setup

Initialize the database connection and create tables.

In [None]:
# Test connection
print("Testing database connection...")
connection_ok = test_connection()

if not connection_ok:
    print("\n⚠️  Database connection failed. Please check your .env file.")
    print("Required environment variables:")
    print("  - SNOWFLAKE_ACCOUNT")
    print("  - SNOWFLAKE_USER")
    print("  - SNOWFLAKE_PASSWORD")
    print("  - SNOWFLAKE_WAREHOUSE")
    print("  - SNOWFLAKE_DATABASE")
else:
    print("\n✓ Database connection established")

In [None]:
# Initialize database (create tables)
print("Initializing database tables...")
init_database()
print("\n✓ Database initialized")

---

## 2. Create a Project

Projects are the top-level containers for knowledge graphs.

In [None]:
# Create a new project
db = get_db()

with db.get_session() as session:
    # Create project
    project = Project(
        project_name="research-papers",
        display_name="Research Papers Knowledge Graph",
        description="A knowledge graph of research papers, authors, and institutions",
        owner_id="user_123",
        owner_email="researcher@example.com"
    )
    
    # Add tags
    project.add_tag("research")
    project.add_tag("academic")
    project.add_tag("papers")
    
    # Configure project settings
    project.update_config(
        default_embedding_model="text-embedding-3-small",
        embedding_dimension=1536,
        default_chunk_size=512,
        enable_auto_embedding=True,
        enable_entity_resolution=True
    )
    
    session.add(project)
    session.commit()
    session.refresh(project)
    
    project_id = project.project_id
    
    print(f"✓ Project created: {project.display_name}")
    print(f"  ID: {project.project_id}")
    print(f"  Name: {project.project_name}")
    print(f"  Status: {project.status.value}")
    print(f"  Tags: {', '.join(project.tags)}")
    print(f"  Created: {project.created_at}")

---

## 3. Define Node Schemas

Define schemas for different node types (entities).

In [None]:
# Define schema for Author nodes
with db.get_session() as session:
    author_schema = Schema(
        schema_name="Author",
        schema_type=SchemaType.NODE,
        version="1.0.0",
        description="Schema for author entities",
        project_id=project_id,
        structured_data_schema={
            "name": {
                "type": "string",
                "required": True,
                "description": "Full name of the author"
            },
            "h_index": {
                "type": "integer",
                "required": False,
                "min": 0,
                "description": "H-index score"
            },
            "affiliation": {
                "type": "string",
                "required": False,
                "description": "Primary affiliation/institution"
            },
            "email": {
                "type": "string",
                "required": False,
                "pattern": r"^[\w\.-]+@[\w\.-]+\.\w+$",
                "description": "Contact email"
            }
        },
        unstructured_data_config={
            "chunk_size": 512,
            "chunk_overlap": 50,
            "allowed_blob_types": ["biography", "research_interests"]
        },
        vector_config={
            "dimension": 1536,
            "model": "text-embedding-3-small",
            "precision": "float32"
        },
        is_active=True
    )
    
    session.add(author_schema)
    session.commit()
    session.refresh(author_schema)
    
    author_schema_id = author_schema.schema_id
    
    print(f"✓ Author schema created")
    print(f"  ID: {author_schema.schema_id}")
    print(f"  Version: {author_schema.version}")
    print(f"  Attributes: {', '.join(author_schema.get_attribute_names())}")

In [None]:
# Define schema for Paper nodes
with db.get_session() as session:
    paper_schema = Schema(
        schema_name="Paper",
        schema_type=SchemaType.NODE,
        version="1.0.0",
        description="Schema for research paper entities",
        project_id=project_id,
        structured_data_schema={
            "title": {
                "type": "string",
                "required": True,
                "description": "Paper title"
            },
            "year": {
                "type": "integer",
                "required": True,
                "min": 1900,
                "max": 2100,
                "description": "Publication year"
            },
            "venue": {
                "type": "string",
                "required": False,
                "description": "Publication venue (journal/conference)"
            },
            "citations": {
                "type": "integer",
                "required": False,
                "min": 0,
                "description": "Number of citations"
            },
            "doi": {
                "type": "string",
                "required": False,
                "description": "Digital Object Identifier"
            }
        },
        unstructured_data_config={
            "chunk_size": 512,
            "chunk_overlap": 50,
            "allowed_blob_types": ["abstract", "full_text", "conclusions"]
        },
        vector_config={
            "dimension": 1536,
            "model": "text-embedding-3-small"
        },
        is_active=True
    )
    
    session.add(paper_schema)
    session.commit()
    session.refresh(paper_schema)
    
    paper_schema_id = paper_schema.schema_id
    
    print(f"✓ Paper schema created")
    print(f"  ID: {paper_schema.schema_id}")
    print(f"  Version: {paper_schema.version}")
    print(f"  Attributes: {', '.join(paper_schema.get_attribute_names())}")

---

## 4. Define Edge Schema

Define schema for relationships.

In [None]:
# Define schema for AUTHORED relationship
with db.get_session() as session:
    authored_schema = Schema(
        schema_name="AUTHORED",
        schema_type=SchemaType.EDGE,
        version="1.0.0",
        description="Author wrote a paper",
        project_id=project_id,
        structured_data_schema={
            "author_position": {
                "type": "integer",
                "required": True,
                "min": 1,
                "description": "Position in author list (1=first, 2=second, etc.)"
            },
            "is_corresponding": {
                "type": "boolean",
                "required": False,
                "default": False,
                "description": "Whether this is the corresponding author"
            },
            "contribution": {
                "type": "string",
                "required": False,
                "description": "Contribution statement"
            }
        },
        unstructured_data_config={
            "chunk_size": 256,
            "chunk_overlap": 25
        },
        is_active=True
    )
    
    session.add(authored_schema)
    session.commit()
    session.refresh(authored_schema)
    
    authored_schema_id = authored_schema.schema_id
    
    print(f"✓ AUTHORED schema created")
    print(f"  ID: {authored_schema.schema_id}")
    print(f"  Version: {authored_schema.version}")

---

## 5. Validate Schema Definitions

In [None]:
# Validate author schema
is_valid, error = StructuredDataValidator.validate_schema_definition(
    author_schema.structured_data_schema
)

if is_valid:
    print("✓ Author schema definition is valid")
else:
    print(f"✗ Author schema validation failed: {error}")

# Validate paper schema
is_valid, error = StructuredDataValidator.validate_schema_definition(
    paper_schema.structured_data_schema
)

if is_valid:
    print("✓ Paper schema definition is valid")
else:
    print(f"✗ Paper schema validation failed: {error}")

# Validate authored schema
is_valid, error = StructuredDataValidator.validate_schema_definition(
    authored_schema.structured_data_schema
)

if is_valid:
    print("✓ AUTHORED schema definition is valid")
else:
    print(f"✗ AUTHORED schema validation failed: {error}")

---

## 6. Create Nodes (Entities)

Create author and paper nodes.

In [None]:
# Create an author node
with db.get_session() as session:
    # Prepare structured data
    author_data = {
        "name": "Dr. Alice Johnson",
        "h_index": 42,
        "affiliation": "Stanford University",
        "email": "alice.johnson@stanford.edu"
    }
    
    # Validate structured data
    is_valid, error, coerced_data = StructuredDataValidator.validate_structured_data(
        author_data,
        author_schema.structured_data_schema
    )
    
    if not is_valid:
        print(f"✗ Validation failed: {error}")
    else:
        # Create unstructured blob (biography)
        biography_blob = UnstructuredBlob(
            blob_id="biography",
            content="Dr. Alice Johnson is a leading researcher in machine learning "
                    "and natural language processing. She has published over 100 papers "
                    "in top-tier conferences and journals, focusing on transformer "
                    "architectures and their applications to knowledge graphs.",
            content_type="text/plain",
            language="en",
            chunks=[]  # Would be populated by chunking service
        )
        
        # Create author node
        author = Node(
            node_name="Dr. Alice Johnson",
            entity_type="Author",
            schema_id=author_schema_id,
            structured_data=coerced_data,
            unstructured_data=[biography_blob],
            project_id=project_id,
            metadata=NodeMetadata(
                extraction_method="manual",
                tags=["ml", "nlp", "transformer"],
                confidence_score=1.0
            ),
            created_by="user_123"
        )
        
        session.add(author)
        session.commit()
        session.refresh(author)
        
        author_node_id = author.node_id
        
        print(f"✓ Author node created: {author.node_name}")
        print(f"  ID: {author.node_id}")
        print(f"  H-index: {author.get_structured_attribute('h_index')}")
        print(f"  Affiliation: {author.get_structured_attribute('affiliation')}")

In [None]:
# Create a paper node
with db.get_session() as session:
    # Prepare structured data
    paper_data = {
        "title": "Attention Is All You Need",
        "year": 2017,
        "venue": "NeurIPS",
        "citations": 85000,
        "doi": "10.5555/3295222.3295349"
    }
    
    # Validate
    is_valid, error, coerced_data = StructuredDataValidator.validate_structured_data(
        paper_data,
        paper_schema.structured_data_schema
    )
    
    if not is_valid:
        print(f"✗ Validation failed: {error}")
    else:
        # Create unstructured blob (abstract)
        abstract_blob = UnstructuredBlob(
            blob_id="abstract",
            content="The dominant sequence transduction models are based on complex "
                    "recurrent or convolutional neural networks in an encoder-decoder "
                    "configuration. The best performing models also connect the encoder "
                    "and decoder through an attention mechanism. We propose a new simple "
                    "network architecture, the Transformer, based solely on attention "
                    "mechanisms, dispensing with recurrence and convolutions entirely.",
            content_type="text/plain",
            language="en",
            chunks=[]
        )
        
        # Create paper node
        paper = Node(
            node_name="Attention Is All You Need",
            entity_type="Paper",
            schema_id=paper_schema_id,
            structured_data=coerced_data,
            unstructured_data=[abstract_blob],
            project_id=project_id,
            metadata=NodeMetadata(
                extraction_method="manual",
                tags=["transformer", "attention", "nlp"],
                confidence_score=1.0
            ),
            created_by="user_123"
        )
        
        session.add(paper)
        session.commit()
        session.refresh(paper)
        
        paper_node_id = paper.node_id
        
        print(f"✓ Paper node created: {paper.node_name}")
        print(f"  ID: {paper.node_id}")
        print(f"  Year: {paper.get_structured_attribute('year')}")
        print(f"  Citations: {paper.get_structured_attribute('citations')}")

---

## 7. Create Edge (Relationship)

Create an AUTHORED relationship between author and paper.

In [None]:
# Create AUTHORED edge
with db.get_session() as session:
    # Prepare structured data
    authored_data = {
        "author_position": 1,  # First author
        "is_corresponding": True,
        "contribution": "Conceived the Transformer architecture and led the research"
    }
    
    # Validate
    is_valid, error, coerced_data = StructuredDataValidator.validate_structured_data(
        authored_data,
        authored_schema.structured_data_schema
    )
    
    if not is_valid:
        print(f"✗ Validation failed: {error}")
    else:
        # Create edge
        authored_edge = Edge(
            edge_name="alice_authored_transformer",
            relationship_type="AUTHORED",
            schema_id=authored_schema_id,
            start_node_id=author_node_id,
            end_node_id=paper_node_id,
            direction=EdgeDirection.DIRECTED,
            structured_data=coerced_data,
            project_id=project_id,
            metadata=EdgeMetadata(
                extraction_method="manual",
                weight=1.0,
                confidence_score=1.0
            ),
            created_by="user_123"
        )
        
        session.add(authored_edge)
        session.commit()
        session.refresh(authored_edge)
        
        print(f"✓ AUTHORED edge created")
        print(f"  ID: {authored_edge.edge_id}")
        print(f"  From: {author_node_id}")
        print(f"  To: {paper_node_id}")
        print(f"  Position: {authored_edge.get_structured_attribute('author_position')}")
        print(f"  Corresponding: {authored_edge.get_structured_attribute('is_corresponding')}")

---

## 8. Query the Knowledge Graph

Retrieve data from the knowledge graph.

In [None]:
# Query all authors
with db.get_session() as session:
    authors = session.query(Node).filter(
        Node.entity_type == "Author",
        Node.project_id == project_id
    ).all()
    
    print(f"\nFound {len(authors)} author(s):")
    for author in authors:
        print(f"  - {author.node_name} (H-index: {author.get_structured_attribute('h_index')})")

In [None]:
# Query all papers
with db.get_session() as session:
    papers = session.query(Node).filter(
        Node.entity_type == "Paper",
        Node.project_id == project_id
    ).all()
    
    print(f"\nFound {len(papers)} paper(s):")
    for paper in papers:
        print(f"  - {paper.node_name} ({paper.get_structured_attribute('year')})")
        print(f"    Citations: {paper.get_structured_attribute('citations')}")

In [None]:
# Query edges (relationships)
with db.get_session() as session:
    edges = session.query(Edge).filter(
        Edge.relationship_type == "AUTHORED",
        Edge.project_id == project_id
    ).all()
    
    print(f"\nFound {len(edges)} AUTHORED relationship(s):")
    for edge in edges:
        print(f"  - {edge.edge_name}")
        print(f"    Position: {edge.get_structured_attribute('author_position')}")
        print(f"    Corresponding: {edge.get_structured_attribute('is_corresponding')}")

---

## 9. Schema Versioning

Demonstrate schema evolution with version compatibility.

In [None]:
# Create a new version of Author schema with additional field
with db.get_session() as session:
    author_schema_v2 = Schema(
        schema_name="Author",
        schema_type=SchemaType.NODE,
        version="1.1.0",  # Minor version bump (backward compatible)
        description="Schema for author entities (v1.1 - added ORCID field)",
        project_id=project_id,
        structured_data_schema={
            "name": {
                "type": "string",
                "required": True,
                "description": "Full name of the author"
            },
            "h_index": {
                "type": "integer",
                "required": False,
                "min": 0,
                "description": "H-index score"
            },
            "affiliation": {
                "type": "string",
                "required": False,
                "description": "Primary affiliation/institution"
            },
            "email": {
                "type": "string",
                "required": False,
                "pattern": r"^[\w\.-]+@[\w\.-]+\.\w+$",
                "description": "Contact email"
            },
            "orcid": {
                "type": "string",
                "required": False,
                "description": "ORCID identifier (NEW in v1.1)"
            }
        },
        unstructured_data_config=author_schema.unstructured_data_config,
        vector_config=author_schema.vector_config,
        is_active=True
    )
    
    session.add(author_schema_v2)
    session.commit()
    session.refresh(author_schema_v2)
    
    print(f"✓ Author schema v1.1.0 created")
    print(f"  New attribute: orcid")
    
    # Mark v1.0.0 as inactive (superseded)
    author_schema_v1 = session.query(Schema).filter(
        Schema.schema_id == author_schema_id
    ).first()
    author_schema_v1.is_active = False
    session.commit()
    
    print(f"✓ Author schema v1.0.0 marked as inactive")

In [None]:
# Check version compatibility
v1 = "1.0.0"
v2 = "1.1.0"
v3 = "2.0.0"

print("\nVersion compatibility checks:")
print(f"  {v1} -> {v2}: {SchemaVersionValidator.is_compatible(v1, v2)}")
print(f"  {v2} -> {v1}: {SchemaVersionValidator.is_compatible(v2, v1)}")
print(f"  {v1} -> {v3}: {SchemaVersionValidator.is_compatible(v1, v3)}")
print(f"  {v2} -> {v3}: {SchemaVersionValidator.is_compatible(v2, v3)}")

---

## 10. Project Statistics

View project statistics.

In [None]:
# Update project statistics
with db.get_session() as session:
    project = session.query(Project).filter(
        Project.project_id == project_id
    ).first()
    
    # Count entities
    schema_count = session.query(Schema).filter(
        Schema.project_id == project_id
    ).count()
    
    node_count = session.query(Node).filter(
        Node.project_id == project_id
    ).count()
    
    edge_count = session.query(Edge).filter(
        Edge.project_id == project_id
    ).count()
    
    # Update stats
    project.update_stats(
        schema_count=schema_count,
        node_count=node_count,
        edge_count=edge_count
    )
    
    session.commit()
    session.refresh(project)
    
    print(f"\nProject: {project.display_name}")
    print(f"  Status: {project.status.value}")
    print(f"  Owner: {project.owner_id}")
    print(f"  Tags: {', '.join(project.tags)}")
    print(f"\nStatistics:")
    print(f"  Schemas: {project.stats.schema_count}")
    print(f"  Nodes: {project.stats.node_count}")
    print(f"  Edges: {project.stats.edge_count}")
    print(f"  Last updated: {project.stats.last_updated}")

---

## Summary

In this notebook, we demonstrated:

✅ **Project Management** - Created a multi-tenant project with configuration

✅ **Schema Definition** - Defined schemas for Author, Paper nodes and AUTHORED edges

✅ **Schema Validation** - Validated schema definitions and structured data

✅ **Node Creation** - Created nodes with structured and unstructured data

✅ **Edge Creation** - Created relationships between nodes

✅ **Querying** - Queried nodes and edges from the knowledge graph

✅ **Schema Versioning** - Evolved schemas with semantic versioning

✅ **Statistics** - Tracked project statistics

### Next Steps

Phase 2 will add:
- Document ingestion and entity extraction
- Automatic embedding generation
- Entity resolution and deduplication
- Vector similarity search
- Graph traversal queries
- Agentic retrieval (vector + graph + filter)

---

In [None]:
# Cleanup (optional)
# close_database()
# print("✓ Database connections closed")