Summary
Add a native type system to GFQL covering schema specification, inference, query validation, and typed data representation (Arrow). This is a foundational capability that would improve correctness, performance, and developer experience across the GFQL stack.
Motivation
- Correctness: Catch schema mismatches at query compile time instead of runtime
- Performance: Arrow-typed columns enable zero-copy GPU transfer and columnar optimization
- Developer experience: Autocompletion, documentation, and error messages that reference schema
- Interop: Typed schemas enable code generation for downstream consumers (TypeScript, Rust, etc.)
Scope
1. Schema specification
Define node and edge schemas declaratively:
from graphistry.schema import NodeType, EdgeType, GraphSchema
Person = NodeType("Person", {
"id": int,
"name": str,
"age": Optional[int],
"scores": list[float],
})
Company = NodeType("Company", {
"id": int,
"name": str,
"founded": datetime,
})
WorksAt = EdgeType("WORKS_AT",
source=Person,
destination=Company,
properties={
"since": datetime,
"role": str,
},
)
schema = GraphSchema(
node_types=[Person, Company],
edge_types=[WorksAt],
)
Design considerations:
- Multi-label: Cypher nodes can have multiple labels (`:Person:Employee`). Schema should support this — a node satisfies a type if it has ALL required labels.
- Topology constraints: Edge types should declare valid (source_type, destination_type) pairs. `WORKS_AT` can only connect Person → Company.
- Union types: A property might be `str | int` (heterogeneous data). Support via Python union syntax or explicit `Union[str, int]`.
- Optional fields: Properties that may be null/missing. Align with `Optional[T]` / `T | None`.
- Pydantic alignment: Consider using or extending Pydantic models for schema definition, getting validation, serialization, and IDE support for free.
2. Schema inference
Infer schemas from existing graph data:
schema = g.infer_schema()
# Returns GraphSchema with node types derived from label__* columns,
# edge types from relationship type column, property types from DataFrame dtypes
Design considerations:
- Infer from `label__X` boolean columns (existing GFQL convention)
- Infer property types from pandas/cudf dtypes → Arrow types
- Handle mixed-type columns (object dtype) gracefully
- Detect topology patterns (which edge types connect which node types)
- Support incremental refinement: infer base schema, then user annotates/overrides
3. Query validation against schema
Validate GFQL chains and Cypher queries against a schema:
schema = GraphSchema(...)
g = g.bind(schema=schema)
# Compile-time validation:
g.gfql("MATCH (p:Person)-[:WORKS_AT]->(c:Company) RETURN p.age, c.nonexistent")
# → SchemaValidationError: Company has no property 'nonexistent'
g.gfql("MATCH (p:Person)-[:WORKS_AT]->(q:Person) RETURN p, q")
# → SchemaValidationError: WORKS_AT edge type requires destination=Company, got Person
Design considerations:
- Validate at Cypher compile time (in lowering.py) — no runtime cost
- Validate native GFQL chains via schema-aware `n()` / `e()` constructors
- Provide helpful error messages referencing the schema definition
- Optional strict mode vs permissive mode (warn vs error on unknown properties)
4. Arrow representation
Map schema types to Arrow types for efficient columnar storage:
schema.to_arrow_schema()
# Returns pyarrow.Schema with typed fields for each property
# Load/save with enforced types:
g = graphistry.from_arrow(nodes_table, edges_table, schema=schema)
g.to_arrow(schema=schema) # Validates and casts to schema types
Design considerations:
- Map Python types → Arrow types: `int → int64`, `str → utf8`, `list[float] → list`, etc.
- Support cudf/RAPIDS Arrow interop
- Enable zero-copy roundtrip: Arrow IPC → cudf → GFQL → Arrow IPC
- Schema evolution: handle missing columns, extra columns, type coercion
Architecture questions
- Where does schema live? On the Plottable? As a separate object? Both?
- GFQL-first or Cypher-first? If we start at GFQL (schema-aware `n()`/`e()`), Cypher gets validation for free via the existing lowering path. Starting at Cypher requires mapping Cypher types to GFQL types.
- Pydantic integration depth: Full Pydantic models (with validation, serialization) vs lightweight dataclasses with Pydantic-style annotations?
- Inference vs declaration: Should `infer_schema()` produce the same schema objects as manual declaration? Or separate "inferred" vs "declared" types?
- Incremental adoption: How to add schemas to existing untyped graphs without breaking anything?
Prior art
- Neo4j constraints: `CREATE CONSTRAINT ... REQUIRE (n.prop) IS :: INTEGER` — runtime enforcement
- Apache AGE: PostgreSQL-based, inherits PostgreSQL type system
- Kuzu: Built-in schema with typed node/edge tables
- Pydantic: Python schema validation library — potential building block
- Apache Arrow Schema: Columnar type system — the target representation
- GraphQL: Typed schema for API queries — similar schema-first philosophy
- openCypher type system: `INTEGER`, `FLOAT`, `STRING`, `BOOLEAN`, `LIST`, `MAP`, `PATH`, `NODE`, `RELATIONSHIP`
Suggested approach
- Spike: Define `NodeType`, `EdgeType`, `GraphSchema` dataclasses. Implement `infer_schema()` from existing graph data.
- Validate: Add schema validation to the Cypher lowering path — check property references and topology constraints at compile time.
- Arrow: Map schema to `pyarrow.Schema`, add `to_arrow()` / `from_arrow()` with schema enforcement.
- Iterate: Pydantic integration, multi-label support, union types based on real usage patterns.
Relationship to existing code
- `graphistry/compute/gfql/cypher/lowering.py`: Cypher property references validated here — schema validation plugs in naturally
- `graphistry/compute/ast.py`: `ASTNode`, `ASTEdge` could carry schema type info
- `graphistry/Engine.py`: Engine resolution (pandas/cudf) — Arrow bridge point
- `graphistry/Plottable.py`: Schema could attach here as `._schema`
- `label__X` convention: Existing multi-label encoding — schema inference reads these
AI contributor notes
- Repo AI guidance: `AGENTS.md`, `ai/README.md`
- GFQL architecture: `ai/docs/` has guides for the query pipeline
- Test patterns: `graphistry/tests/compute/gfql/cypher/test_lowering.py` (600+ tests)
- The Cypher compiler is pure Python (no pandas dependency) — schema validation can be added without runtime overhead
Summary
Add a native type system to GFQL covering schema specification, inference, query validation, and typed data representation (Arrow). This is a foundational capability that would improve correctness, performance, and developer experience across the GFQL stack.
Motivation
Scope
1. Schema specification
Define node and edge schemas declaratively:
Design considerations:
2. Schema inference
Infer schemas from existing graph data:
Design considerations:
3. Query validation against schema
Validate GFQL chains and Cypher queries against a schema:
Design considerations:
4. Arrow representation
Map schema types to Arrow types for efficient columnar storage:
Design considerations:
Architecture questions
Prior art
Suggested approach
Relationship to existing code
AI contributor notes