Skip to content

[FEA] GFQL native type system: schemas, inference, validation, and Arrow representation #1046

@lmeyerov

Description

@lmeyerov

Summary

Add a native type system to GFQL covering schema specification, inference, query validation, and typed data representation (Arrow). This is a foundational capability that would improve correctness, performance, and developer experience across the GFQL stack.

Motivation

  • Correctness: Catch schema mismatches at query compile time instead of runtime
  • Performance: Arrow-typed columns enable zero-copy GPU transfer and columnar optimization
  • Developer experience: Autocompletion, documentation, and error messages that reference schema
  • Interop: Typed schemas enable code generation for downstream consumers (TypeScript, Rust, etc.)

Scope

1. Schema specification

Define node and edge schemas declaratively:

from graphistry.schema import NodeType, EdgeType, GraphSchema

Person = NodeType("Person", {
    "id": int,
    "name": str,
    "age": Optional[int],
    "scores": list[float],
})

Company = NodeType("Company", {
    "id": int,
    "name": str,
    "founded": datetime,
})

WorksAt = EdgeType("WORKS_AT",
    source=Person,
    destination=Company,
    properties={
        "since": datetime,
        "role": str,
    },
)

schema = GraphSchema(
    node_types=[Person, Company],
    edge_types=[WorksAt],
)

Design considerations:

  • Multi-label: Cypher nodes can have multiple labels (`:Person:Employee`). Schema should support this — a node satisfies a type if it has ALL required labels.
  • Topology constraints: Edge types should declare valid (source_type, destination_type) pairs. `WORKS_AT` can only connect Person → Company.
  • Union types: A property might be `str | int` (heterogeneous data). Support via Python union syntax or explicit `Union[str, int]`.
  • Optional fields: Properties that may be null/missing. Align with `Optional[T]` / `T | None`.
  • Pydantic alignment: Consider using or extending Pydantic models for schema definition, getting validation, serialization, and IDE support for free.

2. Schema inference

Infer schemas from existing graph data:

schema = g.infer_schema()
# Returns GraphSchema with node types derived from label__* columns,
# edge types from relationship type column, property types from DataFrame dtypes

Design considerations:

  • Infer from `label__X` boolean columns (existing GFQL convention)
  • Infer property types from pandas/cudf dtypes → Arrow types
  • Handle mixed-type columns (object dtype) gracefully
  • Detect topology patterns (which edge types connect which node types)
  • Support incremental refinement: infer base schema, then user annotates/overrides

3. Query validation against schema

Validate GFQL chains and Cypher queries against a schema:

schema = GraphSchema(...)
g = g.bind(schema=schema)

# Compile-time validation:
g.gfql("MATCH (p:Person)-[:WORKS_AT]->(c:Company) RETURN p.age, c.nonexistent")
# → SchemaValidationError: Company has no property 'nonexistent'

g.gfql("MATCH (p:Person)-[:WORKS_AT]->(q:Person) RETURN p, q")
# → SchemaValidationError: WORKS_AT edge type requires destination=Company, got Person

Design considerations:

  • Validate at Cypher compile time (in lowering.py) — no runtime cost
  • Validate native GFQL chains via schema-aware `n()` / `e()` constructors
  • Provide helpful error messages referencing the schema definition
  • Optional strict mode vs permissive mode (warn vs error on unknown properties)

4. Arrow representation

Map schema types to Arrow types for efficient columnar storage:

schema.to_arrow_schema()
# Returns pyarrow.Schema with typed fields for each property

# Load/save with enforced types:
g = graphistry.from_arrow(nodes_table, edges_table, schema=schema)
g.to_arrow(schema=schema)  # Validates and casts to schema types

Design considerations:

  • Map Python types → Arrow types: `int → int64`, `str → utf8`, `list[float] → list`, etc.
  • Support cudf/RAPIDS Arrow interop
  • Enable zero-copy roundtrip: Arrow IPC → cudf → GFQL → Arrow IPC
  • Schema evolution: handle missing columns, extra columns, type coercion

Architecture questions

  1. Where does schema live? On the Plottable? As a separate object? Both?
  2. GFQL-first or Cypher-first? If we start at GFQL (schema-aware `n()`/`e()`), Cypher gets validation for free via the existing lowering path. Starting at Cypher requires mapping Cypher types to GFQL types.
  3. Pydantic integration depth: Full Pydantic models (with validation, serialization) vs lightweight dataclasses with Pydantic-style annotations?
  4. Inference vs declaration: Should `infer_schema()` produce the same schema objects as manual declaration? Or separate "inferred" vs "declared" types?
  5. Incremental adoption: How to add schemas to existing untyped graphs without breaking anything?

Prior art

  • Neo4j constraints: `CREATE CONSTRAINT ... REQUIRE (n.prop) IS :: INTEGER` — runtime enforcement
  • Apache AGE: PostgreSQL-based, inherits PostgreSQL type system
  • Kuzu: Built-in schema with typed node/edge tables
  • Pydantic: Python schema validation library — potential building block
  • Apache Arrow Schema: Columnar type system — the target representation
  • GraphQL: Typed schema for API queries — similar schema-first philosophy
  • openCypher type system: `INTEGER`, `FLOAT`, `STRING`, `BOOLEAN`, `LIST`, `MAP`, `PATH`, `NODE`, `RELATIONSHIP`

Suggested approach

  1. Spike: Define `NodeType`, `EdgeType`, `GraphSchema` dataclasses. Implement `infer_schema()` from existing graph data.
  2. Validate: Add schema validation to the Cypher lowering path — check property references and topology constraints at compile time.
  3. Arrow: Map schema to `pyarrow.Schema`, add `to_arrow()` / `from_arrow()` with schema enforcement.
  4. Iterate: Pydantic integration, multi-label support, union types based on real usage patterns.

Relationship to existing code

  • `graphistry/compute/gfql/cypher/lowering.py`: Cypher property references validated here — schema validation plugs in naturally
  • `graphistry/compute/ast.py`: `ASTNode`, `ASTEdge` could carry schema type info
  • `graphistry/Engine.py`: Engine resolution (pandas/cudf) — Arrow bridge point
  • `graphistry/Plottable.py`: Schema could attach here as `._schema`
  • `label__X` convention: Existing multi-label encoding — schema inference reads these

AI contributor notes

  • Repo AI guidance: `AGENTS.md`, `ai/README.md`
  • GFQL architecture: `ai/docs/` has guides for the query pipeline
  • Test patterns: `graphistry/tests/compute/gfql/cypher/test_lowering.py` (600+ tests)
  • The Cypher compiler is pure Python (no pandas dependency) — schema validation can be added without runtime overhead

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions