[FEA] GFQL native type system: schemas, inference, validation, and Arrow representation

## Summary

Add a native type system to GFQL covering schema specification, inference, query validation, and typed data representation (Arrow). This is a foundational capability that would improve correctness, performance, and developer experience across the GFQL stack.

## Motivation

- **Correctness**: Catch schema mismatches at query compile time instead of runtime
- **Performance**: Arrow-typed columns enable zero-copy GPU transfer and columnar optimization
- **Developer experience**: Autocompletion, documentation, and error messages that reference schema
- **Interop**: Typed schemas enable code generation for downstream consumers (TypeScript, Rust, etc.)

## Scope

### 1. Schema specification

Define node and edge schemas declaratively:

```python
from graphistry.schema import NodeType, EdgeType, GraphSchema

Person = NodeType("Person", {
    "id": int,
    "name": str,
    "age": Optional[int],
    "scores": list[float],
})

Company = NodeType("Company", {
    "id": int,
    "name": str,
    "founded": datetime,
})

WorksAt = EdgeType("WORKS_AT",
    source=Person,
    destination=Company,
    properties={
        "since": datetime,
        "role": str,
    },
)

schema = GraphSchema(
    node_types=[Person, Company],
    edge_types=[WorksAt],
)
```

**Design considerations:**
- **Multi-label**: Cypher nodes can have multiple labels (\`:Person:Employee\`). Schema should support this — a node satisfies a type if it has ALL required labels.
- **Topology constraints**: Edge types should declare valid (source_type, destination_type) pairs. \`WORKS_AT\` can only connect Person → Company.
- **Union types**: A property might be \`str | int\` (heterogeneous data). Support via Python union syntax or explicit \`Union[str, int]\`.
- **Optional fields**: Properties that may be null/missing. Align with \`Optional[T]\` / \`T | None\`.
- **Pydantic alignment**: Consider using or extending Pydantic models for schema definition, getting validation, serialization, and IDE support for free.

### 2. Schema inference

Infer schemas from existing graph data:

```python
schema = g.infer_schema()
# Returns GraphSchema with node types derived from label__* columns,
# edge types from relationship type column, property types from DataFrame dtypes
```

**Design considerations:**
- Infer from \`label__X\` boolean columns (existing GFQL convention)
- Infer property types from pandas/cudf dtypes → Arrow types
- Handle mixed-type columns (object dtype) gracefully
- Detect topology patterns (which edge types connect which node types)
- Support incremental refinement: infer base schema, then user annotates/overrides

### 3. Query validation against schema

Validate GFQL chains and Cypher queries against a schema:

```python
schema = GraphSchema(...)
g = g.bind(schema=schema)

# Compile-time validation:
g.gfql("MATCH (p:Person)-[:WORKS_AT]->(c:Company) RETURN p.age, c.nonexistent")
# → SchemaValidationError: Company has no property 'nonexistent'

g.gfql("MATCH (p:Person)-[:WORKS_AT]->(q:Person) RETURN p, q")
# → SchemaValidationError: WORKS_AT edge type requires destination=Company, got Person
```

**Design considerations:**
- Validate at Cypher compile time (in lowering.py) — no runtime cost
- Validate native GFQL chains via schema-aware \`n()\` / \`e()\` constructors
- Provide helpful error messages referencing the schema definition
- Optional strict mode vs permissive mode (warn vs error on unknown properties)

### 4. Arrow representation

Map schema types to Arrow types for efficient columnar storage:

```python
schema.to_arrow_schema()
# Returns pyarrow.Schema with typed fields for each property

# Load/save with enforced types:
g = graphistry.from_arrow(nodes_table, edges_table, schema=schema)
g.to_arrow(schema=schema)  # Validates and casts to schema types
```

**Design considerations:**
- Map Python types → Arrow types: \`int → int64\`, \`str → utf8\`, \`list[float] → list<float64>\`, etc.
- Support cudf/RAPIDS Arrow interop
- Enable zero-copy roundtrip: Arrow IPC → cudf → GFQL → Arrow IPC
- Schema evolution: handle missing columns, extra columns, type coercion

## Architecture questions

1. **Where does schema live?** On the Plottable? As a separate object? Both?
2. **GFQL-first or Cypher-first?** If we start at GFQL (schema-aware \`n()\`/\`e()\`), Cypher gets validation for free via the existing lowering path. Starting at Cypher requires mapping Cypher types to GFQL types.
3. **Pydantic integration depth**: Full Pydantic models (with validation, serialization) vs lightweight dataclasses with Pydantic-style annotations?
4. **Inference vs declaration**: Should \`infer_schema()\` produce the same schema objects as manual declaration? Or separate "inferred" vs "declared" types?
5. **Incremental adoption**: How to add schemas to existing untyped graphs without breaking anything?

## Prior art

- **Neo4j constraints**: \`CREATE CONSTRAINT ... REQUIRE (n.prop) IS :: INTEGER\` — runtime enforcement
- **Apache AGE**: PostgreSQL-based, inherits PostgreSQL type system
- **Kuzu**: Built-in schema with typed node/edge tables
- **Pydantic**: Python schema validation library — potential building block
- **Apache Arrow Schema**: Columnar type system — the target representation
- **GraphQL**: Typed schema for API queries — similar schema-first philosophy
- **openCypher type system**: \`INTEGER\`, \`FLOAT\`, \`STRING\`, \`BOOLEAN\`, \`LIST\`, \`MAP\`, \`PATH\`, \`NODE\`, \`RELATIONSHIP\`

## Suggested approach

1. **Spike**: Define \`NodeType\`, \`EdgeType\`, \`GraphSchema\` dataclasses. Implement \`infer_schema()\` from existing graph data.
2. **Validate**: Add schema validation to the Cypher lowering path — check property references and topology constraints at compile time.
3. **Arrow**: Map schema to \`pyarrow.Schema\`, add \`to_arrow()\` / \`from_arrow()\` with schema enforcement.
4. **Iterate**: Pydantic integration, multi-label support, union types based on real usage patterns.

## Relationship to existing code

- \`graphistry/compute/gfql/cypher/lowering.py\`: Cypher property references validated here — schema validation plugs in naturally
- \`graphistry/compute/ast.py\`: \`ASTNode\`, \`ASTEdge\` could carry schema type info
- \`graphistry/Engine.py\`: Engine resolution (pandas/cudf) — Arrow bridge point
- \`graphistry/Plottable.py\`: Schema could attach here as \`._schema\`
- \`label__X\` convention: Existing multi-label encoding — schema inference reads these

## AI contributor notes

- Repo AI guidance: \`AGENTS.md\`, \`ai/README.md\`
- GFQL architecture: \`ai/docs/\` has guides for the query pipeline
- Test patterns: \`graphistry/tests/compute/gfql/cypher/test_lowering.py\` (600+ tests)
- The Cypher compiler is pure Python (no pandas dependency) — schema validation can be added without runtime overhead

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] GFQL native type system: schemas, inference, validation, and Arrow representation #1046

Summary

Motivation

Scope

1. Schema specification

2. Schema inference

3. Query validation against schema

4. Arrow representation

Architecture questions

Prior art

Suggested approach

Relationship to existing code

AI contributor notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] GFQL native type system: schemas, inference, validation, and Arrow representation #1046

Description

Summary

Motivation

Scope

1. Schema specification

2. Schema inference

3. Query validation against schema

4. Arrow representation

Architecture questions

Prior art

Suggested approach

Relationship to existing code

AI contributor notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions