Skip to content

Cypher WHERE: static validation gap on row-boolean shapes (OR / NOT / XOR among row predicates) — emergent from #1217 Earley swap #1219

@lmeyerov

Description

@lmeyerov

Context

PR #1217 (closes #1031 slice 1) swaps the Cypher parser from Lark LALR(1) to Earley. This solves `#1031`'s mixed-pattern+row WHERE problem cleanly, but as an architectural side effect lifts an implicit syntax gate that was previously hiding a missing semantic gate.

Pre-#1217 (LALR):

  • Queries like `WHERE a.x = 1 OR b.y = 2` failed at parse time because LALR(1) couldn't unify FIRST sets across the OR boundary.
  • The implicit "if I can't parse it, you can't write it" was the de facto semantic gate.
  • No explicit Cypher-side validator existed for these shapes — none was needed, because the syntax gate caught them.

Post-#1217 (Earley):

  • Earley parses OR/NOT/XOR among row predicates without complaint.
  • Binder routes the result through the existing raw-expr path: `BoundPredicate(expression="a.x = 1 OR b.y = 2")`.
  • Lowering's runtime evaluator (`graphistry.compute.gfql.expr_parser.parse_expr` at `lowering.py:7031`) handles AND/OR/NOT/XOR with correct SQL three-valued logic — so simple cases produce correct results.
  • No static gate exists for compositions we haven't validated.

Asymmetry made visible

PR #1217 explicitly gates pattern-side compositions (NOT-pattern, OR-pattern, multi-positive-pattern) with E108 at the lift step in `parser.py::generic_where_clause` → `_build_where_with_pattern_lift`. Row-boolean compositions silently pass through:

Newly-accepted shape Static gate? Outcome
`WHERE n.p1 = X OR n.p2 = Y` ❌ none silent accept → expr_parser → runtime
`WHERE NOT n.p1 = X` ❌ none silent accept
`WHERE n.p1 = X XOR n.p2 = Y` ❌ none silent accept
`WHERE NOT (n)-[:R]->()` ✅ E108 at lift explicit "unsupported"
`WHERE (n)-[:R]->() OR n.x = 1` ✅ E108 at lift explicit "unsupported"
`WHERE (n)-[:R*]->() AND (m)-[:R*]->()` ✅ E108 at lift explicit "unsupported"

Empirical evidence

PR #1217 added 5 native execution tests covering simple disjunction shapes (TCK match-where1-10 mirror, same-alias OR, NOT with NaN, OR-then-AND, multi-row union). All pass — the runtime evaluator does the right thing for property-comparison disjunctions under SQL three-valued logic.

What's not validated

Compositional edge cases that may interact with parts of the planner/runtime that were designed assuming AND-only:

  1. Predicate pushdown — `pushdown_safety.py:58` explicitly says "Compound OR is not analyzed". AND has explicit handling (line 73); OR doesn't. Likely safe-by-default (predicates with OR don't push) but unverified.
  2. OPTIONAL MATCH null-extension — `is_null_rejecting()` is AND-aware; behavior on `WHERE x.field = 'val' OR ` where `x` is null-extended is unverified.
  3. Type coercion in OR operands — `WHERE n.p = 12 OR n.p = 'twelve'` mixed-type branches.
  4. OR with NULL literals — `WHERE n.p = 12 OR n.p IS NULL` three-valued logic edge cases.
  5. Nested compositions — `(a OR b) AND (c OR d)`, `NOT (a OR b)`, etc.
  6. Pushdown into PatternMatch — predicates with OR being pushed should respect the OR-aware safety analysis.

Decision space (intentionally open)

Several reasonable solutions; pick based on appetite + signal:

Option A: Static validator that rejects un-validated shapes

Option B: Accept simpler cases provably; reject the rest

  • Define a "safe subset" of OR/NOT/XOR shapes (e.g., property comparisons with non-null-extended aliases, no nested compositions). Accept those; reject the rest.
  • Pro: surfaces the gradient (some OR is supported, some isn't) more honestly.
  • Con: "safe subset" is hard to specify formally; risks fragmentation.

Option C: Implement full disjunction support across the pipeline

  • Audit pushdown, OPTIONAL MATCH null-extension, type coercion paths for OR-correctness. Add tests for each. Then accept all of OR/NOT/XOR with full validated support.
  • Pro: "correctness by construction" across the planner.
  • Con: largest scope; may need M-series collaboration on pushdown's OR-awareness.

Option D: Defer + document

Why this isn't M-series's responsibility

The IR verifier (M2-PR3) checks plan-shape invariants (op_id uniqueness, schema consistency, dangling refs, optional-arm nullability) — not "is this Cypher WHERE a supported subset". That's a Cypher front-end concern. The pattern-shape gates in #1217 live in `parser.py::_build_where_with_pattern_lift`, which is the right home for the row-boolean gates too.

Related

Priority

p3 — same as #1031 (architectural cleanup, no user-visible bug for simple cases; correctness risk for unvalidated compositions).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions