refactor(gfql/cypher): switch to Earley parser; retire LALR pattern-shape cruft (#1031 slice 1) by lmeyerov · Pull Request #1217 · graphistry/pygraphistry

lmeyerov · 2026-04-26T00:42:00Z

Summary

Slice 1 of N for issue #1031. Per tech-lead direction (1-day timebox) following an LALR(1) state-collapse on ` AND ` shapes that could not be resolved by grammar restructure within Lark's LALR(1) constraints.

Earley parser swap eliminates four LALR-induced workarounds:

The 3 dedicated pattern-shape `where_clause` grammar alternatives.
`_canonicalize_where_single_pattern_and_expr` regex source-rewrite that reordered `expr AND pattern AND expr` → `pattern AND `.
`_mixed_where_pattern_expr_error` pre-flight rejector.
The `except LarkError` retry block in `parse_cypher`.

Net diff: +341 / -186 across 7 files.

Two commits, logically separate

`fix(predicates): allow raw strings in NE/GT/LT/GE/LE` — extends EQ's pre-existing string acceptance to its sibling comparison ops. Latent bug: LALR's routing meant string comparisons mixed with labels never reached structured comparison predicates. Earley unifies the route, surfacing the strictness on idiomatic Cypher patterns. ~50 LOC self-contained in `comparison.py`.
`refactor(gfql/cypher): switch to Earley parser; retire LALR pattern-shape cruft` — the main change. Grammar restructure + Earley swap + pre-flight retirement + lift step + AST extension.

Strict-improvement consequences (Earley accepts what LALR rejected)

Query shape	Pre-#1031	Post-#1031
`WHERE expr OR expr`	E107 syntax error	parses as `or` tree
`WHERE expr AND (expr OR expr)` (Shape B)	E107	parses as `and(left, or(...))`
`WHERE n:Label AND n.prop = X` (Shape F)	raw expr text	structured `where_predicates`
`WHERE prop OP 'string'` mixed with label	ran via raw text path	runs via structured ops (now safe via comparison.py fix)

What this PR adds

`BooleanExpr.op` literal extended with `"pattern"`; new `BooleanExpr.pattern` field.
Parser `pattern_atom` transformer wrapping `WHERE_PATTERN` as a tree leaf.
`_split_top_level_and_pattern_leaves` + `_rebuild_and_tree` + `_build_where_with_pattern_lift` helpers that extract pattern leaves from `expr_tree` into `WhereClause.predicates` as `WherePatternPredicate` entries before lowering.
Parser swap: `Lark(parser="earley")` for both `_parser` and `_pattern_parser`.
`_StringAllowingComparisonMixin` applied to NE/GT/LT/GE/LE (separate commit).

What this PR retires

3 grammar alternatives, 3 transformer methods, `_mixed_where_clause`, 2 pre-flight functions, 3 derived regex constants, 3 helper utilities, the `parse_cypher` retry block.

#1031 progress

✅ Retires `_canonicalize_where_single_pattern_and_expr`.
✅ Removes one of three `split_top_level_and` callers (parser side). Remaining: `predicate_pushdown._split_conjuncts` (retired under gfql/ir: expose Lark and_op/or_op tree in WhereClause so passes walk structure instead of re-parsing text #1200). Together: `expr_split.py` deletion is within reach.
⏭️ `NOT pattern` lowering (anti-semi-join) — slice 2. Emits explicit `unsupported` error today.
⏭️ Multi-positive patterns — slice 3. Rejected at lift step.
⏭️ `OR` / `XOR` around patterns — slice 4. Rejected at lift step.

Test plan

`pytest graphistry/tests/compute/gfql/` — 1539 pass / 80 skip / 15 xfail / 0 fail.
10 tests updated to match Earley's strict-improvement routing (Categories A, B, C, D documented in `plans/issue-1031-grammar-mixed-where-pattern-expr/plan.md`).
gfql-benchmarks (per tech-lead's <2× parse-delta bound) — to run on CI.

Notes for review

The plan file (`plans/issue-1031-...`) has the full LALR investigation, planning-review (12 deltas), and tech-lead correspondence baked in.
Earley introduces a new failure mode (grammar ambiguity). Lark's default `ambiguity="resolve"` picks the highest-priority parse. Conformance + benchmark suites are the validation oracle.

🤖 Generated with Claude Code

…cates EQ already accepts raw strings via inline ``_normalize_value`` + ``_validate_fields`` overrides. The other comparison ops lacked the same overrides and rejected raw strings unconditionally — their ``__call__`` methods also explicitly raised ``TypeError`` on strings. This was latent: LALR couldn't unify label-predicate and property-predicate alternatives in Cypher's ``where_predicates``, so queries like ``WHERE n.name <> 'value' AND n:Label`` routed through raw expr text and never reached ``ComparisonPredicate.__call__``. #1031's Earley swap unifies the route, surfacing the strictness on idiomatic Cypher patterns. Pandas Series supports lexicographic ``>``/``<``/``!=`` on strings natively, so the fix is to extend EQ's pre-existing string acceptance to siblings. Implementation: - Extract ``_StringAllowingComparisonMixin`` with the ``_normalize_value`` + ``_validate_fields`` overrides. - Apply mixin to NE/GT/LT/GE/LE. - Extend each ``__call__`` to handle ``isinstance(self.val, str)``. The strict raw-string rejection on the base ``ComparisonPredicate`` stays — it still applies to ``Between`` (and any future direct-IR constructors) where the datetime-vs-string ambiguity matters for direct-IR users. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…hape cruft (#1031 slice 1) Per tech-lead direction (1-day timebox) following an LALR(1) state-collapse on ``<comparable> AND <pattern>`` shapes that could not be resolved by grammar restructure. Earley parser swap eliminates four LALR-induced workarounds: 1. The 3 dedicated pattern-shape ``where_clause`` grammar alternatives (``where_pattern_and_expr_clause`` / ``expr_and_where_pattern_clause`` / ``where_pattern_only_clause``) — collapse to the unified ``expr`` route via a new ``WHERE_PATTERN -> pattern_atom`` leaf in ``?primary``. 2. ``_canonicalize_where_single_pattern_and_expr`` — regex source-rewrite that reordered ``expr AND pattern AND expr`` to ``pattern AND <rest>`` so the LALR grammar could match. Earley accepts patterns at any AND-spine position natively. 3. ``_mixed_where_pattern_expr_error`` — pre-flight rejector for shapes LALR could parse but the IR couldn't represent. Replaced with a structural lift step that inspects the parsed boolean tree. 4. The ``except LarkError`` retry block in ``parse_cypher`` that fed the canonicalize-then-reparse path. What this PR adds: - ``BooleanExpr.op`` literal extended with ``"pattern"``; new ``BooleanExpr.pattern: Optional[Tuple[PatternElement, ...]]`` field. - Parser ``pattern_atom`` transformer: wraps ``WHERE_PATTERN`` token as ``BooleanExpr(op="pattern", pattern=parsed)``; reuses the existing ``_parse_where_pattern_predicate_text`` helper. - ``_split_top_level_and_pattern_leaves`` walker + ``_rebuild_and_tree`` helper + ``_build_where_with_pattern_lift`` that extracts pattern leaves from ``expr_tree`` into ``WhereClause.predicates`` as ``WherePatternPredicate`` entries before lowering. - Parser swap: ``Lark(parser="earley")`` for both ``_parser`` and ``_pattern_parser``. What this PR retires: - 3 grammar alternatives, 3 transformer methods, 1 helper (``_mixed_where_clause``), 2 pre-flight functions, 3 derived regex constants (``_WHERE_PATTERN_SEQUENCE_RE``, ``_WHERE_PATTERN_THEN_EXPR_RE``, ``_WHERE_EXPR_THEN_PATTERN_RE``), 2 helper utilities (``_WHERE_CLAUSE_BODY_RE``, ``_BOOLEAN_KEYWORD_RE``, ``_line_and_column_from_offset``), and the ``parse_cypher`` ``except LarkError`` retry block. Net diff: +204 / -144. Strict-improvement consequences (Earley accepts what LALR rejected): - ``WHERE expr OR expr`` now parses as a structured ``or`` tree. - ``WHERE expr AND (expr OR expr)`` (Shape B) now parses as ``and(left, or(...))``. - ``WHERE n:Label AND n.prop = X`` now routes through structured ``where_predicates`` (was raw expr); label narrowing applies. - ``WHERE prop OP 'string'`` mixed with label predicates now hits ``ComparisonPredicate`` for strings — separately fixed in the preceding commit by extending EQ's string-accepting overrides to NE/GT/LT/GE/LE. #1031 milestones in this PR: - ✅ Retires ``_canonicalize_where_single_pattern_and_expr``. - ✅ Removes one of three ``split_top_level_and`` callers (parser side). Remaining caller: ``predicate_pushdown._split_conjuncts``, retired separately under #1200. Together this brings ``expr_split.py`` deletion within reach. - ⏭️ ``NOT pattern`` lowering (anti-semi-join) — slice 2 territory; emits explicit ``unsupported`` error today. - ⏭️ Multi-positive patterns — slice 3 territory; rejected at lift step. - ⏭️ ``OR``/``XOR`` around patterns — slice 4 territory; rejected at lift step. Tests updated: - 7 tests adjusted to match Earley's strict-improvement routing (``test_or_where_not_yet_supported``, conformance Shape B parser + binder, conformance Shape F parser + binder, label-narrowing mixed case, logical planner WITH-stage attachment). - ``test_parse_supports_where_pattern_predicate_and_expr_mix`` paren- preservation case updated to expect ``boolean_expr_to_text``- reconstructed paren-stripped form (semantically equivalent). 1539 / 1539 GFQL tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ngs entries Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…#1217) Pre-#1217: LALR rejected OR / NOT inside WHERE bodies, so disjunctive and negation shapes had zero native end-to-end test coverage. Earley accepts them and the runtime evaluates via pandas, but until now the correctness was implicit — only the (separate-repo) tck-gfql harness exercised these paths via xfail-marked scenarios. Adds 5 native execution tests in test_lowering.py: - ``test_string_cypher_executes_disjunctive_property_predicate_returns_union`` mirrors TCK match-where1-10: ``WHERE n.p1 = 12 OR n.p2 = 13`` returns the union of matches (a, b), not just one. - ``test_string_cypher_executes_disjunctive_same_alias_property_predicate`` ``WHERE n.p1 = 12 OR n.p1 = 99`` — same alias on both sides. - ``test_string_cypher_executes_negation_property_predicate_returns_complement`` ``WHERE NOT n.p1 = 12`` — locks pandas's NaN-NOT-NaN semantics (NaN is falsy under NOT, so rows with null p1 are excluded). - ``test_string_cypher_executes_disjunctive_then_conjunction`` ``WHERE (n.p1 = 12 OR n.p2 = 13) AND n.id = 'a'`` — mixed OR-AND tree narrowing. - ``test_string_cypher_executes_disjunction_returns_correct_count_with_more_rows`` larger fixture (6 rows) confirming OR doesn't silently union too many or too few rows. Pairs with the tck-gfql sibling branch ``issue-1031-grammar-mixed-where-pattern-expr`` which registers ``match-where1-10`` as ``success_matches_expected`` in the xfail-contract (kept as ``xfail`` rather than promoted to ``supported`` until broader OR-disjunction conformance is validated). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lmeyerov · 2026-04-26T02:41:34Z

Filed follow-up: #1219 — "Cypher WHERE: static validation gap on row-boolean shapes (OR/NOT/XOR among row predicates) — emergent from #1217 Earley swap". The validator question is genuinely out of scope for #1031 (which is specifically about mixing pattern + row predicates). Earley's broader unification incidentally lifts the implicit LALR rejection on row-side OR/NOT/XOR, exposing a missing semantic gate that pre-#1217 wasn't needed. Decision space (validator vs partial-accept vs full-disjunction-support vs defer-with-docs) is left open in #1219.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

IMPORTANT findings from multi-dimension review: 1. Dead-code in OR test (test_lowering.py:3199-3205) — the ``if False else`` made the first comprehension permanently dead; the fallback branch then asserted on either of two contradictory shapes. Replaced with a single deterministic assertion: 5 rows expected, every kept row carries either ``p1: 1`` or ``p2: 2`` in its rendered form. 2. EQ duplicated the new mixin's overrides (comparison.py:245-263). The mixin docstring already invited the retrofit ("EQ predates this mixin with an inline override"). EQ now uses ``_StringAllowingComparisonMixin`` for its single-source contract. SUGGESTIONS deferred (documented in plan): - NaN-NOT semantics test name (locks pandas-not-Cypher behavior; acknowledged in test docstring + #1219 link). - ``_rebuild_and_tree`` invents source-spans across non-adjacent residuals (cosmetic, error-reporting only). - No deeply-nested pattern-under-multi-level-boolean test (``_has_pattern_descendant`` handles correctly by inspection). - PEP-8 spacing nit after constant deletes. - Defensive ``_split_top_level_and_pattern_leaves`` bad-flag drop on malformed AND nodes (#1214 invariant prevents the case). 1680 GFQL + predicates tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…mixed-where-pattern-expr # Conflicts: # CHANGELOG.md # graphistry/compute/gfql/cypher/parser.py # graphistry/tests/compute/gfql/cypher/test_where_bool_conformance.py

…der Earley ambiguity (#1031) Earley with ambiguous grammar parsed `MATCH ... WITH ... WHERE ... RETURN` two ways: (a) `with_stage[with_clause + with_where_clause]` (correct), or (b) `stage[with_clause] + standalone where_clause` (wrong — query_body then attached the where to the preceding MATCH at parser.py:1582-1587). Earley's default ambiguity resolution silently picked the wrong parse: the predicate ended up populated on both matches[0].where and top-level where, with with_stages[0].where = None. Surfaced by Wave 2 fresh-eyes review of #1217 — the renamed/loosened test_logical_planner_applies_predicates_attached_to_with_stage_filter masked the regression. Fix: bump Lark rule priority on with_where_clause so Earley prefers it during ambiguity resolution. Smoke-trace post-fix: match.where=0 preds, top.where=0 preds, with[0].where=ExpressionText(...) Validated on dgx-spark: - graphistry/tests/compute/gfql/cypher/ → 1038 passed - graphistry/tests/compute/gfql/ → 1543 passed - graphistry/tests/compute/ → 2441 passed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ring mixin Two Wave-2 review fixes: 1. test_logical_planner_applies_predicates_attached_to_with_part — restored original test name + WITH-part predicate assertion. The previous rename (..._to_with_stage_filter) and "predicate anywhere in plan" loosening had MASKED the WITH-WHERE re-routing regression that the previous commit (parser priority bump) fixed. Now the binder correctly attaches the predicate to the WITH query_part, so the original assertion holds again. 2. New test_comparison_strings.py — direct unit tests for _StringAllowingComparisonMixin on EQ/NE/GT/LT/GE/LE. Each op verified for: validate() acceptance of raw-string val, JSON round-trip preserving string val, lexicographic Series application matching pandas semantics. Closes the wave-2 IMPORTANT gap that only end-to-end coverage existed for these ops. Validated on dgx-spark: graphistry/tests/compute/ → 2458 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lmeyerov and others added 4 commits April 25, 2026 17:40

docs(changelog): add #1031/#1217 Earley parser swap + comparison stri…

624b7cd

…ngs entries Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lmeyerov mentioned this pull request Apr 26, 2026

Cypher WHERE: static validation gap on row-boolean shapes (OR / NOT / XOR among row predicates) — emergent from #1217 Earley swap #1219

Open

lmeyerov and others added 2 commits April 25, 2026 19:43

docs(changelog): cross-link #1219 row-boolean validation gap

4494518

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

This was referenced Apr 26, 2026

Phase 1: #1219 row-boolean WHERE audit + TCK-linked tests #1223

Open

Draft: Investigate #1219 row-boolean WHERE validation gap #1222

Closed

lmeyerov and others added 3 commits April 25, 2026 22:00

Merge remote-tracking branch 'origin/master' into issue-1031-grammar-…

d8aed83

…mixed-where-pattern-expr # Conflicts: # CHANGELOG.md # graphistry/compute/gfql/cypher/parser.py # graphistry/tests/compute/gfql/cypher/test_where_bool_conformance.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(gfql/cypher): switch to Earley parser; retire LALR pattern-shape cruft (#1031 slice 1)#1217

refactor(gfql/cypher): switch to Earley parser; retire LALR pattern-shape cruft (#1031 slice 1)#1217
lmeyerov wants to merge 9 commits intomasterfrom
issue-1031-grammar-mixed-where-pattern-expr

lmeyerov commented Apr 26, 2026

Uh oh!

lmeyerov commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lmeyerov commented Apr 26, 2026

Summary

Two commits, logically separate

Strict-improvement consequences (Earley accepts what LALR rejected)

What this PR adds

What this PR retires

#1031 progress

Test plan

Notes for review

Uh oh!

lmeyerov commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant