refactor(gfql/cypher): switch to Earley parser; retire LALR pattern-shape cruft (#1031 slice 1)#1217
Open
refactor(gfql/cypher): switch to Earley parser; retire LALR pattern-shape cruft (#1031 slice 1)#1217
Conversation
…cates EQ already accepts raw strings via inline ``_normalize_value`` + ``_validate_fields`` overrides. The other comparison ops lacked the same overrides and rejected raw strings unconditionally — their ``__call__`` methods also explicitly raised ``TypeError`` on strings. This was latent: LALR couldn't unify label-predicate and property-predicate alternatives in Cypher's ``where_predicates``, so queries like ``WHERE n.name <> 'value' AND n:Label`` routed through raw expr text and never reached ``ComparisonPredicate.__call__``. #1031's Earley swap unifies the route, surfacing the strictness on idiomatic Cypher patterns. Pandas Series supports lexicographic ``>``/``<``/``!=`` on strings natively, so the fix is to extend EQ's pre-existing string acceptance to siblings. Implementation: - Extract ``_StringAllowingComparisonMixin`` with the ``_normalize_value`` + ``_validate_fields`` overrides. - Apply mixin to NE/GT/LT/GE/LE. - Extend each ``__call__`` to handle ``isinstance(self.val, str)``. The strict raw-string rejection on the base ``ComparisonPredicate`` stays — it still applies to ``Between`` (and any future direct-IR constructors) where the datetime-vs-string ambiguity matters for direct-IR users. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hape cruft (#1031 slice 1) Per tech-lead direction (1-day timebox) following an LALR(1) state-collapse on ``<comparable> AND <pattern>`` shapes that could not be resolved by grammar restructure. Earley parser swap eliminates four LALR-induced workarounds: 1. The 3 dedicated pattern-shape ``where_clause`` grammar alternatives (``where_pattern_and_expr_clause`` / ``expr_and_where_pattern_clause`` / ``where_pattern_only_clause``) — collapse to the unified ``expr`` route via a new ``WHERE_PATTERN -> pattern_atom`` leaf in ``?primary``. 2. ``_canonicalize_where_single_pattern_and_expr`` — regex source-rewrite that reordered ``expr AND pattern AND expr`` to ``pattern AND <rest>`` so the LALR grammar could match. Earley accepts patterns at any AND-spine position natively. 3. ``_mixed_where_pattern_expr_error`` — pre-flight rejector for shapes LALR could parse but the IR couldn't represent. Replaced with a structural lift step that inspects the parsed boolean tree. 4. The ``except LarkError`` retry block in ``parse_cypher`` that fed the canonicalize-then-reparse path. What this PR adds: - ``BooleanExpr.op`` literal extended with ``"pattern"``; new ``BooleanExpr.pattern: Optional[Tuple[PatternElement, ...]]`` field. - Parser ``pattern_atom`` transformer: wraps ``WHERE_PATTERN`` token as ``BooleanExpr(op="pattern", pattern=parsed)``; reuses the existing ``_parse_where_pattern_predicate_text`` helper. - ``_split_top_level_and_pattern_leaves`` walker + ``_rebuild_and_tree`` helper + ``_build_where_with_pattern_lift`` that extracts pattern leaves from ``expr_tree`` into ``WhereClause.predicates`` as ``WherePatternPredicate`` entries before lowering. - Parser swap: ``Lark(parser="earley")`` for both ``_parser`` and ``_pattern_parser``. What this PR retires: - 3 grammar alternatives, 3 transformer methods, 1 helper (``_mixed_where_clause``), 2 pre-flight functions, 3 derived regex constants (``_WHERE_PATTERN_SEQUENCE_RE``, ``_WHERE_PATTERN_THEN_EXPR_RE``, ``_WHERE_EXPR_THEN_PATTERN_RE``), 2 helper utilities (``_WHERE_CLAUSE_BODY_RE``, ``_BOOLEAN_KEYWORD_RE``, ``_line_and_column_from_offset``), and the ``parse_cypher`` ``except LarkError`` retry block. Net diff: +204 / -144. Strict-improvement consequences (Earley accepts what LALR rejected): - ``WHERE expr OR expr`` now parses as a structured ``or`` tree. - ``WHERE expr AND (expr OR expr)`` (Shape B) now parses as ``and(left, or(...))``. - ``WHERE n:Label AND n.prop = X`` now routes through structured ``where_predicates`` (was raw expr); label narrowing applies. - ``WHERE prop OP 'string'`` mixed with label predicates now hits ``ComparisonPredicate`` for strings — separately fixed in the preceding commit by extending EQ's string-accepting overrides to NE/GT/LT/GE/LE. #1031 milestones in this PR: - ✅ Retires ``_canonicalize_where_single_pattern_and_expr``. - ✅ Removes one of three ``split_top_level_and`` callers (parser side). Remaining caller: ``predicate_pushdown._split_conjuncts``, retired separately under #1200. Together this brings ``expr_split.py`` deletion within reach. - ⏭️ ``NOT pattern`` lowering (anti-semi-join) — slice 2 territory; emits explicit ``unsupported`` error today. - ⏭️ Multi-positive patterns — slice 3 territory; rejected at lift step. - ⏭️ ``OR``/``XOR`` around patterns — slice 4 territory; rejected at lift step. Tests updated: - 7 tests adjusted to match Earley's strict-improvement routing (``test_or_where_not_yet_supported``, conformance Shape B parser + binder, conformance Shape F parser + binder, label-narrowing mixed case, logical planner WITH-stage attachment). - ``test_parse_supports_where_pattern_predicate_and_expr_mix`` paren- preservation case updated to expect ``boolean_expr_to_text``- reconstructed paren-stripped form (semantically equivalent). 1539 / 1539 GFQL tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ngs entries Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…#1217) Pre-#1217: LALR rejected OR / NOT inside WHERE bodies, so disjunctive and negation shapes had zero native end-to-end test coverage. Earley accepts them and the runtime evaluates via pandas, but until now the correctness was implicit — only the (separate-repo) tck-gfql harness exercised these paths via xfail-marked scenarios. Adds 5 native execution tests in test_lowering.py: - ``test_string_cypher_executes_disjunctive_property_predicate_returns_union`` mirrors TCK match-where1-10: ``WHERE n.p1 = 12 OR n.p2 = 13`` returns the union of matches (a, b), not just one. - ``test_string_cypher_executes_disjunctive_same_alias_property_predicate`` ``WHERE n.p1 = 12 OR n.p1 = 99`` — same alias on both sides. - ``test_string_cypher_executes_negation_property_predicate_returns_complement`` ``WHERE NOT n.p1 = 12`` — locks pandas's NaN-NOT-NaN semantics (NaN is falsy under NOT, so rows with null p1 are excluded). - ``test_string_cypher_executes_disjunctive_then_conjunction`` ``WHERE (n.p1 = 12 OR n.p2 = 13) AND n.id = 'a'`` — mixed OR-AND tree narrowing. - ``test_string_cypher_executes_disjunction_returns_correct_count_with_more_rows`` larger fixture (6 rows) confirming OR doesn't silently union too many or too few rows. Pairs with the tck-gfql sibling branch ``issue-1031-grammar-mixed-where-pattern-expr`` which registers ``match-where1-10`` as ``success_matches_expected`` in the xfail-contract (kept as ``xfail`` rather than promoted to ``supported`` until broader OR-disjunction conformance is validated). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
Author
|
Filed follow-up: #1219 — "Cypher WHERE: static validation gap on row-boolean shapes (OR/NOT/XOR among row predicates) — emergent from #1217 Earley swap". The validator question is genuinely out of scope for #1031 (which is specifically about mixing pattern + row predicates). Earley's broader unification incidentally lifts the implicit LALR rejection on row-side OR/NOT/XOR, exposing a missing semantic gate that pre-#1217 wasn't needed. Decision space (validator vs partial-accept vs full-disjunction-support vs defer-with-docs) is left open in #1219. |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
IMPORTANT findings from multi-dimension review:
1. Dead-code in OR test (test_lowering.py:3199-3205) — the
``if False else`` made the first comprehension permanently dead;
the fallback branch then asserted on either of two contradictory
shapes. Replaced with a single deterministic assertion: 5 rows
expected, every kept row carries either ``p1: 1`` or ``p2: 2`` in
its rendered form.
2. EQ duplicated the new mixin's overrides (comparison.py:245-263).
The mixin docstring already invited the retrofit ("EQ predates this
mixin with an inline override"). EQ now uses
``_StringAllowingComparisonMixin`` for its single-source contract.
SUGGESTIONS deferred (documented in plan):
- NaN-NOT semantics test name (locks pandas-not-Cypher behavior;
acknowledged in test docstring + #1219 link).
- ``_rebuild_and_tree`` invents source-spans across non-adjacent
residuals (cosmetic, error-reporting only).
- No deeply-nested pattern-under-multi-level-boolean test
(``_has_pattern_descendant`` handles correctly by inspection).
- PEP-8 spacing nit after constant deletes.
- Defensive ``_split_top_level_and_pattern_leaves`` bad-flag drop on
malformed AND nodes (#1214 invariant prevents the case).
1680 GFQL + predicates tests pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This was referenced Apr 26, 2026
…mixed-where-pattern-expr # Conflicts: # CHANGELOG.md # graphistry/compute/gfql/cypher/parser.py # graphistry/tests/compute/gfql/cypher/test_where_bool_conformance.py
…der Earley ambiguity (#1031) Earley with ambiguous grammar parsed `MATCH ... WITH ... WHERE ... RETURN` two ways: (a) `with_stage[with_clause + with_where_clause]` (correct), or (b) `stage[with_clause] + standalone where_clause` (wrong — query_body then attached the where to the preceding MATCH at parser.py:1582-1587). Earley's default ambiguity resolution silently picked the wrong parse: the predicate ended up populated on both matches[0].where and top-level where, with with_stages[0].where = None. Surfaced by Wave 2 fresh-eyes review of #1217 — the renamed/loosened test_logical_planner_applies_predicates_attached_to_with_stage_filter masked the regression. Fix: bump Lark rule priority on with_where_clause so Earley prefers it during ambiguity resolution. Smoke-trace post-fix: match.where=0 preds, top.where=0 preds, with[0].where=ExpressionText(...) Validated on dgx-spark: - graphistry/tests/compute/gfql/cypher/ → 1038 passed - graphistry/tests/compute/gfql/ → 1543 passed - graphistry/tests/compute/ → 2441 passed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ring mixin Two Wave-2 review fixes: 1. test_logical_planner_applies_predicates_attached_to_with_part — restored original test name + WITH-part predicate assertion. The previous rename (..._to_with_stage_filter) and "predicate anywhere in plan" loosening had MASKED the WITH-WHERE re-routing regression that the previous commit (parser priority bump) fixed. Now the binder correctly attaches the predicate to the WITH query_part, so the original assertion holds again. 2. New test_comparison_strings.py — direct unit tests for _StringAllowingComparisonMixin on EQ/NE/GT/LT/GE/LE. Each op verified for: validate() acceptance of raw-string val, JSON round-trip preserving string val, lexicographic Series application matching pandas semantics. Closes the wave-2 IMPORTANT gap that only end-to-end coverage existed for these ops. Validated on dgx-spark: graphistry/tests/compute/ → 2458 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Slice 1 of N for issue #1031. Per tech-lead direction (1-day timebox) following an LALR(1) state-collapse on ` AND ` shapes that could not be resolved by grammar restructure within Lark's LALR(1) constraints.
Earley parser swap eliminates four LALR-induced workarounds:
Net diff: +341 / -186 across 7 files.
Two commits, logically separate
Strict-improvement consequences (Earley accepts what LALR rejected)
What this PR adds
What this PR retires
#1031 progress
Test plan
Notes for review
🤖 Generated with Claude Code