Skip to content

refactor(gfql/cypher): switch to Earley parser; retire LALR pattern-shape cruft (#1031 slice 1)#1217

Open
lmeyerov wants to merge 9 commits intomasterfrom
issue-1031-grammar-mixed-where-pattern-expr
Open

refactor(gfql/cypher): switch to Earley parser; retire LALR pattern-shape cruft (#1031 slice 1)#1217
lmeyerov wants to merge 9 commits intomasterfrom
issue-1031-grammar-mixed-where-pattern-expr

Conversation

@lmeyerov
Copy link
Copy Markdown
Contributor

Summary

Slice 1 of N for issue #1031. Per tech-lead direction (1-day timebox) following an LALR(1) state-collapse on ` AND ` shapes that could not be resolved by grammar restructure within Lark's LALR(1) constraints.

Earley parser swap eliminates four LALR-induced workarounds:

  1. The 3 dedicated pattern-shape `where_clause` grammar alternatives.
  2. `_canonicalize_where_single_pattern_and_expr` regex source-rewrite that reordered `expr AND pattern AND expr` → `pattern AND `.
  3. `_mixed_where_pattern_expr_error` pre-flight rejector.
  4. The `except LarkError` retry block in `parse_cypher`.

Net diff: +341 / -186 across 7 files.

Two commits, logically separate

  1. `fix(predicates): allow raw strings in NE/GT/LT/GE/LE` — extends EQ's pre-existing string acceptance to its sibling comparison ops. Latent bug: LALR's routing meant string comparisons mixed with labels never reached structured comparison predicates. Earley unifies the route, surfacing the strictness on idiomatic Cypher patterns. ~50 LOC self-contained in `comparison.py`.
  2. `refactor(gfql/cypher): switch to Earley parser; retire LALR pattern-shape cruft` — the main change. Grammar restructure + Earley swap + pre-flight retirement + lift step + AST extension.

Strict-improvement consequences (Earley accepts what LALR rejected)

Query shape Pre-#1031 Post-#1031
`WHERE expr OR expr` E107 syntax error parses as `or` tree
`WHERE expr AND (expr OR expr)` (Shape B) E107 parses as `and(left, or(...))`
`WHERE n:Label AND n.prop = X` (Shape F) raw expr text structured `where_predicates`
`WHERE prop OP 'string'` mixed with label ran via raw text path runs via structured ops (now safe via comparison.py fix)

What this PR adds

  • `BooleanExpr.op` literal extended with `"pattern"`; new `BooleanExpr.pattern` field.
  • Parser `pattern_atom` transformer wrapping `WHERE_PATTERN` as a tree leaf.
  • `_split_top_level_and_pattern_leaves` + `_rebuild_and_tree` + `_build_where_with_pattern_lift` helpers that extract pattern leaves from `expr_tree` into `WhereClause.predicates` as `WherePatternPredicate` entries before lowering.
  • Parser swap: `Lark(parser="earley")` for both `_parser` and `_pattern_parser`.
  • `_StringAllowingComparisonMixin` applied to NE/GT/LT/GE/LE (separate commit).

What this PR retires

  • 3 grammar alternatives, 3 transformer methods, `_mixed_where_clause`, 2 pre-flight functions, 3 derived regex constants, 3 helper utilities, the `parse_cypher` retry block.

#1031 progress

  • ✅ Retires `_canonicalize_where_single_pattern_and_expr`.
  • ✅ Removes one of three `split_top_level_and` callers (parser side). Remaining: `predicate_pushdown._split_conjuncts` (retired under gfql/ir: expose Lark and_op/or_op tree in WhereClause so passes walk structure instead of re-parsing text #1200). Together: `expr_split.py` deletion is within reach.
  • ⏭️ `NOT pattern` lowering (anti-semi-join) — slice 2. Emits explicit `unsupported` error today.
  • ⏭️ Multi-positive patterns — slice 3. Rejected at lift step.
  • ⏭️ `OR` / `XOR` around patterns — slice 4. Rejected at lift step.

Test plan

  • `pytest graphistry/tests/compute/gfql/` — 1539 pass / 80 skip / 15 xfail / 0 fail.
  • 10 tests updated to match Earley's strict-improvement routing (Categories A, B, C, D documented in `plans/issue-1031-grammar-mixed-where-pattern-expr/plan.md`).
  • gfql-benchmarks (per tech-lead's <2× parse-delta bound) — to run on CI.

Notes for review

  • The plan file (`plans/issue-1031-...`) has the full LALR investigation, planning-review (12 deltas), and tech-lead correspondence baked in.
  • Earley introduces a new failure mode (grammar ambiguity). Lark's default `ambiguity="resolve"` picks the highest-priority parse. Conformance + benchmark suites are the validation oracle.

🤖 Generated with Claude Code

lmeyerov and others added 4 commits April 25, 2026 17:40
…cates

EQ already accepts raw strings via inline ``_normalize_value`` +
``_validate_fields`` overrides.  The other comparison ops lacked the
same overrides and rejected raw strings unconditionally — their
``__call__`` methods also explicitly raised ``TypeError`` on strings.

This was latent: LALR couldn't unify label-predicate and
property-predicate alternatives in Cypher's ``where_predicates``, so
queries like ``WHERE n.name <> 'value' AND n:Label`` routed through
raw expr text and never reached ``ComparisonPredicate.__call__``.

#1031's Earley swap unifies the route, surfacing the strictness on
idiomatic Cypher patterns.  Pandas Series supports lexicographic
``>``/``<``/``!=`` on strings natively, so the fix is to extend EQ's
pre-existing string acceptance to siblings.

Implementation:
- Extract ``_StringAllowingComparisonMixin`` with the
  ``_normalize_value`` + ``_validate_fields`` overrides.
- Apply mixin to NE/GT/LT/GE/LE.
- Extend each ``__call__`` to handle ``isinstance(self.val, str)``.

The strict raw-string rejection on the base ``ComparisonPredicate``
stays — it still applies to ``Between`` (and any future direct-IR
constructors) where the datetime-vs-string ambiguity matters for
direct-IR users.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hape cruft (#1031 slice 1)

Per tech-lead direction (1-day timebox) following an LALR(1) state-collapse
on ``<comparable> AND <pattern>`` shapes that could not be resolved by
grammar restructure.

Earley parser swap eliminates four LALR-induced workarounds:

1. The 3 dedicated pattern-shape ``where_clause`` grammar alternatives
   (``where_pattern_and_expr_clause`` / ``expr_and_where_pattern_clause``
   / ``where_pattern_only_clause``) — collapse to the unified ``expr``
   route via a new ``WHERE_PATTERN -> pattern_atom`` leaf in ``?primary``.
2. ``_canonicalize_where_single_pattern_and_expr`` — regex source-rewrite
   that reordered ``expr AND pattern AND expr`` to ``pattern AND <rest>``
   so the LALR grammar could match.  Earley accepts patterns at any
   AND-spine position natively.
3. ``_mixed_where_pattern_expr_error`` — pre-flight rejector for shapes
   LALR could parse but the IR couldn't represent.  Replaced with a
   structural lift step that inspects the parsed boolean tree.
4. The ``except LarkError`` retry block in ``parse_cypher`` that fed
   the canonicalize-then-reparse path.

What this PR adds:

- ``BooleanExpr.op`` literal extended with ``"pattern"``; new
  ``BooleanExpr.pattern: Optional[Tuple[PatternElement, ...]]`` field.
- Parser ``pattern_atom`` transformer: wraps ``WHERE_PATTERN`` token as
  ``BooleanExpr(op="pattern", pattern=parsed)``; reuses the existing
  ``_parse_where_pattern_predicate_text`` helper.
- ``_split_top_level_and_pattern_leaves`` walker + ``_rebuild_and_tree``
  helper + ``_build_where_with_pattern_lift`` that extracts pattern
  leaves from ``expr_tree`` into ``WhereClause.predicates`` as
  ``WherePatternPredicate`` entries before lowering.
- Parser swap: ``Lark(parser="earley")`` for both ``_parser`` and
  ``_pattern_parser``.

What this PR retires:

- 3 grammar alternatives, 3 transformer methods, 1 helper
  (``_mixed_where_clause``), 2 pre-flight functions, 3 derived regex
  constants (``_WHERE_PATTERN_SEQUENCE_RE``, ``_WHERE_PATTERN_THEN_EXPR_RE``,
  ``_WHERE_EXPR_THEN_PATTERN_RE``), 2 helper utilities
  (``_WHERE_CLAUSE_BODY_RE``, ``_BOOLEAN_KEYWORD_RE``,
  ``_line_and_column_from_offset``), and the ``parse_cypher``
  ``except LarkError`` retry block.

Net diff: +204 / -144.

Strict-improvement consequences (Earley accepts what LALR rejected):

- ``WHERE expr OR expr`` now parses as a structured ``or`` tree.
- ``WHERE expr AND (expr OR expr)`` (Shape B) now parses as
  ``and(left, or(...))``.
- ``WHERE n:Label AND n.prop = X`` now routes through structured
  ``where_predicates`` (was raw expr); label narrowing applies.
- ``WHERE prop OP 'string'`` mixed with label predicates now hits
  ``ComparisonPredicate`` for strings — separately fixed in the
  preceding commit by extending EQ's string-accepting overrides
  to NE/GT/LT/GE/LE.

#1031 milestones in this PR:

- ✅ Retires ``_canonicalize_where_single_pattern_and_expr``.
- ✅ Removes one of three ``split_top_level_and`` callers (parser side).
  Remaining caller: ``predicate_pushdown._split_conjuncts``, retired
  separately under #1200.  Together this brings ``expr_split.py``
  deletion within reach.
- ⏭️  ``NOT pattern`` lowering (anti-semi-join) — slice 2 territory;
  emits explicit ``unsupported`` error today.
- ⏭️  Multi-positive patterns — slice 3 territory; rejected at lift step.
- ⏭️  ``OR``/``XOR`` around patterns — slice 4 territory; rejected at lift step.

Tests updated:

- 7 tests adjusted to match Earley's strict-improvement routing
  (``test_or_where_not_yet_supported``, conformance Shape B parser +
  binder, conformance Shape F parser + binder, label-narrowing mixed
  case, logical planner WITH-stage attachment).
- ``test_parse_supports_where_pattern_predicate_and_expr_mix`` paren-
  preservation case updated to expect ``boolean_expr_to_text``-
  reconstructed paren-stripped form (semantically equivalent).

1539 / 1539 GFQL tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ngs entries

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…#1217)

Pre-#1217: LALR rejected OR / NOT inside WHERE bodies, so disjunctive
and negation shapes had zero native end-to-end test coverage.  Earley
accepts them and the runtime evaluates via pandas, but until now the
correctness was implicit — only the (separate-repo) tck-gfql harness
exercised these paths via xfail-marked scenarios.

Adds 5 native execution tests in test_lowering.py:

- ``test_string_cypher_executes_disjunctive_property_predicate_returns_union``
  mirrors TCK match-where1-10: ``WHERE n.p1 = 12 OR n.p2 = 13`` returns
  the union of matches (a, b), not just one.
- ``test_string_cypher_executes_disjunctive_same_alias_property_predicate``
  ``WHERE n.p1 = 12 OR n.p1 = 99`` — same alias on both sides.
- ``test_string_cypher_executes_negation_property_predicate_returns_complement``
  ``WHERE NOT n.p1 = 12`` — locks pandas's NaN-NOT-NaN semantics
  (NaN is falsy under NOT, so rows with null p1 are excluded).
- ``test_string_cypher_executes_disjunctive_then_conjunction``
  ``WHERE (n.p1 = 12 OR n.p2 = 13) AND n.id = 'a'`` — mixed OR-AND
  tree narrowing.
- ``test_string_cypher_executes_disjunction_returns_correct_count_with_more_rows``
  larger fixture (6 rows) confirming OR doesn't silently union too many
  or too few rows.

Pairs with the tck-gfql sibling branch ``issue-1031-grammar-mixed-where-pattern-expr``
which registers ``match-where1-10`` as ``success_matches_expected`` in
the xfail-contract (kept as ``xfail`` rather than promoted to
``supported`` until broader OR-disjunction conformance is validated).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lmeyerov
Copy link
Copy Markdown
Contributor Author

Filed follow-up: #1219 — "Cypher WHERE: static validation gap on row-boolean shapes (OR/NOT/XOR among row predicates) — emergent from #1217 Earley swap". The validator question is genuinely out of scope for #1031 (which is specifically about mixing pattern + row predicates). Earley's broader unification incidentally lifts the implicit LALR rejection on row-side OR/NOT/XOR, exposing a missing semantic gate that pre-#1217 wasn't needed. Decision space (validator vs partial-accept vs full-disjunction-support vs defer-with-docs) is left open in #1219.

lmeyerov and others added 2 commits April 25, 2026 19:43
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
IMPORTANT findings from multi-dimension review:

1. Dead-code in OR test (test_lowering.py:3199-3205) — the
   ``if False else`` made the first comprehension permanently dead;
   the fallback branch then asserted on either of two contradictory
   shapes.  Replaced with a single deterministic assertion: 5 rows
   expected, every kept row carries either ``p1: 1`` or ``p2: 2`` in
   its rendered form.
2. EQ duplicated the new mixin's overrides (comparison.py:245-263).
   The mixin docstring already invited the retrofit ("EQ predates this
   mixin with an inline override").  EQ now uses
   ``_StringAllowingComparisonMixin`` for its single-source contract.

SUGGESTIONS deferred (documented in plan):
- NaN-NOT semantics test name (locks pandas-not-Cypher behavior;
  acknowledged in test docstring + #1219 link).
- ``_rebuild_and_tree`` invents source-spans across non-adjacent
  residuals (cosmetic, error-reporting only).
- No deeply-nested pattern-under-multi-level-boolean test
  (``_has_pattern_descendant`` handles correctly by inspection).
- PEP-8 spacing nit after constant deletes.
- Defensive ``_split_top_level_and_pattern_leaves`` bad-flag drop on
  malformed AND nodes (#1214 invariant prevents the case).

1680 GFQL + predicates tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lmeyerov and others added 3 commits April 25, 2026 22:00
…mixed-where-pattern-expr

# Conflicts:
#	CHANGELOG.md
#	graphistry/compute/gfql/cypher/parser.py
#	graphistry/tests/compute/gfql/cypher/test_where_bool_conformance.py
…der Earley ambiguity (#1031)

Earley with ambiguous grammar parsed `MATCH ... WITH ... WHERE ... RETURN`
two ways: (a) `with_stage[with_clause + with_where_clause]` (correct), or
(b) `stage[with_clause] + standalone where_clause` (wrong — query_body
then attached the where to the preceding MATCH at parser.py:1582-1587).

Earley's default ambiguity resolution silently picked the wrong parse:
the predicate ended up populated on both matches[0].where and top-level
where, with with_stages[0].where = None.  Surfaced by Wave 2 fresh-eyes
review of #1217 — the renamed/loosened
test_logical_planner_applies_predicates_attached_to_with_stage_filter
masked the regression.

Fix: bump Lark rule priority on with_where_clause so Earley prefers it
during ambiguity resolution.  Smoke-trace post-fix:
  match.where=0 preds, top.where=0 preds, with[0].where=ExpressionText(...)

Validated on dgx-spark:
- graphistry/tests/compute/gfql/cypher/  → 1038 passed
- graphistry/tests/compute/gfql/         → 1543 passed
- graphistry/tests/compute/              → 2441 passed

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ring mixin

Two Wave-2 review fixes:

1. test_logical_planner_applies_predicates_attached_to_with_part —
   restored original test name + WITH-part predicate assertion. The
   previous rename (..._to_with_stage_filter) and "predicate anywhere
   in plan" loosening had MASKED the WITH-WHERE re-routing regression
   that the previous commit (parser priority bump) fixed.  Now the
   binder correctly attaches the predicate to the WITH query_part, so
   the original assertion holds again.

2. New test_comparison_strings.py — direct unit tests for
   _StringAllowingComparisonMixin on EQ/NE/GT/LT/GE/LE.  Each op
   verified for: validate() acceptance of raw-string val, JSON
   round-trip preserving string val, lexicographic Series application
   matching pandas semantics.  Closes the wave-2 IMPORTANT gap that
   only end-to-end coverage existed for these ops.

Validated on dgx-spark: graphistry/tests/compute/ → 2458 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant