Skip to content

feat(gfql/ir): expose Lark boolean-expression tree as WhereClause.expr_tree (#1200 slice 1/N)#1202

Merged
lmeyerov merged 4 commits intomasterfrom
issue-1200-boolean-expr-tree
Apr 24, 2026
Merged

feat(gfql/ir): expose Lark boolean-expression tree as WhereClause.expr_tree (#1200 slice 1/N)#1202
lmeyerov merged 4 commits intomasterfrom
issue-1200-boolean-expr-tree

Conversation

@lmeyerov
Copy link
Copy Markdown
Contributor

Summary

First of N slices for issue #1200. Parser now captures Lark's already-parsed `and_op` / `or_op` / `xor_op` / `not_op` tree structurally and surfaces it on `WhereClause.expr_tree` alongside the existing raw-text `.expr`.

Downstream consumers (binder, predicate pushdown) stay on the text path for now; subsequent PRs migrate them to walk the structural tree and eventually retire `expr_split.py`.

What this PR adds

  • `BooleanExpr` dataclass in `cypher/ast.py` with `op` (`and` / `or` / `xor` / `not` / `atom`), span, left/right children, and atom_text/atom_span for leaves.
  • Parser transformer methods `and_op` / `or_op` / `xor_op` / `not_op` that construct `BooleanExpr` nodes bottom-up.
  • `grouped_expr` passthrough so `(...)` parens do not collapse nested boolean structure into atom leaves.
  • `_wrap_as_boolean_atom` helper handles BooleanExpr passthrough, `_ExpressionSlice` spans, Lark Tree `.meta` spans, and primitive literal operands (bool/int/None) without meta.
  • `WhereClause.expr_tree: Optional[BooleanExpr]` additive field; `.expr` preserved for backward compat.
  • Wiring in `generic_where_clause` to extract BooleanExpr from items[0] when Lark produces one.

What this PR does NOT change

  • Binder (`_where_predicates` serialization)
  • Predicate pushdown (`_split_conjuncts`)
  • `expr_split.py` (retained as fallback — retirement tracked for a later slice)
  • `WhereClause.expr` (retained for backward compat)

Strictly additive producer; zero downstream behavior change.

Routing note

`expr_tree` is populated only when Lark routes through the generic `expr` path. That happens for:

  • `OR` / `XOR` / `NOT` operators (not supported by `where_predicates` structured rule).
  • Parenthesized subexpressions that mix operator flavors.

Pure AND conjunctions of comparable predicates (`WHERE a = 1 AND b = 2`) continue to route through the structured `where_predicates` rule and populate `.predicates` rather than `.expr_tree`. AND-joined bare labels (`n:A AND n:B`) continue to lift to structured predicates via the existing label-narrowing logic.

Tests

12 cases in `test_boolean_expr.py`:

  • Structured paths correctly leave `expr_tree=None` (single predicate, pure-AND of comparables, structured label narrowing).
  • OR / XOR / NOT shapes populate `expr_tree` with the expected op.
  • Precedence: AND binds tighter than OR; NOT binds tighter than AND/OR.
  • Parenthesized subexpressions preserve nested boolean structure via the `grouped_expr` passthrough.
  • Spans: atom spans quote the source text exactly; branch spans cover the full subexpression.
  • Backward compat: `expr_tree` and `expr` both populated when Lark routes through generic `expr`.

Test plan

  • `pytest graphistry/tests/compute/gfql/cypher/test_boolean_expr.py` — 12/12 pass
  • `pytest graphistry/tests/compute/gfql/` — 1469 pass, 80 skipped, 15 xfailed (+12 new)
  • `ruff check` clean
  • Existing parser/binder/lowering tests unchanged behavior

Follow-ups (tracked in #1200)

  1. Binder walks `expr_tree` to emit one `BoundPredicate` per top-level AND conjunct.
  2. `predicate_pushdown._split_conjuncts` becomes fallback-only.
  3. Delete `expr_split.py` once both call sites migrated.

🤖 Generated with Claude Code

lmeyerov and others added 4 commits April 24, 2026 00:42
…r_tree (#1200 slice 1/N)

Parser now captures Lark's already-parsed and_op / or_op / xor_op /
not_op tree structurally and surfaces it on ``WhereClause.expr_tree``
alongside the existing raw-text ``.expr``.  Downstream consumers
(binder, predicate pushdown) stay on the text path for now;
subsequent PRs migrate them to walk the structural tree and
eventually retire ``expr_split.py``.

What this PR adds:

- ``BooleanExpr`` dataclass in ``cypher/ast.py`` with ``op``
  (``and`` / ``or`` / ``xor`` / ``not`` / ``atom``), span, left/right
  children, and atom_text/atom_span for leaves.
- Parser transformer methods ``and_op`` / ``or_op`` / ``xor_op`` /
  ``not_op`` that construct ``BooleanExpr`` nodes bottom-up.
- ``grouped_expr`` passthrough so ``(...)`` parens do not collapse
  nested boolean structure into atom leaves.
- ``_wrap_as_boolean_atom`` helper that handles BooleanExpr
  passthrough, _ExpressionSlice spans, Lark Tree .meta spans, and
  primitive literal operands (bool/int/None) without meta.
- ``WhereClause.expr_tree: Optional[BooleanExpr]`` additive field;
  ``.expr`` preserved for backward compat.
- Wiring in ``generic_where_clause`` to extract BooleanExpr from
  items[0] when Lark produces one; ``.expr`` still populated.

Tests (``test_boolean_expr.py``, 12 cases):

- Structured paths (single predicate, pure-AND of comparables,
  structured label narrowing) correctly leave expr_tree=None.
- OR / XOR / NOT shapes populate expr_tree with the expected op.
- Precedence: AND binds tighter than OR; NOT binds tighter than AND/OR.
- Parenthesized subexpressions preserve nested boolean structure via
  the grouped_expr passthrough.
- Spans: atom spans quote the source text exactly; branch spans cover
  the full subexpression.
- Backward compat: expr_tree and expr both populated when Lark routes
  through the generic expr path.

No binder, pushdown, or expr_split.py changes this PR — those are
tracked in follow-up slices of #1200.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
IMPORTANT fixes:

- ast_normalizer._rewrite_where_pattern_predicates_to_matches: when
  reconstructing a remaining WhereClause after pulling a WHERE pattern
  predicate into its own MatchClause, the constructor now preserves
  ``expr_tree=query.where.expr_tree``.  Previously it was silently
  dropped, breaking structural capture for any query mixing a WHERE
  pattern predicate with a boolean-expression tail.
- Documented the primitive-literal ``atom_text`` fidelity limitation on
  ``_wrap_as_boolean_atom`` (``str(True)`` → ``\"True\"`` not Cypher
  ``\"true\"``).  No consumer currently reads atom_text on literal
  atoms; locked by a new test ``test_literal_boolean_atoms_known_
  limitation_python_style_text`` so any future fix flips intentionally.

SUGGESTION fixes:

- DRY: ``_boolean_binary`` helper collapses ``and_op`` / ``or_op`` /
  ``xor_op`` bodies into single-line delegates, centralizing the arity
  check and error message.
- ``WhereClause`` class docstring now states field coexistence contract
  (structured path vs raw-text path vs structured-tree path).
- Tighter branch-span assertion (no ``.strip()``) so future drift is
  caught.
- Three new tests: deeply nested OR grouping preserves structure,
  ``NOT NOT`` produces nested NOT nodes, mixed AND/XOR precedence via
  parens.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ause (wave-2 #1202)

The where_pattern_and_expr_clause and expr_and_where_pattern_clause
routes reconstruct expr from source text and do not capture Lark's
structural boolean-expression tree.  This is pre-existing (they already
re-parse from text for the pattern-side anyway) but the new WhereClause
docstring contract around expr_tree made it worth noting inline as
a known limitation for the slice-1 scope of #1200.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lmeyerov lmeyerov merged commit 4b3e58c into master Apr 24, 2026
93 checks passed
@lmeyerov lmeyerov deleted the issue-1200-boolean-expr-tree branch April 24, 2026 14:43
lmeyerov added a commit that referenced this pull request Apr 25, 2026
…vel AND (#1200 slice 2/N) (#1207)

* feat(gfql/binder): walk expr_tree, emit one BoundPredicate per top-level AND (#1200 slice 2/N)

Slice 2 of issue #1200.  Slice 1 (PR #1202) populated
``WhereClause.expr_tree``; this slice teaches the binder to walk it.

When ``where.expr_tree`` is present, ``_where_predicates`` flattens
top-level AND conjuncts and emits one ``BoundPredicate`` per conjunct
instead of one giant ``BoundPredicate`` with the whole expression
text.  Downstream passes (predicate pushdown, filters) now see
pre-split single-conjunct strings, eliminating one source of text
re-parsing.

When ``where.expr_tree`` is None (mixed-clause routes; pure structured
``where_predicates``; queries with no boolean operators), behavior is
unchanged — fallback to the existing ``where.expr.text`` path or the
structured ``where.predicates`` path.

What this PR adds:

- ``_flatten_top_level_ands(expr) -> List[BooleanExpr]``: descends
  through ``and`` nodes only; or/xor/not/atom stop the recursion and
  count as one conjunct.  Handles left-, right-, and mixed-associative
  AND chains.
- ``_boolean_expr_to_text(expr) -> str``: recursive surface-text
  reconstruction for a boolean subtree.  Atoms emit atom_text; binary
  ops glue with ``L OP R``; NOT prefixes; branch operands are wrapped
  in parens for unambiguous round-tripping.  Inherits the slice-1
  literal-atom fidelity caveat (``str(True) == "True"``).
- Updated ``_where_predicates`` branches on ``expr_tree`` availability.

20 tests in ``test_binder_expr_tree.py`` covering helper invariants,
``_where_predicates`` with hand-built ``WhereClause`` fixtures
(backward-compat, OR root, AND root, no-expr), and end-to-end through
``parse_cypher`` + binder for OR / NOT / nested AND-with-OR queries.

What this PR does NOT change:

- ``predicate_pushdown._split_conjuncts`` (slice 3 — relax to fallback
  only since binder now pre-splits).
- ``graphistry/compute/gfql/expr_split.py`` (retired in slice 4 once
  pushdown migrates).
- ``WhereClause.expr`` field (kept for backward compat with mixed-clause
  routes that still use the text path).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test+docs: address wave-1 review findings on #1207

- Tighten test_e2e_pure_and_of_comparables_unchanged to assert
  expr_tree is None directly, not just BoundPredicate count.  A future
  parser change that accidentally populates expr_tree on the pure-AND
  path would now fail loudly rather than silently match the count.
- Document the parser-layer mutual-exclusion invariant inline in
  _where_predicates so a future refactor introducing overlap does not
  silently double-emit BoundPredicates.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs(changelog): add #1200/#1207 binder-walks-expr-tree slice-2 entry

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(gfql/conformance): update WHERE boolean-shape contract for #1200 slice 2

The conformance suite added in #1204 (commit 91c9777) was explicitly
designed to track the #1194 grammar-disambiguation and #1200 IR
boolean-tree migrations.  Slice 2 (#1207) flips three shape contracts
from one combined BoundPredicate to per-conjunct BoundPredicates as
the binder now flattens top-level AND.

Updates:

- Shape C ``(A OR B) AND C`` → 2 BoundPredicates (OR-compound + bare C)
- Shape D ``NOT A AND B`` → 2 BoundPredicates (NOT-compound + bare B)
- Shape F ``n:Admin AND n.active = true`` → 2 BoundPredicates (label
  + property, top-level AND splits regardless of operand flavor)

Assertions now check both count AND content (each conjunct identifiable
by substring) so future shape-contract drift is caught precisely.

Other shapes in the suite (A, B, E, G) are unaffected: shape A already
emitted N BoundPredicates via where_predicates structured route; shape
B raises GFQLSyntaxError; shape E (XOR) stays one conjunct because
top-level isn't AND; shape G (quoted AND) already produced 2.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Apr 25, 2026
… (#1209)

Replace the text-level `split_top_level_and` + regex loop in
`generic_where_clause` with a structural walker over Lark's parsed
`BooleanExpr` (and `_ExpressionSlice` for the single-atom path).

Closes #1194 in the spirit of the issue: the grammar already declares
the structured rule we want, and slice 1 (#1202) already exposes the
parsed tree on `WhereClause.expr_tree` — `generic_where_clause` should
trust that source of truth instead of re-splitting the WHERE body on
top-level AND.

Adds two helpers:
- `_match_bare_label_atom(text)` — fullmatch atom text against
  `_BARE_LABEL_PREDICATE_RE` (preserves the #1125 false-positive guard).
- `_lift_label_only_and_spine(node)` — DFS over a `BooleanExpr` AND-spine;
  returns lifted `(alias, labels)` tuples iff every leaf is a bare-label
  atom, else `None` (all-or-nothing — mixed/OR/XOR/NOT fall through).

Behaviorally identical to master across the WHERE-shape matrix: single
label, multi-AND chains, mixed label+property, OR/XOR/NOT, parenthesized
boolean trees, and string-literal false-positive guards all produce the
same `WhereClause` shape. Verified by parser, binder, slice-1 producer,
slice-2 conformance + binder_expr_tree, and lowering test suites
(1528 passed, no regressions). Targeted mypy clean.

Adds 4 focused unit tests for the new helpers in `test_parser.py` and
updates the stale comment in
`test_parse_where_triple_and_label_conjunction_through_generic_where_clause`
to reference the walker (was: `split_top_level_and`).

`expr_split.py` and the regex are unchanged — both still have other
callers (where-pattern canonicalization, lowering); their retirement
is deferred to later slices of #1200.

Supersedes the Earley-sub-parser approach in PR #1203, which would
have introduced a second Lark parser path; the structural walker is
cleaner now that slice 1 has landed on master.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant