Skip to content

fix(cypher): walk expr_tree to lift label-only WHERE predicates (#1194)#1209

Merged
lmeyerov merged 2 commits intomasterfrom
m4-1194-expr-tree-walk
Apr 25, 2026
Merged

fix(cypher): walk expr_tree to lift label-only WHERE predicates (#1194)#1209
lmeyerov merged 2 commits intomasterfrom
m4-1194-expr-tree-walk

Conversation

@lmeyerov
Copy link
Copy Markdown
Contributor

Summary

Closes #1194 in the spirit of the issue: lift bare label predicates and AND-spines of bare label predicates by walking Lark's already-parsed BooleanExpr in cypher/parser.py::generic_where_clause, replacing the previous text-level split_top_level_and + regex loop. The grammar already declares the structured rule we want; this PR makes the transformer trust it.

Builds on slice 1 (#1202) which exposed WhereClause.expr_tree. Independent of slice 2 (#1207, binder-side). Shrinks the duplicate-parsing surface flagged by #1200.

What changed

graphistry/compute/gfql/cypher/parser.py:

  • New helpers (module level):
    • _match_bare_label_atom(text) -> Optional[(alias, labels)] — fullmatches atom text against _BARE_LABEL_PREDICATE_RE. fullmatch is load-bearing as the false-positive guard from Binder follow-up: replace regex WHERE label narrowing with AST-based analysis #1125 — atom fragments that merely look label-shaped (e.g. quoted-string contents) must not lift.
    • _lift_label_only_and_spine(node) — DFS over a BooleanExpr; returns lifted (alias, labels) tuples iff every AND-spine leaf is a bare-label atom, else None. Any non-AND op (OR/XOR/NOT), missing child, or non-label atom rejects the whole spine (all-or-nothing).
  • generic_where_clause simplified: walk BooleanExpr when items[0] is one (AND-spine path); for the single-atom path (items[0] is _ExpressionSlice) call _match_bare_label_atom on items[0].text directly. No top-level AND splitting on the WHERE body text inside this function. Mixed/OR/XOR/NOT inputs fall through to raw expr (with expr_tree retained for downstream slices).

graphistry/tests/compute/gfql/cypher/test_parser.py:

  • 4 focused unit tests for the new helpers (positive + negative, single atom, AND-spine, OR/NOT/mixed rejection).
  • Updated stale comment in test_parse_where_triple_and_label_conjunction_through_generic_where_clause to point at the walker (was: split_top_level_and).

CHANGELOG.md: entry under Internal.

What did not change

  • expr_split.py and split_top_level_and — still imported by other call sites in the parser (where_pattern_* canonicalization at parser.py:524) and by passes/predicate_pushdown.py. Their retirement is the eventual goal of gfql/ir: expose Lark and_op/or_op tree in WhereClause so passes walk structure instead of re-parsing text #1200 slices 3+, not this PR.
  • Binder, predicate pushdown, lowering — untouched.
  • Grammar — no change to rule alternatives, priorities, or routing. Lark still produces the same parse tree; only the transformer's interpretation of generic_where_clause is restructured.

Behavioral compatibility

End-to-end I/O contract is identical to master across the WHERE-shape matrix:

Input Routing predicates count expr expr_tree
WHERE n:Admin where_predicates rule 1 None None
WHERE n:Admin AND n:Active [AND ...] generic_where_clause, walker lifts n None None
WHERE n.x = 1 AND n:Admin where_predicates rule 2 None None
WHERE n:Admin AND n.x = 1 generic_where_clause, walker rejects (mixed) 0 text BooleanExpr
WHERE n:Admin OR n:Active generic_where_clause, walker rejects (OR root) 0 text BooleanExpr
WHERE NOT n:Admin generic_where_clause, walker rejects (NOT root) 0 text BooleanExpr
WHERE (n:Admin OR n:Active) AND n.k = 1 generic_where_clause, walker rejects (mixed leaves) 0 text BooleanExpr
WHERE n.x = 'A AND B' (string with AND) where_predicates rule 1 None None

#1207's conformance matrix and test_binder_expr_tree.py continue to pass — verified.

Test plan

  • pytest graphistry/tests/compute/gfql/cypher/test_parser.py — 111 pass (was 108 + 3 new helper tests; +1 for the existing single-label test still green)
  • pytest graphistry/tests/compute/gfql/cypher/test_binder.py — 44 pass
  • pytest graphistry/tests/compute/gfql/cypher/test_boolean_expr.py — slice-1 producer tests, all pass
  • pytest graphistry/tests/compute/gfql/cypher/test_where_bool_conformance.py — slice-2 conformance, all pass
  • pytest graphistry/tests/compute/gfql/cypher/test_binder_expr_tree.py — slice-2 binder, all pass
  • pytest graphistry/tests/compute/gfql/cypher/test_lowering.py — all pass
  • pytest graphistry/tests/compute/gfql/ (full sweep) — 1528 passed, 80 skipped, 15 xfailed, no regressions
  • mypy graphistry/compute/gfql/cypher/parser.py — clean

DRY scoreboard delta

Before this PR (master post-#1207):

Site Status
generic_where_clause regex+split duplicated and_op + bare_label_predicate_expr
expr_split.py::split_top_level_and still used at parser.py:524 (where-pattern canon) and predicate_pushdown.py
_BARE_LABEL_PREDICATE_RE (parser:305) one of two copies in repo
_CYPHER_BARE_LABEL_PREDICATE_RE (lowering:505) second copy

After this PR:

Site Status
generic_where_clause structural walker over Lark's parse tree (#1194 closed)
expr_split.py::split_top_level_and unchanged (1 fewer caller — parser.py retains its other call site)
_BARE_LABEL_PREDICATE_RE unchanged — still load-bearing inside the walker (atom-text fullmatch guard)
Lowering's regex unchanged — independent path, deferred to a later slice

Closes

Closes #1194.

Related

Supersedes

This PR replaces the abandoned approach in PR #1203 (separate Earley sub-parser instance), which would have introduced a second Lark parser path. With slice 1 already merged on master, the structural walker is the cleaner mechanism and aligns with the #1200 trajectory.

Replace the text-level `split_top_level_and` + regex loop in
`generic_where_clause` with a structural walker over Lark's parsed
`BooleanExpr` (and `_ExpressionSlice` for the single-atom path).

Closes #1194 in the spirit of the issue: the grammar already declares
the structured rule we want, and slice 1 (#1202) already exposes the
parsed tree on `WhereClause.expr_tree` — `generic_where_clause` should
trust that source of truth instead of re-splitting the WHERE body on
top-level AND.

Adds two helpers:
- `_match_bare_label_atom(text)` — fullmatch atom text against
  `_BARE_LABEL_PREDICATE_RE` (preserves the #1125 false-positive guard).
- `_lift_label_only_and_spine(node)` — DFS over a `BooleanExpr` AND-spine;
  returns lifted `(alias, labels)` tuples iff every leaf is a bare-label
  atom, else `None` (all-or-nothing — mixed/OR/XOR/NOT fall through).

Behaviorally identical to master across the WHERE-shape matrix: single
label, multi-AND chains, mixed label+property, OR/XOR/NOT, parenthesized
boolean trees, and string-literal false-positive guards all produce the
same `WhereClause` shape. Verified by parser, binder, slice-1 producer,
slice-2 conformance + binder_expr_tree, and lowering test suites
(1528 passed, no regressions). Targeted mypy clean.

Adds 4 focused unit tests for the new helpers in `test_parser.py` and
updates the stale comment in
`test_parse_where_triple_and_label_conjunction_through_generic_where_clause`
to reference the walker (was: `split_top_level_and`).

`expr_split.py` and the regex are unchanged — both still have other
callers (where-pattern canonicalization, lowering); their retirement
is deferred to later slices of #1200.

Supersedes the Earley-sub-parser approach in PR #1203, which would
have introduced a second Lark parser path; the structural walker is
cleaner now that slice 1 has landed on master.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Compose with #1214's parser invariant `(expr is None) == (expr_tree is None)`:

- `generic_where_clause` now relies on #1214's single-atom synthesis to
  guarantee `expr_tree` is always a `BooleanExpr`. The walker
  (`_lift_label_only_and_spine`) runs over that uniform tree.
- Drops the `_ExpressionSlice` branch this PR added — redundant now that
  #1214 wraps non-BooleanExpr operands as single-atom trees upstream of
  this point.
- Merged CHANGELOG: kept both entries (this PR's #1194 walker entry +
  #1214's invariant entry); updated this PR's entry to note the
  composition with #1214.

Verified: 940 targeted (parser/binder/boolean_expr/conformance/
binder_expr_tree/where_clause_expr_tree_invariant/lowering) passed
including #1214's 11 new invariant tests; 1539 full GFQL sweep passed.
mypy on parser.py clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lmeyerov
Copy link
Copy Markdown
Contributor Author

PR Review: PR #1209 — fix(cypher): walk expr_tree to lift label-only WHERE predicates (#1194)

Branch: m4-1194-expr-tree-walk @ 163d885f6
Base: master (post-#1207 merge at 0b1fb0cc6)
Stack: not stacked, no downstream PRs
Files changed: graphistry/compute/gfql/cypher/parser.py, graphistry/tests/compute/gfql/cypher/test_parser.py, CHANGELOG.md
LOC: +140 / −39 (3 files, 1 commit)
CI: 131/131 conclusive checks pass; mergeable; awaiting REVIEW_REQUIRED.


Blockers (must fix before merge)

None.

Important (should fix)

None.

Suggestions (nice to have)

  1. [Quality] Test fixtures _bx_atom / _bx_branch in test_parser.py:457-465 use # type: ignore[no-untyped-def] instead of explicit return annotations. Sibling helpers (_parse_query, _match_parts) are typed properly; aligning is a 4-LOC cleanup. Not blocking. — test_parser.py:457-465.

(Other Wave-1 SUGGESTIONs — right-assoc test, direct _ExpressionSlice unit test, nested mixed AND test, CHANGELOG bullet style — all rejected or downgraded under adversarial pressure. See adversarial/wave-1.md for proofs.)

Human checks required (operator decision needed)

  1. Approve and merge. No code-level concerns.
  2. Confirm Closes #1194 triggers correctly on merge (squash or merge-commit; squash is the predominant pattern for slice-style PRs in this repo).
  3. Optional post-merge cleanup: archive the abandoned m4-1194-lark-where-ambiguity branch (git push origin --delete m4-1194-lark-where-ambiguity) if no longer needed for cherry-pick reference.

Rejected / False positives (with proof)

Wave 1 finding Disposition Proof
Right-assoc AND test missing DOWNGRADED Lark grammar at parser.py:177-183 is left-recursive; right-assoc trees only via grouped_expr parens which the walker handles via the same op == "and" recursion.
Direct unit test for _ExpressionSlice branch missing DOWNGRADED E2E coverage exists via test_parse_where_single_label_predicate_produces_structured_ast and test_parse_where_label_predicate (both verified to route through the branch via plan.md Step 12.4 debug probe).
Nested mixed-AND test missing DOWNGRADED DFS visits every leaf; depth-1 mixed-leaf rejection test covers the same code path.
CHANGELOG should use bullets REJECTED #1202 / #1207 entries in the same Internal section also use long-paragraph form; matching convention.

Methodology

Run per /home/lmeyerov/Work/graphistry/.agents/skills/review/SKILL.md.

  • Phase 0: PR stack check, scope identification, plan setup at plans/1194-lark-where-predicates-ambiguity/review-1209/.
  • Phase 1: research → research/policies.md (relevance/freshness/key-rules per discovered policy doc — cypher_frontend_ci_gates.md, conformance.md, predicates_checklist.md, ARCHITECTURE.md, AGENTS.md, mypy.ini, pyproject.toml, .github/workflows/ci.yml).
  • Phase 2 Wave 1: 9 dimension passes (spec, correctness, tests, security, dry, quality, architecture, operability, conventions) with per-dimension report files. Adversarial pressure test in adversarial/wave-1.md.
  • Phase 2 Wave 2: confirmation pass with re-pressure of the surviving SUGGESTION + new dimension probes (producer invariants, primitive-literal atom edge case, items[0] type handling, WITH-WHERE separation, backtick aliases, CHANGELOG re-read). 0 new findings.
  • Convergence: 2 consecutive non-significant waves → converged.

Cross-references

Recommendation

Approve and merge. No defects. The single confirmed SUGGESTION is a 4-LOC test-fixture typing cleanup that does not warrant blocking the merge.

assert _match_bare_label_atom("'A:B'") is None # quoted string fragment


def _bx_atom(text: str): # type: ignore[no-untyped-def]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION, non-blocking] These two helpers use # type: ignore[no-untyped-def] while sibling helpers in this file (_parse_query, _match_parts) carry explicit type annotations. ~4 LOC cleanup:

def _bx_atom(text: str) -> "BooleanExpr": ...
def _bx_branch(op: str, left: "BooleanExpr", right: Optional["BooleanExpr"] = None) -> "BooleanExpr": ...

Trivial; can land in this PR or as a follow-up. Surfaced by Wave 1 of the review skill (plans/1194-lark-where-predicates-ambiguity/review-1209/waves/wave-1/quality/report.md).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gfql/cypher: resolve Lark ambiguity between where_predicates and expr (grammar-level follow-up to #1125)

1 participant