Skip to content

gfql/ir: expose Lark and_op/or_op tree in WhereClause so passes walk structure instead of re-parsing text #1200

@lmeyerov

Description

@lmeyerov

Context

`WhereClause` currently has two shapes for a WHERE body:

```python
class WhereClause:
predicates: Tuple[Union[WherePredicate, WherePatternPredicate], ...] = ()
expr: Optional[ExpressionText] = None
```

Structured predicates land in `.predicates`. Anything Lark's `where_predicates` rule cannot decompose (nested OR/XOR/NOT, mixed boolean forms, parenthesized subexpressions) falls into `.expr` as a raw text blob:

```python
@DataClass
class ExpressionText:
text: str
span: Optional[SourceSpan] = None
```

Lark's grammar already parses the expression structurally via dedicated rules (`parser.py:177-183`):

```
?or_expr: xor_expr
| or_expr "OR"i xor_expr -> or_op
?xor_expr: and_expr
| xor_expr "XOR"i and_expr -> xor_op
?and_expr: not_expr
| and_expr "AND"i not_expr -> and_op
```

…but the transformer throws the `and_op` / `or_op` / `xor_op` / `not_op` tree away and captures only `self._slice(span)`. Downstream code (binder serialization → `BoundPredicate.expression` string → `predicate_pushdown._split_conjuncts`) re-parses the text via regex-based top-level-AND splitting (`graphistry/compute/gfql/expr_split.py`, introduced by PR #1198 / issue #1195).

Why this is a smell

Two separate systems re-implement what Lark already computed:

  1. `parser.py::generic_where_clause` re-parses label-only AND conjunctions. Tracked by gfql/cypher: resolve Lark ambiguity between where_predicates and expr (grammar-level follow-up to #1125) #1194 (grammar-level disambiguation between `where_predicates` and `expr`).
  2. `predicate_pushdown._split_conjuncts` re-parses top-level AND conjuncts in arbitrary boolean expressions. This issue.

Each re-parse is a potential source of divergence (e.g., the `\\b` vs `\b` bug fixed in #1198 `_refs_for_segment`).

Proposed follow-up

Expose the parsed boolean-expression tree structurally in IR:

```python
@DataClass
class BooleanExpr:
op: Literal["and", "or", "xor", "not", "atom"]
left: Optional["BooleanExpr"] = None
right: Optional["BooleanExpr"] = None
atom_text: Optional[str] = None # leaf: raw predicate text
atom_span: Optional[SourceSpan] = None

@DataClass
class WhereClause:
predicates: Tuple[Union[WherePredicate, WherePatternPredicate], ...] = ()
expr_tree: Optional[BooleanExpr] = None # replaces ExpressionText
span: Optional[SourceSpan] = None
```

  1. Parser transformer: implement `and_op` / `or_op` / `xor_op` / `not_op` methods that build `BooleanExpr` nodes; the leaf case captures atom text + span.
  2. Binder: continue serializing leaves to `BoundPredicate` entries, but emit one BoundPredicate per top-level AND conjunct by walking `BooleanExpr` structure instead of producing one `expression=where.expr.text` blob. The existing `_split_conjuncts` in predicate pushdown becomes redundant for most real inputs.
  3. `predicate_pushdown.py`: once BoundPredicates already arrive as single conjuncts, `_split_conjuncts` is needed only for the residual case of nested subexpressions the binder chose not to flatten. Optionally, drop it entirely once gfql/cypher: resolve Lark ambiguity between where_predicates and expr (grammar-level follow-up to #1125) #1194 also lands.
  4. Delete `graphistry/compute/gfql/expr_split.py` once both call sites are gone; Lark becomes the single source of truth for AND-splitting.

Tests to keep green

  • All 20 unit tests in `test_expr_split.py` (either migrate to test the new tree-walking, or delete).
  • All predicate-pushdown tests in `test_predicate_pushdown_pass.py`, especially:
  • All parser tests in `test_parser.py` including label-narrowing integration.

Constraints

  • `BoundPredicate.expression` is still a string (other code depends on this shape). The refactor produces more BoundPredicates, not a structured replacement.
  • Backward-compatible migration: add `expr_tree` alongside `expr: Optional[ExpressionText]`, migrate callers one by one, remove `expr` in a follow-up.

Related

Priority

p3 — architectural cleanup, no user-visible bug. The current text-split is correct and tested; this change removes duplicated logic and lets the parser be authoritative.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions