Self-pointing Cypher pattern causes lance-graph to hang indefinitely for high cardinality datasets

I've created a second benchmark to push lance-graph to its breaking point, and it breaks :)

This one uses the well-known LDBC SNB (SF1) dataset, and the query suite and results are [here](https://github.com/prrao87/graph-benchmark-ldbc) - should be quite easy to reproduce.

## Issue

The query pattern that breaks is Q30 in the suite: "Are there comments replying to posts created by the same person?". Basically, a self-reference to a node variable in a query.

### Attempt 1
Here's how it looks in Neo4j/Kuzu/Ladybug:

```cypher
// Are there comments replying to posts created by the same person?
MATCH (c:Comment)-[:commentHasCreator]->(p:Person)<-[:postHasCreator]-(post:Post)<-[:replyOfPost]-(c)
RETURN COUNT(DISTINCT c.ID) > 0 AS has_self_reply;
```

The above pattern doesn't work in lance-graph, because of #108: `COUNT(DISTINCT x.id)` isn't yet supported in lance-graph.

### Attempt 2

I then tried just returning the distinct IDs to take them to an external system (e.g., Polars) to aggregate. Not ideal from a performance standpoint, as this requires materializing a lot of results.

But this doesn't work either.
Trying:
```cypher
MATCH (c:Comment)-[:commentHasCreator]->(p:Person)<-[:postHasCreator]-(post:Post)<-[:replyOfPost]-(c)
RETURN DISTINCT c.id AS id
```
Gives:
```
Traceback (most recent call last):
  File "/Users/prrao/code/graph-benchmark-ldbc-snb/lance_graph/query.py", line 685, in <module>
    main(selected_queries)
    ~~~~^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/code/graph-benchmark-ldbc-snb/lance_graph/query.py", line 678, in main
    func(cfg, datasets)
    ~~~~^^^^^^^^^^^^^^^
  File "/Users/prrao/code/graph-benchmark-ldbc-snb/lance_graph/query.py", line 606, in run_query30
    return _execute_distinct_count(
        cfg,
    ...<5 lines>...
        as_bool=True,
    )
  File "/Users/prrao/code/graph-benchmark-ldbc-snb/lance_graph/query.py", line 135, in _execute_distinct_count
    result = execute_query(query, cfg, datasets)
  File "/Users/prrao/code/graph-benchmark-ldbc-snb/lance_graph/query.py", line 108, in execute_query
    result = cypher.with_config(cfg).execute(datasets)
ValueError: Query planning error: Failed to join source to relationship: Schema error: No field named posthascreator_1__dst. Valid fields are c__id, c__creationdate, c__locationip, c__browserused, c__content, c__length, comment.id, comment.creationdate, comment.locationip, comment.browserused, comment.content, comment.length.
```
Based on the error message, it seems that the query planner can't handle query patterns that point back to the same variable? 

### Attempt 3

Finally, I tried rewriting it another way, with an additional `WHERE` predicate that checks where `c1` and `c2`, so as to force a different query plan.

```cypher
// Are there comments replying to posts created by the same person?
MATCH (c1:Comment)-[:commentHasCreator]->(p:Person)<-[:postHasCreator]-(post:Post)<-[:replyOfPost]-(c2:Comment)
WHERE c1.id = c2.id
RETURN DISTINCT c1.id AS id
```
This doesn't work either: Due to extremely high cardinality in these intermediate paths, this query never returns (it just hangs).

As can be seen, this issue is multifold and has a more complex cause.

## How to reproduce

The LDBC SNB SF1 dataset was used for this experiment. The dataset can be downloaded via the script provided [here](https://github.com/prrao87/graph-benchmark-ldbc?tab=readme-ov-file#dataset).

Once the dataset is downloaded locally, the code to build the graph in lance-graph and query it are available in the `lance_graph` directory of the project, [here](https://github.com/prrao87/graph-benchmark-ldbc/tree/main/lance_graph).

```bash
# Ingest dataset into in lance_graph
cd lance_graph
uv run build_graph.py

# Uncomment L649 of query.py and aim to run Q30 as follows
uv run query.py "30"
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self-pointing Cypher pattern causes lance-graph to hang indefinitely for high cardinality datasets #111

Issue

Attempt 1

Attempt 2

Attempt 3

How to reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Self-pointing Cypher pattern causes lance-graph to hang indefinitely for high cardinality datasets #111

Description

Issue

Attempt 1

Attempt 2

Attempt 3

How to reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions