Skip to content

Self-pointing Cypher pattern causes lance-graph to hang indefinitely for high cardinality datasets #111

@prrao87

Description

@prrao87

I've created a second benchmark to push lance-graph to its breaking point, and it breaks :)

This one uses the well-known LDBC SNB (SF1) dataset, and the query suite and results are here - should be quite easy to reproduce.

Issue

The query pattern that breaks is Q30 in the suite: "Are there comments replying to posts created by the same person?". Basically, a self-reference to a node variable in a query.

Attempt 1

Here's how it looks in Neo4j/Kuzu/Ladybug:

// Are there comments replying to posts created by the same person?
MATCH (c:Comment)-[:commentHasCreator]->(p:Person)<-[:postHasCreator]-(post:Post)<-[:replyOfPost]-(c)
RETURN COUNT(DISTINCT c.ID) > 0 AS has_self_reply;

The above pattern doesn't work in lance-graph, because of #108: COUNT(DISTINCT x.id) isn't yet supported in lance-graph.

Attempt 2

I then tried just returning the distinct IDs to take them to an external system (e.g., Polars) to aggregate. Not ideal from a performance standpoint, as this requires materializing a lot of results.

But this doesn't work either.
Trying:

MATCH (c:Comment)-[:commentHasCreator]->(p:Person)<-[:postHasCreator]-(post:Post)<-[:replyOfPost]-(c)
RETURN DISTINCT c.id AS id

Gives:

Traceback (most recent call last):
  File "/Users/prrao/code/graph-benchmark-ldbc-snb/lance_graph/query.py", line 685, in <module>
    main(selected_queries)
    ~~~~^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/code/graph-benchmark-ldbc-snb/lance_graph/query.py", line 678, in main
    func(cfg, datasets)
    ~~~~^^^^^^^^^^^^^^^
  File "/Users/prrao/code/graph-benchmark-ldbc-snb/lance_graph/query.py", line 606, in run_query30
    return _execute_distinct_count(
        cfg,
    ...<5 lines>...
        as_bool=True,
    )
  File "/Users/prrao/code/graph-benchmark-ldbc-snb/lance_graph/query.py", line 135, in _execute_distinct_count
    result = execute_query(query, cfg, datasets)
  File "/Users/prrao/code/graph-benchmark-ldbc-snb/lance_graph/query.py", line 108, in execute_query
    result = cypher.with_config(cfg).execute(datasets)
ValueError: Query planning error: Failed to join source to relationship: Schema error: No field named posthascreator_1__dst. Valid fields are c__id, c__creationdate, c__locationip, c__browserused, c__content, c__length, comment.id, comment.creationdate, comment.locationip, comment.browserused, comment.content, comment.length.

Based on the error message, it seems that the query planner can't handle query patterns that point back to the same variable?

Attempt 3

Finally, I tried rewriting it another way, with an additional WHERE predicate that checks where c1 and c2, so as to force a different query plan.

// Are there comments replying to posts created by the same person?
MATCH (c1:Comment)-[:commentHasCreator]->(p:Person)<-[:postHasCreator]-(post:Post)<-[:replyOfPost]-(c2:Comment)
WHERE c1.id = c2.id
RETURN DISTINCT c1.id AS id

This doesn't work either: Due to extremely high cardinality in these intermediate paths, this query never returns (it just hangs).

As can be seen, this issue is multifold and has a more complex cause.

How to reproduce

The LDBC SNB SF1 dataset was used for this experiment. The dataset can be downloaded via the script provided here.

Once the dataset is downloaded locally, the code to build the graph in lance-graph and query it are available in the lance_graph directory of the project, here.

# Ingest dataset into in lance_graph
cd lance_graph
uv run build_graph.py

# Uncomment L649 of query.py and aim to run Q30 as follows
uv run query.py "30"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions