-
Notifications
You must be signed in to change notification settings - Fork 17
Description
I've created a second benchmark to push lance-graph to its breaking point, and it breaks :)
This one uses the well-known LDBC SNB (SF1) dataset, and the query suite and results are here - should be quite easy to reproduce.
Issue
The query pattern that breaks is Q30 in the suite: "Are there comments replying to posts created by the same person?". Basically, a self-reference to a node variable in a query.
Attempt 1
Here's how it looks in Neo4j/Kuzu/Ladybug:
// Are there comments replying to posts created by the same person?
MATCH (c:Comment)-[:commentHasCreator]->(p:Person)<-[:postHasCreator]-(post:Post)<-[:replyOfPost]-(c)
RETURN COUNT(DISTINCT c.ID) > 0 AS has_self_reply;The above pattern doesn't work in lance-graph, because of #108: COUNT(DISTINCT x.id) isn't yet supported in lance-graph.
Attempt 2
I then tried just returning the distinct IDs to take them to an external system (e.g., Polars) to aggregate. Not ideal from a performance standpoint, as this requires materializing a lot of results.
But this doesn't work either.
Trying:
MATCH (c:Comment)-[:commentHasCreator]->(p:Person)<-[:postHasCreator]-(post:Post)<-[:replyOfPost]-(c)
RETURN DISTINCT c.id AS idGives:
Traceback (most recent call last):
File "/Users/prrao/code/graph-benchmark-ldbc-snb/lance_graph/query.py", line 685, in <module>
main(selected_queries)
~~~~^^^^^^^^^^^^^^^^^^
File "/Users/prrao/code/graph-benchmark-ldbc-snb/lance_graph/query.py", line 678, in main
func(cfg, datasets)
~~~~^^^^^^^^^^^^^^^
File "/Users/prrao/code/graph-benchmark-ldbc-snb/lance_graph/query.py", line 606, in run_query30
return _execute_distinct_count(
cfg,
...<5 lines>...
as_bool=True,
)
File "/Users/prrao/code/graph-benchmark-ldbc-snb/lance_graph/query.py", line 135, in _execute_distinct_count
result = execute_query(query, cfg, datasets)
File "/Users/prrao/code/graph-benchmark-ldbc-snb/lance_graph/query.py", line 108, in execute_query
result = cypher.with_config(cfg).execute(datasets)
ValueError: Query planning error: Failed to join source to relationship: Schema error: No field named posthascreator_1__dst. Valid fields are c__id, c__creationdate, c__locationip, c__browserused, c__content, c__length, comment.id, comment.creationdate, comment.locationip, comment.browserused, comment.content, comment.length.
Based on the error message, it seems that the query planner can't handle query patterns that point back to the same variable?
Attempt 3
Finally, I tried rewriting it another way, with an additional WHERE predicate that checks where c1 and c2, so as to force a different query plan.
// Are there comments replying to posts created by the same person?
MATCH (c1:Comment)-[:commentHasCreator]->(p:Person)<-[:postHasCreator]-(post:Post)<-[:replyOfPost]-(c2:Comment)
WHERE c1.id = c2.id
RETURN DISTINCT c1.id AS idThis doesn't work either: Due to extremely high cardinality in these intermediate paths, this query never returns (it just hangs).
As can be seen, this issue is multifold and has a more complex cause.
How to reproduce
The LDBC SNB SF1 dataset was used for this experiment. The dataset can be downloaded via the script provided here.
Once the dataset is downloaded locally, the code to build the graph in lance-graph and query it are available in the lance_graph directory of the project, here.
# Ingest dataset into in lance_graph
cd lance_graph
uv run build_graph.py
# Uncomment L649 of query.py and aim to run Q30 as follows
uv run query.py "30"