New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding index incurs performance regression in some cases #13010
Comments
While at the onset this looks like a case where the data in the graph explains the problem (no person node meeting those conditions has an :IN_GENRE relationship, so rows go to 0 early, vs the other approach that chooses to start at :Movie nodes, which IS more selective than a label scan, but when expanding the other direction ends up expanding through supernodes resulting in 43388090 rows all requiring expensive filtering that ends up eliminating all of them), there does seem to be opportunity to do an early check via counts store data to quickly determine if there are no such patterns that could possibly match. In this case, the data model should be that :IN_GENRE relationships never connect to :Person nodes. The counts store knows this, and as a result it should be able to intuit that there are no paths that could possibly match this pattern. As such, without any special fail-fast condition from the counts store, a plan that either starts with a :Person nodes, failing to expand the non-existent relationships (as what happens in the first without-indexes query) or that starts with a relationship type scan might be weighted higher by the planner, which would circumvent the choice to go with the ultimately terrible plan that has to expand through supernodes before it can perform the critical filtering step that eliminates all rows. Additionally, there's some opportunity for the planner to have better insight into supernodes, so we could avoid picking plans that end up performing highly expensive expansions. From the counts store, there are 20340 :IN_GENRE relationships, and from the counts of that relationship for each of the relationship types, we can conclude that the only possible pattern for these is |
It is unfortunate that the planner picks a worse plan in the presence of an index. The results are correct, however, so this is not a bug. @InverseFalcon made a good analysis of the issue. In its current state, the planner is right to prefer starting with the index, if it exists. Unfortunately the suggested fixes to the planner are non-trivial and would have to be prioritized against other features. The issue is also for a query that does not find any data because it does not fit the schema. Optimizing this sort of queries is generally a quite low priority compared to queries that can and do find data. I hope this makes sense. |
Neo4j version: 5.2.0
Operating system: Ubuntu 20.04
API/Driver: Cypher
Dataset: Recommendations, https://github.com/neo4j-graph-examples/recommendations
Query:
OPTIONAL MATCH (a:Person)--(b:Movie)-->()-[c:IN_GENRE]-(a:Person) WHERE a.name CONTAINS " " AND b.year>1920 AND c.genre IS NULL RETURN DISTINCT b ORDER BY b LIMIT 1;
Executing the query without index is more than 100x faster than the one with index.
Without Index(Using NodeByLabelScan):
With Index(Using NodeIndexSeekByRange):
The text was updated successfully, but these errors were encountered: