Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cost planner prefers 2 NodeIndexSeek and CartesianProduct to Expand(All #12225

Closed
frant-hartm opened this issue May 30, 2019 · 10 comments

Comments

@frant-hartm
Copy link

commented May 30, 2019

The cost planner fails to produce viable query plan on following query:

MATCH (k:Keyword)-[r:CONTAINS_KEYWORD]-(t:Tweet)
WHERE k.value IN [{keywords}] AND t.created_date > {createdDate}
RETURN t,k

where there are indexes on Keyword(value) and Tweet(created_date)

Neo4j Version: 3.5.5 (both community and enterprise)
Operating System: Ubuntu 18.04
API: Browser

Steps to reproduce

Create indexes on

CREATE INDEX ON :Tweet(created_date)
CREATE INDEX ON :Keyword(value)

Generate some random data

unwind range (1, 100000) as i
create (t:Tweet {created_date:timestamp()})
with t
unwind range(1,10) as j
create (k:Keyword {value:randomUUID()})
create (t)-[:CONTAINS_KEYWORD]->(k)

Get some of the generated values, pass to followoing query

MATCH (t:Tweet)--(k:Keyword)
RETURN t.created_date-1,k.value LIMIT 1

See the query profile:

PROFILE
MATCH (k:Keyword)-[r:CONTAINS_KEYWORD]-(t:Tweet)
WHERE k.value IN ['a72e0d93-5445-444c-a022-455d7958e0a6'] AND t.created_date > 1559128418010
return t,k

Query plan

It picks the wrong index to start with, fair enough. Let's provide the index hint.

PROFILE
MATCH (k:Keyword)-[r:CONTAINS_KEYWORD]-(t:Tweet)
USING INDEX k:Keyword(value)
WHERE k.value IN ["25563f3e-8a89-46d5-bacb-ce8fce43cda2"] AND t.created_date > 1559197470542
RETURN t,k

Query plan

Now it uses both indexes and does cartesian product, bad.

I had to resort to following to trick the planner to execute the desired query plan, note that using only WITH k,t doesn't work, need to alias the t.created_date as cd

PROFILE
MATCH (k:Keyword)-[r:CONTAINS_KEYWORD]-(t:Tweet)
USING INDEX k:Keyword(value)
WHERE k.value IN ["25563f3e-8a89-46d5-bacb-ce8fce43cda2"]
WITH k,t,t.created_date as cd
WHERE cd > 1559197470542
RETURN t,k

Query plan

Expected behaviour

Produce the last query plan without any hints, or with USING INDEX k:Keyword(value) hint only.

The currently generated query plan might be suitable in certain cases, it should be achievable by using 2 index hints:

PROFILE
MATCH (k:Keyword)-[r:CONTAINS_KEYWORD]-(t:Tweet)
USING INDEX k:Keyword(value)
USING INDEX t:Tweet(created_date)
WHERE k.value IN ["25563f3e-8a89-46d5-bacb-ce8fce43cda2"] AND t.created_date > 1559197470542
return t,k

@frant-hartm frant-hartm added the bug label May 30, 2019

@tinwelint tinwelint added the cypher label Jun 3, 2019

@Hunterness Hunterness self-assigned this Jun 3, 2019

@Hunterness

This comment has been minimized.

Copy link
Contributor

commented Jun 3, 2019

Hi, Thanks for reporting this. We will look into it.
/Team Cypher

@Lojjs

This comment has been minimized.

Copy link
Contributor

commented Jun 5, 2019

@frant-hartm When looking at your plans we can see that the estimated number of rows for the NodeIndexSeekByRange over Tweet(created_date) is 0. That could indicate that the index statistics have not been updated for some reason. Could you try to add CALL db.prepareForReplanning after your data generation to help us confirm that this is the issue?

The procedure will resample the indexes and clear query caches to make sure that the latest index statistics are used.

@Lojjs

This comment has been minimized.

Copy link
Contributor

commented Jun 20, 2019

@frant-hartm Since we have not heard back from you about the additional information we asked for 2 weeks ago, I will close this issue. If you still have problems, you are very welcome to re-open it with further information included or create a new issue.

@Lojjs Lojjs closed this Jun 20, 2019

@luanne

This comment has been minimized.

Copy link

commented Jul 26, 2019

@Lojjs I'm investigating this at the moment. I could not find a db.prepareForReplanning so instead did both call db.resampleOutdatedIndexes(), call db.resampleIndex(":Keyword(value)")
and call db.resampleIndex(":Tweet(created_date)").

The issue persists. To be sure that the Keyword index was being picked up, I dropped the Tweet index and profiled the query- works as expected by doing a NodeIndexSeek on the Keyword index and filters down to the 1 row.
Then I recreated the Tweet index on created_date and the planner immediately picks up this index to do a NodeIndexSeekByRange instead of picking the more selective Keyword index. Note that estimated rows is still at 0, which is probably why it appears to be the most selective when it's not.

I'm unable to reopen this issue on behalf of Frantisek. Let me know if you prefer a new issue, or whether you will re-open.
BTW I tried this on Neo4j Enterprise 3.5.7

@sherfert sherfert reopened this Jul 29, 2019

@fickludd

This comment has been minimized.

Copy link
Contributor

commented Aug 7, 2019

@luanne Thank you for trying all those things.

I am investigating this issue now, and the problem turned out to be caused by two surprising things in combination.

  1. Are you aware that with this setup, all Tweets get the same created_date, because timestamp() uses the query/statement clock? If you want each Tweet to get a slightly different created_date, use datetime.realtime().epochMillis instead. This also fixed the planning problem.
  2. The selectivity estimator for range seeks does not behave nicely for indexes with very low selectivity (like here with only one unique value). In fact when there is only one unique value it believed range scans would never return any rows. That will be fixed in 3.5.9.

Also note that the db.resample* procedures are asynchronous, so it might take some time after that call until the resampling is actually completed. For this setup, you could populate the graph first, and then create indexes, which would ensure sampling was correct immediately when coming online.

@luanne

This comment has been minimized.

Copy link

commented Aug 18, 2019

@fickludd thanks for investigating. The timestamp() is indeed something I was not aware of, thanks for pointing that out.
Good to know about the fix for the index selectivity in 3.5.9!

@luanne

This comment has been minimized.

Copy link

commented Aug 18, 2019

@fickludd sorry, I have a few more observations. Assuming that we did indeed have many tweets with exactly the same timestamp, providing the index hint only partially solves the issue.

In my test database, the keyword is related to exactly one tweet which matches the date filter:
image

So, I'd expect that the expansion happens via the single keyword to the single tweet but the plan for this query with the index hint

PROFILE
MATCH (k:Keyword)<-[r:CONTAINS_KEYWORD]-(t:Tweet)
USING INDEX k:Keyword(value)
WHERE k.value IN ['c4160609-ebf9-4197-9dc3-86839e3bd278'] AND t.created_date > 1566123035893
return t,k

is
image

And now for a bit of fun ;-)
If you flip the order of the MATCH expression:

PROFILE
MATCH (t:Tweet)-[r:CONTAINS_KEYWORD]->(k:Keyword)
USING INDEX k:Keyword(value)
WHERE k.value IN ['c4160609-ebf9-4197-9dc3-86839e3bd278'] AND t.created_date > 1566123035893
return t,k

the query plan changes to

image

:-)

@sherfert

This comment has been minimized.

Copy link
Contributor

commented Aug 19, 2019

@luanne I expect both of these plans to change in 3.5.9 with the bugfix. Could you verify that when 3.5.9 is out?

@tomswinkels

This comment has been minimized.

Copy link

commented Sep 3, 2019

@luanne

This comment has been minimized.

Copy link

commented Sep 3, 2019

@sherfert I'm confirming that these are fixed in 3.5.9, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.