Using neo 2.0.1, the query:
where n.name =~ '.*rul.*'
return n.name order by n.name desc
returns result after 14 secs. Number of n:States nodes is about 450 000, number of nodes fullfilling the condition is about 130 000 and count with the same condition returns in 1 sec. There is index on :States(name).
ColumnFilter(symKeys=["n", "n.name"], returnItemNames=["n.name"], _rows=200, _db_hits=0)
Top(orderBy=["SortItem(Cached(n.name of type Any),false)"], limit="Literal(200)", _rows=200, _db_hits=0)
Extract(symKeys=["n"], exprKeys=["n.name"], _rows=131072, _db_hits=131072)
Filter(pred="LiteralRegularExpression(Property(n,name(1)),Literal(.*rul.*))", _rows=131072, _db_hits=458752)
NodeByLabel(label="States", identifier="n", _rows=458752, _db_hits=0)
Now, this type of match is very common to all applications - table with filter and ordering, so performance is important.
This query probably doesn't use the index (because of the condition) but it should because of the order by.
When running this type of query in MSSQL, server works as follows:
Using index scan in descending order goes over every value and evaluates the condition. When number of items fulfilling the condition reaches 200 the work is done and result is returned. In many cases this happens in very short time (enough of items fulfilling the condition is found at the end of indexed items)
In Neo4j with cypher, server probably sorts all items fulfilling the condition and then returns the 200 from the begining.
Is there some plan to optimize this type of queries?
Thanx and sorry, I'm new to Neo and GitHub :)
What you are asking for is unfortunately no graph query, it's a fulltext search query.
Currently Neo4j's schema indexes are pure exact lookup indexes, so no luck there.
What is done here is pulling all data with that label through the comparison and then sorting all names into that 200 element window to be returned.
Do you run this query on cold caches or is the 14s after the second run? Which is the more realistic value (if you have enough RAM available for caching).
If you really need to do this now you have to resort to legacy indexes and the node_auto_index for sub-pattern matching:
start n=node:node_auto_index("name:*rul*") return n
Fulltext and other special indexes will be added to Neo4j's schema index approach in a future version.
Thank You for quick response.
Yes, it's not graph query, but having possibility to use those types of queries can be very useful - for example list of starting points for real graph search, when you know just name of the starting node or nodes.
I tried to use legacy index, but it didn't help
return n.name order by n.name desc
Profiler shows, that it used the auto index
ColumnFilter(symKeys=["n", "n.name"], returnItemNames=["n.name"], _rows=200, _db_hits=0) Top(orderBy=["SortItem(Cached(n.name of type Any),false)"], limit="Literal(200)", _rows=200, _db_hits=0)
Extract(symKeys=["n"], exprKeys=["n.name"], _rows=65536, _db_hits=65536)
NodeByIndexQuery(identifier="n", _db_hits=65536, _rows=65536, query="Literal(name:urra)", identifiers=["n"], idxName="node_auto_index", producer="NodeByIndexQuery")
The problem is not the usage of index, but the order by. It should be clever enough to do the same as SQL Server - start the search in ordered way using the index and stop after gathering enough nodes for result.
It could speed up also many graph search queries, that results are ordered and using limit.
I hope that many applications can benefit from this feature and spread the "Graph is everywhere" idea :)
But I also understand that for real graph targeted application this is not a major feature.
Just an offtopic
I have deleted all nodes using
MATCH (n) OPTIONAL MATCH (n)-[r]-() DELETE n,r
(500 000 of nodes)
match(n) return n;
on empty DB took about 4 secs even after restart of db and multiple runs.
Did I do anything wrong? :)