Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

very slow queries when "expand" is needed from subqueries #7220

Closed
rdelangh opened this issue Mar 8, 2017 · 47 comments
Closed

very slow queries when "expand" is needed from subqueries #7220

rdelangh opened this issue Mar 8, 2017 · 47 comments

Comments

@rdelangh
Copy link

rdelangh commented Mar 8, 2017

OrientDB Version: <= 2.2.18

Java Version: N/A

OS: N/A

Expected behavior

  • I have classes that store documents of an entire week, one week per class
  • each such class holds roughly 120M records
  • some 7 indexes are defined for each class, 6 of which are FULLTEXT Lucene indexes with "allowLeadingWildcard": true
  • when I perform a count in such index, the result is obtained relatively timeframe (could be better, tough) of 61 secs:
select count(*) from index:idx_cdr_af_20170226_3 WHERE key LUCENE ' *08001* '
29884 records
  • this subset of 29884 can only be filtered further, via "expand" to obtain the properties of these documents on which I can apply a WHERE clause, such as selecting a specific date:
SELECT count(*) FROM (select expand(rid) from index:idx_cdr_af_20170226_3 WHERE key LUCENE ' *08001* ') WHERE PRT_DATE BETWEEN '2017-02-21 00:00:00.000' AND '2017-02-21 23:59:59.999'
4643 records

This query completed in 9 seconds, entirely due to caching from the previous query (which lasted for about 1 minute)

  • so if I run this counting on another class, which is not in the cache, the counting query lasts for about 4 minutes !
SELECT count( * ) FROM (select expand(rid) from index:idx_cdr_af_20170212_3 WHERE key LUCENE ' *08001* ')
WHERE PRT_DATE BETWEEN '2017-02-15 00:00:00.000' AND '2017-02-15 23:59:59.999'
2941 records

Query executed in 261.234 sec

-> is there no better (=faster) way to run these countings ? It is a factor 3 slower than exactly the same query on a traditional, relational database with partitioning of a single big table.
So why would we use a big-database concept...

@luigidellaquila
Copy link
Member

Hi @rdelangh

A couple of points here:

  1. Lucene by itself (I mean the Lucene engine, nothing about OrientDB) is not designed/optimized to do queries with leading wildcards, this is why the index query is so slow

  2. About the rest, it would be interesting to see the result of an EXPLAIN, probably it's randomly fetching a lot of records from the cluster to do the filtering. Did you try to rewrite the query and use a LIKE instead? What is the difference in performance

Thanks

Luigi

@rdelangh
Copy link
Author

rdelangh commented Mar 8, 2017

@luigidellaquila
What do you mean with the LIKE ? Create a extra index on the property that I want to search for "0800" patterns? That would have to be an SB-Tree index, because only these (like Lucene) allow range queries.

@luigidellaquila
Copy link
Member

luigidellaquila commented Mar 8, 2017

Hi @rdelangh

No, I mean just running the query without the index. Something like

SELECT count( * ) FROM YourClass WHERE theField LIKE '*08001*'
AND PRT_DATE BETWEEN '2017-02-15 00:00:00.000' AND '2017-02-15 23:59:59.999'

Just wondering if a full scan of the cluster is more efficient than an inefficient index access plus an inefficient random record access...

Thanks

Luigi

@rdelangh
Copy link
Author

rdelangh commented Mar 8, 2017

@luigidellaquila
scanning 120M records of the class, to find some 2000 relevant records ?
I can try, but it's likely going to last for many hours ...

@luigidellaquila
Copy link
Member

luigidellaquila commented Mar 8, 2017

Sorry, right syntax

SELECT count( * ) FROM YourClass WHERE theField LIKE '%08001%'
AND PRT_DATE BETWEEN '2017-02-15 00:00:00.000' AND '2017-02-15 23:59:59.999'

@luigidellaquila
Copy link
Member

Well, no, with these numbers for sure it will take a lot.
How many record would this query return?

SELECT count( * ) FROM YourClass WHERE theField LIKE '%08001%'

And what about this?

SELECT count( * ) FROM YourClass WHERE  PRT_DATE BETWEEN '2017-02-15 00:00:00.000' AND '2017-02-15 23:59:59.999'

Thanks

Luigi

@rdelangh
Copy link
Author

rdelangh commented Mar 8, 2017

@luigidellaquila
I have a compisite UNIQUE SBTREE index on some properties ENV, PRT_DATE, and others.
I launched now this query:

SELECT count(*) FROM cdr_af_20170226 
WHERE CALLED_NUMBER LIKE '%08001%' AND 
ENV='prod' AND PRT_DATE BETWEEN '2017-02-28 00:00:00.000' AND '2017-02-28 23:59:59.999'

I learned that I have to specify the index properties (ENV, PRT_DATE) as last criteria in the WHERE, otherwise the index will not be used. Correct?
Let's see how that performs...

@rdelangh
Copy link
Author

rdelangh commented Mar 8, 2017

Result:
6538 records found,
Query executed in 2640.024 sec

So not good.

@luigidellaquila
Copy link
Member

Yep, I see...

But you did not answer my question: how many records have theField LIKE '%08001%'? and how many have PRT_DATE BETWEEN '2017-02-15 00:00:00.000' AND '2017-02-15 23:59:59.999' ?

My guess is that both index calls are returning a lot of records, so the executor is doing a lot of operations.

Thanks

Luigi

@rdelangh
Copy link
Author

rdelangh commented Mar 8, 2017

There are about 19M records for the date-range (= 1 day)
Counting with the LIKE is unfeasable because there is no index that can be used with the LIKE.
Counting with the Lucene returns about 29K records for the pattern '08001'

@rdelangh
Copy link
Author

rdelangh commented Mar 8, 2017

@luigidellaquila
Do you have any suggestions how such query can be speed up?
Parallelism ? If so, how?
Server parameters? If so, which?

@luigidellaquila
Copy link
Member

luigidellaquila commented Mar 8, 2017

I'm afraid you will hardly optimize it under 60 seconds. The LUCENE '*foo*' or LIKE '*foo*' cannot be optimized further.
Probably the only way to really optimize it is to store that 08001 as a separate field and query it with a =, but I don't know if it's feasible for your use case

Thanks

Luigi

@rdelangh
Copy link
Author

rdelangh commented Mar 8, 2017

@luigidellaquila
That is bad. The best timing was not 60 seconds, but 261 seconds (subquery with Lucene index, then expand, then filter on PRT_DATE property). This was a small class of 41M records. The queries will need to run against big classes of about 120-200M records...

Disappointing.

@luigidellaquila
Copy link
Member

@rdelangh I think if you get rid of that direct index lookup (ie. select from index:...) you will be able to write a query with plain filters and without the EXPAND, this should speed up the execution to a few seconds

Thanks

Luigi

@rdelangh
Copy link
Author

rdelangh commented Mar 8, 2017

hi @luigidellaquila
pls see my earlier comment above:

SELECT count(*) FROM cdr_af_20170226 
WHERE CALLED_NUMBER LIKE '%08001%' AND 
ENV='prod' AND PRT_DATE BETWEEN '2017-02-28 00:00:00.000' AND '2017-02-28 23:59:59.999'

Result:
6538 records found,
Query executed in 2640.024 sec

So not good.

@rdelangh
Copy link
Author

rdelangh commented Mar 8, 2017

How about parallellism ? I noticed that this is by default automatic, but

  1. how can I find out whether anything was executed in parallel ?
  2. in which conditions/use cases would parallellism speed up my above query ?
    For example, would it make sense to split up the selection in multiple parts, each for e.g. 5 mins range of PRT_DATE ?

@robfrank
Copy link
Contributor

robfrank commented Mar 8, 2017

On the lucene side, can you avoid the leading wildcard? It is really the wrong way to use an inverted index such as lucene. Maybe writing a more expressive lucene query.

@rdelangh
Copy link
Author

rdelangh commented Mar 8, 2017

hi @robfrank
No sorry, the leading wildcard feature is absolutely necessary. Even without the leading wildcard, the selection from the Lucene index is ready after 9 secs, which is ok, and returns some 20K RIDs. But then the "expand", which is necessary to apply a further filtering on eg PRT_DATE, makes the whole query last for 6 minutes.

@rdelangh
Copy link
Author

rdelangh commented Mar 8, 2017

And can you please explain what you mean with "writing a more expressive lucene query" ?

@rdelangh
Copy link
Author

rdelangh commented Mar 8, 2017

Maybe another strategy then: can we configure how the Lucene indexes are built?
For example:

  • we only need to search on substrings of the entire original STRING property value. We do not need Lucene to decompose strings into 'words' according to some delimiter characters (such as "." or ",").
  • We can also require that these substrings on which we want to search, have a minimum length of 4 characters.

How can we do that? Can you please share any examples of how the CREATE INDEX command would need to be adapted to do such Lucene optimisation?

@robfrank
Copy link
Contributor

robfrank commented Mar 8, 2017

If you can recap to me with some samples of data and what kind of queries / results you need, I can try to help

@rdelangh
Copy link
Author

rdelangh commented Mar 9, 2017

@robfrank

  • we have classes where data (telecoms call records) of one week is stored in a single class ; we have tought about storing data in a single class per day, but with the number of different kind of data (hence different classes, 365 classes per year) that would mean that we exceed the max number of clusters that OrientDB can support

  • each such class contains approx 120M documents ('records')

  • to quickly find records for a certain CALLED-number (for example), we need to be able to search all called-nrs for a given substring (hence we need the leading and trailing wildcards)

  • we need to be able to identify and/or just count such records for several consecutive days (possibly spanning multiple weekly classes) or even for a few months (several weeks, hence classes)

  • so we need also a filtering criterium on their date, which is contained in a property PRT_DATE

The quickest way to reduce the number of relevant records, is by filtering first on that CALLED-number. Only Lucene index supports leading and trailing wildcard searches, and does so fairly quickly (matter of seconds, or 1-2 minutes). Scanning the Lucene index for this, yields only the RIDs of the filtered records.

But then we need also the filtering on PRT_DATE, so we need to "expand" these RIDs from the previous step. It is apparently this "expand" step that is very inefficient, and makes the queries run for hours !

For example (only a single day is searched for, but takes many minutes to complete, see above):

SELECT count( * ) FROM (
   select expand(rid) from index:idx_cdr_af_20170212_3 WHERE key LUCENE ' *08001* ')
WHERE PRT_DATE BETWEEN '2017-02-15 00:00:00.000' AND '2017-02-15 23:59:59.999'

The inner query from the Lucene index returns in this case some 29K RIDs.
In real cases, the pattern will consist of even less significant digits, such as 0800. Such search string returns approx 322K RIDs.

The total number of records for a single day is approx 19M records. Therefor, swapping the order of the SELECTs would not give better results: first selecting all RIDs for one full day, then doing "expand" on these appro 20M RIDs, then filtering these records on their value of CALLED-number with LIKE.

I have tried this, it did not terminate even after 3 hours...

@robfrank
Copy link
Contributor

O don't know if it could work but you're using OrientDB mostly as a KV datastore. My suggestion is to try to build a multifield index on the cdr and date/datetime and use range queries :

http://orientdb.com/docs/last/Full-Text-Index.html#numeric-and-date-range-queries-from-2214

so your query would be something like

SELECT count( * ) FROM (
   select expand(rid) from index:idx_cdr_af_20170212_3 WHERE key LUCENE ' cdr_filed_name:*08001*  prt_date:[201612221000 TO 201612221100] ')

for sure, you need to do some experiments.

@rdelangh
Copy link
Author

rdelangh commented Mar 21, 2017

hi @robfrank , I was able to create such multi-field Lucene index, but the resulting counts from that kind of query seem impossible:

With property SERVEDIMSI being a STRING, and PRT_DATE being a DATETIME:

create index idx_cdr_eno_20170319_3 on cdr_eno_20170319 (SERVEDIMSI, PRT_DATE) FULLTEXT ENGINE LUCENE

Then I search for a combination of SERVEDIMSI (exact string) and PRT_DATE (range):

SELECT count(*) FROM (
SELECT expand(rid) FROM index:idx_cdr_eno_20170319_3 WHERE key LUCENE ' SERVEDIMSI:"206012224204810" AND PRT_DATE:[201703201204 TO 201703201205]'
    )
Returned 0 record(s)

whereas there exists only one single (=1) record in the class for these criteria!

Also other syntaxes do not seem to be correct:

  SELECT expand(rid) FROM index:idx_cdr_eno_20170319_3 WHERE key LUCENE ' SERVEDIMSI:206012224204810 PRT_DATE:[20170320120400000 TO 20170320120500000]'

returns records for many different values of SERVEDIMSI and many different PRT_DATE ranges outside of my specified range.

Also:


SELECT * FROM cdr_eno_20170319 WHERE [SERVEDIMSI,PRT_DATE] LUCENE 'SERVEDIMSI:206012224204810* AND PRT_DATE:[20170320120400000 TO 20170320120500000]'
Returned 0 record(s)

-> what is the correct syntax for such Lucene indexes used by OrientDB-SQL, and where can I find documentation about this syntax?

@rdelangh
Copy link
Author

can anyone please provide me the correct syntax to perform a query against a Lucene index that has been defined on multiple properties, such as the examples above in my previous post?

@rdelangh
Copy link
Author

hi @robfrank
can you please check and provide me the correct syntax for such multi-field Lucene queries?
many thanks in advance!

@rdelangh
Copy link
Author

can anyone please help me out with the correct multi-field Lucene queries syntax ?

@robfrank
Copy link
Contributor

robfrank commented Mar 28, 2017

The default Lucene's operator is 'OR'. So the query you wrote is asking for SERVEDIMSI:... OR PRT_DATE:.
Try this:

+SERVEDIMSI:206012224204810 +PRT_DATE:[20170320120400000 TO 20170320120500000]

The + means MUST

As a reference: https://lucene.apache.org/core/5_3_2/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description

@rdelangh
Copy link
Author

the none-Lucene query shows that the record definitely exists:

select * from cdr_eno_20170319 WHERE SERVEDIMSI='206012224204810' AND ENV='prod' AND ORIGTERMMSCID='000399' AND CALLTRANSACTIONTYPE='01' AND PRT_DATE BETWEEN '2017-03-20 12:04:46.000' AND '2017-03-20 12:04:46.900'
Query executed in 0.057 sec. Returned 1 record(s)

However the Lucene multi-field query does not work:

orientdb {db=mobile}> desc cdr_eno_20170319
...
INDEXES (3 altogether)
+----+----------------------+---------------------------------------------------------------------------------------------------------------------------+
|#   |NAME                  |PROPERTIES                                                                                                                 |
+----+----------------------+---------------------------------------------------------------------------------------------------------------------------+
|0   |idx_cdr_eno_20170319_3|[SERVEDIMSI,PRT_DATE] 
...

orientdb {db=mobile}>  SELECT expand(rid) FROM index:idx_cdr_eno_20170319_3 WHERE key LUCENE '+SERVEDIMSI:206012224204810 +PRT_DATE:[20170320120400000 TO 20170320120500000]'

0 item(s) found. Query executed in 0.019 sec(s).

@robfrank
Copy link
Contributor

I'll check, because I usually don't write queries using the index: notation, but a select from "class" where [field1,field2] lucene "query" .

@rdelangh
Copy link
Author

rdelangh commented Apr 4, 2017

hi @robfrank ,
did you manage to verify which is the correct syntax to use a multi-property Lucene index?

@robfrank
Copy link
Contributor

robfrank commented Apr 5, 2017

I added some working tests on our suite, you can look at them in the linked commits.

@rdelangh
Copy link
Author

hi @robfrank , can you please explain what the "linked commits" are, and where I find your tests?

@smolinari
Copy link
Contributor

@rdelangh - the links (to the commits) are right above Rob's comment.

Scott

@robfrank
Copy link
Contributor

@rdelangh
Copy link
Author

hi @robfrank ,
thanks for the info, I had a look at your tests and I see a very similar test with a multi-property Lucene index, however in my case it does not work:
the class consists of:

orientdb {db=mobile}> desc cdr_eno_20170312
PROPERTIES
+----+---------------------------+--------+-----------------+---------+--------+--------+----+----+-------+-------+
|#   |NAME                       |TYPE    |LINKED-TYPE/CLASS|MANDATORY|READONLY|NOT-NULL|MIN |MAX |COLLATE|DEFAULT|
+----+---------------------------+--------+-----------------+---------+--------+--------+----+----+-------+-------+
|29  |SERVEDIMSI                 |STRING  |                 |true     |false   |false   |    |    |default|       |
|61  |PRT_DATE                   |DATETIME|                 |true     |false   |false   |    |    |default|       |
...
INDEXES (3 altogether)
|0   |idx_cdr_eno_20170312_3|[SERVEDIMSI,PRT_DATE]                                                                     

A direct query of one particular record:

orientdb {db=mobile}> select SERVEDIMSI, PRT_DATE from #453:9
+----+---------------+-------------------+
|#   |SERVEDIMSI     |PRT_DATE           |
+----+---------------+-------------------+
|0   |206012221582810|2017-03-12 00:00:09|
+----+---------------+-------------------+
1 item(s) found. Query executed in 0.003 sec(s).

The query via this index:

orientdb {db=mobile}> select * from index:idx_cdr_eno_20170312_3 WHERE key = '+SERVEDIMSI:206012221582810 +PRT_DATE:[20170312000001000 TO 20170312000010000]'
0 item(s) found. Query executed in 0.009 sec(s).

The query via the class:

orientdb {db=mobile}> select * from cdr_eno_20170312 WHERE [SERVEDIMSI,PRT_DATE] LUCENE '+SERVEDIMSI:206012221582810 +PRT_DATE:[20170312000001000 TO 20170312000110000]'
0 item(s) found. Query executed in 0.011 sec(s).

Sorry, but I do not find the error in my syntax, if any.

@robfrank
Copy link
Contributor

may you try with dateTime values at minute? e.g.:

orientdb {db=mobile}> select * from cdr_eno_20170312 WHERE [SERVEDIMSI,PRT_DATE] LUCENE '+SERVEDIMSI:206012221582810 +PRT_DATE:[201703120000 TO 201703120001]'

Just as a test, while I investigate if it is possible to query at least on seconds interval

@rdelangh
Copy link
Author

@robfrank
no luck:

orientdb {db=mobile}> select * from cdr_eno_20170312 WHERE [SERVEDIMSI,PRT_DATE] LUCENE '+SERVEDIMSI:206012221582810 +PRT_DATE:[201703120000 TO 201703120001]'
0 item(s) found. Query executed in 0.022 sec(s).

@robfrank
Copy link
Contributor

Can you please send me a sample data set and what/how do you want to search over them?
e.g.:

  • we have this 100 records, want to search the ones from 12:00 to 13:00, should return 50 records

@rdelangh
Copy link
Author

rdelangh commented Apr 13, 2017

hi @robfrank ,

  1. we have way too much data to store everything in a single class per data-type; since the data can be grouped on a timestamp, we chose to store them either in a single class per day (but that means 365 classes per year, or way too many clusters required than what ODB currently supports) or in a single class per week (52 classes per year, with 6 fixed clusters per class, remains a supported number of clusters)
  2. sometimes we need to do very precise queries on one or some of the properties with quite a long substring that needs to be searched, in a reasonable timeframe of let's say 1 day; for that we can do with the single-property Lucene indexes because they return few results when we search for a long string
  3. other times, we need to search on a very generic (=short) substring, such as "0800", in a long period of time like a week or a month; for this, we do not find a fast method using ODB features, and you advised to try with multi-field Lucene indexes. But that syntax is not possible, apparently.

@rdelangh
Copy link
Author

rdelangh commented Apr 14, 2017

hi,
some extra info, in case my previous answer was not clear:
We need to be able to perform ad-hoc searches like this:

  • given an exact value of some string property, search all records in a given timerange: like
select * from cdr_eno_20170312 
WHERE [SERVEDIMSI,PRT_DATE] LUCENE '+SERVEDIMSI:206012221582810 +PRT_DATE:[201703120000 TO 201703120001]
  • given a substring of some string property, search all records in a large timerange, like:
select * from cdr_eno_20170312 
WHERE [SERVEDIMSI,PRT_DATE] LUCENE '+SERVEDIMSI:*20601* +PRT_DATE:[201703120000 TO 201703190000]

@robfrank
Copy link
Contributor

Ok, thank you for the clarification. Having a little sample of real data, in form of SQL insert, little database or java code, will help a lot. Feel free to send it privately.
I will take care, other than of the queries, also of the memory/threads.

@rdelangh
Copy link
Author

rdelangh commented Apr 18, 2017

@robfrank

  1. in attach is the full description of class "cdr_eno_20170312", one of the per-weekly classes from which we hope to be able to run such multi-field Lucene queries as described above
    desc-cdr-eno-20170312.txt
  2. by private email I have sent you some sample records

Mind you that the study of how we might be able to obtain our query-results, is still independent from the fact that the documented syntax to run a multi-field Lucene query is not working in Orientdb...

@rdelangh
Copy link
Author

hello @robfrank ,
can we have some update, please?

@robfrank
Copy link
Contributor

Hi @rdelangh, I'm on it, stay tuned. Thanks for sample datasa

@robfrank
Copy link
Contributor

About range queries, I understand the problem, that is "obviously" related to time-zones. I'm working on it to fix. As a very temporary workaround, just for testing, you should transform your date in UTC adding or subtracting the number of hours to "reach" utc:

201703120000
to
201703110000

(I'm on UTC+1 right now)

@robfrank
Copy link
Contributor

I created a dedicated issue related to timezone mismatch: #7382

@laa laa closed this as completed Aug 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants