A selection of minor changes, including coping with larger data sets #3

spither · 2011-07-11T13:40:45Z

Hi,

These commits cover several little problems I found and fixed, including adding support for a cluster.name config option and allowing mass-index operations to survive for larger data sets (eg 50,000+ rows on a 20 field table).

Simon

…owing

…during bulk operations (eg import)

… when no extra parameters are provided

mstein · 2011-07-26T14:24:08Z

Hi spither, sorry for the very late answer, but I was (and still is) kind of on vacation so I didn't really looked into the project for some time.

I took a look in your changes, and noted that there are some changes already done on the HEAD, like the cluster.name config (was committed just a few days before your pull request actually).
I'll accept the 2a6b6fb commit at least, "force per-domain methods to always set domain based filters".

For your hibernate session split (which is a good idea), there's just a thing that bother me : you're assuming that the id field of the domain instances are numeric values (max() on the id, sort on the id), which may not be the case everytime.

spither · 2011-07-26T14:40:09Z

Sorry, I didn't notice the cluster.name additions - I'll rebase and merge things in my fork soon.

If you've got any suggestions on changing the Hibernate session split, I'd be happy to put in a little extra work to improve it?

mstein · 2011-08-10T07:25:28Z

Probably using an offset (firstResult/MaxResults) instead of the "findAllByIdGreaterThan" is the first step to look into.

spither · 2011-08-10T08:54:59Z

Unfortunately using an offset isn't suitable as performance of offsets in MySQL (and possibly others?) sucks:

http://explainextended.com/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/
http://forums.mysql.com/read.php?20,428637,428771

Which means that approach definitely isn't going to be suitable for anyone using MySQL, which includes me in this case.

Perhaps a config option to switch between offset and numeric id? It could default to offset so that it works out of the box, but to make it perform well for MySQL, the config option could be set (assuming the user has a numeric id)?

mstein · 2011-10-24T22:35:48Z

A small update about this issue (yeah I know, it was about time... sorry about that)
I've implemented the solution with the offsets (should be generic), and I'd like to run a few benchmarks with MySQL to see if the performances are really that bad and compare them with the numeric id-based solution.
If there is a noticeable difference, then I'm ok to include the id-based implementation as a user-configurable mode.
Have you ran a benchmark yourself and noticed that much of difference in performance ?

spither · 2011-11-02T18:21:55Z

These are some very quick numbers (from direct mysql queries, not indexing code) on a fairly small testing table. They are taken on an un-loaded system and while I'm only going to paste one example of each command, I've run them several times and the timings were very consistent. I've omitted the actual data returned.

Just to size the table:

mysql> select count(*), max(id) from hit_info;
+----------+---------+
| count(*) | max(id) |
+----------+---------+
|  1689490 | 1690211 |
+----------+---------+

mysql> select * from hit_info limit 0, 1;
1 row in set (0.00 sec)

mysql> select * from hit_info limit 1000, 1;
1 row in set (0.00 sec)

mysql> select * from hit_info limit 100000, 1;
1 row in set (0.03 sec)

mysql> select * from hit_info limit 1000000, 1;
1 row in set (0.32 sec)

mysql> select * from hit_info limit 1500000, 1;
1 row in set (0.49 sec)

So to index row 1 million to row 1.5 million, 1000 rows at a time (my tests on some wide tables needed many less rows at a time to avoid memory exhaustion, but it makes the maths easy) would take 500 queries averaging 0.4 seconds each which is about 3.3 minutes.

However an equivalent query using IDs:

mysql> select * from hit_info where id > 1500000 limit 1;
1 row in set (0.00 sec)

...is too fast to measure in such a simple test. Quite clearly it's going to be an awful lot faster than the LIMIT based approach though.

Sorry the numbers are direct SQL queries rather than running test code but hopefully they illustrate the problem (with MySQL!).

confile · 2013-08-10T13:21:24Z

@spither du you have an updated version of the elastic search plugin for version 0.90.3?

spither added 5 commits June 7, 2011 19:56

Attempt to force a pause to avoid the first mapping creation from thr…

fe80a9c

…owing

Add some useful ignores

a2b1bc5

Add support for cluster.name config option

0332606

Recycle the Hibernate session and limit the size of the bulk request …

a8c594e

…during bulk operations (eg import)

Force per-domain methods to always set domain based filters, not just…

2a6bf6b

… when no extra parameters are provided

mstein closed this Oct 24, 2011

mstein reopened this Oct 24, 2011

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A selection of minor changes, including coping with larger data sets #3

A selection of minor changes, including coping with larger data sets #3

spither commented Jul 11, 2011

mstein commented Jul 26, 2011

spither commented Jul 26, 2011

mstein commented Aug 10, 2011

spither commented Aug 10, 2011

mstein commented Oct 24, 2011

spither commented Nov 2, 2011

confile commented Aug 10, 2013

A selection of minor changes, including coping with larger data sets #3

Are you sure you want to change the base?

A selection of minor changes, including coping with larger data sets #3

Conversation

spither commented Jul 11, 2011

mstein commented Jul 26, 2011

spither commented Jul 26, 2011

mstein commented Aug 10, 2011

spither commented Aug 10, 2011

mstein commented Oct 24, 2011

spither commented Nov 2, 2011

confile commented Aug 10, 2013