Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A selection of minor changes, including coping with larger data sets #3

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

spither
Copy link
Contributor

@spither spither commented Jul 11, 2011

Hi,

These commits cover several little problems I found and fixed, including adding support for a cluster.name config option and allowing mass-index operations to survive for larger data sets (eg 50,000+ rows on a 20 field table).

Simon

@mstein
Copy link
Owner

mstein commented Jul 26, 2011

Hi spither, sorry for the very late answer, but I was (and still is) kind of on vacation so I didn't really looked into the project for some time.

I took a look in your changes, and noted that there are some changes already done on the HEAD, like the cluster.name config (was committed just a few days before your pull request actually).
I'll accept the 2a6b6fb commit at least, "force per-domain methods to always set domain based filters".

For your hibernate session split (which is a good idea), there's just a thing that bother me : you're assuming that the id field of the domain instances are numeric values (max() on the id, sort on the id), which may not be the case everytime.

@spither
Copy link
Contributor Author

spither commented Jul 26, 2011

Sorry, I didn't notice the cluster.name additions - I'll rebase and merge things in my fork soon.

If you've got any suggestions on changing the Hibernate session split, I'd be happy to put in a little extra work to improve it?

@mstein
Copy link
Owner

mstein commented Aug 10, 2011

Probably using an offset (firstResult/MaxResults) instead of the "findAllByIdGreaterThan" is the first step to look into.

@spither
Copy link
Contributor Author

spither commented Aug 10, 2011

Unfortunately using an offset isn't suitable as performance of offsets in MySQL (and possibly others?) sucks:

http://explainextended.com/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/
http://forums.mysql.com/read.php?20,428637,428771

Which means that approach definitely isn't going to be suitable for anyone using MySQL, which includes me in this case.

Perhaps a config option to switch between offset and numeric id? It could default to offset so that it works out of the box, but to make it perform well for MySQL, the config option could be set (assuming the user has a numeric id)?

@mstein
Copy link
Owner

mstein commented Oct 24, 2011

A small update about this issue (yeah I know, it was about time... sorry about that)
I've implemented the solution with the offsets (should be generic), and I'd like to run a few benchmarks with MySQL to see if the performances are really that bad and compare them with the numeric id-based solution.
If there is a noticeable difference, then I'm ok to include the id-based implementation as a user-configurable mode.
Have you ran a benchmark yourself and noticed that much of difference in performance ?

@mstein mstein closed this Oct 24, 2011
@mstein mstein reopened this Oct 24, 2011
@spither
Copy link
Contributor Author

spither commented Nov 2, 2011

These are some very quick numbers (from direct mysql queries, not indexing code) on a fairly small testing table. They are taken on an un-loaded system and while I'm only going to paste one example of each command, I've run them several times and the timings were very consistent. I've omitted the actual data returned.

Just to size the table:

mysql> select count(*), max(id) from hit_info;
+----------+---------+
| count(*) | max(id) |
+----------+---------+
|  1689490 | 1690211 |
+----------+---------+

mysql> select * from hit_info limit 0, 1;
1 row in set (0.00 sec)

mysql> select * from hit_info limit 1000, 1;
1 row in set (0.00 sec)

mysql> select * from hit_info limit 100000, 1;
1 row in set (0.03 sec)

mysql> select * from hit_info limit 1000000, 1;
1 row in set (0.32 sec)

mysql> select * from hit_info limit 1500000, 1;
1 row in set (0.49 sec)

So to index row 1 million to row 1.5 million, 1000 rows at a time (my tests on some wide tables needed many less rows at a time to avoid memory exhaustion, but it makes the maths easy) would take 500 queries averaging 0.4 seconds each which is about 3.3 minutes.

However an equivalent query using IDs:

mysql> select * from hit_info where id > 1500000 limit 1;
1 row in set (0.00 sec)

...is too fast to measure in such a simple test. Quite clearly it's going to be an awful lot faster than the LIMIT based approach though.

Sorry the numbers are direct SQL queries rather than running test code but hopefully they illustrate the problem (with MySQL!).

@confile
Copy link

confile commented Aug 10, 2013

@spither du you have an updated version of the elastic search plugin for version 0.90.3?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants