Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[5.2] Speed up chunk for large sets of data #12861

Merged
merged 2 commits into from Mar 28, 2016

Conversation

Projects
None yet
7 participants
@kevindoole
Copy link
Contributor

kevindoole commented Mar 25, 2016

I've been working on a project that involves sifting through some pretty large databases, and the query builder's chunk method was not really doing it for me because it slowed down so much as it worked through the table.

As chunk gets deeper and deeper into a table, it asks MySQL to count through more and more rows. When chunk says to MySQL, select whatever from wherever limit 800000, 1000 MySQL counts through 800,000 rows without using an index or nuthin. Then, the next page, it again counts through all previous rows to find the next set. Poor MySQL... It gets really really slow (details below).

I realize this may be too much an edge case, so didn't want to spend too much time working on the code; just looking for feedback at this point.

If you're still reading, here are some scrappy benchmarks. :)


After 4 seconds, chunk is very optimistic:
187500/862443 [▓▓▓▓▓▓░░░░░░░░░░░░░░░░░░░░░░] 21% 4 secs/18 secs 10.0 MiB

However, in the end the process has taken substantially longer:
862443/862443 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓] 100% 2 mins/2 mins 10.0 MiB

As you can imagine, the problem gets progressively worse as the database gets larger.

If we instead query by 'id' > $theLastIdFromTheLastSet, the chunking is much faster.

Again, after 4 seconds:
387000/862443 [▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░░░░░░░] 44% 4 secs/9 secs 10.0 MiB

And the same set takes only 21s in the end.
862443/862443 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓] 100% 21 secs/21 secs 10.0 MiB

@GrahamCampbell GrahamCampbell changed the title Speed up chunk for large sets of data [5.2] Speed up chunk for large sets of data Mar 25, 2016

@GrahamCampbell

This comment has been minimized.

Copy link
Member

GrahamCampbell commented Mar 25, 2016

This only works if we are actually ordering by id.

@GrahamCampbell

This comment has been minimized.

Copy link
Member

GrahamCampbell commented Mar 25, 2016

Also, the each function is still forced to use the poor performance chunk method.

@kevindoole

This comment has been minimized.

Copy link
Contributor Author

kevindoole commented Mar 25, 2016

Yeah, it's def a pretty specific use case (large set, ordered by id), but that might be common enough. I mean, for people using an auto incrementing id, using a where, or better yet a whereBetween will always be faster, regardless of the size of the dataset. So i guess there's actually some performance benefit for smaller sets.

Fwiw, for my needs i ended up implementing a method using whereBetween instead of using chunk. It makes sense when you don't necessarily care how many rows you get back in a set, which is probably not useful a whole lot of the time.

@taylorotwell

This comment has been minimized.

Copy link
Member

taylorotwell commented Mar 25, 2016

Yeah it's common enough to order by ID that it's probably worth the improvement here. Thanks!

@kevindoole

This comment has been minimized.

Copy link
Contributor Author

kevindoole commented Mar 27, 2016

Ok, great!

So far i've kept the changed functionality completely separated in a chunkById method. Is it worth spending a bit extra time to make the chunk method use where instead of offset by default when it's possible, or would you rather keep it separated? I guess the logic would be something like, if there's already an order by or if there's no id available, use offset. Otherwise, use the where id > x.

Come to think of it, i guess where id < x order by id desc is more natural.

Sound good?

@taylorotwell

This comment has been minimized.

Copy link
Member

taylorotwell commented Mar 28, 2016

I would just keep them separate how you have them now.

@taylorotwell taylorotwell merged commit bd09dda into laravel:5.2 Mar 28, 2016

2 checks passed

StyleCI The StyleCI analysis has passed
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@taylorotwell

This comment has been minimized.

Copy link
Member

taylorotwell commented Mar 28, 2016

I just merged this but noticed the Query builder version is all messed up. The query builder doesn't return a collection at all in 5.2. I guess you only wrote test for Eloquent version?

@kevindoole

This comment has been minimized.

Copy link
Contributor Author

kevindoole commented Mar 28, 2016

Yeah, sorry! I was expecting to need to put a little more work in -- should've put a 'WIP' in the PR title or something.

This evening I can send another PR to polish up the query builder side and add tests.

@vlakoff

This comment has been minimized.

Copy link
Contributor

vlakoff commented Mar 29, 2016

  • There are two chunkById methods defined, but they are very similar, so maybe code duplication could be avoided. If you define chunkById only in Builder, it will also be callable by Eloquent because of the __call().
  • More important, because of the select statement in pageAfterId, the results contain only id column...
@halaei

This comment has been minimized.

Copy link
Contributor

halaei commented Apr 12, 2016

Could there be a similar thing for pagination?

@@ -1315,6 +1315,23 @@ public function forPage($page, $perPage = 15)
}
/**
* Set the limit and query for the next set of results, given the
* last scene ID in a set.

This comment has been minimized.

@hughgrigg

hughgrigg Apr 12, 2016

Contributor

Scene or seen?

@barryvdh

This comment has been minimized.

Copy link
Contributor

barryvdh commented May 16, 2016

@kevindoole Did you perhaps also benchmark chunking with a whereIn() or subquery for the ids?
Eg. for large datasets, make 2 queries (or 1 + 1 subquery); 1 which selects just the id for the resultset, the second to get the actual data. This means MySQL can look up the ids from just the index, which should be faster (see https://explainextended.com/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/ and https://explainextended.com/2011/02/11/late-row-lookups-innodb/ although I'm not sure this is not optimised yet)
That could work for other orderBy's also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.