speed up package_list action by only retrieving column we need (not whole package) from db #1042

wardi · 2013-06-25T19:16:28Z

importing package_revision_table here might be frowned on. What sorts of optimizations would you consider acceptable?

It would be even faster if we did some select array(...) magic in our query, but that might be taking it a little too far.

edit: performance numbers from comments below:

option	time(s)	improvement
original	0.479	1.0x
this PR	0.048	10.0x
select array	0.027	17.7x

tobes · 2013-06-26T10:09:56Z

ckan/logic/action/get.py

-
-    packages = query.all()
-    return [getattr(p, ref_package_by) for p in packages]
+    from ckan.model.package import package_revision_table


This is frowned upon in ckan

a) we avoid from ckan... import ... as it leads to circular imports of which we have too many but are slowly improving
b) imports should be in the head of the file (unless circular import hell prevents this)

additionally logic functions get model from the context
also package_revision_table should be in ckan.model which has crazy imports due to pylons

this sort of thing should work (model is already in the function on ln 70)

model = context['model'] package_revision_table = model.package_revision_table

tobes · 2013-06-26T10:18:26Z

what sort of performance improvement did this give and how much more did the array madness?

One thing to be aware of is our long term goal of pushing the sql side of stuff back into the models rather than in the logic functions. What you've done here is good but just so you know that is where we want to end up eventually

you have also got several failing test but not sure why

@kindly what is your view of this pr it seems sensible to me

kindly · 2013-06-26T10:26:17Z

@tobes Seems fine with me. If somebody cleans up the import stuff I can merge.

tobes · 2013-06-26T10:33:45Z

@wardi do you want to do this if not I'm happy to

wardi · 2013-06-26T12:43:36Z

@tobes drat. Don't merge this yet, I forgot to cherry-pick the commit that actually makes this work (I think this code is creating an extra list, hence the failing tests)

Let me clean this up with the changes you suggested and apply the fix.

You mentioned wanting this sort of low level thing in the model, should I create some methods on the package model instead? e.g. active_package_ids() + active_package_names()

tobes · 2013-06-26T13:05:16Z

@wardi if you can push it back into the model that would be great (otherwise we will end up doing it much later)

could you do something like def active_package_ids(self, limit=None, offset=0): as that would seem to be useful

wardi · 2013-06-26T13:09:22Z

@tobes this was 2 or 3x faster for us, I never implemented the array() trick because I think I would have to drop to raw SQL to do it. Would raw SQL be acceptable if it was hidden in a model method?

I'm not a big fan of limit+offset. How about limit and "following" (i.e. pass the last id/name that was retrieved)? That way ids+names won't be repeated or disappear while making repeated calls to collect all the ids/names.

tobes · 2013-06-26T13:14:08Z

limit and "following" hmm - this might be good

@seanh @kindly @amercader @johnglover @joetsoi @johnmartin any thoughts on this offset/following stuff - consistency is king

tobes · 2013-06-26T13:15:29Z

I think we would try to not use raw sql unless there was a BIG performance gain others may agree/disagree

wardi · 2013-06-26T17:31:01Z

Here are some numbers: On a CKAN instance with 6k datasets these are the best times I saw:

2013-06-26 13:29:24,005 INFO [ckan.lib.base] /api/action/package_list_original render time 0.479 seconds
2013-06-26 13:30:18,356 INFO [ckan.lib.base] /api/action/package_list render time 0.048 seconds

So, about a 10x speed-up in this case

wardi · 2013-06-26T18:07:12Z

raw SQL with select array best time:

2013-06-26 14:02:01,016 INFO [ckan.lib.base] /api/action/package_list_array render time 0.027 seconds

    conn = model.Session.connection()
    result = conn.execute("""SELECT ARRAY(
        SELECT package_revision.{0}
        FROM package_revision
        WHERE package_revision.state = 'active'
        AND package_revision.current = True
        ORDER BY package_revision.{0}
        )
    """.format('id' if api == 2 else 'name'))
    return list(result)[0][0]

tobes · 2013-06-27T07:23:38Z

@wardi The first change is now in master and I've requested it is accepted in 2.1 but It may have missed the merge window we will see.

@kindly What's your view of the raw SQL approach?

amercader · 2013-06-28T10:40:58Z

@tobes what is in master and what is that should go into 2.1? Is it the current 4 commits or just the first one?

@wardi Not a big deal, but if you can prepend the issue number on the commit message it makes life easier when managing different releases, see https://github.com/okfn/ckan/blob/master/CONTRIBUTING.rst#commit-messages

tobes · 2013-06-28T10:56:33Z

@amercader it is the 4 commits here - thanks

joetsoi · 2013-06-30T22:50:26Z

@tobes a quick grep looks like we use offset/limit in the other get action functions, the datastore and bunch of other places.

wardi · 2013-07-01T01:51:53Z

@joetsoi no question. My argument is that for new interfaces we should build something that's easier to use correctly.

offset/limit has an assumption that the data isn't changing from one call to the next. That's an assumption that often isn't true, however, and can lead to code with bugs that are subtle and hard to reproduce. a parameter like "following" makes the caller tell the API which was the last record returned, so that the API can resume sending entries in a sensible way.

For this API imagine trying to write code that will list all the package IDs while only retrieving 100 at a time using offset/limit parameters. If you want it to be reliable (not missing any ids that existed prior to the first call) then you need logic that will intentionally overlap previous results and rewind if the records it gets back don't cover ones it has already seen -- not something I would expect most users to program. With limit/following there's just one obvious and simple way to get all the ids.

speed up package_list action by only retrieving column we need (not whole package) from db

package_list performance fix

9d4ba31

tobes reviewed Jun 26, 2013
View reviewed changes

wardi added 3 commits June 26, 2013 10:35

package_list: return sorted by name

c84e9ea

fix for package_list returning nested list

fd6e639

package_list: use context['model'] instead of import, from @tobes

874abf5

ghost assigned kindly Jul 23, 2013

kindly added a commit that referenced this pull request Aug 1, 2013

Merge pull request #1042 from wardi/package_list-performance

7292794

speed up package_list action by only retrieving column we need (not whole package) from db

kindly merged commit 7292794 into ckan:master Aug 1, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed up package_list action by only retrieving column we need (not whole package) from db #1042

speed up package_list action by only retrieving column we need (not whole package) from db #1042

wardi commented Jun 25, 2013

tobes Jun 26, 2013

tobes commented Jun 26, 2013

kindly commented Jun 26, 2013

tobes commented Jun 26, 2013

wardi commented Jun 26, 2013

tobes commented Jun 26, 2013

wardi commented Jun 26, 2013

tobes commented Jun 26, 2013

tobes commented Jun 26, 2013

wardi commented Jun 26, 2013

wardi commented Jun 26, 2013

tobes commented Jun 27, 2013

amercader commented Jun 28, 2013

tobes commented Jun 28, 2013

joetsoi commented Jun 30, 2013

wardi commented Jul 1, 2013

speed up package_list action by only retrieving column we need (not whole package) from db #1042

speed up package_list action by only retrieving column we need (not whole package) from db #1042

Conversation

wardi commented Jun 25, 2013

tobes Jun 26, 2013

Choose a reason for hiding this comment

tobes commented Jun 26, 2013

kindly commented Jun 26, 2013

tobes commented Jun 26, 2013

wardi commented Jun 26, 2013

tobes commented Jun 26, 2013

wardi commented Jun 26, 2013

tobes commented Jun 26, 2013

tobes commented Jun 26, 2013

wardi commented Jun 26, 2013

wardi commented Jun 26, 2013

tobes commented Jun 27, 2013

amercader commented Jun 28, 2013

tobes commented Jun 28, 2013

joetsoi commented Jun 30, 2013

wardi commented Jul 1, 2013