Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up package_list action by only retrieving column we need (not whole package) from db #1042

Merged
merged 4 commits into from Aug 1, 2013

Conversation

wardi
Copy link
Contributor

@wardi wardi commented Jun 25, 2013

importing package_revision_table here might be frowned on. What sorts of optimizations would you consider acceptable?

It would be even faster if we did some select array(...) magic in our query, but that might be taking it a little too far.

edit: performance numbers from comments below:

option time(s) improvement
original 0.479 1.0x
this PR 0.048 10.0x
select array 0.027 17.7x


packages = query.all()
return [getattr(p, ref_package_by) for p in packages]
from ckan.model.package import package_revision_table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is frowned upon in ckan

a) we avoid from ckan... import ... as it leads to circular imports of which we have too many but are slowly improving
b) imports should be in the head of the file (unless circular import hell prevents this)

additionally logic functions get model from the context
also package_revision_table should be in ckan.model which has crazy imports due to pylons

this sort of thing should work (model is already in the function on ln 70)

model = context['model']
package_revision_table  = model.package_revision_table

@tobes
Copy link
Contributor

tobes commented Jun 26, 2013

what sort of performance improvement did this give and how much more did the array madness?

One thing to be aware of is our long term goal of pushing the sql side of stuff back into the models rather than in the logic functions. What you've done here is good but just so you know that is where we want to end up eventually

you have also got several failing test but not sure why

@kindly what is your view of this pr it seems sensible to me

@kindly
Copy link
Contributor

kindly commented Jun 26, 2013

@tobes Seems fine with me. If somebody cleans up the import stuff I can merge.

@tobes
Copy link
Contributor

tobes commented Jun 26, 2013

@wardi do you want to do this if not I'm happy to

@wardi
Copy link
Contributor Author

wardi commented Jun 26, 2013

@tobes drat. Don't merge this yet, I forgot to cherry-pick the commit that actually makes this work (I think this code is creating an extra list, hence the failing tests)

Let me clean this up with the changes you suggested and apply the fix.

You mentioned wanting this sort of low level thing in the model, should I create some methods on the package model instead? e.g. active_package_ids() + active_package_names()

@tobes
Copy link
Contributor

tobes commented Jun 26, 2013

@wardi if you can push it back into the model that would be great (otherwise we will end up doing it much later)

could you do something like def active_package_ids(self, limit=None, offset=0): as that would seem to be useful

@wardi
Copy link
Contributor Author

wardi commented Jun 26, 2013

@tobes this was 2 or 3x faster for us, I never implemented the array() trick because I think I would have to drop to raw SQL to do it. Would raw SQL be acceptable if it was hidden in a model method?

I'm not a big fan of limit+offset. How about limit and "following" (i.e. pass the last id/name that was retrieved)? That way ids+names won't be repeated or disappear while making repeated calls to collect all the ids/names.

@tobes
Copy link
Contributor

tobes commented Jun 26, 2013

limit and "following" hmm - this might be good

@seanh @kindly @amercader @johnglover @joetsoi @johnmartin any thoughts on this offset/following stuff - consistency is king

@tobes
Copy link
Contributor

tobes commented Jun 26, 2013

I think we would try to not use raw sql unless there was a BIG performance gain others may agree/disagree

@wardi
Copy link
Contributor Author

wardi commented Jun 26, 2013

Here are some numbers: On a CKAN instance with 6k datasets these are the best times I saw:

2013-06-26 13:29:24,005 INFO [ckan.lib.base] /api/action/package_list_original render time 0.479 seconds
2013-06-26 13:30:18,356 INFO [ckan.lib.base] /api/action/package_list render time 0.048 seconds

So, about a 10x speed-up in this case

@wardi
Copy link
Contributor Author

wardi commented Jun 26, 2013

raw SQL with select array best time:

2013-06-26 14:02:01,016 INFO [ckan.lib.base] /api/action/package_list_array render time 0.027 seconds

    conn = model.Session.connection()
    result = conn.execute("""SELECT ARRAY(
        SELECT package_revision.{0}
        FROM package_revision
        WHERE package_revision.state = 'active'
        AND package_revision.current = True
        ORDER BY package_revision.{0}
        )
    """.format('id' if api == 2 else 'name'))
    return list(result)[0][0]

@tobes
Copy link
Contributor

tobes commented Jun 27, 2013

@wardi The first change is now in master and I've requested it is accepted in 2.1 but It may have missed the merge window we will see.

@kindly What's your view of the raw SQL approach?

@amercader
Copy link
Member

@tobes what is in master and what is that should go into 2.1? Is it the current 4 commits or just the first one?

@wardi Not a big deal, but if you can prepend the issue number on the commit message it makes life easier when managing different releases, see https://github.com/okfn/ckan/blob/master/CONTRIBUTING.rst#commit-messages

@tobes
Copy link
Contributor

tobes commented Jun 28, 2013

@amercader it is the 4 commits here - thanks

@joetsoi
Copy link
Contributor

joetsoi commented Jun 30, 2013

@tobes a quick grep looks like we use offset/limit in the other get action functions, the datastore and bunch of other places.

@wardi
Copy link
Contributor Author

wardi commented Jul 1, 2013

@joetsoi no question. My argument is that for new interfaces we should build something that's easier to use correctly.

offset/limit has an assumption that the data isn't changing from one call to the next. That's an assumption that often isn't true, however, and can lead to code with bugs that are subtle and hard to reproduce. a parameter like "following" makes the caller tell the API which was the last record returned, so that the API can resume sending entries in a sensible way.

For this API imagine trying to write code that will list all the package IDs while only retrieving 100 at a time using offset/limit parameters. If you want it to be reliable (not missing any ids that existed prior to the first call) then you need logic that will intentionally overlap previous results and rewind if the records it gets back don't cover ones it has already seen -- not something I would expect most users to program. With limit/following there's just one obvious and simple way to get all the ids.

@ghost ghost assigned kindly Jul 23, 2013
kindly added a commit that referenced this pull request Aug 1, 2013
speed up package_list action by only retrieving column we need (not whole package) from db
@kindly kindly merged commit 7292794 into ckan:master Aug 1, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants