New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speed up package_list action by only retrieving column we need (not whole package) from db #1042
Conversation
|
||
packages = query.all() | ||
return [getattr(p, ref_package_by) for p in packages] | ||
from ckan.model.package import package_revision_table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is frowned upon in ckan
a) we avoid from ckan... import ...
as it leads to circular imports of which we have too many but are slowly improving
b) imports should be in the head of the file (unless circular import hell prevents this)
additionally logic functions get model from the context
also package_revision_table
should be in ckan.model
which has crazy imports due to pylons
this sort of thing should work (model is already in the function on ln 70)
model = context['model']
package_revision_table = model.package_revision_table
what sort of performance improvement did this give and how much more did the array madness? One thing to be aware of is our long term goal of pushing the sql side of stuff back into the models rather than in the logic functions. What you've done here is good but just so you know that is where we want to end up eventually you have also got several failing test but not sure why @kindly what is your view of this pr it seems sensible to me |
@tobes Seems fine with me. If somebody cleans up the import stuff I can merge. |
@wardi do you want to do this if not I'm happy to |
@tobes drat. Don't merge this yet, I forgot to cherry-pick the commit that actually makes this work (I think this code is creating an extra list, hence the failing tests) Let me clean this up with the changes you suggested and apply the fix. You mentioned wanting this sort of low level thing in the model, should I create some methods on the package model instead? e.g. active_package_ids() + active_package_names() |
@wardi if you can push it back into the model that would be great (otherwise we will end up doing it much later) could you do something like |
@tobes this was 2 or 3x faster for us, I never implemented the array() trick because I think I would have to drop to raw SQL to do it. Would raw SQL be acceptable if it was hidden in a model method? I'm not a big fan of limit+offset. How about limit and "following" (i.e. pass the last id/name that was retrieved)? That way ids+names won't be repeated or disappear while making repeated calls to collect all the ids/names. |
limit and "following" hmm - this might be good @seanh @kindly @amercader @johnglover @joetsoi @johnmartin any thoughts on this offset/following stuff - consistency is king |
I think we would try to not use raw sql unless there was a BIG performance gain others may agree/disagree |
Here are some numbers: On a CKAN instance with 6k datasets these are the best times I saw: 2013-06-26 13:29:24,005 INFO [ckan.lib.base] /api/action/package_list_original render time 0.479 seconds So, about a 10x speed-up in this case |
raw SQL with select array best time: 2013-06-26 14:02:01,016 INFO [ckan.lib.base] /api/action/package_list_array render time 0.027 seconds conn = model.Session.connection()
result = conn.execute("""SELECT ARRAY(
SELECT package_revision.{0}
FROM package_revision
WHERE package_revision.state = 'active'
AND package_revision.current = True
ORDER BY package_revision.{0}
)
""".format('id' if api == 2 else 'name'))
return list(result)[0][0] |
@tobes what is in master and what is that should go into 2.1? Is it the current 4 commits or just the first one? @wardi Not a big deal, but if you can prepend the issue number on the commit message it makes life easier when managing different releases, see https://github.com/okfn/ckan/blob/master/CONTRIBUTING.rst#commit-messages |
@amercader it is the 4 commits here - thanks |
@tobes a quick grep looks like we use offset/limit in the other get action functions, the datastore and bunch of other places. |
@joetsoi no question. My argument is that for new interfaces we should build something that's easier to use correctly. offset/limit has an assumption that the data isn't changing from one call to the next. That's an assumption that often isn't true, however, and can lead to code with bugs that are subtle and hard to reproduce. a parameter like "following" makes the caller tell the API which was the last record returned, so that the API can resume sending entries in a sensible way. For this API imagine trying to write code that will list all the package IDs while only retrieving 100 at a time using offset/limit parameters. If you want it to be reliable (not missing any ids that existed prior to the first call) then you need logic that will intentionally overlap previous results and rewind if the records it gets back don't cover ones it has already seen -- not something I would expect most users to program. With limit/following there's just one obvious and simple way to get all the ids. |
speed up package_list action by only retrieving column we need (not whole package) from db
importing package_revision_table here might be frowned on. What sorts of optimizations would you consider acceptable?
It would be even faster if we did some
select array(...)
magic in our query, but that might be taking it a little too far.edit: performance numbers from comments below: