How to fetch a single 'page' of search results? #22

Open
inactivist opened this Issue Mar 31, 2013 · 2 comments

Comments

Projects
None yet
2 participants

I'm sure this is a dumb question, but I'm trying to figure out how to enumerate only the first 'page' of results from a search -- I want to perform one API query (hitting the server only once) then enumerate the results.

For example, the search.py demo enumerates the full results of a query, even if I pass in page=1 and pagesize=30. I would have expected a maximum of 30 results. What am I missing?

I'm aware of the 'lazy lists' option, but haven't tried that. Is that the right solution?

Edit: Here's my modified search.py tweaked to search for tags rather than title text. Invoking it with the following command line:

$ python search.py python

results in over 100,000 results being enumerated, and thus hitting the API endpoint repeatedly. Time to fire up the debugger...

    #!/usr/bin/env python
    import sys
    sys.path.append('.')

    import stackexchange
    so = stackexchange.Site(stackexchange.StackOverflow)

    if len(sys.argv) < 2:
        print 'Usage: search.py TERM'
    else:
        term = ' '.join(sys.argv[1:])
        print 'Searching for %s...' % term,
        sys.stdout.flush()

        qs = so.search(tagged=term,
            page=1,
            pagesize=30)

        print '\r--- %d questions tagged with "%s" ---' % (qs.total, term)

        for q in qs:
            print '%8d %s' % (q.id, q.title)
Owner

lucjon commented Mar 31, 2013

search, and similar methods, return a resultset object, which derives from the tuple class. It has its __iter__ method redefined to automatically fetch subsequent pages in a query. This makes perfect sense in some situations, but rather less in others.

You can access the underlying list of model objects via the .items field; giving you in this case the first page of results. As the class is a tuple subclass, you can also use slices, but these will not be able to cross page boundaries without a proper implementation of __getitem__, but with manipulation of page= and pagesize=, this will probably be manageable. To clarify, page= will only have the effect of specifying the initial page of results fetched, as it is passed directly to the StackExchange API.

I will look to writing a proper __getitem__ implementation to help fix the rather leaky abstraction the resultset class currently exposes, though if you have any opinions as far as a cleaner API is concerned, don't hesitate to respond.

Thanks for the pointers. I'm digging through StackExchangeResultset now -- told you I hadn't been through your code...

It's confusing (to me, anyway) to specify page and pagesize value to .search() (or other APIs) and yet enumerating the results span multiple pages -- it doesn't map directly to the API behavior, which fetches a single page of results. In the use case where a client app wants to hit the API once, additional logic is necessary (or, as you indicate, one can use the .items field.)

I'm can't suggest improvements just yet, but at a minimum the current behavior should be called out in the docs: FAQ or Wiki page describing the pagination model and how it works 'under the hood' (er... bonnet) could help here. (Yes, reading the code answers all questions, but that's not how most developers want to spend their time.)

I'll be only to happy to contribute to said docs as time (and my understanding) permits.

Edit: I'd suggest that automatic next page fetching during iteration could be an explicit option (explicit being better than implicit) but we don't want to go breaking existing code, do we?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment