Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paging improvements #23

Closed
cholmes opened this issue Dec 13, 2017 · 37 comments
Closed

Paging improvements #23

cholmes opened this issue Dec 13, 2017 · 37 comments

Comments

@cholmes
Copy link
Member

cholmes commented Dec 13, 2017

So I believe paging as it stands right now is a bit problematic. Hopefully a developer who has actually worked with paging on restful api's with elastic search backends sounds in and can explain more.

But in Planet's data API we don't have 'startIndex' or anything like that, since it's a lot more complexity to enable clients to request arbitrary locations in an index. Indeed if the index changes, with data added or deleted then that can throw off the results.

Planet solves this by requiring users to create a saved search, or there's a shortcut 'quicksearch' to get results right away. But both create an index on the server side that's set in time, that the user can page through. DigitalGlobe actually doesn't enable paging on their Catalog results, and the underlying vector data store seems to require a similar creation of a paging ID - https://gbdxdocs.digitalglobe.com/v1/docs/vs-retrieve-page-of-vector-items

I don't think we need to specify that level of paging / search in the core of the spec (though possibly an extension / best practice). But I think we should remove 'startIndex', in favor of just having results supply their 'next' link, which can be generated by the service. Planet's 'next' links look like https://api.planet.com/data/v1/searches/f8abab5007a14b31b5ccfb8a3d3f02d1/results?_page=eyJxdWVyeV9wYXJhbXMiOiB7fSwgInNvcnRfcHJldiI6IGZhbHNlLCAicGFnZV9zaXplIjogMjUwLCAic29ydF9ieSI6ICJwdWJ

(with a string that is actually 3 times that long).

In the STAC spec we just have an optional nextPageToken, and then the 'links' section uses that token.

Services wouldn't be required to use an obscure string for paging through results. Though we should make a recommendation on if services are expected to return consistent results, or if it is ok that they are returning like less than the 'count'. See like https://developers.facebook.com/blog/post/478/ for an api that doesn't guarantee the number of results.

Additional discussion on paging best practices at https://stackoverflow.com/questions/13872273/api-pagination-best-practices - we should perhaps recommend some default ordering of results that paging can be driven off of.

The other thing that seems a bit arbitrary is the 10000 maximum limit on the 'count' parameter. And I couldn't figure out in the spec if that is something that implementations can change, or is like hard coded in. At Planet we enable dynamic setting of the page size, but the limit is 250 results per page. And the operations team resists any increases to that, as it introduces more complexity to support.

@pvretano
Copy link
Contributor

Review this from WFS 2.0: http://docs.opengeospatial.org/is/09-025r2/09-025r2.html#76 Sounds pretty close to what you are describing no? ... ignoring all the references to capabilities documents which would no longer be relevant in WFS 3.0 anyway.

WFS 2.0 supported two types of paging. The first used startIndex and count and was there for backward compatibility with WFS 1.X. The second method used count and prev/next links in the response.

I always preferred that latter so I would be good with tossing the legacy and sticking with count/prev/next.

The draft says that the 10000 value is only an example and may be changed (see Req 17 in the core).

@cholmes
Copy link
Member Author

cholmes commented Dec 13, 2017

Yeah, that's definitely right in line. Count & prev/next links. Would be happy to toss the legacy and do the count/prev/next.

Ah, and missed that the 10000 is an example. I think that is the challenge with the current format, how to get across what can be tweaked and what must be there - though I've got no great ideas on how to do it better, as the current spec does a decent job, particularly with the sample openapi document.

@cportele
Copy link
Member

Tossing the startIndex legacy would be fine with me, too. So we would "just" keep count and it would be up to the implementation how it creates the prev/next links (which could use a parameter like startIndex).

To make it clearer that the 10 and 10000 are just examples, maybe we should add an explicit sentence in the count requirement that default and maximum count values have to be declared in the API definition, but the values are determined by the server.

@pvretano
Copy link
Contributor

That is correct. The prev/next links would be opaque. An implementation could generate whatever links it wants including using the startIndex parameter (which would now be reclassified as a vendor-specific parameter).
About the default and max count values, yes we should add some clarification text.

@cholmes
Copy link
Member Author

cholmes commented Dec 15, 2017

Agreed. And yeah, an explicit sentence on the server determining values makes good sense.

@rcoup
Copy link

rcoup commented Jan 19, 2018

About the default and max count values, yes we should add some clarification text.

I think count should be a suggestion to the server ("I only want 10", or "I'd like 1M"), but at the end of the day it can send back what it likes. You can ask for 1M but I'll only send 250, or you can ask for 10 but I might send you 100. As long as the next link is present you can eventually get the number you're after.

Should prev be optional?

@cholmes
Copy link
Member Author

cholmes commented Jan 19, 2018

Agreed on both counts. Better to return records, even less than asked for, than return an error.

And I'd missed 'prev' in there - definitely should make it optional. We don't do it in the Planet data API, as it's definitely more complex to support it, and we didn't see any win from doing it.

Maybe we should put that one in its own ticket?

@pvretano
Copy link
Contributor

pvretano commented Jan 20, 2018

By "suggestion" do we mean a maximum per fetch? count=100, I want 100 but the server can return any number up to 100 ... and provide a next link for next set of features.
As for prev being optional ... hmmm ... I don't think that I am opposed to that.

@rcoup
Copy link

rcoup commented Jan 24, 2018

I think it probably boils down to:

  • the client may request a count it's interested in
  • the server likely has a default value for count, and a maximum limit
  • if the server has any more results available than it returns (whether the number it returns is less than, equal to, or greater than either the requested/limited/default count) then the server should include a link to the next set of results.

So:

  • If you ask for 50, you might get 50 and a next link if there are more. (requested)
  • If you ask for 50000, you might get 1000 and a next link if there are more (server-limited)
  • If you ask for 1, you might get 100 and a next link if there are more. (ignored your request)
  • If you don't specify a count, you might get 100 and a next link if there are more. (default)

@pvretano
Copy link
Contributor

Don't think that the server should ever return more than requested ... If client requests 1 gets 1 or zero and possibly a next link.

@cportele
Copy link
Member

I agree with making prev optional and count as a hint with the constraint that the server must not return more entities than the count value.

@cportele cportele added this to the Part 1, First Draft milestone Jan 29, 2018
@cportele
Copy link
Member

To summarize, the conclusion seems to be:

  • Remove startIndex (it could be one example for how to implement next links, but opaque links are fine, too)
  • Clarify that servers may provide prev, but it is always optional (remove from the requirement)
  • For count (reusing @rcoup's summary, with the change discussed above)
    • the client may request a count it is interested in
    • the server likely has a default value for count, and a maximum limit
    • if the server has any more results available than it returns (whether the number it returns is less than or equal to the requested/limited/default count) then the server should include a link to the next set of results.
    • So:
      • If you ask for 50, you will get 0 to 50 and a next link if there are more. (requested)
      • If you ask for 50000, you might get 1000 and a next link if there are more (server-limited)
      • If you don't specify a count, you might get 100 and a next link if there are more. (default)

Is everyone ok with this or do we need more discussion?

@jvanulde
Copy link
Contributor

Okay with me @cportele

@cmheazel
Copy link
Contributor

cmheazel commented Feb 1, 2018

Okay for the core. But let's not loose sight of Chris's original issues. Some of the paging practices he referenced allow random access to the pages (give me page n). This is a ubiquitous web capability. It is also far more difficult than it looks. Can we leave this issue open, or open a new issue, to track this for an extension to the core?

@cportele
Copy link
Member

cportele commented Feb 1, 2018

Agreement on 2018-02-01: Implement the proposal above. Random page access can be implemented in the future in an extension, if it turns out that it is a common need.

@aaime
Copy link
Contributor

aaime commented Mar 7, 2018

@cholmes how is a "saved query" actually saved? I mean, an efficient implementation in my mind would require a versioning backend, otherwise we'd have to dump the whole result without knowing if the user will ever crawl it, which does not seem to be feasible on a public server where everybody can start one or more paging operations.

@rcoup
Copy link

rcoup commented Mar 8, 2018

@aaime to clarify, there's absolutely nothing wrong with an implementation like next="https://wfs3.example.com/data/?offset=100&count=100". Though continually running that against a database for large resultsets is really inefficient — eg. &offset=500000&count=100 would mean the DB retrieves 500100 results and throws away 500000 of them, and if the dataset changes then results would be missed or repeated.

ElasticSearch implements the Scroll and Search After APIs which offer two alternatives.

For a DB, "Scroll" would map onto named cursors (but requires mapping to a consistent open DB connection, which has other issues).

A simple DB-backed implementation of "Search After" is

  1. /data/?category=trees&order_by=created_at becomes SELECT * FROM mydata WHERE category='trees' ORDER BY created_at, pk LIMIT 100; (primary key is a tie-breaker for sorting)
  2. get the max sort-key values from the returned records (eg. ["2018-03-08", 1554433])
  3. the next link returned is: /data/?query=[base64({query: "category=trees&order_by=created_at", start: ['2018-03-08', 1554433], pageSize: 100})]
  4. when the next link comes back it gets parsed and turned into the SQL: SELECT * FROM mytable WHERE category='trees' AND (created_at >= '2018-03-08' AND pk > 1554433) ORDER BY created_at, pk LIMIT 100;

https://www.citusdata.com/blog/2016/03/30/five-ways-to-paginate/ has some other database-driven options too. And if you have a versioned backed then there are other possibilities as well.

@aaime
Copy link
Contributor

aaime commented Mar 8, 2018

@rcoup thanks for the feedback. Those a great suggestions, easy to implement when one is focusing on publishing a particular data set. Doing the same in something generic as GeoServer (any data source, any sorting attribute) would seem challenging, but who knows, maybe one day ;-)

@cportele cportele mentioned this issue Mar 8, 2018
@cmheazel cmheazel reopened this Sep 25, 2018
@cportele
Copy link
Member

I looked at range requests a year ago for "paging" (using features as units), but decided that it is not a good fit. I need to look, if I have any notes of my analysis....

@jampukka
Copy link
Contributor

Range: items=0-999 is the same as &startIndex=0&count=1000

&startIndex= was deemed inefficient for collections backed by a relational database (see @rcoup comment on March 8th) and problematic for other solutions (see original issue).

Having the paging information in HTTP headers is problematic, since it can't be embedded to links (e.g. the next link). And do we really want to deal with 206 Partial Content responses?

@cportele
Copy link
Member

Yes, there were a few reasons why I did not propose Range headers for "paging":

  • We need to support some non-header mechanism anyhow to be able to mint URIs (links) to pages / partial responses.
  • A request without a range header cannot return 206, so there is no way that a server could have a default limit. I.e. any request to a collection without a range header would have to return all features and it is important to enable servers to have a default page limit.
  • I do not see a real need for multipart responses, but servers would have to support requests that lead to 206 multipart responses
  • Developers do not seem to expect such an approach and since it uses a custom range unit anyhow (i.e. not bytes), it is unclear how much value it brings. Probably consistent with this: I have not seen much of range headers in Web APIs for paging.

@cmheazel - What would be the practical benefits for using range headers in your view? And how to address the concerns above?

@aaime
Copy link
Contributor

aaime commented Sep 25, 2018

The question came back up (among other things) because during CITE testing we found out that, in general, it's hard to test for proper implementation of paging. The servers are not required to expose a specific data set, that implies in order to test paging the test suite can only start from the first page and got to the very end to verify all possible links layouts. This may lead to tests taking hours or more.

<rant>
But I don't think this is the main issue, I'm more concerned about usefulness. This is a typical paging UI layout

To support it, one needs to have random paging access. In my experience, this is one of the most common WFS use cases, side by side with visual display of data (which is arguably better taken care of by vector tiles).
Some will (correctly) argue this could be taken care of by an extension, and they are probably right, but so far it seems that WFS3 is by design limited to the point that any of our current WFS2 usage will require a WFS3 with a few extensions. Maybe it's ok, one could make a parallel of WFS3 to an base class in object oriented programming, providing a solid base, but not actually being used as-is.
</rant>

@rcoup
Copy link

rcoup commented Sep 25, 2018

I agree with @cportele's assessment here. It's something seems like it's applicable until you dive deeper, then the abstraction kinda falls apart quickly.

@rcoup
Copy link

rcoup commented Sep 25, 2018

But I don't think this is the main issue, I'm more concerned about usefulness. This is a typical paging UI layout
To support it, one needs to have random paging access.

@aaime but there's zero UX benefit to the paging here? It's not like I know what's on Page 4, or that page 5 is the "right" page for my answer. Next/Previous buttons and maybe an (estimated) total result count would achieve the same thing IME.

@aaime
Copy link
Contributor

aaime commented Sep 25, 2018

Next and Forward help with random looks at the data, when you have no idea of the data layout, yes.
However, the current draft of the standard the prev link is not requested, see "permission 4", states "A response to a next link MAY include a prev link to the resource that included the next link."
And there is no link to the last page either. So imagine that table with only a "first page" and "next" links.

About "it's not like I know what's on Page 4" I have a different experience. I've been working in municipalities where these paging UIs are used a lot, they normally have a default sorting, and after a while people start growing a sense of where is stuff they are looking for in the paging structure. We would use a filter to locate what we want, but not everybody does (and the WFS 3 basic filtering is not particularly useful to start with).

Again, I don't see WFS 3 being particularly useful without extensions, and as said, this may just how it's meant to be, a base kit on which one puts extensions to make it useful for a particular use case.

@cportele
Copy link
Member

For simpler use cases WFS 3 Core is and should be useful as it is.

For more advanced requirements / deployments extensions will be needed / essential. This includes several of the WFS 2 capabilities (which were often also in additional conformance classes).

@jampukka
Copy link
Contributor

In WFS3 Core there's no way to specify the sorting order. Therefore paging is only really useful for "streaming" through the response in count-sized chunks. Access to to previous page(s) might be easier to implement in the UI application (you already had the information as there's no way to skip pages with sequential forward-only next links).

When sorting extension is added paging becomes much more meaningful. Then you can access the last page by flipping the sorting order and accessing the first page, so I'd still vote no for last link.

@cportele
Copy link
Member

cportele commented Mar 5, 2019

Do we need to keep this issue "re-opened"? I do not see any proposal emerging for a change to the specification.

@cholmes
Copy link
Member Author

cholmes commented Mar 5, 2019

I'd prefer to have the 'extension' for startIndex be published and blessed by the WFS group before we close this issue. I've liked the idea of a simple core, but I think we need to prioritize publishing the set of core extensions that gets WFS3 rough 'parity' with the core of WFS2 asap, to cover many of its core use cases.

I've been trying to push WFS 3 in STAC, and we hit this paging issue, and I tried to explain that the 'wfs 3 way' was to recommend startIndex, and that we should adopt the same. But then I had no place to point except this issue. It looks like there's actually a folder for extensions now, with one it for CRS https://github.com/opengeospatial/WFS_FES/tree/master/extensions which is a great improvement. Though there's a ton of boilerplate going on in that crs directory, is there a way to cut that down?

@cportele
Copy link
Member

cportele commented Mar 5, 2019

@cholmes - For startIndex we have #75.

Regarding the extensions. Yes, I agree, we need a short template for extensions. The current CRS draft has quite some text that is a left-over and does not belong into this extension.

@alpha-beta-soup
Copy link

alpha-beta-soup commented Mar 21, 2019

Without explicit sorting, does pagination have much meaning? Pagination of an unsorted (or at least, not-explicitly-sorted) result seems like a useful way to break up a request that would otherwise be too large for client or server, but random page access to unsorted records doesn't seem like a use case for anything apart from achieving these small sequential payloads. A client cannot sort without access to the full collection.

Given that sorting is not mentioned in the Core (I think; I've only just read the specification), is there a contract that a particular page has coherent content with its linked adjacent rel pages? (For the moment assuming no race-condition changes to the underlying dataset — which itself makes pagination problematic due to the potential presence of duplicates, etc.) Hypothetically (ignoring a caching layer), if you could retrieve an entire collection in a single request, must the result be identically sorted each time?

Sorting features of a collection by time (if temporal) and then secondarily by primary key seems like a suitable implementation for some servers and datasets, but obviously primary keys are not necessarily in any kind of semantic and/or alpha-numeric order so ideally function only as tie-breakers when sorting on some other property. Is time a default sort condition? And if so, what of features without temporal information: is it determined by their name? What of features with durations?

Being able to bypass a server's maximum limit #152 would be one workaround to the problem of pages omitting/duplicating features on different pages: a result is at least internally consistent at time of request. But it also has its own issues.

Another solution might be the ability to return a list of all pages upfront, thereby giving a client the opportunity to request pages in parallel rather than sequentially. This has advantages to the client beyond reducing the probability of data in a collection changing while paginating: including client-side latency, and UI advantages when aiming to render a First Previous 2 3 Current 5 6 Next Last pagination UI.

Hypothetically, pages could include some information about their range (e.g. temporal extent if relevant), which would provide additional benefits to both human and computer interaction. For large collections this is still probably not feasible, since it may not be able to efficiently compute where page boundaries lie in a sorted collection beyond previous/next relations. However, given that a collection's feature count is known at request, as is the page limit, a simple list of all pages seems feasible, though might fall into the realm of a client optimisation if pages can be reliably constructed using standard query parameters. (Which the existing spec says is not mandated.)

Pages themselves could even take on an explicit spatial property, perhaps if pages are organised (features are sorted) into some kind of tessellated grid, like DGGS—at this point perhaps a client would prefer vector tiles.

These are more rambling and obvious thoughts than coherent contributions. My point is really the same as @jampukka's:

  • Without explicit sorting, pagination is equivalent to chunking a large dataset.

To this I'd add:

  • In an unsorted collection, the client knows that it needs to request all pages before being able to perform many consequent tasks.
  • In a sorted collection, where the sort condition is known, a client may discover that it can stop making requests when its implicit search condition is met.
  • "Random" page access is useful with a sorted collection.

@cportele
Copy link
Member

cportele commented May 6, 2019

My proposal is to close this issue and transfer the open aspects to other issues. For startIndex we already have #75. For a brief template for an extension we have #193. For sorting we would transfer the relevant input to a new issue.

@cportele
Copy link
Member

cportele commented May 6, 2019

06-MAY-2019 Conference call: As proposed, close this issue and transfer the open aspects to other issues. For startIndex we already have #75. For a brief template for an extension we have #193. For sorting Clemens will transfer the relevant input to a new issue.

@cportele
Copy link
Member

cportele commented May 8, 2019

The sorting discussion has been moved to #157. This issue can be closed, all open discussions have separate issues.

@pvgenuchten
Copy link

pvgenuchten commented Feb 21, 2021

Was discussing this topic with @tomkralidis, i wonder if we can reopen this discussion (here or in common), Ah, sorry i see the item is moved to #75.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants