New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add paging support for URL/HTML #464
Comments
To be discussed @fsteeg and @blackwinter :
Identified with @TobiasNx at least 3 different types of pagination:
(1) and (2) are already demanded by oersi. |
I don't quite understand how your example(s) would work. I think I'd need a concrete API / URL / input data (for the different kinds of pagination). Perhaps starting with the OERSI example that I used in my first / WIP approach to pagination support: https://gitlab.com/oersi/oersi-etl/-/commit/60ba8d70b1dd728006cd07485b632ecef961e98d Did you see the Swissbib approach: https://github.com/linked-swissbib/swissbib-metafacture-commands#open-multi-http And I had this general thought of using URL globbing for paging somehow: https://everything.curl.dev/cmdline/globbing |
re swissbib: is just a very sad way to do it, because you have to know how many data there is to begin with and set these values via parameters. My approach would parse the actual totalItems and pass them to the pager. Your approach is stuck to Json as result. Also it seems to make use of sitemaps to get all resources - this would be another type "4" (as I understand we want to switch from sitemap to using an API) re globbing: this is type (2). Concrete example for using an API (type (1) ) (enabled by #463):
|
one thing I think is difficult is, that the input within MF for |
I don't think I have much to add at this point. Except maybe that type 3 could also be useful for Elasticsearch pagination ( |
Reads sitemap from URL, sends each `loc` URL to the receiver. e.g. `"https://hoou.de/sitemap.xml" | read-sitemap | open-http ...` in a Flux workflow to process every document linked in the sitemap. Supports paging via `from=` query string parameter (see #464) See: https://en.wikipedia.org/wiki/Sitemaps https://gitlab.com/oersi/oersi-etl/-/issues/4 https://gitlab.com/oersi/oersi-etl/-/issues/17
You mean the WIP approach I mentioned above? That actually removes using the sitemap to get the list of resources. |
I'm not sure if it makes sense to approach this in a generic way. It all depends on what API we actually want to talk to. Maybe we should start with implementing paging for specific use cases instead, and then try to generalize that. |
Discussed in our planning meeting: I will try to implement the approach described by @dr0i in #464 (comment) for edusharing APIs (4 workflows) in OERSI (e.g. ZOERR, see https://gitlab.com/oersi/oersi-etl/-/issues/64). |
To recall, this was the sketched approach from above:
One problem is that conceptually, this part has to be called repeatedly:
With different To solve this in a generic way as envisioned here, I think we would need some kind of loop construct in the Flux. But that whole setup would become very complex. I think a specific module (or extension of HttpOpener) would be a better way to go. I was going to start with a very specific EdusharingReader, but gladly @TobiasNx and @acka47 stopped me there, asking for something a little more generic, which became a JsonApiReader, and seems to be a good balance of specific and generic, see https://gitlab.com/oersi/oersi-etl/-/merge_requests/227/diffs. For proceeding here, I suggest we close this issue and keep using that new module for some more workflows in OERSI, and consider if and how we want to move it to metafacture-core when a use case in a second project comes up, be it as a dedicated module as currently, or by adding the paging functionality to HttpOpener itself. |
Just had a short glimpse on this - the idea is to get all items via invoking |
I'm not sure I understand. We can't get them all with a single call, that's why we need the paging. (Or do you mean the number? In the current implementation in OERSI, we don't even need that, we stop when the API returns empty results.)
If we do the repeated calls in |
Yes, I mean the number. Relying on empty results may be a very good idea - but maybe not, there may be non-empty result containing a valid json saying "nothing here". Idk.
You could (re)use a variable: `baseUrlWithPaginationParameter='https://www.zoerr.de/edu-sharing/rest/search/v1/queriesV2/-home-/-default-/ngsearch?maxItems=10&skipCount=0&propertyFilter=-all-' -d '{"criterias": [], "facettes": []}''
The idea is to be more generic, independent of a JSON-API or XML or whatever (even http-headers, having the proper module). |
Right, thanks for explaining, now I see how that could be done in a reasonable way. We should keep that in mind for when we have other pagination use cases, this could help avoiding the need for new modules and duplication. |
Quoting myself from #464 (comment):
@TobiasNx, since you opened this issue, could you close if you agree? |
I am okay with that! |
We can only open single URLs but cannot page trough multiple URLs.
This is connected to #460 since it is relevant for GET and POST with APIs.
The text was updated successfully, but these errors were encountered: