Add paging support for URL/HTML #464

TobiasNx · 2022-08-29T10:00:14Z

We can only open single URLs but cannot page trough multiple URLs.
This is connected to #460 since it is relevant for GET and POST with APIs.

dr0i · 2022-09-05T14:01:09Z

To be discussed @fsteeg and @blackwinter :
we could enhance HttpOpener, but I think better would be a module on its own, like:

$urlToRetrieveAllTotalItemsAsJson
| open-http
|decode-json(recordpath="pagination.totalItems")
| page-http(url="$baseUrlWithPaginationParmeter", size="100")
|print
;

Identified with @TobiasNx at least 3 different types of pagination:

with "total items" and a "size" (see example above)
using $urlWithPageParameter , where the value of "page" is incremented as long as no 404 appears
using a (next/resumption) token

(1) and (2) are already demanded by oersi.
The httpPager would be capable of doing all 3 kinds of paging types. Using (1) would be independent of the serialisation of the return of the API (json/xml/text). (3) could maybe also be used as a naive OAIPMH harvester.

fsteeg · 2022-09-06T07:23:02Z

I don't quite understand how your example(s) would work. I think I'd need a concrete API / URL / input data (for the different kinds of pagination).

Perhaps starting with the OERSI example that I used in my first / WIP approach to pagination support: https://gitlab.com/oersi/oersi-etl/-/commit/60ba8d70b1dd728006cd07485b632ecef961e98d

Did you see the Swissbib approach: https://github.com/linked-swissbib/swissbib-metafacture-commands#open-multi-http

And I had this general thought of using URL globbing for paging somehow: https://everything.curl.dev/cmdline/globbing

dr0i · 2022-09-06T08:37:02Z

re swissbib: is just a very sad way to do it, because you have to know how many data there is to begin with and set these values via parameters. My approach would parse the actual totalItems and pass them to the pager.

Your approach is stuck to Json as result. Also it seems to make use of sitemaps to get all resources - this would be another type "4" (as I understand we want to switch from sitemap to using an API)

re globbing: this is type (2).

Concrete example for using an API (type (1) ) (enabled by #463):

'{"criterias": [], "facettes": []}'
| open-http(url="https://www.zoerr.de/edu-sharing/rest/search/v1/queriesV2/-home-/-default-/ngsearch?maxItems=10&skipCount=0&propertyFilter=-all-", method="post")
| decode-json(recordpath="pagination.totalItems") // => 1348
| page-http(url="$baseUrlWithPaginationParmeter", size="100")
|print
;

TobiasNx · 2022-09-06T08:40:49Z

one thing I think is difficult is, that the input within MF for page-http would vary for every paging type. This would make this modul complex in it self.

blackwinter · 2022-09-09T10:27:32Z

I don't think I have much to add at this point. Except maybe that type 3 could also be useful for Elasticsearch pagination (search_after and/or scroll).

Reads sitemap from URL, sends each `loc` URL to the receiver. e.g. `"https://hoou.de/sitemap.xml" | read-sitemap | open-http ...` in a Flux workflow to process every document linked in the sitemap. Supports paging via `from=` query string parameter (see #464) See: https://en.wikipedia.org/wiki/Sitemaps https://gitlab.com/oersi/oersi-etl/-/issues/4 https://gitlab.com/oersi/oersi-etl/-/issues/17

fsteeg · 2022-09-22T15:55:14Z

Your approach is stuck to Json as result. Also it seems to make use of sitemaps to get all resources - this would be another type "4" (as I understand we want to switch from sitemap to using an API)

You mean the WIP approach I mentioned above? That actually removes using the sitemap to get the list of resources.

fsteeg · 2022-09-22T15:59:18Z

I'm not sure if it makes sense to approach this in a generic way. It all depends on what API we actually want to talk to. Maybe we should start with implementing paging for specific use cases instead, and then try to generalize that.

fsteeg · 2022-09-23T11:07:27Z

Discussed in our planning meeting: I will try to implement the approach described by @dr0i in #464 (comment) for edusharing APIs (4 workflows) in OERSI (e.g. ZOERR, see https://gitlab.com/oersi/oersi-etl/-/issues/64).

fsteeg · 2022-10-28T15:20:56Z

I will try to implement the approach described by @dr0i in #464 (comment) for edusharing APIs (4 workflows) in OERSI

To recall, this was the sketched approach from above:

'{"criterias": [], "facettes": []}'
| open-http(url="https://www.zoerr.de/edu-sharing/.../ngsearch?skipCount=0)
| decode-json(recordpath="pagination.totalItems") // => 1348
| page-http(url="$baseUrlWithPaginationParmeter", size="100")
| print
;

One problem is that conceptually, this part has to be called repeatedly:

'{"criterias": [], "facettes": []}'
| open-http(url="https://www.zoerr.de/edu-sharing/.../ngsearch?skipCount=0)

With different skipCount= values, but always passing the body to POST. So further down, when we want to do something like page-http, we no longer have the {"criterias": [], "facettes": []} body to POST. We also need to process the response of open-http to pass each record on the page, which would work with decode-json and recordPath, but then we'd kind of have to go back to call again, with incremented skipCount=.

To solve this in a generic way as envisioned here, I think we would need some kind of loop construct in the Flux. But that whole setup would become very complex. I think a specific module (or extension of HttpOpener) would be a better way to go. I was going to start with a very specific EdusharingReader, but gladly @TobiasNx and @acka47 stopped me there, asking for something a little more generic, which became a JsonApiReader, and seems to be a good balance of specific and generic, see https://gitlab.com/oersi/oersi-etl/-/merge_requests/227/diffs.

For proceeding here, I suggest we close this issue and keep using that new module for some more workflows in OERSI, and consider if and how we want to move it to metafacture-core when a use case in a second project comes up, be it as a dedicated module as currently, or by adding the paging functionality to HttpOpener itself.

dr0i · 2022-10-28T15:43:00Z

Just had a short glimpse on this - the idea is to get all items via invoking open-http and decode-json(now around 1360) and the do it like swissbib : repeatedly getting the document in page-http(url="$baseUrlWithPaginationParmeter", size="100") up to this 1360. The skipCount and the do-loop is done in page-http and stopped when totalItems is reached.
Does this not work?

fsteeg · 2022-11-02T09:38:35Z

the idea is to get all items via invoking open-http and decode-json (now around 1360)

I'm not sure I understand. We can't get them all with a single call, that's why we need the paging. (Or do you mean the number? In the current implementation in OERSI, we don't even need that, we stop when the API returns empty results.)

The skipCount and the do-loop is done in page-http and stopped when totalItems is reached.

If we do the repeated calls in page-http, we need the JSON body from the first line, since that is part of the API request. (At that position, we're getting the totalItems number in your example instead.) We could add the request body as an option to page-http, but that would be redundant and inconsistent (specifying the request body twice, in different ways). I also don't see what we gain by that separation in the Flux of the first API call and the following API calls.

dr0i · 2022-11-08T16:21:42Z

(Or do you mean the number? In the current implementation in OERSI, we don't even need that, we stop when the API returns empty results.)

Yes, I mean the number. Relying on empty results may be a very good idea - but maybe not, there may be non-empty result containing a valid json saying "nothing here". Idk.

We could add the request body as an option to page-http, but that would be redundant and inconsistent (specifying the request body twice, in different ways).

You could (re)use a variable: `baseUrlWithPaginationParameter='https://www.zoerr.de/edu-sharing/rest/search/v1/queriesV2/-home-/-default-/ngsearch?maxItems=10&skipCount=0&propertyFilter=-all-' -d '{"criterias": [], "facettes": []}''

... | page-http(url="$baseUrlWithPaginationParameter", size="100", setSizeValueForParameter="skipCount")
The page-http would use the java class corresponding to open-http (if that's possible), flushing results to downstream modules until skipCount is greater than the input passed to | page-http.

I also don't see what we gain by that separation in the Flux of the first API call and the following API calls.

The idea is to be more generic, independent of a JSON-API or XML or whatever (even http-headers, having the proper module).
BUT maybe may thinking has a flaw and this generic approach is to complex in itself (commands piping into command piping into ... + using a variable + setting proper parameters) so this is not a viable approach (even if wrappers (new "commands") could be programmed as an abbreviation for the different APIs).

fsteeg · 2022-11-09T11:18:35Z

Right, thanks for explaining, now I see how that could be done in a reasonable way. We should keep that in mind for when we have other pagination use cases, this could help avoiding the need for new modules and duplication.

fsteeg · 2022-11-15T15:31:31Z

Quoting myself from #464 (comment):

For proceeding here, I suggest we close this issue and keep using that new module (JsonApiReader) for some more workflows in OERSI, and consider if and how we want to move it to metafacture-core when a use case in a second project comes up, be it as a dedicated module as currently, or by adding the paging functionality to HttpOpener itself. Edit: Or by implementing generic paging as discussed above.

@TobiasNx, since you opened this issue, could you close if you agree?

TobiasNx · 2022-11-18T13:05:40Z

I am okay with that!

dr0i mentioned this issue Aug 29, 2022

Extend HttpOpener. #463

Merged

dr0i assigned blackwinter and fsteeg Sep 5, 2022

dr0i added Enhancement Stream modules labels Sep 5, 2022

fsteeg assigned dr0i and unassigned fsteeg Sep 6, 2022

blackwinter removed their assignment Sep 9, 2022

fsteeg mentioned this issue Sep 22, 2022

Add SitemapReader originally developed in OERSI #469

Draft

fsteeg assigned fsteeg and unassigned dr0i Sep 23, 2022

fsteeg assigned acka47, dr0i and TobiasNx and unassigned fsteeg and acka47 Oct 28, 2022

dr0i assigned fsteeg and unassigned dr0i Oct 28, 2022

fsteeg removed their assignment Nov 2, 2022

dr0i assigned dr0i and TobiasNx and unassigned TobiasNx and dr0i Nov 14, 2022

TobiasNx closed this as completed Nov 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add paging support for URL/HTML #464

Add paging support for URL/HTML #464

TobiasNx commented Aug 29, 2022

dr0i commented Sep 5, 2022

fsteeg commented Sep 6, 2022

dr0i commented Sep 6, 2022 •

edited

TobiasNx commented Sep 6, 2022

blackwinter commented Sep 9, 2022

fsteeg commented Sep 22, 2022

fsteeg commented Sep 22, 2022

fsteeg commented Sep 23, 2022

fsteeg commented Oct 28, 2022

dr0i commented Oct 28, 2022

fsteeg commented Nov 2, 2022

dr0i commented Nov 8, 2022

fsteeg commented Nov 9, 2022

fsteeg commented Nov 15, 2022

TobiasNx commented Nov 18, 2022

Add paging support for URL/HTML #464

Add paging support for URL/HTML #464

Comments

TobiasNx commented Aug 29, 2022

dr0i commented Sep 5, 2022

fsteeg commented Sep 6, 2022

dr0i commented Sep 6, 2022 • edited

TobiasNx commented Sep 6, 2022

blackwinter commented Sep 9, 2022

fsteeg commented Sep 22, 2022

fsteeg commented Sep 22, 2022

fsteeg commented Sep 23, 2022

fsteeg commented Oct 28, 2022

dr0i commented Oct 28, 2022

fsteeg commented Nov 2, 2022

dr0i commented Nov 8, 2022

fsteeg commented Nov 9, 2022

fsteeg commented Nov 15, 2022

TobiasNx commented Nov 18, 2022

dr0i commented Sep 6, 2022 •

edited