Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add paging support for URL/HTML #464

Closed
TobiasNx opened this issue Aug 29, 2022 · 15 comments
Closed

Add paging support for URL/HTML #464

TobiasNx opened this issue Aug 29, 2022 · 15 comments

Comments

@TobiasNx
Copy link
Contributor

We can only open single URLs but cannot page trough multiple URLs.
This is connected to #460 since it is relevant for GET and POST with APIs.

@dr0i
Copy link
Member

dr0i commented Sep 5, 2022

To be discussed @fsteeg and @blackwinter :
we could enhance HttpOpener, but I think better would be a module on its own, like:

$urlToRetrieveAllTotalItemsAsJson
| open-http
|decode-json(recordpath="pagination.totalItems")
| page-http(url="$baseUrlWithPaginationParmeter", size="100")
|print
;

Identified with @TobiasNx at least 3 different types of pagination:

  1. with "total items" and a "size" (see example above)
  2. using $urlWithPageParameter , where the value of "page" is incremented as long as no 404 appears
  3. using a (next/resumption) token

(1) and (2) are already demanded by oersi.
The httpPager would be capable of doing all 3 kinds of paging types. Using (1) would be independent of the serialisation of the return of the API (json/xml/text). (3) could maybe also be used as a naive OAIPMH harvester.

@fsteeg
Copy link
Member

fsteeg commented Sep 6, 2022

I don't quite understand how your example(s) would work. I think I'd need a concrete API / URL / input data (for the different kinds of pagination).

Perhaps starting with the OERSI example that I used in my first / WIP approach to pagination support: https://gitlab.com/oersi/oersi-etl/-/commit/60ba8d70b1dd728006cd07485b632ecef961e98d

Did you see the Swissbib approach: https://github.com/linked-swissbib/swissbib-metafacture-commands#open-multi-http

And I had this general thought of using URL globbing for paging somehow: https://everything.curl.dev/cmdline/globbing

@fsteeg fsteeg assigned dr0i and unassigned fsteeg Sep 6, 2022
@dr0i
Copy link
Member

dr0i commented Sep 6, 2022

re swissbib: is just a very sad way to do it, because you have to know how many data there is to begin with and set these values via parameters. My approach would parse the actual totalItems and pass them to the pager.

Your approach is stuck to Json as result. Also it seems to make use of sitemaps to get all resources - this would be another type "4" (as I understand we want to switch from sitemap to using an API)

re globbing: this is type (2).

Concrete example for using an API (type (1) ) (enabled by #463):

'{"criterias": [], "facettes": []}'
| open-http(url="https://www.zoerr.de/edu-sharing/rest/search/v1/queriesV2/-home-/-default-/ngsearch?maxItems=10&skipCount=0&propertyFilter=-all-", method="post")
| decode-json(recordpath="pagination.totalItems") // => 1348
| page-http(url="$baseUrlWithPaginationParmeter", size="100")
|print
;

@TobiasNx
Copy link
Contributor Author

TobiasNx commented Sep 6, 2022

one thing I think is difficult is, that the input within MF for page-http would vary for every paging type. This would make this modul complex in it self.

@blackwinter
Copy link
Member

I don't think I have much to add at this point. Except maybe that type 3 could also be useful for Elasticsearch pagination (search_after and/or scroll).

@blackwinter blackwinter removed their assignment Sep 9, 2022
fsteeg added a commit that referenced this issue Sep 22, 2022
Reads sitemap from URL, sends each `loc` URL to the receiver.

e.g. `"https://hoou.de/sitemap.xml" | read-sitemap | open-http ...`
in a Flux workflow to process every document linked in the sitemap.

Supports paging via `from=` query string parameter (see #464)

See:

https://en.wikipedia.org/wiki/Sitemaps
https://gitlab.com/oersi/oersi-etl/-/issues/4
https://gitlab.com/oersi/oersi-etl/-/issues/17
@fsteeg
Copy link
Member

fsteeg commented Sep 22, 2022

Your approach is stuck to Json as result. Also it seems to make use of sitemaps to get all resources - this would be another type "4" (as I understand we want to switch from sitemap to using an API)

You mean the WIP approach I mentioned above? That actually removes using the sitemap to get the list of resources.

@fsteeg
Copy link
Member

fsteeg commented Sep 22, 2022

I'm not sure if it makes sense to approach this in a generic way. It all depends on what API we actually want to talk to. Maybe we should start with implementing paging for specific use cases instead, and then try to generalize that.

@fsteeg
Copy link
Member

fsteeg commented Sep 23, 2022

Discussed in our planning meeting: I will try to implement the approach described by @dr0i in #464 (comment) for edusharing APIs (4 workflows) in OERSI (e.g. ZOERR, see https://gitlab.com/oersi/oersi-etl/-/issues/64).

@fsteeg fsteeg assigned fsteeg and unassigned dr0i Sep 23, 2022
@fsteeg
Copy link
Member

fsteeg commented Oct 28, 2022

I will try to implement the approach described by @dr0i in #464 (comment) for edusharing APIs (4 workflows) in OERSI

To recall, this was the sketched approach from above:

'{"criterias": [], "facettes": []}'
| open-http(url="https://www.zoerr.de/edu-sharing/.../ngsearch?skipCount=0)
| decode-json(recordpath="pagination.totalItems") // => 1348
| page-http(url="$baseUrlWithPaginationParmeter", size="100")
| print
;

One problem is that conceptually, this part has to be called repeatedly:

'{"criterias": [], "facettes": []}'
| open-http(url="https://www.zoerr.de/edu-sharing/.../ngsearch?skipCount=0)

With different skipCount= values, but always passing the body to POST. So further down, when we want to do something like page-http, we no longer have the {"criterias": [], "facettes": []} body to POST. We also need to process the response of open-http to pass each record on the page, which would work with decode-json and recordPath, but then we'd kind of have to go back to call again, with incremented skipCount=.

To solve this in a generic way as envisioned here, I think we would need some kind of loop construct in the Flux. But that whole setup would become very complex. I think a specific module (or extension of HttpOpener) would be a better way to go. I was going to start with a very specific EdusharingReader, but gladly @TobiasNx and @acka47 stopped me there, asking for something a little more generic, which became a JsonApiReader, and seems to be a good balance of specific and generic, see https://gitlab.com/oersi/oersi-etl/-/merge_requests/227/diffs.

For proceeding here, I suggest we close this issue and keep using that new module for some more workflows in OERSI, and consider if and how we want to move it to metafacture-core when a use case in a second project comes up, be it as a dedicated module as currently, or by adding the paging functionality to HttpOpener itself.

@fsteeg fsteeg assigned acka47, dr0i and TobiasNx and unassigned fsteeg and acka47 Oct 28, 2022
@dr0i
Copy link
Member

dr0i commented Oct 28, 2022

Just had a short glimpse on this - the idea is to get all items via invoking open-http and decode-json(now around 1360) and the do it like swissbib : repeatedly getting the document in page-http(url="$baseUrlWithPaginationParmeter", size="100") up to this 1360. The skipCount and the do-loop is done in page-http and stopped when totalItems is reached.
Does this not work?

@dr0i dr0i assigned fsteeg and unassigned dr0i Oct 28, 2022
@fsteeg
Copy link
Member

fsteeg commented Nov 2, 2022

the idea is to get all items via invoking open-http and decode-json (now around 1360)

I'm not sure I understand. We can't get them all with a single call, that's why we need the paging. (Or do you mean the number? In the current implementation in OERSI, we don't even need that, we stop when the API returns empty results.)

The skipCount and the do-loop is done in page-http and stopped when totalItems is reached.

If we do the repeated calls in page-http, we need the JSON body from the first line, since that is part of the API request. (At that position, we're getting the totalItems number in your example instead.) We could add the request body as an option to page-http, but that would be redundant and inconsistent (specifying the request body twice, in different ways). I also don't see what we gain by that separation in the Flux of the first API call and the following API calls.

@fsteeg fsteeg removed their assignment Nov 2, 2022
@dr0i
Copy link
Member

dr0i commented Nov 8, 2022

(Or do you mean the number? In the current implementation in OERSI, we don't even need that, we stop when the API returns empty results.)

Yes, I mean the number. Relying on empty results may be a very good idea - but maybe not, there may be non-empty result containing a valid json saying "nothing here". Idk.

We could add the request body as an option to page-http, but that would be redundant and inconsistent (specifying the request body twice, in different ways).

You could (re)use a variable: `baseUrlWithPaginationParameter='https://www.zoerr.de/edu-sharing/rest/search/v1/queriesV2/-home-/-default-/ngsearch?maxItems=10&skipCount=0&propertyFilter=-all-' -d '{"criterias": [], "facettes": []}''

... | page-http(url="$baseUrlWithPaginationParameter", size="100", setSizeValueForParameter="skipCount")
The page-http would use the java class corresponding to open-http (if that's possible), flushing results to downstream modules until skipCount is greater than the input passed to | page-http.

I also don't see what we gain by that separation in the Flux of the first API call and the following API calls.

The idea is to be more generic, independent of a JSON-API or XML or whatever (even http-headers, having the proper module).
BUT maybe may thinking has a flaw and this generic approach is to complex in itself (commands piping into command piping into ... + using a variable + setting proper parameters) so this is not a viable approach (even if wrappers (new "commands") could be programmed as an abbreviation for the different APIs).

@fsteeg
Copy link
Member

fsteeg commented Nov 9, 2022

Right, thanks for explaining, now I see how that could be done in a reasonable way. We should keep that in mind for when we have other pagination use cases, this could help avoiding the need for new modules and duplication.

@dr0i dr0i assigned dr0i and TobiasNx and unassigned TobiasNx and dr0i Nov 14, 2022
@fsteeg
Copy link
Member

fsteeg commented Nov 15, 2022

Quoting myself from #464 (comment):

For proceeding here, I suggest we close this issue and keep using that new module (JsonApiReader) for some more workflows in OERSI, and consider if and how we want to move it to metafacture-core when a use case in a second project comes up, be it as a dedicated module as currently, or by adding the paging functionality to HttpOpener itself. Edit: Or by implementing generic paging as discussed above.

@TobiasNx, since you opened this issue, could you close if you agree?

@TobiasNx
Copy link
Contributor Author

I am okay with that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Metafacture alt
Awaiting triage
Development

No branches or pull requests

5 participants