Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POST end-points for synchronous requests #124

Closed
jerstlouis opened this issue Feb 8, 2021 · 22 comments
Closed

POST end-points for synchronous requests #124

jerstlouis opened this issue Feb 8, 2021 · 22 comments
Labels

Comments

@jerstlouis
Copy link
Member

jerstlouis commented Feb 8, 2021

Currently process execution requests are always POSTed to /jobs.
If /jobs becomes a generic OGC API - Common end-point for asynchronous requests, would it make sense to reserve that for asynchronous processing?
While POSTing to /jobs could still be potentially supported for submitting async requests, proposing to support POST to /processes/{processId} (also currently the end-point used for Workflows extension).

Either only /processes/{processId} could be supported, or if the mode is omitted the end-point you POST to could determine a default mode.

@jerstlouis jerstlouis changed the title POST for synchronous end-points POST end-points for synchronous requests Feb 8, 2021
@pvretano
Copy link
Contributor

pvretano commented Feb 8, 2021

@jerstlouis why are you under the impressions that /jobs is only for asynchronous requests? A job is not asynchronously executed just because you POST it to /jobs.

@jerstlouis
Copy link
Member Author

jerstlouis commented Feb 8, 2021

@pvretano What was discussed at the last Members Meeting Open OAB session, and I had also brought that up at the Common SWG meeting, was that /jobs could be a general asynchronous processing end-points, not only for Processes but for other OGC API specifications as well.

So it's from that line of thought that I am suggesting that perhaps /jobs could be reserved for querying and managing jobs status, and potentially also to submit jobs as async by default, if supporting POST requests to /jobs as well.

It makes very little sense to POST a synchronous execute-request to /jobs when you are not expecting to be creating a job but just receiving results right away.

I also think that having /processes/{processId} as the general end-point to POST to, i.e. you POST directly to the process, would facilitate the harmonization with process aliases, e.g. with the Routes API, where if your routing process /processes/routingEngine end-point has an alias at /route, then you can POST your execute-request to /route.

@pvretano
Copy link
Contributor

pvretano commented Feb 8, 2021

@jerstlouis when we discussed the move from /process/{processId}/jobs to /job in the SWG asynchronous execution was not part of the discussion. I was simply an example that I presented as one reason why a non-processes API might want to use a /jobs endpoint.
Why does it make very little sense to post a synchronous job to /process/{processesId}/job (like we did in Processes up until very recently) and now because we moved jobs to /jobs it no longer makes sense? What does not make sense is posting to two different endpoints (with potentially two different execute schemas) just because you want to execute a processes synchronously or asynchronously.
I would also point out that the "mode" key accepts a value of "auto" that lets the server decide whether the job should be synchronously or asynchronously executed and so having two different execution endpoints make that implementation a little messy.

@jerstlouis
Copy link
Member Author

jerstlouis commented Feb 8, 2021

@pvretano I understand it was not addressed specifically in the discussion, but now that it lives outside of the /processes tree I think it screams at you even more. I personally always thought it odd that you'd POST to /process/{processId}/jobs for a synchronous request when you are not creating a new job (and I find it very confusing from a resource name / methods perspective), and that's something that was inherited from WPS being traditionally very rooted in async processing.

What does not make sense is posting to two different endpoints

Always POST to /processes/{processId} then, and if async, the jobs can be managed from /jobs?

Two different execute schemas

Definitely not, the same one schema :)

Another potential advantage of this proposal is that the async job you POST to /processes/{processId} could possibly create a job which will be running and managed on another server elsewhere which implements the /jobs job management functionality.

I would also point out that the "mode" key accepts a value of "auto"

I think that could have worked quite well meaning "depending on where you POST it to", if we were to decide to support POSTing it to both places... But that's a different use case of auto than currently specifiled, the current one expecting the server to make the decision based on something like results size, right? So I guess not... (side note: since there is an auto, wishing mode can be optional and default to auto -- less is more for clients trying to craft minimal valid execution requests).

@pvretano
Copy link
Contributor

pvretano commented Feb 8, 2021

@jerstlouis just because you execute a job synchronously, that does not mean that you don't want the results to persist. This is especially true if the transmission mode is set to reference. Having the a job endpoint allows the server to consistently and conveniently persist execution results at /jobs/{jobId}/results/{outputId}.
In my implementation every execution gets a job id (sync or async) and the results can age out or persist until removed (but that is just an extension in my server).

@jerstlouis
Copy link
Member Author

jerstlouis commented Feb 8, 2021

@pvretano ah, well that's a good point.
If I understand the current specs correctly, there is no requirement for the results to persist when executing synchronously when using "response" : "raw" and "transmissionMode" : "value". So even if the server can decide to offer the synchronous results at /jobs/{jobId}/results/{outputId}, from the specs a client cannot rely on that being the case.

In our implementation, for synchronous we would not want to expose the persistence of the results, even if internally we could manage some persistence / caching to accelerate future identical requests. And actually this would allow to point to the exact same output for different requests producing the same results. In general, those sync results are likely to be much more numerous and short-lived than typical async requests.

Especially if synchronous execute requests are available for non-authenticated users, it's convenient to not expose them and avoid any privacy issue by not providing a list of jobs / results (as well as to avoid too many results piling up quickly, without any special efforts on the server's part).

But I am not suggesting to get rid of the jobs end-point or alter anything concerning the results, just either replacing its POST method by (or adding as an alternative to) a POST method to /processes/{processId}.

@pvretano
Copy link
Contributor

pvretano commented Feb 8, 2021

@jerstlouis what you seem to be suggesting is that very late in the game we change the well established execution endpoint from the jobs endpoint to the endpoint that is used by Workflows. Does Workflows not work if it changes its execution endpoint to what is specified in the processes specification?

@jerstlouis
Copy link
Member Author

jerstlouis commented Feb 8, 2021

@pvretano Not exactly...

The well established execution end-point was just changed recently from /processes/{processId}/jobs to /jobs.

The Workflows extension was using /processes/{processId} as:
A) that felt like the logical thing to me (when originally proposing this over a year ago I actually assumed that's what Processes used)
B) when realizing that's not what Processes used, I thought, well, then it conveniently avoids any overlap / confusion between the two approaches.

But considering the recent move to /jobs and concern with POSTing to /jobs without really creating a job for synchronous execution (the main reason for this suggestion), and the fact that there is mode to distinguish between sync / async, and for Workflows extension maybe: "mode" : "collection" for collection description results (and on-demand processing/triggering based on OGC API requests), or maybe later for multi-results: "mode" : "dataset" for OGC API landing page triggers/results, I am now making this proposal that brings it all to /processes/{processId}

because:

What does not make sense is posting to two different endpoints

:)

And yes Workflows do not work with /jobs, because the client will POST the execute request to whatever the value for "process" is directly (attempting to do any kind of URL parsing / replacing to figure out what to strip and where to add /jobs would be problematic).

To some extent this argument would also work for non-Workflows synchronous or asynchronous requests... being able to POST the execute request directly to whatever the "process" URL is I think is quite convenient, if request documents are shared (e.g. for Reusability / Reproduceability -- the R in FAIR!).

@pvretano
Copy link
Contributor

pvretano commented Feb 8, 2021

@jerstlouis the move the /jobs is NOT recent. Process execution was always to a "jobs" endpoint (previously /processes/{processId}/jobs and now /jobs) ... and that was the point ... processes execution always involved a job (whether implicitly or explicitly created).
With regard to the requirements for Workflows, perhaps "process" is not the correct key and "executionEndpoint" is the key that should be in the execute request.

@jerstlouis
Copy link
Member Author

jerstlouis commented Feb 8, 2021

Fair point about always having a .../jobs.
But the implicit job with "mode" : "sync", "response" : "raw", "transmissionMode" : "value" is not a resource the client can interact with per the specs (in that mode the job is just as abstract as with a request to /map or to /tiles/{tmsID}/{level}/{row}/{col}).

@pvretano Well what I liked very much about the "process" key, is that it's also what identifies the type of input.
i.e. in core you have for example:

  • "href"
  • "value"

and Workflows adds:

  • "collection"
  • "process"

So the "process" input in Workflows allowing to specify an external process would neatly map to the process identifier / execute end-point in Core, while allowing the "process" object to be recursive/nested in Workflows.

I especially liked the process & collection duality, as it illustrates very well the idea of bringing together data and algorithms/computing resources from any available OGC API service.

@christophenoel
Copy link
Contributor

It makes very little sense to POST a synchronous execute-request to /jobs when you are not expecting to be creating a job but just receiving results right away.

I expect that a WPS client application can discover the jobs and results independently of the execution mode sync/async. Therefore, I hope that, even if the execution was sync, the client still can access the result at /jobs/{jobId}/result .

"mode" : "sync", "response" : "raw", "transmissionMode" : "value"

Those properties are old fashioned, not in line with RESTful services, and complexifies the implemention for no added value. I expect the API to support two modes:

  • Request includes a callback endpoint --> the response is sent async
  • Request does not include a callback endpoint --> the client needs to poll the jobs endpoint.

response (raw/document) and transmissionMode (value,reference) act on the API itself which is not a good idea at all.

@jerstlouis
Copy link
Member Author

@spacebel With "mode" : "sync", "response" : "raw", "transmissionMode" : "value" the client was never given a jobId.
The result is just directly the data, as is the case for a GET to /map or to a tile...

While this may not seem practical for batch processing taking a lot of time, it makes tons of sense for processes / small requests expected to complete almost immediately, which is what I expect the synchronous mode would be most useful for.
A perfect example of this is the Routes API which we are trying to harmonize with OGC API - Processes, which supports a synchronous mode.
Another excellent example is rendering a map as a process.

Please don't take away that mode, it's the simplest one by far :)

@pvretano
Copy link
Contributor

pvretano commented Feb 9, 2021

@spacebel about the async processing ... agreed! see: https://github.com/opengeospatial/ogcapi-common/tree/master/proposals/simple-async

@jerstlouis how exactly is staying with the status quo "taking away that mode"?? OGC API Processes supports synchronous processing and always has. I thought that the current debate is about whether the execution endpoint should be /jobs or /processes/{processesId}. Am I wrong about that? In my opinion, whatever endpoint is chosen for execution it should be a single endpoint regardless of the execution mode (sync or async). Having one execution endpoint for sync and another for async seems unnecessarily complicated.

@jerstlouis
Copy link
Member Author

@pvretano At the moment OGC API - Processes - Core supports
"mode" : "sync", "response" : "raw", "transmissionMode" : "value"

But I was getting the impression that @spacebel would like to remove that functionality.

You are correct, this issue is strictly about the execution end-point.
I agree with a single end-point -- but I do feel there are many advantages (even putting the Workflows extension aside) with /processes/{processId} being that one end-point :)

@christophenoel
Copy link
Contributor

To my knowledge the "sync" mode (from XML bindings) was not migrated to the REST/JSON API. So the question about how is performed the sync mode seems a good question. Note the sync mode is not async callback, neither the async polling.

For the 'sync' mode, the approach should be friendly with the REST philosoph (that you probably know better than me).

  1. Same REST endpoint path for both:
  • 201 response (as usual) for async mode
  • 200 response with result document for sync mode
  1. Dedicated endpoint path for each.

In any case, I suppose returning on same endpoint with code 200 two possibles reponse types (status or result) would not be a good practice ? Such pattern was more in line with our good old SOAP/XML...

PS: Even in 'sync' mode, you could return a jobId and that would make sense.

@jerstlouis
Copy link
Member Author

jerstlouis commented Feb 9, 2021

@spacebel

For sync mode, the current details as specified at: http://docs.opengeospatial.org/DRAFTS/18-062.html#_response_6

  • A successful execution of the operation SHALL be reported as a response with a HTTP status code 200.
  • If the "response" attribute of the execute request was set to "document", the content of the response SHALL be based upon the OpenAPI 3.0 schema result.yaml
  • If the "response" attribute of the execute request was set to "raw", the content of the response SHALL only include the one output selected by the execute request body.

There is no jobId returned for sync mode, whether response is set to raw or document, or whether transmissionMode is set to value or reference (if I understand correctly, transmissionMode only applies to "response" : "document"). The server can have an internal jobId if it likes and implement a vendor extension exposing that, but as far as I understand, the specifications do not specify anything about jobId for sync mode (and in my opinion, that is exactly how it should be).

For async, the status query is at /jobs/{jobId}, while the results are found at /jobs/{jobId}/results.

@christophenoel
Copy link
Contributor

There is no jobId returned for sync mode,

I would rather say: There is no jobId returned for sync mode in the current spec (which simply translated the XML bindings instead of rethinking the standard for a modern, restful API).

Ok fine, but my opinion is still that such options (e.g. response, mode) are absolutely not RESTful. All those modes and options acting on the operation response is exactly what a RESTful micro service should NOT do.

My humble personnal point of view :) :)

@ghobona
Copy link
Contributor

ghobona commented Feb 10, 2021

All, let's keep in mind that the focus at the moment is on Part 1: Core.

mode and other options are well understood and have been proven to work.

Future extensions may include support for additional approaches such as those discussed in the Pub/Sub whitepaper edited by @sarasaeedi.

@fmigneault
Copy link
Contributor

I have just taken note of this proposition through other channels, and I am very against it.

I agree with the comment of @spacebel, POST on /processes/{id} to execute a job is NOT RESTful.
The execution is not done ON the process, but BY the process (or some async worker). There is no way to know an execution gets created with that path and gives the wrong impression that something modifying the process was accomplished (e.g. such as a PUT or PATCH would override/modify the process with PATCH|PUT /processes/{id}). Path /processes/{id} should be used only to do operations ON the process.

Since nothing is created on /processes/{id}, the process execution, or job, must be created with another path such as /processes/{id}/job and retrieved by /processes/{id}/jobs/{id} (or with corresponding /jobs and /jobs/{id}).
Whether the sync mode does create or not an actual job item is an implementation detail that should not affect the API definition. The actual name job could even be replaced by execution, and the same principle would still apply. No execution item necessarily created, but the API definition indicates that a job/execution of the process was created, not modifying that process definition.

I also agree with @pvretano that sync vs async executions should be done on the same endpoint. Having different ones unnecessarily complicates things, and since jobs are mandatory for async to retrieve the status, jobs path should also be used for sync mode.

I again agree with @pvretano that sync does not necessarily mean that there is no job, but instead just enforces that the status is directly returned only when the full execution is completed. The result of that execution could well be persisted in a job, and fetched at a later time.

I would argue also that even a sync execution could return a jobId where further details about the execution such as processing logs could be retrieved at a later time. For example, our implementation (https://github.com/crim-ca/weaver) provides /jobs/{id} for status, /jobs/{id}/results for obtained results, /jobs/{id}/inputs for submitted inputs, /jobs/{id}/logs for logging steps and /jobs/{id}/exceptions for details in case of errors. It wouldn't make any sense to query GET /processes/{id}/logs to obtain one specific execution logs, as we are missing a parameter and that path seems to indicate that we are fetching logs of the process itself rather than its execution. Even when running in sync, being able to retrieve execution logs after the fact is often the only way we can actually debug a failing process.

Finally, POST /processes/{id} could be requested more than once at the same time even when running them both in sync (e.g.: when parallel load-balanced workers running the API). There would be no way at all to distinguish between both executions without a jobId in that case.

@bpross-52n
Copy link
Contributor

Thanks @fmigneault, we should take this into consideration. I also invite you to our next SWG meeting on March 17, 9 a. EDT to take part in the discussion.

@jerstlouis
Copy link
Member Author

jerstlouis commented Mar 17, 2021

2021-03-17 SWG (Members) Meeting: We agreed to /processes/{processId}/execution .
One argument was that this would facilitate the DAPA API integration where execution via GET is supported.

@bpross-52n
Copy link
Contributor

Should be fixed by #159

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants