Configure HTTP WebServer Transaction timeouts#5047
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## dev-2.x #5047 +/- ##
=============================================
- Coverage 64.60% 64.49% -0.11%
- Complexity 13886 13895 +9
=============================================
Files 1710 1720 +10
Lines 67054 67237 +183
Branches 7206 7213 +7
=============================================
+ Hits 43319 43364 +45
- Misses 21313 21445 +132
- Partials 2422 2428 +6
☔ View full report in Codecov by Sentry. |
|
@leonardehrenfried and @optionsome Can you try to test this with the Legacy GraphQL API and we should decide what we need to do for best supporting timeouts in the API. To make the call time-out, just insert a Thread.sleep in the code. |
| case OUTSIDE_BOUNDS -> LegacyGraphQLRoutingErrorCode.OUTSIDE_BOUNDS; | ||
| case OUTSIDE_SERVICE_PERIOD -> LegacyGraphQLRoutingErrorCode.OUTSIDE_SERVICE_PERIOD; | ||
| case SYSTEM_ERROR -> LegacyGraphQLRoutingErrorCode.SYSTEM_ERROR; | ||
| case SYSTEM_ERROR, PROCESSING_TIMEOUT -> LegacyGraphQLRoutingErrorCode.SYSTEM_ERROR; |
There was a problem hiding this comment.
I will map this to a proper error code afterwards.
| try { | ||
| CompletableFuture | ||
| .allOf( | ||
| CompletableFuture.runAsync(() -> routeDirectStreet(itineraries, routingErrors)), |
There was a problem hiding this comment.
Do you happen to know which executors the CompletableFuture uses by default, @vpaturet ?
There was a problem hiding this comment.
They use ForkJoinPool.commonPool()
https://dzone.com/articles/be-aware-of-forkjoinpoolcommonpool
|
|
||
| @Override | ||
| public Collection<Station> getStations() { | ||
| OTPRequestTimeoutException.checkForTimeout(); |
There was a problem hiding this comment.
What is the reason that the checks for the timeouts happen before the actual work takes place?
There was a problem hiding this comment.
Both options are valid (before and after). If you check before you ensure that you will not waste time and CPU doing something that will be rejected anyway.
|
I tested this with a |
Applied review suggestions Co-authored-by: Leonard Ehrenfried <mail@leonard.io>
…onfig The 'streetRoutingTimeout' is moved into the 'routingDefaults' as well.
Co-authored-by: Thomas Gran <t2gran@gmail.com>
Summary
As detailed in #5037, protecting the server from resource-consuming queries can be achieved by setting a HTTP transaction timeout at the HTTP server level.
This PR provides support for configuring the timeout and add timeout control points in the code before time-consuming operations. It also ensures that the timeout is propagated to any async tasks spawned by the request.
The new parameter "apiProcessingTimeout" is configurable in the router configuration. By default the timeout is deactivated.
In case of timeout, OTP returns a GraphQL response with HTTP status "422 Unprocessable entity" that contains the following error field:
The REST API returns the following response, with HTTP status "422 Unprocessable entity":
{"requestParameters":{"date":"04-25-2023","mode":"TRANSIT,WALK","arriveBy":"false","wheelchair":"false","debugItineraryFilter":"false","showIntermediateStops":"true","fromPlace":"61.1147189610296,10.461744368076326","baseLayer":"OSM Standard Tiles","toPlace":"60.79163495097939,11.077045798301697","time":"2:28pm","locale":"en"},"error":{"id":422,"msg":"The trip planner is taking too long to process your request.","message":"PROCESSING_TIMEOUT"}}Implementation notes:
The new timeout parameter is named "apiProcessingTimeout" to prevent any confusion with other usual causes of latency (slow network, dead socket connections, tasks waiting in a thread pool queue, ...). The countdown before timing out starts when the request is picked up by a thread in the web server thread pool.
This PR uses thread interruption to propagate the request cancellation to async tasks spawned by the request (parallel routing feature, parallel heuristic). Since interruption is not supported by the CompletableFuture framework, CompletableFutures have been refactored as Futures. A dedicated, application-wide, thread pool, equivalent to the common pool used by CompletableFutures, is used for executing these async tasks.
The actual cancellation of a task relies on cooperative multi-threading: a thread that is executing a long-running task should regularly check the interruption flag and stop execution if the flag is set. "Control points" have been added in the code before the most computation-intensive code blocks.
The HTTP code 408 (Request Timeout) means that the server forcibly closed a client connection because the client took too long to send a request, It is not appropriate to send this code when the request takes too long to process on the server side. On the other hand, the HTTP code 422 means that the request is syntactically correct but semantically invalid, which makes sense when the client is asking for a request that is beyond the capabilities of the server.
The rationale for using a 4XX status code over a 5XX status code is that 5XX denotes a server-side issue and is implicitly retryable, while 4XX means that the client should change its request (non-retryable error).
The HTTP code 422 is used in the REST API and the Transmodel API, but no in the Legacy GraphQL API (possible breaking change). In the Legacy GraphQL API, timeouts are mapped to the generic "system error" message.
Deprecation of the batch request feature (Transmodel API, Legacy GraphQL API) should be considered in a separate PR.
Issue
partially implement #5037
Unit tests
✅
Documentation
Updated router configuration