Skip to content

Configure HTTP WebServer Transaction timeouts#5047

Merged
vpaturet merged 16 commits into
opentripplanner:dev-2.xfrom
entur:otp2_timeout
May 2, 2023
Merged

Configure HTTP WebServer Transaction timeouts#5047
vpaturet merged 16 commits into
opentripplanner:dev-2.xfrom
entur:otp2_timeout

Conversation

@vpaturet

@vpaturet vpaturet commented Apr 13, 2023

Copy link
Copy Markdown
Contributor

Summary

As detailed in #5037, protecting the server from resource-consuming queries can be achieved by setting a HTTP transaction timeout at the HTTP server level.
This PR provides support for configuring the timeout and add timeout control points in the code before time-consuming operations. It also ensures that the timeout is propagated to any async tasks spawned by the request.
The new parameter "apiProcessingTimeout" is configurable in the router configuration. By default the timeout is deactivated.

In case of timeout, OTP returns a GraphQL response with HTTP status "422 Unprocessable entity" that contains the following error field:

{
    "errors": [
        {
            "message": "PROCESSING_TIMEOUT",
            "extensions": {
                "classification": "ExecutionAborted"
            }
        }
    ],
    "data": {
        "serverInfo": {
            "version": "2.3.0-SNAPSHOT"
        },
        "trip": {
            "tripPatterns": []
        }
    }
}

The REST API returns the following response, with HTTP status "422 Unprocessable entity":

{"requestParameters":{"date":"04-25-2023","mode":"TRANSIT,WALK","arriveBy":"false","wheelchair":"false","debugItineraryFilter":"false","showIntermediateStops":"true","fromPlace":"61.1147189610296,10.461744368076326","baseLayer":"OSM Standard Tiles","toPlace":"60.79163495097939,11.077045798301697","time":"2:28pm","locale":"en"},"error":{"id":422,"msg":"The trip planner is taking too long to process your request.","message":"PROCESSING_TIMEOUT"}}

Implementation notes:

  • The new timeout parameter is named "apiProcessingTimeout" to prevent any confusion with other usual causes of latency (slow network, dead socket connections, tasks waiting in a thread pool queue, ...). The countdown before timing out starts when the request is picked up by a thread in the web server thread pool.

  • This PR uses thread interruption to propagate the request cancellation to async tasks spawned by the request (parallel routing feature, parallel heuristic). Since interruption is not supported by the CompletableFuture framework, CompletableFutures have been refactored as Futures. A dedicated, application-wide, thread pool, equivalent to the common pool used by CompletableFutures, is used for executing these async tasks.

  • The actual cancellation of a task relies on cooperative multi-threading: a thread that is executing a long-running task should regularly check the interruption flag and stop execution if the flag is set. "Control points" have been added in the code before the most computation-intensive code blocks.

  • The HTTP code 408 (Request Timeout) means that the server forcibly closed a client connection because the client took too long to send a request, It is not appropriate to send this code when the request takes too long to process on the server side. On the other hand, the HTTP code 422 means that the request is syntactically correct but semantically invalid, which makes sense when the client is asking for a request that is beyond the capabilities of the server.

  • The rationale for using a 4XX status code over a 5XX status code is that 5XX denotes a server-side issue and is implicitly retryable, while 4XX means that the client should change its request (non-retryable error).

  • The HTTP code 422 is used in the REST API and the Transmodel API, but no in the Legacy GraphQL API (possible breaking change). In the Legacy GraphQL API, timeouts are mapped to the generic "system error" message.

  • Deprecation of the batch request feature (Transmodel API, Legacy GraphQL API) should be considered in a separate PR.

Issue

partially implement #5037

Unit tests

Documentation

Updated router configuration

@codecov

codecov Bot commented Apr 13, 2023

Copy link
Copy Markdown

Codecov Report

Patch coverage: 35.10% and project coverage change: -0.11 ⚠️

Comparison is base (78815c3) 64.60% compared to head (d93a24a) 64.49%.

Additional details and impacted files
@@              Coverage Diff              @@
##             dev-2.x    #5047      +/-   ##
=============================================
- Coverage      64.60%   64.49%   -0.11%     
- Complexity     13886    13895       +9     
=============================================
  Files           1710     1720      +10     
  Lines          67054    67237     +183     
  Branches        7206     7213       +7     
=============================================
+ Hits           43319    43364      +45     
- Misses         21313    21445     +132     
- Partials        2422     2428       +6     
Impacted Files Coverage Δ
...anner/ext/legacygraphqlapi/LegacyGraphQLUtils.java 24.56% <0.00%> (ø)
...tripplanner/ext/transmodelapi/TransmodelGraph.java 0.00% <0.00%> (ø)
...er/ext/transmodelapi/TransmodelGraphQLPlanner.java 0.00% <0.00%> (ø)
.../support/OTPProcessingTimeoutGraphQLException.java 0.00% <0.00%> (ø)
...opentripplanner/api/common/OTPExceptionMapper.java 0.00% <0.00%> (ø)
.../opentripplanner/api/resource/PlannerResource.java 3.27% <0.00%> (-0.43%) ⬇️
...er/framework/concurrent/InterruptibleExecutor.java 0.00% <0.00%> (ø)
.../opentripplanner/framework/http/OtpHttpStatus.java 0.00% <0.00%> (ø)
...anner/raptor/service/RangeRaptorDynamicSearch.java 67.54% <0.00%> (-3.10%) ⬇️
.../algorithm/raptoradapter/router/TransitRouter.java 75.67% <0.00%> (ø)
... and 17 more

... and 39 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@t2gran t2gran added this to the 2.4 milestone Apr 24, 2023
@vpaturet vpaturet marked this pull request as ready for review April 24, 2023 13:10
@vpaturet vpaturet requested a review from a team as a code owner April 24, 2023 13:10
@t2gran

t2gran commented Apr 25, 2023

Copy link
Copy Markdown
Member

@leonardehrenfried and @optionsome Can you try to test this with the Legacy GraphQL API and we should decide what we need to do for best supporting timeouts in the API. To make the call time-out, just insert a Thread.sleep in the code.

case OUTSIDE_BOUNDS -> LegacyGraphQLRoutingErrorCode.OUTSIDE_BOUNDS;
case OUTSIDE_SERVICE_PERIOD -> LegacyGraphQLRoutingErrorCode.OUTSIDE_SERVICE_PERIOD;
case SYSTEM_ERROR -> LegacyGraphQLRoutingErrorCode.SYSTEM_ERROR;
case SYSTEM_ERROR, PROCESSING_TIMEOUT -> LegacyGraphQLRoutingErrorCode.SYSTEM_ERROR;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will map this to a proper error code afterwards.

try {
CompletableFuture
.allOf(
CompletableFuture.runAsync(() -> routeDirectStreet(itineraries, routingErrors)),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you happen to know which executors the CompletableFuture uses by default, @vpaturet ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread src/main/java/org/opentripplanner/standalone/config/RouterConfig.java Outdated
Comment thread src/main/java/org/opentripplanner/standalone/config/RouterConfig.java Outdated

@Override
public Collection<Station> getStations() {
OTPRequestTimeoutException.checkForTimeout();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason that the checks for the timeouts happen before the actual work takes place?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both options are valid (before and after). If you check before you ensure that you will not waste time and CPU doing something that will be rejected anyway.

@leonardehrenfried

leonardehrenfried commented Apr 25, 2023

Copy link
Copy Markdown
Member

@leonardehrenfried

Copy link
Copy Markdown
Member

I tested this with a Thread.sleep on the legacy GraphQL API and it does what it says.

Applied review suggestions

Co-authored-by: Leonard Ehrenfried <mail@leonard.io>
Comment thread docs/RouterConfiguration.md
@t2gran t2gran mentioned this pull request Apr 27, 2023
@t2gran t2gran added !New Feature A functional feature targeting the end user. !Technical Debt Improve code quality, no functional changes. labels Apr 28, 2023
Comment thread src/main/java/org/opentripplanner/framework/http/OtpHttpStatus.java Outdated
Comment thread src/main/java/org/opentripplanner/framework/http/OtpHttpStatus.java Outdated
Comment thread src/main/java/org/opentripplanner/standalone/config/RouterConfig.java Outdated
Comment thread src/main/java/org/opentripplanner/standalone/config/RouterConfig.java Outdated
Co-authored-by: Thomas Gran <t2gran@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

!New Feature A functional feature targeting the end user. !Technical Debt Improve code quality, no functional changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants