Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic retry for rust crypto outgoing requests #4061

Merged
merged 10 commits into from Feb 26, 2024

Conversation

BillCarsonFr
Copy link
Member

@BillCarsonFr BillCarsonFr commented Feb 9, 2024

Add basic retry for outgoing requests supported by the rust SDK.
It will retry 3 times if:

  • Http status is 429, 500, 502, 503, 504, 525 (a bit arbitrary what would you suggest?)
  • If it's a connection error (timeout, connection went down ...)

It's also looking for the retry_after_ms of DEFAULT_RETRY_DELAY_MS to wait for the appropriate time.

Fixes element-hq/element-web#26755 as it ultimatly fails on 500 and will work if retried

Checklist

  • Tests written for new code (and old code if feasible)
  • Linter and other CI checks pass
  • Sign-off given on the changes (see CONTRIBUTING.md)

Here's what your changelog entry will look like:

馃悰 Bug Fixes

@BillCarsonFr BillCarsonFr requested a review from a team as a code owner February 9, 2024 18:15
@BillCarsonFr BillCarsonFr added the T-Defect Bugs, crashes, hangs, vulnerabilities, or other reported problems label Feb 10, 2024
Copy link
Member

@t3chguy t3chguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrying 504 isn't necessarily safe as the action you requested may have already been performed but the response from the backend was dropped. It can only be retried on idempotent APIs and there a resulting error may need to be caught accordingly so the caller isn't told that the API failed due to e.g. M_NOT_FOUND or whatever.

@BillCarsonFr
Copy link
Member Author

Retrying 504 isn't necessarily safe as the action you requested may have already been performed but the response from the backend was dropped. It can only be retried on idempotent APIs and there a resulting error may need to be caught accordingly so the caller isn't told that the API failed due to e.g. M_NOT_FOUND or whatever.

Ok i removed 504 from the list. It's true that some of the concerned requests are idempotent but I'll keep this PR simple for now

@richvdh
Copy link
Member

richvdh commented Feb 12, 2024

Virtually (?) all matrix APIs are idempotent, deliberately so that they can be safely retried by clients. I would argue that, for simplicity, all 5xx response codes should be retried.

We should maybe also think about timeouts. Should we ever retry when the request times out?

Copy link
Member

@t3chguy t3chguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the PR title, this implies it applies to outgoing requests generally, not specifically to a part of rust crypto.

@BillCarsonFr BillCarsonFr changed the title Add basic retry for outgoing requests Add basic retry for rust crypto outgoing requests Feb 13, 2024
@BillCarsonFr
Copy link
Member Author

Virtually (?) all matrix APIs are idempotent, deliberately so that they can be safely retried by clients. I would argue that, for simplicity, all 5xx response codes should be retried.

We should maybe also think about timeouts. Should we ever retry when the request times out?

I updated given that all requests are indempotent, now retrying any 5xx requests.
Also client timeouts are not currently retried (Will be AbortError and not ConnectionError). There is an option to set the local timeout on MatrixClient, but it's not used and defaults to a very very high timing, so asm you said we should probably not retry them


currentRetryCount++;

const maybeRetryAfter = this.shouldWaitBeforeRetryingMillis(e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a class that implements the retry logic. It would probably be better to use that instead of writing something else here. https://github.com/matrix-org/matrix-js-sdk/blob/develop/src/scheduler.ts#L54

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I have seen it. Indeed I think that there should be a retry mecanism inside the FetchHttpApi maybe with a new flag in IRequestOpts retriable (default true as most mx request are indempotent).
Then the event scheduler will benefit from it too (unless there needs to be special logic for events).
It will help on other issues like element-hq/element-web#26967

The thing is that it would impact the application a bit everywhere, I'd like to have support for the web team if we want to go that way.

(on side note I noticed that the scheduler has blocked retry on 502/M_TOO_LARGE, might not be common for rust outgoing request, but might want to exclude it too)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a two different suggestions in this thread:

  • MatrixScheduler and OutgoingRequestManager both implement logic to decide whether to retry a request, and if so when. We could extract that logic and share it, rather than having two copies. (Presumably, we would make a new method somewhere, and have MatrixScheduler.RETRY_BACKOFF_RATELIMIT call it.) That seems like quite a good idea to me. @BillCarsonFr: any reason we can't do this?
  • Move the retry mechanism down to FetchHttpApi, so that everything (including MatrixScheduler and OutgoingRequestManager) can benefit from it. This seems like a much more invasive change, and one I don't think we should do right now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason we can't do this?

The RETRY_BACKOFF_RATELIMIT function is a bit strange, it has an used parameter, it is trying to cast a MatrixError into a ConnectionError (this can't work? or maybe at runtime?)
It doesn't want to retry on ConnectionError (like if the client internet connection is down), and we want that.

So I think we can't use it directly as it doesn't seem to be the same logic, and touching it will be also quite invasive. The only thing we could take easily is the exponential backoff, but it's just one line.

It's still bad that the logic doesn't match, but we might want to unify that in a second step?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so after discussion I extracted a common retry alg between the scheduler and the outgoing request processor.
2 slight changes on the previous state:

  • Outgoing request processor try 5 times in total instead of 4
  • The matrix scheduer is not retrying on client timeouts anymore (as this timeout is already > 1mn depending on browser)

Here is the refactoring commit c2126a6

Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sensible but a few suggestions for cleanups

src/rust-crypto/OutgoingRequestProcessor.ts Outdated Show resolved Hide resolved
src/rust-crypto/OutgoingRequestProcessor.ts Show resolved Hide resolved
src/rust-crypto/OutgoingRequestProcessor.ts Outdated Show resolved Hide resolved
src/rust-crypto/OutgoingRequestProcessor.ts Outdated Show resolved Hide resolved
Comment on lines 254 to 255
// All keys/signatures uploads are, message and to device are because of txn_id, keys claim in worst case will claim
// several keys but won't cause harm.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is hard to understand.

Suggested change
// All keys/signatures uploads are, message and to device are because of txn_id, keys claim in worst case will claim
// several keys but won't cause harm.
// * All key/signature uploads are idempotent.
// * Room message and to-device send requests are idempotent because of txn_id.
// * Keys claim in worst case will claim several keys but won't cause harm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, updated

* is resolved or when the rate limit is reset.
* @param httpStatus - the HTTP status code of the response
*/
private canRetry(httpStatus: number): boolean {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, this should probably be after shouldWaitBeforeRetryingMillis for consistency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, changed

spec/unit/rust-crypto/OutgoingRequestProcessor.spec.ts Outdated Show resolved Hide resolved
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks sensible otherwise

src/request-retry-utils.ts Outdated Show resolved Hide resolved
src/request-retry-utils.ts Outdated Show resolved Hide resolved
src/request-retry-utils.ts Outdated Show resolved Hide resolved
src/request-retry-utils.ts Outdated Show resolved Hide resolved
src/request-retry-utils.ts Outdated Show resolved Hide resolved
src/rust-crypto/OutgoingRequestProcessor.ts Outdated Show resolved Hide resolved
src/request-retry-utils.ts Outdated Show resolved Hide resolved
@BillCarsonFr BillCarsonFr added this pull request to the merge queue Feb 26, 2024
Merged via the queue into develop with commit d3dfcd9 Feb 26, 2024
23 checks passed
@BillCarsonFr BillCarsonFr deleted the valere/element-r/retry_outgoing_requests branch February 26, 2024 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-Defect Bugs, crashes, hangs, vulnerabilities, or other reported problems
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Element-R | "Unable to set up keys" error on register
4 participants