Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change ActivityPub::DeliveryWorker retries to be spread out more #21956

Merged
merged 1 commit into from Mar 3, 2023

Conversation

ClearlyClaire
Copy link
Contributor

@ClearlyClaire ClearlyClaire commented Dec 2, 2022

Currently, ActivityPub::DeliveryWorker jobs are retried up to 16 times using Sidekiq's default delay and jitter, which means retries are spaced by the following formula:
(count**4) + 15 + rand(10) * (count + 1)

In some cases (e.g. account migrations), large amounts of jobs are queued at the same time, resulting in failing ones also being retried at approximately the same time: while the 16th retry is spaced from the 15th retry by about 14 hours, the rand(10) * (count + 1) splits them in 10 buckets spreading over less than 3 minutes in total! While the real life situation is more nuanced, it's still within the same ballpark.

This PR changes it so that retries are spaced out much more evenly, by adding another jitter component of rand(0.5 * (count ** 4)). This does not replace Sidekiq's default one because there is no option to control it.

We already do something similar in the ExponentialBackoff concern, but the backoff is much more aggressive than I think is necessary here.

@ClearlyClaire
Copy link
Contributor Author

The specific issue this PR is trying to address is migrating to a new account when you have a large amount of followers: in this case, the server you're migrating to would receive a lot of follow requests in a short amount of time. Retries help, but they would mostly come in waves. Spreading out more gives more chances to the receiving server to adequately process those requests.

Currently, `ActivityPub::DeliveryWorker` jobs are retried up to 16 times
using Sidekiq's default delay and jitter, which means retries are spaced by the
following formula:
  `(count**4) + 15 + rand(10) * (count + 1)`

In some cases (e.g. account migrations), large amounts of jobs are queued at
the same time, resulting in failing ones also being retried at approximately
the same time: while the 16th retry is spaced from the 15th retry by about
14 hours, the `rand(10) * (count + 1)` splits them in 10 buckets spreading over
less than 3 minutes in total! While the real life situation is more nuanced,
it's still within the same ballpark.

This PR changes it so that retries are spaced out much more evenly, by adding
another jitter component of `rand(0.5 * (count ** 4))`. This does not replace
Sidekiq's default one because there is no option to control it.
@Gargron
Copy link
Member

Gargron commented Dec 15, 2022

We already do something similar in the ExponentialBackoff concern, but the backoff is much more aggressive than I think is necessary here.

Are you sure we can't just reuse it?

@ClearlyClaire
Copy link
Contributor Author

We could, but jobs currently retried for about 2 days would be retried for over 3 weeks.

@simonft
Copy link

simonft commented Feb 15, 2023

I just migrated a 10k follower account to an account on its own server and can speak to this change being helpful. It seems like the server needs to be way over provisioned at first, to make sure it can accept at least 1/16th of the followers before the requests time out in each of the stampedes that hit it, and then can be scaled down a bunch.

Gargron

This comment was marked as duplicate.

@Gargron Gargron merged commit ddde4e0 into mastodon:main Mar 3, 2023
arachnist pushed a commit to arachnist/mastodon that referenced this pull request Apr 4, 2023
Roboron3042 pushed a commit to Roboron3042/mastodon that referenced this pull request Apr 16, 2023
skerit pushed a commit to 11ways/mastodon that referenced this pull request Jul 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants