Change `ActivityPub::DeliveryWorker` retries to be spread out more #21956

ClearlyClaire · 2022-12-02T16:09:36Z

Currently, ActivityPub::DeliveryWorker jobs are retried up to 16 times using Sidekiq's default delay and jitter, which means retries are spaced by the following formula:
(count**4) + 15 + rand(10) * (count + 1)

In some cases (e.g. account migrations), large amounts of jobs are queued at the same time, resulting in failing ones also being retried at approximately the same time: while the 16th retry is spaced from the 15th retry by about 14 hours, the rand(10) * (count + 1) splits them in 10 buckets spreading over less than 3 minutes in total! While the real life situation is more nuanced, it's still within the same ballpark.

This PR changes it so that retries are spaced out much more evenly, by adding another jitter component of rand(0.5 * (count ** 4)). This does not replace Sidekiq's default one because there is no option to control it.

We already do something similar in the ExponentialBackoff concern, but the backoff is much more aggressive than I think is necessary here.

ClearlyClaire · 2022-12-02T16:13:56Z

The specific issue this PR is trying to address is migrating to a new account when you have a large amount of followers: in this case, the server you're migrating to would receive a lot of follow requests in a short amount of time. Retries help, but they would mostly come in waves. Spreading out more gives more chances to the receiving server to adequately process those requests.

Currently, `ActivityPub::DeliveryWorker` jobs are retried up to 16 times using Sidekiq's default delay and jitter, which means retries are spaced by the following formula: `(count**4) + 15 + rand(10) * (count + 1)` In some cases (e.g. account migrations), large amounts of jobs are queued at the same time, resulting in failing ones also being retried at approximately the same time: while the 16th retry is spaced from the 15th retry by about 14 hours, the `rand(10) * (count + 1)` splits them in 10 buckets spreading over less than 3 minutes in total! While the real life situation is more nuanced, it's still within the same ballpark. This PR changes it so that retries are spaced out much more evenly, by adding another jitter component of `rand(0.5 * (count ** 4))`. This does not replace Sidekiq's default one because there is no option to control it.

Gargron · 2022-12-15T17:00:49Z

We already do something similar in the ExponentialBackoff concern, but the backoff is much more aggressive than I think is necessary here.

Are you sure we can't just reuse it?

ClearlyClaire · 2022-12-15T17:07:42Z

We could, but jobs currently retried for about 2 days would be retried for over 3 weeks.

simonft · 2023-02-15T22:51:04Z

I just migrated a 10k follower account to an account on its own server and can speak to this change being helpful. It seems like the server needs to be way over provisioned at first, to make sure it can accept at least 1/16th of the followers before the requests time out in each of the stampedes that hit it, and then can be scaled down a bunch.

…21956)

…astodon#21956)

ClearlyClaire force-pushed the fixes/spread-out-retries branch from 71b0d93 to 4f6a8c8 Compare December 2, 2022 16:41

ClearlyClaire mentioned this pull request Dec 2, 2022

Destroy follow requests on exhausted delivery attempts #21958

Draft

ineffyble mentioned this pull request Dec 10, 2022

app/workers: add JitteredRetryIn, use in UnfollowFollowWorker #22165

Closed

ClearlyClaire mentioned this pull request Dec 14, 2022

Add MoveService account.migrations.last to Tootctl to help with abandoned followers on remote instances #22281

Closed

ClearlyClaire mentioned this pull request Feb 14, 2023

Something went wrong during account migration - lost followers #23594

Open

This comment was marked as duplicate.

Sign in to view

Gargron approved these changes Mar 3, 2023

View reviewed changes

Gargron merged commit ddde4e0 into mastodon:main Mar 3, 2023

ClearlyClaire added a commit that referenced this pull request Mar 14, 2023

Change ActivityPub::DeliveryWorker retries to be spread out more (#…

6962d11

…21956)

arachnist pushed a commit to arachnist/mastodon that referenced this pull request Apr 4, 2023

Change ActivityPub::DeliveryWorker retries to be spread out more (m…

b80436d

…astodon#21956)

Roboron3042 pushed a commit to Roboron3042/mastodon that referenced this pull request Apr 16, 2023

Change ActivityPub::DeliveryWorker retries to be spread out more (m…

2951209

…astodon#21956)

skerit pushed a commit to 11ways/mastodon that referenced this pull request Jul 7, 2023

Change ActivityPub::DeliveryWorker retries to be spread out more (m…

e70587e

…astodon#21956)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change `ActivityPub::DeliveryWorker` retries to be spread out more #21956

Change `ActivityPub::DeliveryWorker` retries to be spread out more #21956

ClearlyClaire commented Dec 2, 2022 •

edited

ClearlyClaire commented Dec 2, 2022

Gargron commented Dec 15, 2022

ClearlyClaire commented Dec 15, 2022

simonft commented Feb 15, 2023

This comment was marked as duplicate.

Change ActivityPub::DeliveryWorker retries to be spread out more #21956

Change ActivityPub::DeliveryWorker retries to be spread out more #21956

Conversation

ClearlyClaire commented Dec 2, 2022 • edited

ClearlyClaire commented Dec 2, 2022

Gargron commented Dec 15, 2022

ClearlyClaire commented Dec 15, 2022

simonft commented Feb 15, 2023

This comment was marked as duplicate.

Change `ActivityPub::DeliveryWorker` retries to be spread out more #21956

Change `ActivityPub::DeliveryWorker` retries to be spread out more #21956

ClearlyClaire commented Dec 2, 2022 •

edited