New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change ActivityPub::DeliveryWorker
retries to be spread out more
#21956
Conversation
The specific issue this PR is trying to address is migrating to a new account when you have a large amount of followers: in this case, the server you're migrating to would receive a lot of follow requests in a short amount of time. Retries help, but they would mostly come in waves. Spreading out more gives more chances to the receiving server to adequately process those requests. |
Currently, `ActivityPub::DeliveryWorker` jobs are retried up to 16 times using Sidekiq's default delay and jitter, which means retries are spaced by the following formula: `(count**4) + 15 + rand(10) * (count + 1)` In some cases (e.g. account migrations), large amounts of jobs are queued at the same time, resulting in failing ones also being retried at approximately the same time: while the 16th retry is spaced from the 15th retry by about 14 hours, the `rand(10) * (count + 1)` splits them in 10 buckets spreading over less than 3 minutes in total! While the real life situation is more nuanced, it's still within the same ballpark. This PR changes it so that retries are spaced out much more evenly, by adding another jitter component of `rand(0.5 * (count ** 4))`. This does not replace Sidekiq's default one because there is no option to control it.
71b0d93
to
4f6a8c8
Compare
Are you sure we can't just reuse it? |
We could, but jobs currently retried for about 2 days would be retried for over 3 weeks. |
I just migrated a 10k follower account to an account on its own server and can speak to this change being helpful. It seems like the server needs to be way over provisioned at first, to make sure it can accept at least 1/16th of the followers before the requests time out in each of the stampedes that hit it, and then can be scaled down a bunch. |
Currently,
ActivityPub::DeliveryWorker
jobs are retried up to 16 times using Sidekiq's default delay and jitter, which means retries are spaced by the following formula:(count**4) + 15 + rand(10) * (count + 1)
In some cases (e.g. account migrations), large amounts of jobs are queued at the same time, resulting in failing ones also being retried at approximately the same time: while the 16th retry is spaced from the 15th retry by about 14 hours, the
rand(10) * (count + 1)
splits them in 10 buckets spreading over less than 3 minutes in total! While the real life situation is more nuanced, it's still within the same ballpark.This PR changes it so that retries are spaced out much more evenly, by adding another jitter component of
rand(0.5 * (count ** 4))
. This does not replace Sidekiq's default one because there is no option to control it.We already do something similar in the
ExponentialBackoff
concern, but the backoff is much more aggressive than I think is necessary here.