-
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mastodon does not respect ntfy.sh's 429
responses and gets temporarily ip-blacklisted
#26078
Comments
I can recommend looking into something like sidekiq-rate-limiter. Then, mastodon can do something like dynamic throttling based on how many errors per hour or such there are, which means that eventually it'll get into a rhythm where it will hit the ratelimit, but still work. This can also work for overburdened selfhosted notifiers, which with this degree, and errors, will also be held back. |
Experiencing the same issue on rheinneckar.social, where the nfty.sh timeouts are blocking the whole push queue and thus delivery of posts to other instances. |
Last night, meow.social was also affected and a large proportion of our system resources were impacted, with a delay in push processing of over an hour, which is constantly increasing. This is a problem because it causes an internal DoS. As a workaround, we have redirected the traffic that is usually destined for ntfy.sh to 127.0.0.1 at DNS level, on the server that manages push. We're eagerly awaiting a fix in the mastodon main release. |
I'm honestly surprised that mastodon.social hasn't run into this issue itself yet. If it did, this issue would probably get a lot more attention. |
Social.coop (~2k users) ran into this issue a few weeks back. We had to manually remove registrations on ntfy.sh to get out of a 'queue is stuck' situation. Could this be prioritized? Can the community help? Thank you! |
We ran into this just today on our instance. How did people deal with this? Install their own ntfy.sh instance? We blocked outbound traffic to their service, and it seems they drop their block after some time - but I assume we'll run into the unhandled rate limit again as soon as I let the traffic pass from our side... |
Ran into this today as well, even though there are literally only two users on my instance at the moment. And, as previous commenters noted, this not just breaks push notification for apps, but also breaks federation entirely; no posts from my instance make it to other instances (at least none made it in the last 6 hours) because the queue is filled with 30 retried attempts to connect to ntfy (just for two ntfy tokens) per minute. Also it seems that there is no clear / easy workaround for docker setup, because updating the /etc/hosts file on the host does not prevent sidekiq docker container from trying to connect to the actual ntfy.sh |
Right now I'm not sure how unprofessional admins of small instances can recover from this and restore federation. I've added
to the sidekiq config in docker-compose.yml (just a random valid ipv6 address that will be unreachable from inside the sidekiq container because it doesn't have ipv6 connectivity). But still, I have 23k entries in my And I tried deleting these push tasks manually from "Retries" page (where I have to pay a lot of attention to not accidentally delete |
TIP: you can parallelise the push queue massively, as most of it is HTTP connect and wait. Where we have a parallelisation of 8 for most CPU-intensive tasks (like ingress and default) per container, we have 32 for our push queues, so i suggest spinning up some extra workers just for clearing the push queue, and churning the retries, they should eventually clear. |
@ShadowJonathan I'm not sure how to do this with docker-based setup? |
If you're using the default docker-compose file, i can recommend adding a new container with (something like) the following;
|
@ShadowJonathan thank you! In my case the problem was caused not by too many notifications but by too many (duplicate) subscriptions, so I updated the old subscriptions in the database to use "invalid://" endpoint to prevent this from happening in the future (until new duplicate subscriptions will get created), and even with default settings the push queue is now gradually getting smaller (by about 200 tasks per minute). But your advice should help other admins :) |
1 similar comment
@ShadowJonathan thank you! In my case the problem was caused not by too many notifications but by too many (duplicate) subscriptions, so I updated the old subscriptions in the database to use "invalid://" endpoint to prevent this from happening in the future (until new duplicate subscriptions will get created), and even with default settings the push queue is now gradually getting smaller (by about 200 tasks per minute). But your advice should help other admins :) |
Oh I just remembered: Deleting the subscriptions entirely should short-cut the jobs and return them complete, basically churning through them as quickly as they can without throwing them on the retry queue. :) mastodon/app/workers/web/push_notification_worker.rb Lines 47 to 49 in d4e0949
|
Steps to reproduce the problem
Expected behaviour
Notification jobs to work, or at least not die unnecessarily after a lot of processes.
Actual behaviour
Observe jobs failing with 429 codes at first, but eventually failing with connect timeouts
Detailed description
Ntfy.sh has a rate limit of how many requests it can take a second. It tries to back off servers with a 429 mechanism, to throttle and have it retry later.
However, mastodon's queuing system will not throttle itself in such a situation, instead, it will keep trying, while the server keeps returning 429, until eventually the server gets banned via fail2ban.
Specifications
Mastodon v4.1.4
The text was updated successfully, but these errors were encountered: