Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mastodon does not respect ntfy.sh's 429 responses and gets temporarily ip-blacklisted #26078

Open
ShadowJonathan opened this issue Jul 19, 2023 · 14 comments
Labels
bug Something isn't working

Comments

@ShadowJonathan
Copy link
Contributor

Steps to reproduce the problem

  1. Have an active mastodon server
  2. Use a few queue workers on one IP address
  3. Send a lot (>60 requests / 5 seconds) of notification push requests to ntfy

Expected behaviour

Notification jobs to work, or at least not die unnecessarily after a lot of processes.

Actual behaviour

Observe jobs failing with 429 codes at first, but eventually failing with connect timeouts

Detailed description

Ntfy.sh has a rate limit of how many requests it can take a second. It tries to back off servers with a 429 mechanism, to throttle and have it retry later.

However, mastodon's queuing system will not throttle itself in such a situation, instead, it will keep trying, while the server keeps returning 429, until eventually the server gets banned via fail2ban.

Specifications

Mastodon v4.1.4

@ShadowJonathan ShadowJonathan added the bug Something isn't working label Jul 19, 2023
@ShadowJonathan
Copy link
Contributor Author

I can recommend looking into something like sidekiq-rate-limiter. Then, mastodon can do something like dynamic throttling based on how many errors per hour or such there are, which means that eventually it'll get into a rhythm where it will hit the ratelimit, but still work.

This can also work for overburdened selfhosted notifiers, which with this degree, and errors, will also be held back.

@niklaskorz
Copy link

Experiencing the same issue on rheinneckar.social, where the nfty.sh timeouts are blocking the whole push queue and thus delivery of posts to other instances.

@Tiwy57
Copy link
Contributor

Tiwy57 commented Nov 28, 2023

Last night, meow.social was also affected and a large proportion of our system resources were impacted, with a delay in push processing of over an hour, which is constantly increasing. This is a problem because it causes an internal DoS.

As a workaround, we have redirected the traffic that is usually destined for ntfy.sh to 127.0.0.1 at DNS level, on the server that manages push. We're eagerly awaiting a fix in the mastodon main release.

@niklaskorz
Copy link

Last night, meow.social was also affected and a large proportion of our system resources were impacted, with a delay in push processing of over an hour, which is constantly increasing. This is a problem because it causes an internal DoS.

As a workaround, we have redirected the traffic that is usually destined for ntfy.sh to 127.0.0.1 at DNS level, on the server that manages push. We're eagerly awaiting a fix in the mastodon main release.

I'm honestly surprised that mastodon.social hasn't run into this issue itself yet. If it did, this issue would probably get a lot more attention.

@flancian
Copy link

Social.coop (~2k users) ran into this issue a few weeks back. We had to manually remove registrations on ntfy.sh to get out of a 'queue is stuck' situation. Could this be prioritized? Can the community help? Thank you!

@abochmann
Copy link

We ran into this just today on our instance.

How did people deal with this? Install their own ntfy.sh instance?

We blocked outbound traffic to their service, and it seems they drop their block after some time - but I assume we'll run into the unhandled rate limit again as soon as I let the traffic pass from our side...

@inga-lovinde
Copy link

inga-lovinde commented Jun 4, 2024

Ran into this today as well, even though there are literally only two users on my instance at the moment.

And, as previous commenters noted, this not just breaks push notification for apps, but also breaks federation entirely; no posts from my instance make it to other instances (at least none made it in the last 6 hours) because the queue is filled with 30 retried attempts to connect to ntfy (just for two ntfy tokens) per minute.

Also it seems that there is no clear / easy workaround for docker setup, because updating the /etc/hosts file on the host does not prevent sidekiq docker container from trying to connect to the actual ntfy.sh

@inga-lovinde
Copy link

inga-lovinde commented Jun 4, 2024

Right now I'm not sure how unprofessional admins of small instances can recover from this and restore federation.

I've added

  extra_hosts:
    - ntfy.sh:2001:4860:4860::8888

to the sidekiq config in docker-compose.yml (just a random valid ipv6 address that will be unreachable from inside the sidekiq container because it doesn't have ipv6 connectivity).
So at least now all these retries don't connect to actual ntfy.

But still, I have 23k entries in my push queue, almost all of them Web::PushNotificationWorker, but with enough ActivityPub::DeliveryWorker that I don't want to just clear the entire queue (because then all the posts made today will never federate, as I understand).

And I tried deleting these push tasks manually from "Retries" page (where I have to pay a lot of attention to not accidentally delete ActivityPub::DeliveryWorker), I have deleted what feels to be a couple of thousands, and they just keep coming. At least the size of push queue decreased by a couple of thousands (not sure if this is related). When I do not delete them manually, the size of the push queue does not decrease.
It's as if the only way to recover was to delete all 20k tasks manually one by one, while taking care to not delete anything except for Web::PushNotificationWorker. Or maybe there is some way to automate it?

@ShadowJonathan
Copy link
Contributor Author

TIP: you can parallelise the push queue massively, as most of it is HTTP connect and wait. Where we have a parallelisation of 8 for most CPU-intensive tasks (like ingress and default) per container, we have 32 for our push queues, so i suggest spinning up some extra workers just for clearing the push queue, and churning the retries, they should eventually clear.

@inga-lovinde
Copy link

@ShadowJonathan I'm not sure how to do this with docker-based setup?
And the default seems to be 5 threads.

@ShadowJonathan
Copy link
Contributor Author

If you're using the default docker-compose file, i can recommend adding a new container with (something like) the following;

  sidekiq_push:
    # adjust these two as required
    build: .
    image: ghcr.io/mastodon/mastodon:v4.2.9
    
    restart: always
    env_file: .env.production
    command: bundle exec sidekiq -q push -t 600 -c 32
    depends_on:
      - db
      - redis
    networks:
      - external_network
      - internal_network
    volumes:
      - ./public/system:/mastodon/public/system
    healthcheck:
      test: ['CMD-SHELL', "ps aux | grep '[s]idekiq\ 6' || false"]

@inga-lovinde
Copy link

@ShadowJonathan thank you! In my case the problem was caused not by too many notifications but by too many (duplicate) subscriptions, so I updated the old subscriptions in the database to use "invalid://" endpoint to prevent this from happening in the future (until new duplicate subscriptions will get created), and even with default settings the push queue is now gradually getting smaller (by about 200 tasks per minute).

But your advice should help other admins :)

1 similar comment
@inga-lovinde
Copy link

@ShadowJonathan thank you! In my case the problem was caused not by too many notifications but by too many (duplicate) subscriptions, so I updated the old subscriptions in the database to use "invalid://" endpoint to prevent this from happening in the future (until new duplicate subscriptions will get created), and even with default settings the push queue is now gradually getting smaller (by about 200 tasks per minute).

But your advice should help other admins :)

@ShadowJonathan
Copy link
Contributor Author

ShadowJonathan commented Jun 4, 2024

Oh I just remembered: Deleting the subscriptions entirely should short-cut the jobs and return them complete, basically churning through them as quickly as they can without throwing them on the retry queue. :)

rescue ActiveRecord::RecordNotFound
true
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants