Mastodon does not respect ntfy.sh's `429` responses and gets temporarily ip-blacklisted #26078

ShadowJonathan · 2023-07-19T14:23:24Z

Steps to reproduce the problem

Have an active mastodon server
Use a few queue workers on one IP address
Send a lot (>60 requests / 5 seconds) of notification push requests to ntfy

Expected behaviour

Notification jobs to work, or at least not die unnecessarily after a lot of processes.

Actual behaviour

Observe jobs failing with 429 codes at first, but eventually failing with connect timeouts

Detailed description

Ntfy.sh has a rate limit of how many requests it can take a second. It tries to back off servers with a 429 mechanism, to throttle and have it retry later.

However, mastodon's queuing system will not throttle itself in such a situation, instead, it will keep trying, while the server keeps returning 429, until eventually the server gets banned via fail2ban.

Specifications

Mastodon v4.1.4

ShadowJonathan · 2023-07-19T15:15:19Z

I can recommend looking into something like sidekiq-rate-limiter. Then, mastodon can do something like dynamic throttling based on how many errors per hour or such there are, which means that eventually it'll get into a rhythm where it will hit the ratelimit, but still work.

This can also work for overburdened selfhosted notifiers, which with this degree, and errors, will also be held back.

niklaskorz · 2023-11-07T20:43:57Z

Experiencing the same issue on rheinneckar.social, where the nfty.sh timeouts are blocking the whole push queue and thus delivery of posts to other instances.

Tiwy57 · 2023-11-28T14:35:09Z

Last night, meow.social was also affected and a large proportion of our system resources were impacted, with a delay in push processing of over an hour, which is constantly increasing. This is a problem because it causes an internal DoS.

As a workaround, we have redirected the traffic that is usually destined for ntfy.sh to 127.0.0.1 at DNS level, on the server that manages push. We're eagerly awaiting a fix in the mastodon main release.

niklaskorz · 2023-11-28T21:00:27Z

Last night, meow.social was also affected and a large proportion of our system resources were impacted, with a delay in push processing of over an hour, which is constantly increasing. This is a problem because it causes an internal DoS.

As a workaround, we have redirected the traffic that is usually destined for ntfy.sh to 127.0.0.1 at DNS level, on the server that manages push. We're eagerly awaiting a fix in the mastodon main release.

I'm honestly surprised that mastodon.social hasn't run into this issue itself yet. If it did, this issue would probably get a lot more attention.

flancian · 2023-12-16T09:54:52Z

Social.coop (~2k users) ran into this issue a few weeks back. We had to manually remove registrations on ntfy.sh to get out of a 'queue is stuck' situation. Could this be prioritized? Can the community help? Thank you!

abochmann · 2024-05-16T19:45:01Z

We ran into this just today on our instance.

How did people deal with this? Install their own ntfy.sh instance?

We blocked outbound traffic to their service, and it seems they drop their block after some time - but I assume we'll run into the unhandled rate limit again as soon as I let the traffic pass from our side...

inga-lovinde · 2024-06-04T00:00:28Z

Ran into this today as well, even though there are literally only two users on my instance at the moment.

And, as previous commenters noted, this not just breaks push notification for apps, but also breaks federation entirely; no posts from my instance make it to other instances (at least none made it in the last 6 hours) because the queue is filled with 30 retried attempts to connect to ntfy (just for two ntfy tokens) per minute.

Also it seems that there is no clear / easy workaround for docker setup, because updating the /etc/hosts file on the host does not prevent sidekiq docker container from trying to connect to the actual ntfy.sh

inga-lovinde · 2024-06-04T01:37:17Z

Right now I'm not sure how unprofessional admins of small instances can recover from this and restore federation.

I've added

  extra_hosts:
    - ntfy.sh:2001:4860:4860::8888

to the sidekiq config in docker-compose.yml (just a random valid ipv6 address that will be unreachable from inside the sidekiq container because it doesn't have ipv6 connectivity).
So at least now all these retries don't connect to actual ntfy.

But still, I have 23k entries in my push queue, almost all of them Web::PushNotificationWorker, but with enough ActivityPub::DeliveryWorker that I don't want to just clear the entire queue (because then all the posts made today will never federate, as I understand).

And I tried deleting these push tasks manually from "Retries" page (where I have to pay a lot of attention to not accidentally delete ActivityPub::DeliveryWorker), I have deleted what feels to be a couple of thousands, and they just keep coming. At least the size of push queue decreased by a couple of thousands (not sure if this is related). When I do not delete them manually, the size of the push queue does not decrease.
It's as if the only way to recover was to delete all 20k tasks manually one by one, while taking care to not delete anything except for Web::PushNotificationWorker. Or maybe there is some way to automate it?

ShadowJonathan · 2024-06-04T07:24:07Z

TIP: you can parallelise the push queue massively, as most of it is HTTP connect and wait. Where we have a parallelisation of 8 for most CPU-intensive tasks (like ingress and default) per container, we have 32 for our push queues, so i suggest spinning up some extra workers just for clearing the push queue, and churning the retries, they should eventually clear.

inga-lovinde · 2024-06-04T07:27:27Z

@ShadowJonathan I'm not sure how to do this with docker-based setup?
And the default seems to be 5 threads.

ShadowJonathan · 2024-06-04T07:32:21Z

If you're using the default docker-compose file, i can recommend adding a new container with (something like) the following;

  sidekiq_push:
    # adjust these two as required
    build: .
    image: ghcr.io/mastodon/mastodon:v4.2.9
    
    restart: always
    env_file: .env.production
    command: bundle exec sidekiq -q push -t 600 -c 32
    depends_on:
      - db
      - redis
    networks:
      - external_network
      - internal_network
    volumes:
      - ./public/system:/mastodon/public/system
    healthcheck:
      test: ['CMD-SHELL', "ps aux | grep '[s]idekiq\ 6' || false"]

inga-lovinde · 2024-06-04T07:51:59Z

@ShadowJonathan thank you! In my case the problem was caused not by too many notifications but by too many (duplicate) subscriptions, so I updated the old subscriptions in the database to use "invalid://" endpoint to prevent this from happening in the future (until new duplicate subscriptions will get created), and even with default settings the push queue is now gradually getting smaller (by about 200 tasks per minute).

But your advice should help other admins :)

inga-lovinde · 2024-06-04T07:52:08Z

@ShadowJonathan thank you! In my case the problem was caused not by too many notifications but by too many (duplicate) subscriptions, so I updated the old subscriptions in the database to use "invalid://" endpoint to prevent this from happening in the future (until new duplicate subscriptions will get created), and even with default settings the push queue is now gradually getting smaller (by about 200 tasks per minute).

But your advice should help other admins :)

ShadowJonathan · 2024-06-04T07:56:06Z

Oh I just remembered: Deleting the subscriptions entirely should short-cut the jobs and return them complete, basically churning through them as quickly as they can without throwing them on the retry queue. :)

mastodon/app/workers/web/push_notification_worker.rb

Lines 47 to 49 in d4e0949

    
           rescue ActiveRecord::RecordNotFound 
        
             true 
        
           end

ShadowJonathan added the bug Something isn't working label Jul 19, 2023

This was referenced Jul 20, 2023

UnifiedPush: Respond with 404/409/... to Mastodon/etc. instead of 507 based on User-Agent binwiederhier/ntfy#664

Open

v4.1.5 Sidekiq freezes processing PushNotificationWorker #26115

Closed

inga-lovinde mentioned this issue Jun 4, 2024

Deduplicate push subscriptions by subscription endpoint (current behavior causes severe performance degradation) #30544

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mastodon does not respect ntfy.sh's `429` responses and gets temporarily ip-blacklisted #26078

Mastodon does not respect ntfy.sh's `429` responses and gets temporarily ip-blacklisted #26078

ShadowJonathan commented Jul 19, 2023

ShadowJonathan commented Jul 19, 2023

niklaskorz commented Nov 7, 2023

Tiwy57 commented Nov 28, 2023

niklaskorz commented Nov 28, 2023

flancian commented Dec 16, 2023

abochmann commented May 16, 2024

inga-lovinde commented Jun 4, 2024 •

edited

inga-lovinde commented Jun 4, 2024 •

edited

ShadowJonathan commented Jun 4, 2024

inga-lovinde commented Jun 4, 2024

ShadowJonathan commented Jun 4, 2024

inga-lovinde commented Jun 4, 2024

inga-lovinde commented Jun 4, 2024

ShadowJonathan commented Jun 4, 2024 •

edited

Mastodon does not respect ntfy.sh's 429 responses and gets temporarily ip-blacklisted #26078

Mastodon does not respect ntfy.sh's 429 responses and gets temporarily ip-blacklisted #26078

Comments

ShadowJonathan commented Jul 19, 2023

Steps to reproduce the problem

Expected behaviour

Actual behaviour

Detailed description

Specifications

ShadowJonathan commented Jul 19, 2023

niklaskorz commented Nov 7, 2023

Tiwy57 commented Nov 28, 2023

niklaskorz commented Nov 28, 2023

flancian commented Dec 16, 2023

abochmann commented May 16, 2024

inga-lovinde commented Jun 4, 2024 • edited

inga-lovinde commented Jun 4, 2024 • edited

ShadowJonathan commented Jun 4, 2024

inga-lovinde commented Jun 4, 2024

ShadowJonathan commented Jun 4, 2024

inga-lovinde commented Jun 4, 2024

inga-lovinde commented Jun 4, 2024

ShadowJonathan commented Jun 4, 2024 • edited

Mastodon does not respect ntfy.sh's `429` responses and gets temporarily ip-blacklisted #26078

Mastodon does not respect ntfy.sh's `429` responses and gets temporarily ip-blacklisted #26078

inga-lovinde commented Jun 4, 2024 •

edited

inga-lovinde commented Jun 4, 2024 •

edited

ShadowJonathan commented Jun 4, 2024 •

edited