Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v4.1.5 Sidekiq freezes processing PushNotificationWorker #26115

Closed
hugogameiro opened this issue Jul 22, 2023 · 4 comments · Fixed by #26116
Closed

v4.1.5 Sidekiq freezes processing PushNotificationWorker #26115

hugogameiro opened this issue Jul 22, 2023 · 4 comments · Fixed by #26116
Labels
bug Something isn't working status/confirmed This bug has been confirmed

Comments

@hugogameiro
Copy link
Contributor

hugogameiro commented Jul 22, 2023

Steps to reproduce the problem

Use ntfy.sh for push notifications causes Sidekiq to freeze before failing and continuing.

Expected behaviour

fast or non blocking time out

Actual behaviour

time out taking too long

Detailed description

I have seen on multiple Mastodon installations that Sidekiq freezes while processing PushNotificationWorker.

On all cases that I have seen this happening there are timeouts on requests to ntfy.sh.

Example:

2023-07-22T09:03:40.102Z pid=7 tid=2hmv class=Web::PushNotificationWorker jid=60b27c61419b1a71ff0787f1 elapsed=129.454 INFO: fail
2023-07-22T09:03:40.102Z pid=7 tid=2hmv WARN: {"context":"Job raised exception","job":{"retry":5,"queue":"push","class":"Web::PushNotificationWorker","args":[68395,1147646],"jid":"60b27c61419b1a71ff0787f1","created_at":1690012751.5607605,"enqueued_at":1690015901.6819582,"error_message":"failed to connect: Connection timed out - connect(2) for 159.203.148.75:443 on https://ntfy.sh/_ONE_HASH_EXAMPLE_?up=1","error_class":"HTTP::ConnectionError","failed_at":1690013410.8812373,"retry_count":2,"retried_at":1690015845.9492471}}
2023-07-22T09:03:40.102Z pid=7 tid=2hmv WARN: HTTP::ConnectionError: failed to connect: Connection timed out - connect(2) for 159.203.148.75:443 on https://ntfy.sh/_ONE_HASH_EXAMPLE_?up=1
2023-07-22T09:03:40.102Z pid=7 tid=2hmv WARN: /opt/ruby/lib/ruby/3.0.0/socket.rb:1214:in `__connect_nonblock'
2023-07-22T09:03:40.103Z pid=7 tid=2hbr class=Web::PushNotificationWorker jid=3684ce8ee1f12b85b2116b52 elapsed=130.103 INFO: fail
2023-07-22T09:03:40.103Z pid=7 tid=2hbr WARN: {"context":"Job raised exception","job":{"retry":5,"queue":"push","class":"Web::PushNotificationWorker","args":[68825,1147652],"jid":"3684ce8ee1f12b85b2116b52","created_at":1690012800.725376,"enqueued_at":1690015753.0426877,"error_message":"failed to connect: Connection timed out - connect(2) for 159.203.148.75:443 on https://ntfy.sh/_ONE_HASH_EXAMPLE_?up=1","error_class":"HTTP::ConnectionError","failed_at":1690013486.65336,"retry_count":2,"retried_at":1690015696.449503}}
2023-07-22T09:03:40.107Z pid=7 tid=2hbr WARN: HTTP::ConnectionError: failed to connect: Connection timed out - connect(2) for 159.203.148.75:443 on https://ntfy.sh/_ONE_HASH_EXAMPLE_?up=1
2023-07-22T09:03:40.108Z pid=7 tid=2hbr WARN: /opt/ruby/lib/ruby/3.0.0/socket.rb:1214:in `__connect_nonblock'
2023-07-22T09:03:40.095Z pid=7 tid=2heb class=Web::PushNotificationWorker jid=26d65f50ad26f5937a1928e6 INFO: Adding dead Web::PushNotificationWorker job 26d65f50ad26f5937a1928e6
2023-07-22T09:03:40.105Z pid=7 tid=2hhn class=Web::PushNotificationWorker jid=d378fd3313d12ac62d651220 elapsed=129.618 INFO: fail
2023-07-22T09:03:40.109Z pid=7 tid=2hhn WARN: {"context":"Job raised exception","job":{"retry":5,"queue":"push","class":"Web::PushNotificationWorker","args":[68035,1147646],"jid":"d378fd3313d12ac62d651220","created_at":1690012751.5607605,"enqueued_at":1690015864.5601733,"error_message":"failed to connect: Connection timed out - connect(2) for 159.203.148.75:443 on https://ntfy.sh/_ONE_HASH_EXAMPLE_?up=1","error_class":"HTTP::ConnectionError","failed_at":1690013347.3894298,"retry_count":2,"retried_at":1690015811.1333215}}
2023-07-22T09:03:40.109Z pid=7 tid=2heb class=Web::PushNotificationWorker jid=26d65f50ad26f5937a1928e6 elapsed=129.636 INFO: fail
2023-07-22T09:03:40.109Z pid=7 tid=2heb WARN: {"context":"Job raised exception","job":{"retry":5,"queue":"push","class":"Web::PushNotificationWorker","args":[56687,1147628],"jid":"26d65f50ad26f5937a1928e6","created_at":1690011829.8810835,"enqueued_at":1690015847.922677,"error_message":"failed to connect: Connection timed out - connect(2) for 159.203.148.75:443 on https://ntfy.sh/_DIFFERENT_HASH_EXAMPLE_?up=1","error_class":"HTTP::ConnectionError","failed_at":1690012098.1093197,"retry_count":4,"retried_at":1690015548.9941778}}
2023-07-22T09:03:40.109Z pid=7 tid=2heb WARN: HTTP::ConnectionError: failed to connect: Connection timed out - connect(2) for 159.203.148.75:443 on https://ntfy.sh/_DIFFERENT_HASH_EXAMPLE_?up=1
2023-07-22T09:03:40.109Z pid=7 tid=2heb WARN: /opt/ruby/lib/ruby/3.0.0/socket.rb:1214:in `__connect_nonblock'
2023-07-22T09:03:40.105Z pid=7 tid=2hhb class=Web::PushNotificationWorker jid=ce33d8556b563cacfd24e8f2 elapsed=129.628 INFO: fail
2023-07-22T09:03:40.110Z pid=7 tid=2hhb WARN: {"context":"Job raised exception","job":{"retry":5,"queue":"push","class":"Web::PushNotificationWorker","args":[68825,1147632],"jid":"ce33d8556b563cacfd24e8f2","created_at":1690011841.54834,"enqueued_at":1690015864.5594342,"error_message":"failed to connect: Connection timed out - connect(2) for 159.203.148.75:443 on https://ntfy.sh/_ONE_HASH_EXAMPLE_?up=1","error_class":"HTTP::ConnectionError","failed_at":1690012104.2540042,"retry_count":4,"retried_at":1690015583.805272}}

Not sure if the freezes really block all other processing or if there are simply multiple requests waiting for the timeout but reverting #26055 solves the problem.

Mastodon version

v4.1.5

@hugogameiro hugogameiro added bug Something isn't working status/to triage This issue needs to be triaged labels Jul 22, 2023
@ClearlyClaire ClearlyClaire added status/confirmed This bug has been confirmed and removed status/to triage This issue needs to be triaged labels Jul 22, 2023
@ClearlyClaire
Copy link
Contributor

I have confirmed and identified the issue, sorry about that.

@wiegelmann
Copy link

In my case around 15% of customer instances are affected. You can check the queue:

Web::PushSubscription.where('endpoint LIKE ?', 'https://ntfy.sh/%').size
Sidekiq::DeadSet.new.select {_1.display_class == "Web::PushNotificationWorker"}.size

Cleanup workaround until the next release:

Web::PushSubscription.where('endpoint LIKE ?', 'https://ntfy.sh/%').in_batches.delete_all
Sidekiq::DeadSet.new.select {_1.display_class == "Web::PushNotificationWorker"}.map(&:delete)

@ShadowJonathan
Copy link
Contributor

ShadowJonathan commented Jul 22, 2023

Shining the problem from another side, i'd like to highlight this: #26078

Ntfy bans mastodon instances on the IP-level (no ACK on SYN) when it detects a huge volume of requests that dont respect 429 throttling, it might be worth looking into that as well.

Joshix-1 pushed a commit to Joshix-1/mastodon that referenced this issue Jul 22, 2023
@jcrabapple
Copy link

How do I install this patch on my instance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working status/confirmed This bug has been confirmed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants