-
Notifications
You must be signed in to change notification settings - Fork 55
2025-01-20 Outage Report #609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ThisIsMissEm
wants to merge
6
commits into
main
Choose a base branch
from
feat/2025-01-20-outage-report
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
90977c3
Fetch hugo version from .tool-versions
ThisIsMissEm 1b97b5c
Add missing actions/setup-node for publish workflow
ThisIsMissEm fede30b
Always run the build step of the publish workflow, but only deploy on…
ThisIsMissEm 0f3b1fe
Prevent branches from clobbering each other
ThisIsMissEm 1940384
Add blog incident report for week starting 2025-01-20
ThisIsMissEm 7a5a75a
Add blog incident report for week starting 2025-01-20
ThisIsMissEm File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| hugo extended_0.139.4 | ||
| node 22 |
Binary file added
BIN
+147 KB
...hyderm.io/content/en/blog/incidents/2025-01-24-elevated-500-errors/image-01.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+131 KB
...hyderm.io/content/en/blog/incidents/2025-01-24-elevated-500-errors/image-02.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
161 changes: 161 additions & 0 deletions
161
....hachyderm.io/content/en/blog/incidents/2025-01-24-elevated-500-errors/index.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,161 @@ | ||||||||||
| --- | ||||||||||
| date: 2025-01-24 | ||||||||||
| title: "Global Outage: Elevated 500 error levels" | ||||||||||
| linkTitle: "Global Outage: Elevated 500 error levels" | ||||||||||
| description: "Hachyderm experienced elevated 500 error levels in mid-January 2025. Here is the postmortem, analysis, and writeup on the incident." | ||||||||||
| author: Emelia Smith ([@thisismissem](https://hachyderm.io/@thisismissem)) | ||||||||||
| resources: | ||||||||||
| - src: "**.{png,jpg}" | ||||||||||
| --- | ||||||||||
|
|
||||||||||
| On Monday, January 20th, 2025 at roughly 17:00 UTC Hachyderm started to experience elevated 500 error levels during routine system maintenance. | ||||||||||
|
|
||||||||||
| On Tuesday, January 21st, 2025 at roughly 17:20 UTC Hachyderm started to again experience elevated 500 error levels. Hachyderm appeared to still be up, but there were reports of outages. | ||||||||||
|
|
||||||||||
| Impacted users experienced 500 internal server error responses from `https://hachyderm.io` in all regions of the world. | ||||||||||
|
|
||||||||||
| ## Context | ||||||||||
|
|
||||||||||
| As part of routine updates, we went through each of our machines and ran regular system updates. | ||||||||||
|
|
||||||||||
| Two of the updated packages included `libicu` and `libvips`. The affected machines involved were the `mastodon-web` machines, known as `fritz` and `buttons`, and `mastodon-sidekiq` machines, known as `franz` and `freud`. `nietzsche` is our database server, which we are in the process of replacing. | ||||||||||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
|
|
||||||||||
| ## Timeline | ||||||||||
|
|
||||||||||
| All events are documented in [UTC](https://en.wikipedia.org/wiki/Coordinated_Universal_Time) time. | ||||||||||
|
|
||||||||||
| ### 2025-01-20 | ||||||||||
| - **15:38** `@esk` Started doing routine system updates | ||||||||||
| - **16:35** `@esk` Ran system updates on `buttons` | ||||||||||
| - **16:43** `@esk` Rebooted `buttons`, and questioned if we'd have issues due to the `libicu` update, since this is compiled into ruby dependencies for Mastodon | ||||||||||
| - **16:44** `@esk` Noted that `buttons` seemed okay | ||||||||||
| - **16:45** `@esk` Started system updates on `fritz` | ||||||||||
| - **16:57** `@esk` Rebooted `fritz` | ||||||||||
| - **16:59** `@esk` Noted that whilst `fritz` was down for maintenance, requests did not correctly failover to `buttons` | ||||||||||
| - **17:00** `@ThisIsMissEm` Responded saying that failover should have happened based on our CDN configuration, but for some reason the CDN didn't see `fritz` as being down | ||||||||||
| - **17:18** `@esk` Noted that something didn't look good whilst attempting to run system updates on `franz`. Logs were showing a database connection error: | ||||||||||
| ``` | ||||||||||
| [novix@franz]: ~$ sudo journalctl -xeu mastodon-sidekiq-ingress@1 | ||||||||||
| Jan 20 17:18:41 franz bundle[600]: 2025-01-20T17:18:41.618Z pid=600 tid=2pgw class=ActivityPub::ProcessingWorker jid=b29a74a7be10bcbcdba588be elapsed=30.137 INFO: fail | ||||||||||
| Jan 20 17:18:41 franz bundle[600]: 2025-01-20T17:18:41.618Z pid=600 tid=2pgw WARN: {"context":"Job raised exception","job":{"retry":8,"queue":"ingress","backtrace":true,"args":[1125751588830572xx,"{\"@context\":[\"https://www.w3 Jan 20 17:18:41 franz bundle [600]: 2025-01-20T17:18:41.618Z pid=600 tid=2pgw WARN: ActiveRecord::DatabaseConnectionError: There is an issue connecting with your hostname: nietzsche. Jan 20 17:18:41 franz bundle [600]: Please check your database configuration and ensure there is a valid connection to your database. | ||||||||||
| Jan 20 17:18:41 franz bundle [600]: 2025-01-20T17:18:41.618Z pid=600 tid=2pgw WARN: /var/lib/mastodon-versions/v4.3.3/vendor/bundle/ruby/3.3.0/gems/activerecord-7.1.5.1/lib/active_record/connection_adapters/postgresql_adapter.rb:7 Jan 20 17:18:41 franz bundle [600]: /var/lib/mastodon-versions/v4.3.3/vendor/bundle/ruby/3.3.0/gems/activerecord-7.1.5.1/lib/active_record/connection_adapters/postgresql_adapter.rb:68:in | ||||||||||
| `new_client' | ||||||||||
| Jan 20 17:18:41 franz bundle[600]: /var/lib/mastodon-versions/v4.3.3/vendor/bundle/ruby/3.3.0/gems/activerecord-7.1.5.1/lib/active_record/connection_adapters/postgresql_adapter.rb:980:in `connect' Jan 20 17:18:41 franz bundle[600]: 2025-01-20T17:18:41.919Z pid=600 tid=650 INFO: Shutting down | ||||||||||
| Jan 20 17:18:41 franz systemd[1]: Stopping mastodon-sidekiq... | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| The specific part to note is: | ||||||||||
| ``` | ||||||||||
| WARN: ActiveRecord::DatabaseConnectionError: There is an issue connecting with your hostname: nietzsche. | ||||||||||
| Jan 20 17:18:41 franz bundle [600]: Please check your database configuration and ensure there is a valid connection to your database. | ||||||||||
| ``` | ||||||||||
| - **17:19** An incident call was started between `@esk` and `@ThisIsMissEm` | ||||||||||
| - **17:20** `@esk` Tested the connectivity to nietzsche, our database server, and connectivity was fine. | ||||||||||
| - **17:20** `@esk` Noted a suspicion it might be because of ruby/icu update. | ||||||||||
| - **17:20** `@esk` Performed a reinstall of ruby on the affected machine (`franz`) | ||||||||||
| - **17:25** After reinstall `franz` appeared to start up sidekiq fine. | ||||||||||
| - **17:30** Some investigation happened into a warning that was being generated: | ||||||||||
| ``` | ||||||||||
| (process:24889): VIPS-WARNING **: 17:30:23.963: unable to load "/usr/lib/vips-modules-8.16/vips-poppler.so" -- libpoppler-glib.so.8: cannot open shared object file: No such file or directory | ||||||||||
| (process:24889): VIPS-WARNING **: 17:30:23.963: unable to load "/usr/lib/vips-modules-8.16/vips-openslide.so" -- libopenslide.so.1: cannot open shared object file: No such file or directory | ||||||||||
| ``` | ||||||||||
| - **17:35** `@esk` Installed the missing packages for `libpoppler` and `libopenslide` resolving the warnings. | ||||||||||
| - **17:43** `@ThisIsMissEm` responded that these errors had actually always been present, as we didn't use poppler or openslide with libvips, so hadn't installed them. | ||||||||||
| - **17:59** Discussion between `@ThisIsMissEm` and `@esk` about the failover not happening correctly. | ||||||||||
| - Something was making the CDN nodes not detect that the `mastodon-web` nodes were down | ||||||||||
| - Was identified to be because of caching configuration existing on `fritz` and `buttons` nginx, which had existed when `fritz` was still exposed directly to the internet (we'd moved everything behind CDN back in November/December as part of our [Global Resiliency Plan]({{< ref "../posts/2024-11-infrastructure-updates" >}})) | ||||||||||
| - This caching on `fritz` and `buttons` meant that even though they were down, some percentage of requests were still responding from the cache as 200 responses, so CDN nodes didn't detect they were down. | ||||||||||
| - **18:00** Initial incident call ended | ||||||||||
| - **18:30** First confirmation, and first report of 500 responses globally | ||||||||||
| - **18:40** `@esk` Decided to reinstall ruby on all mastodon-web machines. | ||||||||||
| - **19:00** `@esk` Noted that `buttons` wasn't starting up correctly after reinstalling ruby, reporting the same database connection error as the `franz` earlier. | ||||||||||
| - **19:00** New incident response call started | ||||||||||
| - **19:40** After reinstalling ruby on `buttons` and `fritz`, the logs looked fine, and the alert stopped firing. | ||||||||||
|
|
||||||||||
| ### 2025-01-21: | ||||||||||
| - **15:15** Alerts started firing for elevated 500 error levels, `@ThisIsMissEm` was the only person about at the time, but hadn't been receiving notifications for the alerts. | ||||||||||
| - **17:20** Community member reported seeing 500 errors when attempting to use Hachyderm, which `@ThisIsMissEm` saw. | ||||||||||
| - **17:21** `@ThisIsMissEm` confirmed outage via grafana: | ||||||||||
|  | ||||||||||
| <br /><br /> | ||||||||||
| - **17:23** `@ThisIsMissEm` noticed that there were database connection errors on `buttons` too: | ||||||||||
| ``` | ||||||||||
| Jan 21 15:51:55 buttons bundle[807963]: E, [2025-01-21T15:51:55.658538 #807963] ERROR -- : [65517bcd-64a2-4cd8-922e-440466a8e499] | ||||||||||
| Jan 21 15:51:55 buttons bundle[807963]: [65517bcd-64a2-4cd8-922e-440466a8e499] ActiveRecord::DatabaseConnectionError (There is an issue connecting with your hostname: nietzsche. | ||||||||||
| Jan 21 15:51:55 buttons bundle[807963]: Please check your database configuration and ensure there is a valid connection to your database. | ||||||||||
| Jan 21 15:51:55 buttons bundle[807963]: ): | ||||||||||
| Jan 21 15:51:55 buttons bundle[807963]: [65517bcd-64a2-4cd8-922e-440466a8e499] | ||||||||||
| Jan 21 15:51:55 buttons bundle[807963]: [65517bcd-64a2-4cd8-922e-440466a8e499] app/models/domain_block.rb:73:in `rule_for' | ||||||||||
| Jan 21 15:51:55 buttons bundle[807963]: [65517bcd-64a2-4cd8-922e-440466a8e499] app/models/domain_block.rb:49:in `suspend?' | ||||||||||
| Jan 21 15:51:55 buttons bundle[807963]: [65517bcd-64a2-4cd8-922e-440466a8e499] app/helpers/domain_control_helper.rb:16:in `domain_not_allowed?' | ||||||||||
| Jan 21 15:51:55 buttons bundle[807963]: [65517bcd-64a2-4cd8-922e-440466a8e499] app/controllers/concerns/signature_verification.rb:223:in `actor_from_key_id' | ||||||||||
| Jan 21 15:51:55 buttons bundle[807963]: [65517bcd-64a2-4cd8-922e-440466a8e499] app/controllers/concerns/signature_verification.rb:56:in `signed_request_actor' | ||||||||||
| Jan 21 15:51:55 buttons bundle[807963]: [65517bcd-64a2-4cd8-922e-440466a8e499] app/controllers/concerns/signature_verification.rb:20:in `require_actor_signature!' | ||||||||||
| Jan 21 15:51:55 buttons bundle[807963]: [65517bcd-64a2-4cd8-922e-440466a8e499] lib/mastodon/rack_middleware.rb:9:in `call' | ||||||||||
| Jan 21 15:51:55 buttons bundle[807848]: I, [2025-01-21T15:51:55.675758 #807848] INFO -- : [a4191d7b-c2c6-4233-9ef0-e97b43886729] method=GET path=/api/v2/notifications/policy format=*/* | ||||||||||
| ``` | ||||||||||
| - **17:23** `@ThisIsMissEm` started investigating if we actually did have a database outage, since it couldn't be related to the updates performed the day before as we'd "fixed" that issue. | ||||||||||
| - **17:25** `@ThisIsMissEm` reached out to other Mastodon admins to see if they had encountered this error, since the database appeared to be reachable. That led to the suggestion to check the error logs on the database servers just in case. | ||||||||||
| - **17:45** `@ThisIsMissEm` discovered that `pgbouncer` on `nietzsche` was actually logging errors, but no alerts were firing. | ||||||||||
| ``` | ||||||||||
| Jan 21 15:51:56 host pgbouncer[8342]: 2025-01-21 15:51:56.202 UTC [8342] ERROR accept() failed: Too many open files | ||||||||||
| ``` | ||||||||||
| This was the cause of the error from the previous day: | ||||||||||
| ``` | ||||||||||
| WARN: ActiveRecord::DatabaseConnectionError: There is an issue connecting with your hostname: nietzsche. | ||||||||||
| Jan 20 17:18:41 franz bundle [600]: Please check your database configuration and ensure there is a valid connection to your database. | ||||||||||
| ``` | ||||||||||
| The database was indeed actually down and unreachable, because mastodon-web and mastodon-sidekiq couldn't acquire a connection via pgbouncer. | ||||||||||
| - **17:45-18:45** Team discussed the appropriate solution to increase available file handles to pgbouncer, this wasn't something `@ThisIsMissEm` knew how to resolve on her own, and documentation for how to increase these limits in systemd was confusing. | ||||||||||
|
|
||||||||||
| In the systemd file, you need to do: | ||||||||||
| ``` | ||||||||||
| LimitNOFILE=3000:524288 | ||||||||||
| ``` | ||||||||||
| which then gets printed as: | ||||||||||
| ``` | ||||||||||
| LimitNOFILE=524288 | ||||||||||
| LimitNOFILESoft=3000 | ||||||||||
| ``` | ||||||||||
| - **17:50** Outage confirmed to be resolved | ||||||||||
|
|
||||||||||
| ## Root Cause | ||||||||||
|
|
||||||||||
| An error that first appeared during system updates was misinterpreted to be caused by the system updates, rather than a separate system actually being in a failure mode. The database load balancer, pgbouncer had been configured for 2000 connections, but only 1024 max file handles. Hachyderm's recent infrastructure changes had resulted in the maximum potential database connections raising to 1428. | ||||||||||
|
|
||||||||||
| The 500 errors were actually database errors, and not related to the system updates. It just so happened that when we were performing the upgrades, we also experienced elevated traffic levels, which pushed us over the 1024 connections that pgbouncer could support. | ||||||||||
|
|
||||||||||
|  | ||||||||||
|
|
||||||||||
| ### Things that went well | ||||||||||
|
|
||||||||||
| TODO | ||||||||||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
|
|
||||||||||
| ### Things that went poorly | ||||||||||
|
|
||||||||||
| Multiple different issues were happening simultaneously: | ||||||||||
| - Increased HTTP requests resulting in database connection exhaustion | ||||||||||
| - System updates to packages known to have caused issues in the past | ||||||||||
| - CDN failover not happening correctly | ||||||||||
|
|
||||||||||
| There was a lack of alerts on pgbouncer errors which resulted in the database issue not being discovered in a timely fashion. | ||||||||||
|
|
||||||||||
| We didn't check the grafana dashboards for 500 errors, which would have indicated that the outage actually started slightly before the updates were performed. | ||||||||||
|
|
||||||||||
| ### Opportunities | ||||||||||
|
|
||||||||||
| We need to improve our monitoring and alerting on critical systems. | ||||||||||
|
|
||||||||||
| We need to improve our nginx configuration to ensure failover works correctly | ||||||||||
|
|
||||||||||
| We need to review our pgbouncer and postgresql configuration | ||||||||||
|
|
||||||||||
| ## Resulting Action | ||||||||||
|
|
||||||||||
| #### 1) Alerts added for pgbouncer errors by `@ThisIsMissEm` | ||||||||||
|
|
||||||||||
| Alerts for error loglines on pgbouncer were added. | ||||||||||
|
|
||||||||||
| #### 2) Spreadsheet created to correctly calculate database connections required `@ThisIsMissEm` | ||||||||||
|
|
||||||||||
| This will be published after review from the Hachyderm team. We'd previously manually run the mathematics set out in `@hazelweakly`'s fantastic [Scaling Mastodon](https://hazelweakly.me/blog/scaling-mastodon/#db_pool-notes-from-nora's-blog) blog post. | ||||||||||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.