Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ceph OSD crashes cause IO stall at HKG #155

Closed
7 of 8 tasks
darkk opened this issue Sep 1, 2017 · 1 comment
Closed
7 of 8 tasks

Ceph OSD crashes cause IO stall at HKG #155

darkk opened this issue Sep 1, 2017 · 1 comment
Labels

Comments

@darkk
Copy link
Contributor

darkk commented Sep 1, 2017

Impact: some lost in-memory data at jupyter, some soft deadlines could be missed

Detection: @hellais unable to access https://jupyter.ooni.io/

Timeline UTC:
01 Sep 03:25: LA1 starts growing without corresponding CPU load that likely means IO stall, node_disk_io_time_ms and node_disk_io_now confirm that, but most of nodes do not export these metrics
01 Sep 04:16: some nodes are not up anymore
01 Sep 09:46: <@hellais🐙> @darkk not sure if you experienced this with HKG machines too, but I had to reboot the jupyter server because I suspect it was locking on IO in a way that made it unusable.
01 Sep 09:53: mail to support@
01 Sep 10:35: LA1 starts going down
01 Sep 12:45: jupyter boots collector="time" is OK
01 Sep 14:15: incident published

What went well:

  • seems, node_exporter was not significantly affected by disk issues despite calls to statfs() and enabled collector.textfile.directory (list of files to check is made during startup) and was able to report bad LA1

What went wrong:

  • there were no reachability alerts although some nodes had obvious issues like up being zero:
    • measurements-beta.ooni.io
    • test-lists.openobservatory.org
    • hkgmetadb.infra.ooni.io
    • *.test.ooni.io
  • most of nodes did not export disk IO stats
  • hkgsuperset.ooni.io is lost

What is still unclear:

  • node_scrape_collector_success turned 0 for diskstats and filesystem, that makes sense. But what was the reason for collector time and loadavg to report 0? It looks like undesirable side-effect of up=0. avg_over_time is 1.
  • what was the reason for nodes to go down? Was it nginx, node_exporter itself or something else?

What could be done to prevent relapse and decrease impact:

  • bring hkgsuperset.ooni.io back
  • cleanup kernel versions kernel version without moddep at ssdams #122
  • verify that IO stat is exported
  • LA alerts: 10*NCPU is safe alert threshold :)
  • disk IO alerts: 100% util for long time, non-zero queue for long time
  • basic active checks for all nodes: ping, ssh
@darkk darkk added the incident label Sep 1, 2017
darkk added a commit that referenced this issue Oct 14, 2017
SSL client cert is checked instead of auth_basic as nginx reads basic auth
file on every single request and that may be bad in case of IO issues, see
#155
darkk added a commit that referenced this issue Oct 31, 2018
It helps to distinguish bad network from bad userspace and bad service
at the affected host. See also #155
darkk added a commit that referenced this issue Oct 31, 2018
That should highlight resource exhaustion and possible malicious
activity. See #101, #135 an #155 umbrelled under #226.
darkk added a commit that referenced this issue Oct 31, 2018
This also fixes hkgsuperset.ooni.io missing from `dom0` and `hkg`
inventory groups. See #155 and #226
@darkk
Copy link
Contributor Author

darkk commented Oct 31, 2018

Everything is cleared, diverged kernel versions are part of #122.

@darkk darkk closed this as completed Oct 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant