Ceph OSD crashes cause IO stall at HKG #155

darkk · 2017-09-01T14:24:20Z

Impact: some lost in-memory data at jupyter, some soft deadlines could be missed

Detection: @hellais unable to access https://jupyter.ooni.io/

Timeline UTC:
01 Sep 03:25: LA1 starts growing without corresponding CPU load that likely means IO stall, node_disk_io_time_ms and node_disk_io_now confirm that, but most of nodes do not export these metrics
01 Sep 04:16: some nodes are not up anymore
01 Sep 09:46: <@hellais🐙> @darkk not sure if you experienced this with HKG machines too, but I had to reboot the jupyter server because I suspect it was locking on IO in a way that made it unusable.
01 Sep 09:53: mail to support@
01 Sep 10:35: LA1 starts going down
01 Sep 12:45: jupyter boots collector="time" is OK
01 Sep 14:15: incident published

What went well:

seems, node_exporter was not significantly affected by disk issues despite calls to statfs() and enabled collector.textfile.directory (list of files to check is made during startup) and was able to report bad LA1

What went wrong:

there were no reachability alerts although some nodes had obvious issues like up being zero:
- measurements-beta.ooni.io
- test-lists.openobservatory.org
- hkgmetadb.infra.ooni.io
- *.test.ooni.io
most of nodes did not export disk IO stats
hkgsuperset.ooni.io is lost

What is still unclear:

node_scrape_collector_success turned 0 for diskstats and filesystem, that makes sense. But what was the reason for collector time and loadavg to report 0? It looks like undesirable side-effect of up=0. avg_over_time is 1.
what was the reason for nodes to go down? Was it nginx, node_exporter itself or something else?

What could be done to prevent relapse and decrease impact:

bring hkgsuperset.ooni.io back
cleanup kernel versions kernel version without moddep at ssdams #122
verify that IO stat is exported
LA alerts: 10*NCPU is safe alert threshold :)
disk IO alerts: 100% util for long time, non-zero queue for long time
basic active checks for all nodes: ping, ssh

The text was updated successfully, but these errors were encountered:

SSL client cert is checked instead of auth_basic as nginx reads basic auth file on every single request and that may be bad in case of IO issues, see #155

It helps to distinguish bad network from bad userspace and bad service at the affected host. See also #155

That should highlight resource exhaustion and possible malicious activity. See #101, #135 an #155 umbrelled under #226.

This also fixes hkgsuperset.ooni.io missing from `dom0` and `hkg` inventory groups. See #155 and #226

darkk · 2018-10-31T14:43:07Z

Everything is cleared, diverged kernel versions are part of #122.

darkk added the incident label Sep 1, 2017

This was referenced Sep 6, 2017

disk overflow at hkgmetadb #156

Closed

Insistent notifications for urgent actionable alerts #158

Closed

darkk mentioned this issue Sep 5, 2018

Monitoring epic, Oct 2017 … Sep 2018 #226

Closed

31 tasks

darkk added a commit that referenced this issue Oct 31, 2018

Add ICMP-PING and SSH banner monitoring

676b64b

It helps to distinguish bad network from bad userspace and bad service at the affected host. See also #155

darkk added a commit that referenced this issue Oct 31, 2018

Alert on anomalously high LA, CPU, RX, TX and low RAM

030d887

That should highlight resource exhaustion and possible malicious activity. See #101, #135 an #155 umbrelled under #226.

darkk added a commit that referenced this issue Oct 31, 2018

Alert on high duration of high IO rate

4a52e2d

This also fixes hkgsuperset.ooni.io missing from `dom0` and `hkg` inventory groups. See #155 and #226

darkk closed this as completed Oct 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ceph OSD crashes cause IO stall at HKG #155

Ceph OSD crashes cause IO stall at HKG #155

darkk commented Sep 1, 2017 •

edited

Loading

darkk commented Oct 31, 2018

Ceph OSD crashes cause IO stall at HKG #155

Ceph OSD crashes cause IO stall at HKG #155

Comments

darkk commented Sep 1, 2017 • edited Loading

darkk commented Oct 31, 2018

darkk commented Sep 1, 2017 •

edited

Loading