-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ceph OSD crashes cause IO stall at HKG #155
Labels
Comments
This was referenced Sep 6, 2017
darkk
added a commit
that referenced
this issue
Oct 14, 2017
SSL client cert is checked instead of auth_basic as nginx reads basic auth file on every single request and that may be bad in case of IO issues, see #155
darkk
added a commit
that referenced
this issue
Oct 31, 2018
It helps to distinguish bad network from bad userspace and bad service at the affected host. See also #155
Everything is cleared, diverged kernel versions are part of #122. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Impact: some lost in-memory data at jupyter, some soft deadlines could be missed
Detection: @hellais unable to access https://jupyter.ooni.io/
Timeline UTC:
01 Sep 03:25: LA1 starts growing without corresponding CPU load that likely means IO stall, node_disk_io_time_ms and node_disk_io_now confirm that, but most of nodes do not export these metrics
01 Sep 04:16: some nodes are not up anymore
01 Sep 09:46: <@hellais🐙> @darkk not sure if you experienced this with HKG machines too, but I had to reboot the jupyter server because I suspect it was locking on IO in a way that made it unusable.
01 Sep 09:53: mail to support@
01 Sep 10:35: LA1 starts going down
01 Sep 12:45: jupyter boots collector="time" is OK
01 Sep 14:15: incident published
What went well:
statfs()
and enabledcollector.textfile.directory
(list of files to check is made during startup) and was able to report bad LA1What went wrong:
up
being zero:What is still unclear:
diskstats
andfilesystem
, that makes sense. But what was the reason for collectortime
andloadavg
to report0
? It looks like undesirable side-effect of up=0. avg_over_time is 1.down
? Was it nginx, node_exporter itself or something else?What could be done to prevent relapse and decrease impact:
The text was updated successfully, but these errors were encountered: