From 5a705959d2b746f983e644839720a9cef499f59d Mon Sep 17 00:00:00 2001 From: Maxim Svistunov Date: Fri, 20 Dec 2019 15:26:46 +0100 Subject: [PATCH 1/6] Add description of metrics exported by Machine Config Daemon --- _topic_map.yml | 2 + modules/machine-config-daemon-metrics.adoc | 77 +++++++++++++++++++ ...s-nodes-machine-config-daemon-metrics.adoc | 13 ++++ 3 files changed, 92 insertions(+) create mode 100644 modules/machine-config-daemon-metrics.adoc create mode 100644 nodes/nodes/nodes-nodes-machine-config-daemon-metrics.adoc diff --git a/_topic_map.yml b/_topic_map.yml index 288ca48f6bcf..463ba47cf678 100644 --- a/_topic_map.yml +++ b/_topic_map.yml @@ -861,6 +861,8 @@ Topics: # File: nodes-nodes-problem-detector - Name: Viewing node audit logs File: nodes-nodes-audit-log + - Name: Machine Config Daemon metrics + File: nodes-nodes-machine-config-daemon-metrics - Name: Working with containers Dir: containers Topics: diff --git a/modules/machine-config-daemon-metrics.adoc b/modules/machine-config-daemon-metrics.adoc new file mode 100644 index 000000000000..6e9861a33ae8 --- /dev/null +++ b/modules/machine-config-daemon-metrics.adoc @@ -0,0 +1,77 @@ +// Module included in the following assemblies: +// +// * nodes/nodes/nodes-nodes-machine-config-daemon-metrics.adoc + +[id="machine-config-daemon-metrics_{context}"] += Machine Config Daemon metrics + +Beginning {product-title} 4.3, the Machine Config Daemon provides a set of metrics. These metrics can be accessed using the Prometheus Cluster Monitoring stack. + +The following table describes this set of metrics. Note that: + +* Metrics marked with `*` in the `Name` and `Description` columns represent serious errors that might cause performance problems. Such problems might prevent updates and upgrades from proceeding. +* While some entries contain commands for getting specific logs, the most comprehensive set of logs is available using the `oc adm must-gather` command. + +[cols="1,1,2,2", options="header"] +.MCO metrics +|=== +|Name +|Format +|Description +|Notes + +|mcd_host_os_and_version +|[]string{"os", "version"} +|Shows the OS that MCD is running on, such as RHCOS or RHEL. In case of RHCOS, the version is provided. +| + +|ssh_accessed +|counter +|Shows the number of successful SSH authentications into the node. +|The non-zero value shows that someone might have made manual changes to the node. Such changes might cause irreconcilable errors due to the differences between the state on the disk and the state defined in the machine configuration. + +|mcd_drain* +|{"drain_time", "err"} +|Logs errors received during failed drain. * +|While drains might need multiple tries to succeed, terminal failed drains prevent updates from proceeding. The `drain_time` metric, which shows how much time the drain took, might help with troubleshooting. For further investigation, see the logs by running: + +`$ oc logs -f -n openshift-machine-config-operator machine-config-daemon- -c machine-config-daemon` + +|mcd_pivot_err* +|[]string{"pivot_target", "err"} +|Logs errors encountered during pivot. * +|Pivot errors might prevent OS upgrades from proceeding. For further investigation, see the logs by running these commands: + +`$ oc logs -f -n openshift-machine-config-operator machine-config-daemon- -c machine-config-daemon` + +`$ journalctl -u pivot.service` + +|mcd_state +|[]string{"state", "reason"} +|State of Machine Config Daemon for the indicated node. Possible states are "Done", "Working", and "Degraded". In case of "Degraded", the reason is included. +|For further investigation, see the logs by running: + +`$ oc logs -f -n openshift-machine-config-operator machine-config-daemon- -c machine-config-daemon` + +|mcd_kubelet_state* +|[]string{"err"} +|Logs kubelet health failures. * +|This is expected to be empty, with failure count of 0. If failure count exceeds 2, the error indicating threshold is exceeded. This indicates a possible issue with the health of the kubelet. For further investigation, see the logs by running: + +`$ journalctl -u kubelet` + +|mcd_reboot_err* +|[]string{"message", "err"} +|Logs the failed reboots and the corresponding errors. * +|This is expected to be empty, which indicates a successful reboot. For further investigation, see the logs by running: + +`$ oc logs -f -n openshift-machine-config-operator machine-config-daemon- -c machine-config-daemon` + +|mcd_update_state +|[]string{"config", "err"} +|Logs success or failure of configuration updates and the corresponding errors. +|The expected value is `rendered-master/rendered-worker-XXXX`. If the update fails, an error is present. For further investigation, see the logs by running: + +`$ oc logs -f -n openshift-machine-config-operator machine-config-daemon- -c machine-config-daemon` +|=== + diff --git a/nodes/nodes/nodes-nodes-machine-config-daemon-metrics.adoc b/nodes/nodes/nodes-nodes-machine-config-daemon-metrics.adoc new file mode 100644 index 000000000000..f5cef467616f --- /dev/null +++ b/nodes/nodes/nodes-nodes-machine-config-daemon-metrics.adoc @@ -0,0 +1,13 @@ +[id="machine-config-daemon-metrics"] += Machine Config Daemon metrics +include::modules/common-attributes.adoc[] +:context: machine-config-operator + +The Machine Config Daemon is a part of the Machine Config Operator. It runs on every node in the cluster. Machine Config Daemon's purpose is managing configuration changes and updates on each of the nodes. + +include::modules/machine-config-daemon-metrics.adoc[leveloffset=+1] + +.Additional resources + +* See xref:../../monitoring/cluster-monitoring/about-cluster-monitoring.adoc[the documentation on the Prometheus Cluster Monitoring stack]. +* See link:https://github.com/openshift/must-gather[the repository of the must-gather tool]. From 95e33de53170dfe89a20582cc908b14a6f68ea47 Mon Sep 17 00:00:00 2001 From: Maxim Svistunov Date: Thu, 9 Jan 2020 15:57:03 +0100 Subject: [PATCH 2/6] Clarify that actions are done on nodes --- modules/machine-config-daemon-metrics.adoc | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/modules/machine-config-daemon-metrics.adoc b/modules/machine-config-daemon-metrics.adoc index 6e9861a33ae8..c67a725e209f 100644 --- a/modules/machine-config-daemon-metrics.adoc +++ b/modules/machine-config-daemon-metrics.adoc @@ -40,7 +40,9 @@ The following table describes this set of metrics. Note that: |mcd_pivot_err* |[]string{"pivot_target", "err"} |Logs errors encountered during pivot. * -|Pivot errors might prevent OS upgrades from proceeding. For further investigation, see the logs by running these commands: +|Pivot errors might prevent OS upgrades from proceeding. For further investigation, run these commands to access the node and see its logs: + +`$ oc debug node ` `$ oc logs -f -n openshift-machine-config-operator machine-config-daemon- -c machine-config-daemon` @@ -56,7 +58,9 @@ The following table describes this set of metrics. Note that: |mcd_kubelet_state* |[]string{"err"} |Logs kubelet health failures. * -|This is expected to be empty, with failure count of 0. If failure count exceeds 2, the error indicating threshold is exceeded. This indicates a possible issue with the health of the kubelet. For further investigation, see the logs by running: +|This is expected to be empty, with failure count of 0. If failure count exceeds 2, the error indicating threshold is exceeded. This indicates a possible issue with the health of the kubelet. For further investigation, run these commands to access the node and see its logs: + +`$ oc debug node ` `$ journalctl -u kubelet` From 735502667befb2bde65da26e7ed439e5680af015 Mon Sep 17 00:00:00 2001 From: Maxim Svistunov Date: Fri, 10 Jan 2020 09:48:16 +0100 Subject: [PATCH 3/6] Put information into [NOTE]s --- modules/machine-config-daemon-metrics.adoc | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/modules/machine-config-daemon-metrics.adoc b/modules/machine-config-daemon-metrics.adoc index c67a725e209f..32b78ad1252d 100644 --- a/modules/machine-config-daemon-metrics.adoc +++ b/modules/machine-config-daemon-metrics.adoc @@ -7,10 +7,17 @@ Beginning {product-title} 4.3, the Machine Config Daemon provides a set of metrics. These metrics can be accessed using the Prometheus Cluster Monitoring stack. -The following table describes this set of metrics. Note that: +The following table describes this set of metrics. -* Metrics marked with `*` in the `Name` and `Description` columns represent serious errors that might cause performance problems. Such problems might prevent updates and upgrades from proceeding. -* While some entries contain commands for getting specific logs, the most comprehensive set of logs is available using the `oc adm must-gather` command. +[NOTE] +==== +Metrics marked with `*` in the `Name` and `Description` columns represent serious errors that might cause performance problems. Such problems might prevent updates and upgrades from proceeding. +==== + +[NOTE] +==== +While some entries contain commands for getting specific logs, the most comprehensive set of logs is available using the `oc adm must-gather` command. +==== [cols="1,1,2,2", options="header"] .MCO metrics From a657dbc6825e1408ba61b05267fd1312e5507024 Mon Sep 17 00:00:00 2001 From: Maxim Svistunov Date: Fri, 10 Jan 2020 10:07:11 +0100 Subject: [PATCH 4/6] Fixed the instructions for accessing logs --- modules/machine-config-daemon-metrics.adoc | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/modules/machine-config-daemon-metrics.adoc b/modules/machine-config-daemon-metrics.adoc index 32b78ad1252d..d54f863f0dc5 100644 --- a/modules/machine-config-daemon-metrics.adoc +++ b/modules/machine-config-daemon-metrics.adoc @@ -49,12 +49,16 @@ While some entries contain commands for getting specific logs, the most comprehe |Logs errors encountered during pivot. * |Pivot errors might prevent OS upgrades from proceeding. For further investigation, run these commands to access the node and see its logs: -`$ oc debug node ` +`$ oc debug node/` -`$ oc logs -f -n openshift-machine-config-operator machine-config-daemon- -c machine-config-daemon` +`$ chroot /host` `$ journalctl -u pivot.service` +You can also run: + +`$ oc logs -f -n openshift-machine-config-operator machine-config-daemon- -c machine-config-daemon` + |mcd_state |[]string{"state", "reason"} |State of Machine Config Daemon for the indicated node. Possible states are "Done", "Working", and "Degraded". In case of "Degraded", the reason is included. @@ -67,7 +71,9 @@ While some entries contain commands for getting specific logs, the most comprehe |Logs kubelet health failures. * |This is expected to be empty, with failure count of 0. If failure count exceeds 2, the error indicating threshold is exceeded. This indicates a possible issue with the health of the kubelet. For further investigation, run these commands to access the node and see its logs: -`$ oc debug node ` +`$ oc debug node/` + +`$ chroot /host` `$ journalctl -u kubelet` From 1f0af5d3d9d49addeea711b9b226eeb51e3d0a95 Mon Sep 17 00:00:00 2001 From: Maxim Svistunov Date: Tue, 14 Jan 2020 20:43:25 +0100 Subject: [PATCH 5/6] Add info on difference between two commands & structure improvements --- modules/machine-config-daemon-metrics.adoc | 32 ++++++++++++++++++---- 1 file changed, 26 insertions(+), 6 deletions(-) diff --git a/modules/machine-config-daemon-metrics.adoc b/modules/machine-config-daemon-metrics.adoc index d54f863f0dc5..c4c5442f160c 100644 --- a/modules/machine-config-daemon-metrics.adoc +++ b/modules/machine-config-daemon-metrics.adoc @@ -40,25 +40,35 @@ While some entries contain commands for getting specific logs, the most comprehe |mcd_drain* |{"drain_time", "err"} |Logs errors received during failed drain. * -|While drains might need multiple tries to succeed, terminal failed drains prevent updates from proceeding. The `drain_time` metric, which shows how much time the drain took, might help with troubleshooting. For further investigation, see the logs by running: +|While drains might need multiple tries to succeed, terminal failed drains prevent updates from proceeding. The `drain_time` metric, which shows how much time the drain took, might help with troubleshooting. + +For further investigation, see the logs by running: `$ oc logs -f -n openshift-machine-config-operator machine-config-daemon- -c machine-config-daemon` |mcd_pivot_err* |[]string{"pivot_target", "err"} |Logs errors encountered during pivot. * -|Pivot errors might prevent OS upgrades from proceeding. For further investigation, run these commands to access the node and see its logs: +|Pivot errors might prevent OS upgrades from proceeding. + +For further investigation, run the following commands to access the node and see its logs. + +To access the node: `$ oc debug node/` +Once on the node, show the logs: + `$ chroot /host` `$ journalctl -u pivot.service` -You can also run: +Alternatively, you can run: `$ oc logs -f -n openshift-machine-config-operator machine-config-daemon- -c machine-config-daemon` +Unlike `journalctl`, this command only shows the logs from the `machine-config-daemon` container and not all the logs from the node. + |mcd_state |[]string{"state", "reason"} |State of Machine Config Daemon for the indicated node. Possible states are "Done", "Working", and "Degraded". In case of "Degraded", the reason is included. @@ -69,10 +79,16 @@ You can also run: |mcd_kubelet_state* |[]string{"err"} |Logs kubelet health failures. * -|This is expected to be empty, with failure count of 0. If failure count exceeds 2, the error indicating threshold is exceeded. This indicates a possible issue with the health of the kubelet. For further investigation, run these commands to access the node and see its logs: +|This is expected to be empty, with failure count of 0. If failure count exceeds 2, the error indicating threshold is exceeded. This indicates a possible issue with the health of the kubelet. + +For further investigation, run the following commands to access the node and see its logs. + +To access the node: `$ oc debug node/` +Once on the node, show the logs: + `$ chroot /host` `$ journalctl -u kubelet` @@ -80,14 +96,18 @@ You can also run: |mcd_reboot_err* |[]string{"message", "err"} |Logs the failed reboots and the corresponding errors. * -|This is expected to be empty, which indicates a successful reboot. For further investigation, see the logs by running: +|This is expected to be empty, which indicates a successful reboot. + +For further investigation, see the logs by running: `$ oc logs -f -n openshift-machine-config-operator machine-config-daemon- -c machine-config-daemon` |mcd_update_state |[]string{"config", "err"} |Logs success or failure of configuration updates and the corresponding errors. -|The expected value is `rendered-master/rendered-worker-XXXX`. If the update fails, an error is present. For further investigation, see the logs by running: +|The expected value is `rendered-master/rendered-worker-XXXX`. If the update fails, an error is present. + +For further investigation, see the logs by running: `$ oc logs -f -n openshift-machine-config-operator machine-config-daemon- -c machine-config-daemon` |=== From e5d64b043c5717301bbbabd3264196cba2040ede Mon Sep 17 00:00:00 2001 From: Maxim Svistunov Date: Mon, 20 Jan 2020 12:41:47 +0100 Subject: [PATCH 6/6] Simplify instructions and wording around them --- modules/machine-config-daemon-metrics.adoc | 28 ++++------------------ 1 file changed, 5 insertions(+), 23 deletions(-) diff --git a/modules/machine-config-daemon-metrics.adoc b/modules/machine-config-daemon-metrics.adoc index c4c5442f160c..f0cb90b1898f 100644 --- a/modules/machine-config-daemon-metrics.adoc +++ b/modules/machine-config-daemon-metrics.adoc @@ -51,24 +51,14 @@ For further investigation, see the logs by running: |Logs errors encountered during pivot. * |Pivot errors might prevent OS upgrades from proceeding. -For further investigation, run the following commands to access the node and see its logs. +For further investigation, run this command to access the node and see all its logs: -To access the node: +`$ oc debug node/ -- chroot /host journalctl -u pivot.service` -`$ oc debug node/` - -Once on the node, show the logs: - -`$ chroot /host` - -`$ journalctl -u pivot.service` - -Alternatively, you can run: +Alternatively, you can run this command to only see the logs from the `machine-config-daemon` container: `$ oc logs -f -n openshift-machine-config-operator machine-config-daemon- -c machine-config-daemon` -Unlike `journalctl`, this command only shows the logs from the `machine-config-daemon` container and not all the logs from the node. - |mcd_state |[]string{"state", "reason"} |State of Machine Config Daemon for the indicated node. Possible states are "Done", "Working", and "Degraded". In case of "Degraded", the reason is included. @@ -81,17 +71,9 @@ Unlike `journalctl`, this command only shows the logs from the `machine-config-d |Logs kubelet health failures. * |This is expected to be empty, with failure count of 0. If failure count exceeds 2, the error indicating threshold is exceeded. This indicates a possible issue with the health of the kubelet. -For further investigation, run the following commands to access the node and see its logs. - -To access the node: - -`$ oc debug node/` - -Once on the node, show the logs: - -`$ chroot /host` +For further investigation, run this command to access the node and see all its logs: -`$ journalctl -u kubelet` +`$ oc debug node/ -- chroot /host journalctl -u kubelet` |mcd_reboot_err* |[]string{"message", "err"}