Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions _topic_map.yml
Original file line number Diff line number Diff line change
Expand Up @@ -861,6 +861,8 @@ Topics:
# File: nodes-nodes-problem-detector
- Name: Viewing node audit logs
File: nodes-nodes-audit-log
- Name: Machine Config Daemon metrics
File: nodes-nodes-machine-config-daemon-metrics
- Name: Working with containers
Dir: containers
Topics:
Expand Down
96 changes: 96 additions & 0 deletions modules/machine-config-daemon-metrics.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
// Module included in the following assemblies:
//
// * nodes/nodes/nodes-nodes-machine-config-daemon-metrics.adoc

[id="machine-config-daemon-metrics_{context}"]
= Machine Config Daemon metrics

Beginning {product-title} 4.3, the Machine Config Daemon provides a set of metrics. These metrics can be accessed using the Prometheus Cluster Monitoring stack.

The following table describes this set of metrics.

[NOTE]
====
Metrics marked with `*` in the `Name` and `Description` columns represent serious errors that might cause performance problems. Such problems might prevent updates and upgrades from proceeding.
====

[NOTE]
====
While some entries contain commands for getting specific logs, the most comprehensive set of logs is available using the `oc adm must-gather` command.
====

[cols="1,1,2,2", options="header"]
.MCO metrics
|===
|Name
|Format
|Description
|Notes

|mcd_host_os_and_version
|[]string{"os", "version"}
|Shows the OS that MCD is running on, such as RHCOS or RHEL. In case of RHCOS, the version is provided.
|

|ssh_accessed
|counter
|Shows the number of successful SSH authentications into the node.
|The non-zero value shows that someone might have made manual changes to the node. Such changes might cause irreconcilable errors due to the differences between the state on the disk and the state defined in the machine configuration.

|mcd_drain*
|{"drain_time", "err"}
|Logs errors received during failed drain. *
|While drains might need multiple tries to succeed, terminal failed drains prevent updates from proceeding. The `drain_time` metric, which shows how much time the drain took, might help with troubleshooting.

For further investigation, see the logs by running:

`$ oc logs -f -n openshift-machine-config-operator machine-config-daemon-<hash> -c machine-config-daemon`

|mcd_pivot_err*
|[]string{"pivot_target", "err"}
|Logs errors encountered during pivot. *
|Pivot errors might prevent OS upgrades from proceeding.

For further investigation, run this command to access the node and see all its logs:

`$ oc debug node/<node> -- chroot /host journalctl -u pivot.service`

Alternatively, you can run this command to only see the logs from the `machine-config-daemon` container:

`$ oc logs -f -n openshift-machine-config-operator machine-config-daemon-<hash> -c machine-config-daemon`

|mcd_state
|[]string{"state", "reason"}
|State of Machine Config Daemon for the indicated node. Possible states are "Done", "Working", and "Degraded". In case of "Degraded", the reason is included.
|For further investigation, see the logs by running:

`$ oc logs -f -n openshift-machine-config-operator machine-config-daemon-<hash> -c machine-config-daemon`

|mcd_kubelet_state*
|[]string{"err"}
|Logs kubelet health failures. *
|This is expected to be empty, with failure count of 0. If failure count exceeds 2, the error indicating threshold is exceeded. This indicates a possible issue with the health of the kubelet.

For further investigation, run this command to access the node and see all its logs:

`$ oc debug node/<node> -- chroot /host journalctl -u kubelet`

|mcd_reboot_err*
|[]string{"message", "err"}
|Logs the failed reboots and the corresponding errors. *
|This is expected to be empty, which indicates a successful reboot.

For further investigation, see the logs by running:

`$ oc logs -f -n openshift-machine-config-operator machine-config-daemon-<hash> -c machine-config-daemon`

|mcd_update_state
|[]string{"config", "err"}
|Logs success or failure of configuration updates and the corresponding errors.
|The expected value is `rendered-master/rendered-worker-XXXX`. If the update fails, an error is present.

For further investigation, see the logs by running:

`$ oc logs -f -n openshift-machine-config-operator machine-config-daemon-<hash> -c machine-config-daemon`
|===

13 changes: 13 additions & 0 deletions nodes/nodes/nodes-nodes-machine-config-daemon-metrics.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
[id="machine-config-daemon-metrics"]
= Machine Config Daemon metrics
include::modules/common-attributes.adoc[]
:context: machine-config-operator

The Machine Config Daemon is a part of the Machine Config Operator. It runs on every node in the cluster. Machine Config Daemon's purpose is managing configuration changes and updates on each of the nodes.

include::modules/machine-config-daemon-metrics.adoc[leveloffset=+1]

.Additional resources

* See xref:../../monitoring/cluster-monitoring/about-cluster-monitoring.adoc[the documentation on the Prometheus Cluster Monitoring stack].
* See link:https://github.com/openshift/must-gather[the repository of the must-gather tool].