Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable hwmon for sensor collection for bare metal clusters #971

Merged
merged 1 commit into from Nov 17, 2020

Conversation

dhellmann
Copy link
Contributor

@dhellmann dhellmann commented Oct 27, 2020

Enable the hwmon data collection so that hardware telemetry like CPU
temperature and fan speeds are available for bare metal clusters.

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

@@ -23,7 +23,6 @@ spec:
- --path.sysfs=/host/sys
- --path.rootfs=/host/root
- --no-collector.wifi
- --no-collector.hwmon
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This modifies the generate assets directly, you need to modify the jsonnet file to remove this, see this guide how to then generate https://github.com/openshift/cluster-monitoring-operator/blob/master/CONTRIBUTING.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this was the only file in this repository that had that string in it. Where is this file coming from upstream?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR that removed hwmon from enabled collectors in upstream repository: prometheus-operator/kube-prometheus#381

@@ -7,6 +7,7 @@
- Adjusted NodeClockNotSynchronising, NodeNetworkReceiveErrs, and NodeNetworkTransmitErrs alerts.
- [#962](https://github.com/openshift/cluster-monitoring-operator/pull/962) Enable namespace by pod and pod total networking Grafana dashboards.
- [#959](https://github.com/openshift/cluster-monitoring-operator/pull/959) Remove memory limits from prometheus-config-reloader in user workload monitoring
- [#971](https://github.com/openshift/cluster-monitoring-operator/pull/971) Enable `hwmon` in node-exporter for hardware sensor data collection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember us disabling not urgent collectors due to high cardinality, I am wondering how many new series this brings in on each cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of series depends on the number of sensors visible on the node, so it's hard to pin that down. On my dev system, I only get some temperature sensors. On other hosts, we should see more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhellmann I am wondering is this enablement part of some concrete initiative?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's some example output from one host: https://paste.centos.org/view/e2d24856

That includes temperature data, but that host is apparently not exposing fan speed or other sensors.

@dhellmann
Copy link
Contributor Author

/retest

@dhellmann
Copy link
Contributor Author

I've updated the PR with a jsonnet expression to remove the flag from the upstream version of the deployment settings.

Regenerating the assets modified manifests/0000_50_cluster-monitoring-operator_02-role.yaml but it looks like that has to do with something not sorting output consistently. I can manually remove that change, but thought I should include all of the output of make generate to start out.

@dhellmann
Copy link
Contributor Author

/retest

2 similar comments
@dhellmann
Copy link
Contributor Author

/retest

@dhellmann
Copy link
Contributor Author

/retest

@dhellmann dhellmann force-pushed the enable-hwmon branch 2 times, most recently from 2a3a858 to d745180 Compare October 28, 2020 12:42
@dhellmann
Copy link
Contributor Author

/test generate

@dhellmann dhellmann mentioned this pull request Oct 28, 2020
2 tasks
@s-urbaniak
Copy link
Contributor

@dhellmann this lgtm, but could you please rebase?

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 1, 2020
Enable the hwmon data collection so that hardware telemetry like CPU
temperature and fan speeds are available for bare metal clusters.

Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 10, 2020
@dhellmann
Copy link
Contributor Author

Rebased to resolve the changelog merge conflict.

@s-urbaniak
Copy link
Contributor

/lgtm

1 similar comment
@s-urbaniak
Copy link
Contributor

/lgtm

@s-urbaniak
Copy link
Contributor

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 10, 2020
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 10, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhellmann, s-urbaniak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 10, 2020
@s-urbaniak
Copy link
Contributor

just holding to verify that @lilic's comments have been addressed.

@dhellmann
Copy link
Contributor Author

/retest

1 similar comment
@lilic
Copy link
Contributor

lilic commented Nov 13, 2020

/retest

@lilic
Copy link
Contributor

lilic commented Nov 13, 2020

/hold cancel

lgtm 👍

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 13, 2020
@dhellmann
Copy link
Contributor Author

@lilic The test failures look real, but I don't know how they're related to this change. I see similar failures on #965. Is it possible something else broke that test job?

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit a872330 into openshift:master Nov 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants