Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MON-2903: add nodeExporter.collectors.systemd settings. #1892

Merged
merged 1 commit into from Jul 10, 2023

Conversation

raptorsun
Copy link
Contributor

@raptorsun raptorsun commented Feb 14, 2023

This PR is held for the moment, systems collector requires more settings, it will come together later in this PR.

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

This PR is based on top of #1876 , we should merge that before this one.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 14, 2023
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Feb 14, 2023

@raptorsun: This pull request references MON-2903 which is a valid jira issue.

In response to this:

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

This PR is based on top of #1876 , we should merge that before this one.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2023
@simonpasquier
Copy link
Contributor

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 14, 2023
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Feb 14, 2023

@raptorsun: This pull request references MON-2903 which is a valid jira issue.

In response to this:

This PR is held for the moment, systems collector requires more settings, it will come together later in this PR.

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

This PR is based on top of #1876 , we should merge that before this one.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 18, 2023
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 11, 2023
@raptorsun
Copy link
Contributor Author

/retest

@raptorsun
Copy link
Contributor Author

/retest-required

@raptorsun
Copy link
Contributor Author

when activating systemd collector 8 new metrics are collected:

name cardinality per node
node_systemd_socket_accepted_connections_total, ~ 15 sockets
node_systemd_socket_current_connections, ~ 15 sockets
node_systemd_socket_refused_connections_total, ~ 15 sockets
node_systemd_system_running, 1 running state
node_systemd_timer_last_trigger_seconds, 3 timers
node_systemd_unit_state, ~ 220 service x 4 states
node_systemd_units, 4 states
node_systemd_version 1 version

The metric node_systemd_unit_state will generate many series in proportion to the number of systemd service per node.

@simonpasquier
Copy link
Contributor

My take is that enabling the systemd collector should only expose metrics which have a fixed cardinality: node_systemd_system_running, node_systemd_unit_state, node_systemd_units and node_systemd_version.

By default CMO should set --collector.systemd.unit-include="" meaning that no node_systemd_unit_state metric is exposed. Users can configure which units they want to collect metrics for. Maybe via a list of regexps?

nodeExporter:
  collectors:
    systemd:
      enabled: true
      units:
      - iscsi-init.*
      - sshd.service

node_systemd_socket_* and node_systemd_timer_* are automatically exposed for units which are included.

@Tai-RedHat
Copy link

@simonpasquier, I guess Haoyu is not back yet, is this PR ready for testing?

@Tai-RedHat
Copy link

test PR with cluster-bot, new features implemented
/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label May 17, 2023
Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case you've missed, I had a comment about metrics cardinality and how to limit it.

// Among them, the `node_systemd_unit_state` metric is the most useful show the state of each systemd unit. So its cardinality cound be high.
// If you enable this collector, watch the prometheus-k8s deployment closely for excessive memory usage.
type NodeExporterCollectorSystemdConfig struct {
// A Boolean flag that enables or disables the `systemd` colletor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// A Boolean flag that enables or disables the `systemd` colletor.
// A Boolean flag that enables or disables the `systemd` collector.

@raptorsun
Copy link
Contributor Author

in case you've missed, I had a comment about metrics cardinality and how to limit it.

Thanks for reminding me of limiting cardinality, I will later push a new version with these filters.

@@ -297,6 +297,9 @@ type NodeExporterCollectorConfig struct {
// Defines the configuration of the `buddyinfo` collector, which collects statistics about memory fragmentation from the `node_buddyinfo_blocks` metric. This metric collects data from `/proc/buddyinfo`.
// Disabled by default.
BuddyInfo NodeExporterCollectorBuddyInfoConfig `json:"buddyinfo,omitempty"`
// Defines the configuration of the `systemd` collector, which collects statistics on systemd daemon and its managed services.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Defines the configuration of the `systemd` collector, which collects statistics on systemd daemon and its managed services.
// Defines the configuration of the `systemd` collector, which collects statistics on the `systemd` daemon and its managed services.

// `node_systemd_unit_state`,
// `node_systemd_units`,
// `node_systemd_version`.
// Among them, the `node_systemd_unit_state` metric is the most useful show the state of each systemd unit. So its cardinality cound be high.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Among them, the `node_systemd_unit_state` metric is the most useful show the state of each systemd unit. So its cardinality cound be high.
// Of these metrics, the `node_systemd_unit_state` metric is the most useful because it shows the state of each `systemd` unit. However, note that the cardinality for this metric might be high.

// `node_systemd_units`,
// `node_systemd_version`.
// Among them, the `node_systemd_unit_state` metric is the most useful show the state of each systemd unit. So its cardinality cound be high.
// If you enable this collector, watch the prometheus-k8s deployment closely for excessive memory usage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If you enable this collector, watch the prometheus-k8s deployment closely for excessive memory usage.
// If you enable this collector, closely monitor the `prometheus-k8s` deployment for excessive memory usage.

@raptorsun raptorsun changed the title MON-2903: add nodeExporter.collectors.systemd settings. [WIP] MON-2903: add nodeExporter.collectors.systemd settings. Jun 19, 2023
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 19, 2023
@raptorsun
Copy link
Contributor Author

/hold
adding systemd unit selection in the config.

@raptorsun
Copy link
Contributor Author

@bburt-rh @simonpasquier Thank you for the timely review :D

All points has been addressed. The PR is ready for another review now.

pkg/manifests/types.go Outdated Show resolved Hide resolved
@raptorsun
Copy link
Contributor Author

ready to review again.
/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 27, 2023
Comment on lines 889 to 890
units := f.config.ClusterMonitoringConfiguration.NodeExporterConfig.Collectors.Systemd.Units
for idx, unit := range units {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't modify the config slice.

Suggested change
units := f.config.ClusterMonitoringConfiguration.NodeExporterConfig.Collectors.Systemd.Units
for idx, unit := range units {
patternUnits := make([]string, len(f.config.ClusterMonitoringConfiguration.NodeExporterConfig.Collectors.Systemd.Units))
for i, unit := range f.config.ClusterMonitoringConfiguration.NodeExporterConfig.Collectors.Systemd.Units {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after removing the addition of brackets, this will not be needed, the unit list will be read only in this function.

pkg/manifests/manifests.go Outdated Show resolved Hide resolved
if err != nil {
return nil, fmt.Errorf("invalid regexp for systemd unit: %s", unit)
}
units[idx] = fmt.Sprintf("(%s)", unit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the enclosing brackets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a second thought, any pattern that needs brackets to limit cannot pass the regex check just before this line. No bracket will be added.

test/e2e/node_exporter_test.go Outdated Show resolved Hide resolved
@raptorsun raptorsun force-pushed the feature/MON-2903 branch 2 times, most recently from bc71bc7 to 882073c Compare June 28, 2023 10:11
@bburt-rh
Copy link
Contributor

/label docs-approved

@openshift-ci openshift-ci bot added the docs-approved Signifies that Docs has signed off on this PR label Jun 28, 2023
@raptorsun
Copy link
Contributor Author

/retest-required

@jan--f
Copy link
Contributor

jan--f commented Jul 5, 2023

This lgtm but will let @simonpasquier apply the label.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 5, 2023
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 5, 2023
@jan--f
Copy link
Contributor

jan--f commented Jul 10, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 10, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 10, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bburt-rh, jan--f, raptorsun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 10, 2023

@raptorsun: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit b886eb2 into openshift:master Jul 10, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. docs-approved Signifies that Docs has signed off on this PR jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants