Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MON-3544: Adjust NodeClock* alerting rules to work with PTP operator #2182

Merged
merged 1 commit into from Dec 7, 2023

Conversation

simonpasquier
Copy link
Contributor

@simonpasquier simonpasquier commented Dec 5, 2023

This commit adapts the upstream NodeClockNotSynchronising and NodeClockSkewDetected rules to be always inactive when the PTP operator is installed. The PTP operator ships a more robust rule to detect unsynchronised clocks and the default rules are redundant in this case.

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 5, 2023
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Dec 5, 2023

@simonpasquier: This pull request references MON-3544 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.15.0" version, but no target version was set.

In response to this:

This commit adapts the upstream NodeClockNotSynchronising alerting rule to "mute" itself when the PTP operator is installed. The PTP operator ships a more robust rule to detect unsynchronised clocks and the default NodeClockNotSynchronising rule is redundant in this case.

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@simonpasquier simonpasquier changed the title MON-3544: Adjust NodeClockNotSynchronising to work with PTP operator [WIP] [WIP] MON-3544: Adjust NodeClockNotSynchronising to work with PTP operator Dec 5, 2023
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 5, 2023
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Dec 5, 2023

@simonpasquier: This pull request references MON-3544 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.15.0" version, but no target version was set.

In response to this:

This commit adapts the upstream NodeClockNotSynchronising alerting rule to "mute" itself when the PTP operator is installed. The PTP operator ships a more robust rule to detect unsynchronised clocks and the default NodeClockNotSynchronising rule is redundant in this case.

I need to add unit rule tests.

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 5, 2023
@simonpasquier
Copy link
Contributor Author

/cc @midu16

@openshift-ci openshift-ci bot requested a review from midu16 December 6, 2023 08:21
This commit adapts the upstream NodeClockNotSynchronising and
NodeClockSkewDetected rules to be always inactive when the PTP operator
is installed. The PTP operator ships a more robust rule to detect
unsynchronised clocks and the default rules are redundant in this case.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Dec 6, 2023

@simonpasquier: This pull request references MON-3544 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.15.0" version, but no target version was set.

In response to this:

This commit adapts the upstream NodeClockNotSynchronising and NodeClockSkewDetected rules to be always inactive when the PTP operator is installed. The PTP operator ships a more robust rule to detect unsynchronised clocks and the default rules are redundant in this case.

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@simonpasquier simonpasquier changed the title [WIP] MON-3544: Adjust NodeClockNotSynchronising to work with PTP operator [WIP] MON-3544: Adjust NodeClock* alerting rules to work with PTP operator Dec 6, 2023
@@ -185,6 +186,7 @@ spec:
and
deriv(node_timex_offset_seconds{job="node-exporter"}[5m]) <= 0
)
) and on() absent(up{job="ptp-monitor-service"})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@midu16 can you validate the job label value here? Looking at the telemetry data, I see that the NodeOutOfPtpSync alert fires with this value.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier overall everything looks good, i would be able to have a deeper testing/confirmation by end of the week in a live cluster.

Thank you

@simonpasquier simonpasquier changed the title [WIP] MON-3544: Adjust NodeClock* alerting rules to work with PTP operator MON-3544: Adjust NodeClock* alerting rules to work with PTP operator Dec 6, 2023
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 6, 2023
@simonpasquier
Copy link
Contributor Author

/skip

@simonpasquier
Copy link
Contributor Author

@midu16 this PR patches also the NodeClockSkewDetected alerting rule which wasn't mentioned in the original ticket but I think that it makes sense. Can you validate?

@simonpasquier
Copy link
Contributor Author

/retest-required
/hold waiting for answers from @midu16

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 6, 2023
Copy link
Contributor

openshift-ci bot commented Dec 6, 2023

@simonpasquier: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node dc2c689 link false /test e2e-aws-ovn-single-node
ci/prow/versions dc2c689 link false /test versions

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@simonpasquier
Copy link
Contributor Author

/retest-required

1 similar comment
@simonpasquier
Copy link
Contributor Author

/retest-required

@simonpasquier
Copy link
Contributor Author

/skip

@simonpasquier
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 7, 2023
@jan--f
Copy link
Contributor

jan--f commented Dec 7, 2023

/lgtm

@@ -1,5 +1,5 @@
rule_files:
- ocpbugs-1453.yaml
- rules.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, for a long time we skipped this test 😶

@raptorsun
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 7, 2023
@machine424
Copy link
Contributor

/lgtm
suggestion: how about making the PTP operator use a AlertRelabelConfig to disable/adjust those alerts?

Copy link
Contributor

openshift-ci bot commented Dec 7, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jan--f, machine424, midu16, raptorsun, simonpasquier

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [jan--f,machine424,raptorsun,simonpasquier]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@raptorsun
Copy link
Contributor

/lgtm suggestion: how about making the PTP operator use a AlertRelabelConfig to disable/adjust those alerts?

It looks like PTP operator and CMO watching each other to decide their next move :P

CMO absent(up{job="ptp-monitor-service"})
PTP ALERTS{alertname="NodeClockNotSynchronising"

img

@simonpasquier
Copy link
Contributor Author

suggestion: how about making the PTP operator use a AlertRelabelConfig to disable/adjust those alerts?

The change to the upstream alert definition looked minimal enough to me that it was ok/easier to tweak the PromQL expression in CMO. Also the NodeClock* alerts would still be visible in the OCP console which might create confusion.

@openshift-merge-bot openshift-merge-bot bot merged commit fb487ff into openshift:master Dec 7, 2023
17 checks passed
@simonpasquier simonpasquier deleted the MON-3544 branch December 7, 2023 15:41
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-monitoring-operator-container-v4.16.0-202312071732.p0.gfb487ff.assembly.stream for distgit cluster-monitoring-operator.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants