New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MON-3544: Adjust NodeClock* alerting rules to work with PTP operator #2182
Conversation
@simonpasquier: This pull request references MON-3544 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.15.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@simonpasquier: This pull request references MON-3544 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.15.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cc @midu16 |
This commit adapts the upstream NodeClockNotSynchronising and NodeClockSkewDetected rules to be always inactive when the PTP operator is installed. The PTP operator ships a more robust rule to detect unsynchronised clocks and the default rules are redundant in this case. Signed-off-by: Simon Pasquier <spasquie@redhat.com>
5b0b93d
to
dc2c689
Compare
@simonpasquier: This pull request references MON-3544 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.15.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@@ -185,6 +186,7 @@ spec: | |||
and | |||
deriv(node_timex_offset_seconds{job="node-exporter"}[5m]) <= 0 | |||
) | |||
) and on() absent(up{job="ptp-monitor-service"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@midu16 can you validate the job label value here? Looking at the telemetry data, I see that the NodeOutOfPtpSync
alert fires with this value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonpasquier overall everything looks good, i would be able to have a deeper testing/confirmation by end of the week in a live cluster.
Thank you
/skip |
@midu16 this PR patches also the NodeClockSkewDetected alerting rule which wasn't mentioned in the original ticket but I think that it makes sense. Can you validate? |
/retest-required |
@simonpasquier: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest-required |
1 similar comment
/retest-required |
/skip |
/hold cancel |
/lgtm |
@@ -1,5 +1,5 @@ | |||
rule_files: | |||
- ocpbugs-1453.yaml | |||
- rules.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, for a long time we skipped this test 😶
/lgtm |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jan--f, machine424, midu16, raptorsun, simonpasquier The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The change to the upstream alert definition looked minimal enough to me that it was ok/easier to tweak the PromQL expression in CMO. Also the NodeClock* alerts would still be visible in the OCP console which might create confusion. |
[ART PR BUILD NOTIFIER] This PR has been included in build cluster-monitoring-operator-container-v4.16.0-202312071732.p0.gfb487ff.assembly.stream for distgit cluster-monitoring-operator. |
This commit adapts the upstream NodeClockNotSynchronising and NodeClockSkewDetected rules to be always inactive when the PTP operator is installed. The PTP operator ships a more robust rule to detect unsynchronised clocks and the default rules are redundant in this case.