New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OVN: Fix control plane metrics alert job name #742
OVN: Fix control plane metrics alert job name #742
Conversation
@tssurya I would actually use "ovnkube-db" since that's the service that actually fronts the master pods. If we use the metrics service, then technically we're just watching to see if the metrics proxy pods are running, I think. And the ovnkube master pods and the metrics proxy pods are independent daemonsets |
@@ -17,7 +17,7 @@ spec: | |||
message: | | |||
there is no running ovn-kubernetes master | |||
expr: | | |||
absent(up{job="ovnkube-master",namespace="openshift-ovn-kubernetes"} ==1) | |||
absent(up{job="ovnkube-master-metrics",namespace="openshift-ovn-kubernetes"} ==1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the job name is the name of the service, should it be ovn-kubernetes-master-metrics
instead as per: https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/ovn-kubernetes/monitor.yaml#L33
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The job name is a Kube Service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should target the "ovnkube-db" service though per #742 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried changing it to ovnkube-db, the alert still fires. I might have to add an additional ServiceMonitor for us to use this service from what I understand (https://coreos.com/blog/the-prometheus-operator.html)
Also I think its ok to leave it as ovnkube-master-metrics since in this instance the metrics pod's endpoint at port 9102 is indirectly connected to master container's 21902, it should be checking for the master pods' endpoint while scraping right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think the job name could be basically the label defined in the jobLabel
field of the respective ServiceMonitor (https://coreos.com/operators/prometheus/docs/latest/api.html#servicemonitorspec)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think its ok to leave it as ovnkube-master-metrics since in this instance the metrics pod's endpoint at port 9102 is indirectly connected to master container's 21902, it should be checking for the master pods' endpoint while scraping right ?
this assumption is right, as in the alerts gets fired even if the master pods go down.
/retest |
1 similar comment
/retest |
Hang on, I'm not sure this is the best way to write this alert. We should instead be alerting on whether or not there is a leader. The query can then just be |
An update: we should really alert on both conditions: a failure to scrape and no ovnkube master leader. |
this metric |
ah ok, I guess its better to add this as well. From what I understand this alert will be fired if the leader election fails ? err, actually I don't know if there is a leader election in ovn like in sdn. Let me check. |
Yeah there is. That is why gcp job alert was firing in the first place. In GCP there is some temporary outage to kapi server and the leader fails to renew his leadership lease. When that happens he shuts down and restarts and becomes leader again. The entire process takes about a second. This causes the alert to fire. However, the alert clause iiuc says "for 10m" which means as long as the pod comes back up within a 10 minute period the alert should not fire. Question is why does the alert fire if we were only down for a second? Does prometheus correctly detect that the pod is back up? |
Pending alerts will show up in the interface (but not actually fire), so maybe that's what's happening? |
I think it is firing
|
6062777
to
a9ea052
Compare
Yes prometheus does come back to pending state from firing if the pod is back up - or at least it did in my nightly cluster. |
So the current alert as it stands should just pick up |
IMO this can be 3 different things, in priority order:
Each one can be a separate alert and thus PR. #1 is the most pressing need to get our CI flakes taken care of. So let's do that first, and then file Jira cards for the others? |
a9ea052
to
95f303f
Compare
--- | ||
apiVersion: v1 | ||
kind: Service | ||
metadata: | ||
labels: | ||
app: ovnkube-master-metrics | ||
app: ovnkube-master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. This needs to change to match the app: ovnkube-master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After our call yesterday, I tried it but it looks like the ports are not orthogonal. I haven't yet found a clear documentation on Prometheus that explains this. I have asked the monitoring-team to confirm this. So basically the scraping works only if I add a metrics
port to ovnkube-db
service as well.
So the way I see it, now we have two options:
- Either we go back to my previous solution where we sort of umbrella all 6 pods under the
ovn-kubernetes-master-metrics
service and not use theovnkube-db
service at all - when it comes to metrics scraping. - We use both
ovnkube-db
andovn-kubernetes-master-metrics
but then define/expose the 9012 port on both. - I don't know if this will have an (undoing) effect on the TLS work done and all the http/https scraping stuff.
Following the change in PR openshift#581 the job name in the alert rule should be renamed to "ovnkube-master-metrics". Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
95f303f
to
ea72c82
Compare
So we tested this. Even if we use the metrics service, when the master pods go down, it will be detected since the 9102 port used for scraping is proxying the 29102 localhost port. |
/assign @abhat |
/test e2e-gcp-ovn |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhat, tssurya The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
4 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/test e2e-gcp-ovn |
/retest Please review the full test history for this PR and help us cut down flakes. |
8 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/override ci/prow/e2e-gcp-ovn last 4 gcp test failures have either been |
@dcbw: Overrode contexts on behalf of dcbw: ci/prow/e2e-gcp-ovn In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest Please review the full test history for this PR and help us cut down flakes. |
9 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@tssurya: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest Please review the full test history for this PR and help us cut down flakes. |
Following the change in PR #581 the job name should be renamed to
"ovnkube-master-metrics"
Signed-off-by: Surya Seetharaman suryaseetharaman.9@gmail.com