Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OVN: Fix control plane metrics alert job name #742

Conversation

tssurya
Copy link
Contributor

@tssurya tssurya commented Aug 3, 2020

Following the change in PR #581 the job name should be renamed to
"ovnkube-master-metrics"

Signed-off-by: Surya Seetharaman suryaseetharaman.9@gmail.com

@dcbw
Copy link
Member

dcbw commented Aug 3, 2020

@tssurya I would actually use "ovnkube-db" since that's the service that actually fronts the master pods. If we use the metrics service, then technically we're just watching to see if the metrics proxy pods are running, I think. And the ovnkube master pods and the metrics proxy pods are independent daemonsets

@@ -17,7 +17,7 @@ spec:
message: |
there is no running ovn-kubernetes master
expr: |
absent(up{job="ovnkube-master",namespace="openshift-ovn-kubernetes"} ==1)
absent(up{job="ovnkube-master-metrics",namespace="openshift-ovn-kubernetes"} ==1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the job name is the name of the service, should it be ovn-kubernetes-master-metrics instead as per: https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/ovn-kubernetes/monitor.yaml#L33

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The job name is a Kube Service.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should target the "ovnkube-db" service though per #742 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried changing it to ovnkube-db, the alert still fires. I might have to add an additional ServiceMonitor for us to use this service from what I understand (https://coreos.com/blog/the-prometheus-operator.html)

Also I think its ok to leave it as ovnkube-master-metrics since in this instance the metrics pod's endpoint at port 9102 is indirectly connected to master container's 21902, it should be checking for the master pods' endpoint while scraping right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think the job name could be basically the label defined in the jobLabel field of the respective ServiceMonitor (https://coreos.com/operators/prometheus/docs/latest/api.html#servicemonitorspec)

Copy link
Contributor Author

@tssurya tssurya Aug 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think its ok to leave it as ovnkube-master-metrics since in this instance the metrics pod's endpoint at port 9102 is indirectly connected to master container's 21902, it should be checking for the master pods' endpoint while scraping right ?

this assumption is right, as in the alerts gets fired even if the master pods go down.

@dcbw
Copy link
Member

dcbw commented Aug 4, 2020

/retest

1 similar comment
@tssurya
Copy link
Contributor Author

tssurya commented Aug 4, 2020

/retest

@squeed
Copy link
Contributor

squeed commented Aug 4, 2020

Hang on, I'm not sure this is the best way to write this alert. We should instead be alerting on whether or not there is a leader. The query can then just be max(ovnkube_master_leader) = 0.

@squeed
Copy link
Contributor

squeed commented Aug 4, 2020

An update: we should really alert on both conditions: a failure to scrape and no ovnkube master leader.

@tssurya
Copy link
Contributor Author

tssurya commented Aug 4, 2020

An update: we should really alert on both conditions: a failure to scrape and no ovnkube master leader.

this metric up basically fails for both - if the endpoint is not configured properly and there is failure on scraping and if there is no ovnkube master pod which again results in not being able to get in touch with the pod.

@tssurya
Copy link
Contributor Author

tssurya commented Aug 4, 2020

Hang on, I'm not sure this is the best way to write this alert. We should instead be alerting on whether or not there is a leader. The query can then just be max(ovnkube_master_leader) = 0.

ah ok, I guess its better to add this as well. From what I understand this alert will be fired if the leader election fails ? err, actually I don't know if there is a leader election in ovn like in sdn. Let me check.

@trozet
Copy link
Contributor

trozet commented Aug 4, 2020

Hang on, I'm not sure this is the best way to write this alert. We should instead be alerting on whether or not there is a leader. The query can then just be max(ovnkube_master_leader) = 0.

ah ok, I guess its better to add this as well. From what I understand this alert will be fired if the leader election fails ? err, actually I don't know if there is a leader election in ovn like in sdn. Let me check.

Yeah there is. That is why gcp job alert was firing in the first place. In GCP there is some temporary outage to kapi server and the leader fails to renew his leadership lease. When that happens he shuts down and restarts and becomes leader again. The entire process takes about a second. This causes the alert to fire. However, the alert clause iiuc says "for 10m" which means as long as the pod comes back up within a 10 minute period the alert should not fire. Question is why does the alert fire if we were only down for a second? Does prometheus correctly detect that the pod is back up?

@squeed
Copy link
Contributor

squeed commented Aug 4, 2020

Pending alerts will show up in the interface (but not actually fire), so maybe that's what's happening?

@trozet
Copy link
Contributor

trozet commented Aug 4, 2020

Pending alerts will show up in the interface (but not actually fire), so maybe that's what's happening?

I think it is firing

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/727/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn/1289934434367180800

fail [github.com/openshift/origin@/test/extended/util/prometheus/helpers.go:174]: Expected
    <map[string]error | len:1>: {
        "ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1": {
            s: "promQL query: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1 had reported incorrect results:\n[{\"metric\":{\"__name__\":\"ALERTS\",\"alertname\":\"NoRunningOvnMaster\",\"alertstate\":\"firing\",\"severity\":\"warning\"},\"value\":[1596381630.6,\"1\"]}]",
        },
    }
to be empty

@tssurya tssurya force-pushed the change-ovn-master-metrics-jobname branch from 6062777 to a9ea052 Compare August 4, 2020 13:24
@tssurya
Copy link
Contributor Author

tssurya commented Aug 4, 2020

Yeah there is. That is why gcp job alert was firing in the first place. In GCP there is some temporary outage to kapi server and the leader fails to renew his leadership lease. When that happens he shuts down and restarts and becomes leader again. The entire process takes about a second. This causes the alert to fire. However, the alert clause iiuc says "for 10m" which means as long as the pod comes back up within a 10 minute period the alert should not fire. Question is why does the alert fire if we were only down for a second? Does prometheus correctly detect that the pod is back up?

Yes prometheus does come back to pending state from firing if the pod is back up - or at least it did in my nightly cluster.
Btw, this particular alert; the NoRunningOVNKubeMaster was always firing because it was incorrectly configured.

@abhat
Copy link
Contributor

abhat commented Aug 4, 2020

So the current alert as it stands should just pick up ovnkube-master pods being down. Like @squeed mentioned we need separate alert for when there is no leader. And I am not so sure if we need to alert if the metrics scraper pods are missing. I mean it's good to know if the ovnkube-master-metrics pods are down, but the real deal is the ovnkube-master pods being down. Thoughts?

@dcbw
Copy link
Member

dcbw commented Aug 4, 2020

So the current alert as it stands should just pick up ovnkube-master pods being down. Like @squeed mentioned we need separate alert for when there is no leader. And I am not so sure if we need to alert if the metrics scraper pods are missing. I mean it's good to know if the ovnkube-master-metrics pods are down, but the real deal is the ovnkube-master pods being down. Thoughts?

IMO this can be 3 different things, in priority order:

  1. alert when no ovnkube master pod is running in the cluster
  2. alert when there is no ovnkube master leader
  3. metrics pods are down

Each one can be a separate alert and thus PR. #1 is the most pressing need to get our CI flakes taken care of. So let's do that first, and then file Jira cards for the others?

@tssurya tssurya force-pushed the change-ovn-master-metrics-jobname branch from a9ea052 to 95f303f Compare August 4, 2020 18:23
---
apiVersion: v1
kind: Service
metadata:
labels:
app: ovnkube-master-metrics
app: ovnkube-master
Copy link
Contributor

@abhat abhat Aug 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. This needs to change to match the app: ovnkube-master

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After our call yesterday, I tried it but it looks like the ports are not orthogonal. I haven't yet found a clear documentation on Prometheus that explains this. I have asked the monitoring-team to confirm this. So basically the scraping works only if I add a metrics port to ovnkube-db service as well.

So the way I see it, now we have two options:

  1. Either we go back to my previous solution where we sort of umbrella all 6 pods under the ovn-kubernetes-master-metrics service and not use the ovnkube-db service at all - when it comes to metrics scraping.
  2. We use both ovnkube-db and ovn-kubernetes-master-metrics but then define/expose the 9012 port on both. - I don't know if this will have an (undoing) effect on the TLS work done and all the http/https scraping stuff.

Following the change in PR openshift#581 the job name in the alert rule should
be renamed to "ovnkube-master-metrics".

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
@tssurya tssurya force-pushed the change-ovn-master-metrics-jobname branch from 95f303f to ea72c82 Compare August 5, 2020 20:03
@tssurya
Copy link
Contributor Author

tssurya commented Aug 6, 2020

@tssurya I would actually use "ovnkube-db" since that's the service that actually fronts the master pods. If we use the metrics service, then technically we're just watching to see if the metrics proxy pods are running, I think. And the ovnkube master pods and the metrics proxy pods are independent daemonsets

So we tested this. Even if we use the metrics service, when the master pods go down, it will be detected since the 9102 port used for scraping is proxying the 29102 localhost port.

@tssurya
Copy link
Contributor Author

tssurya commented Aug 6, 2020

/assign @abhat

@tssurya
Copy link
Contributor Author

tssurya commented Aug 6, 2020

/test e2e-gcp-ovn

@abhat
Copy link
Contributor

abhat commented Aug 6, 2020

/lgtm
/retest

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 6, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhat, tssurya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

4 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@tssurya
Copy link
Contributor Author

tssurya commented Aug 10, 2020

/test e2e-gcp-ovn

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

8 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@dcbw
Copy link
Member

dcbw commented Aug 10, 2020

/override ci/prow/e2e-gcp-ovn

last 4 gcp test failures have either been [sig-storage] PersistentVolumes [Feature:vsphere][Feature:ReclaimPolicy] [sig-storage] persistentvolumereclaim:vsphere [Feature:vsphere] should retain persistent volume when reclaimPolicy set to retain when associated claim is deleted [Suite:openshift/conformance/parallel] [Suite:k8s] or a few storage related ones about local volumes.

@openshift-ci-robot
Copy link
Contributor

@dcbw: Overrode contexts on behalf of dcbw: ci/prow/e2e-gcp-ovn

In response to this:

/override ci/prow/e2e-gcp-ovn

last 4 gcp test failures have either been [sig-storage] PersistentVolumes [Feature:vsphere][Feature:ReclaimPolicy] [sig-storage] persistentvolumereclaim:vsphere [Feature:vsphere] should retain persistent volume when reclaimPolicy set to retain when associated claim is deleted [Suite:openshift/conformance/parallel] [Suite:k8s] or a few storage related ones about local volumes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

9 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Aug 10, 2020

@tssurya: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-ovn-hybrid-step-registry ea72c82 link /test e2e-ovn-hybrid-step-registry
ci/prow/e2e-ovn-step-registry ea72c82 link /test e2e-ovn-step-registry
ci/prow/e2e-vsphere ea72c82 link /test e2e-vsphere

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit c2e5f8a into openshift:master Aug 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants