OVN: Fix control plane metrics alert job name #742

tssurya · 2020-08-03T16:46:04Z

Following the change in PR #581 the job name should be renamed to
"ovnkube-master-metrics"

Signed-off-by: Surya Seetharaman suryaseetharaman.9@gmail.com

dcbw · 2020-08-03T17:05:51Z

@tssurya I would actually use "ovnkube-db" since that's the service that actually fronts the master pods. If we use the metrics service, then technically we're just watching to see if the metrics proxy pods are running, I think. And the ovnkube master pods and the metrics proxy pods are independent daemonsets

abhat · 2020-08-03T17:42:06Z

bindata/network/ovn-kubernetes/alert-rules-control-plane.yaml

@@ -17,7 +17,7 @@ spec:
        message: |
          there is no running ovn-kubernetes master
      expr: |
-        absent(up{job="ovnkube-master",namespace="openshift-ovn-kubernetes"} ==1)
+        absent(up{job="ovnkube-master-metrics",namespace="openshift-ovn-kubernetes"} ==1)


If the job name is the name of the service, should it be ovn-kubernetes-master-metrics instead as per: https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/ovn-kubernetes/monitor.yaml#L33

The job name is a Kube Service.

I think we should target the "ovnkube-db" service though per #742 (comment)

I tried changing it to ovnkube-db, the alert still fires. I might have to add an additional ServiceMonitor for us to use this service from what I understand (https://coreos.com/blog/the-prometheus-operator.html)

Also I think its ok to leave it as ovnkube-master-metrics since in this instance the metrics pod's endpoint at port 9102 is indirectly connected to master container's 21902, it should be checking for the master pods' endpoint while scraping right ?

Also I think the job name could be basically the label defined in the jobLabel field of the respective ServiceMonitor (https://coreos.com/operators/prometheus/docs/latest/api.html#servicemonitorspec)

Also I think its ok to leave it as ovnkube-master-metrics since in this instance the metrics pod's endpoint at port 9102 is indirectly connected to master container's 21902, it should be checking for the master pods' endpoint while scraping right ?

this assumption is right, as in the alerts gets fired even if the master pods go down.

dcbw · 2020-08-04T02:17:36Z

/retest

tssurya · 2020-08-04T09:34:22Z

/retest

squeed · 2020-08-04T10:52:29Z

Hang on, I'm not sure this is the best way to write this alert. We should instead be alerting on whether or not there is a leader. The query can then just be max(ovnkube_master_leader) = 0.

squeed · 2020-08-04T10:56:49Z

An update: we should really alert on both conditions: a failure to scrape and no ovnkube master leader.

tssurya · 2020-08-04T11:21:31Z

An update: we should really alert on both conditions: a failure to scrape and no ovnkube master leader.

this metric up basically fails for both - if the endpoint is not configured properly and there is failure on scraping and if there is no ovnkube master pod which again results in not being able to get in touch with the pod.

tssurya · 2020-08-04T11:29:36Z

Hang on, I'm not sure this is the best way to write this alert. We should instead be alerting on whether or not there is a leader. The query can then just be max(ovnkube_master_leader) = 0.

ah ok, I guess its better to add this as well. From what I understand this alert will be fired if the leader election fails ? err, actually I don't know if there is a leader election in ovn like in sdn. Let me check.

trozet · 2020-08-04T12:54:28Z

Hang on, I'm not sure this is the best way to write this alert. We should instead be alerting on whether or not there is a leader. The query can then just be max(ovnkube_master_leader) = 0.

ah ok, I guess its better to add this as well. From what I understand this alert will be fired if the leader election fails ? err, actually I don't know if there is a leader election in ovn like in sdn. Let me check.

Yeah there is. That is why gcp job alert was firing in the first place. In GCP there is some temporary outage to kapi server and the leader fails to renew his leadership lease. When that happens he shuts down and restarts and becomes leader again. The entire process takes about a second. This causes the alert to fire. However, the alert clause iiuc says "for 10m" which means as long as the pod comes back up within a 10 minute period the alert should not fire. Question is why does the alert fire if we were only down for a second? Does prometheus correctly detect that the pod is back up?

squeed · 2020-08-04T13:06:51Z

Pending alerts will show up in the interface (but not actually fire), so maybe that's what's happening?

trozet · 2020-08-04T13:20:52Z

Pending alerts will show up in the interface (but not actually fire), so maybe that's what's happening?

I think it is firing

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/727/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn/1289934434367180800

fail [github.com/openshift/origin@/test/extended/util/prometheus/helpers.go:174]: Expected
    <map[string]error | len:1>: {
        "ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1": {
            s: "promQL query: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1 had reported incorrect results:\n[{\"metric\":{\"__name__\":\"ALERTS\",\"alertname\":\"NoRunningOvnMaster\",\"alertstate\":\"firing\",\"severity\":\"warning\"},\"value\":[1596381630.6,\"1\"]}]",
        },
    }
to be empty

tssurya · 2020-08-04T14:03:47Z

Yeah there is. That is why gcp job alert was firing in the first place. In GCP there is some temporary outage to kapi server and the leader fails to renew his leadership lease. When that happens he shuts down and restarts and becomes leader again. The entire process takes about a second. This causes the alert to fire. However, the alert clause iiuc says "for 10m" which means as long as the pod comes back up within a 10 minute period the alert should not fire. Question is why does the alert fire if we were only down for a second? Does prometheus correctly detect that the pod is back up?

Yes prometheus does come back to pending state from firing if the pod is back up - or at least it did in my nightly cluster.
Btw, this particular alert; the NoRunningOVNKubeMaster was always firing because it was incorrectly configured.

bindata/network/ovn-kubernetes/alert-rules-control-plane.yaml

abhat · 2020-08-04T14:07:47Z

So the current alert as it stands should just pick up ovnkube-master pods being down. Like @squeed mentioned we need separate alert for when there is no leader. And I am not so sure if we need to alert if the metrics scraper pods are missing. I mean it's good to know if the ovnkube-master-metrics pods are down, but the real deal is the ovnkube-master pods being down. Thoughts?

dcbw · 2020-08-04T16:47:52Z

So the current alert as it stands should just pick up ovnkube-master pods being down. Like @squeed mentioned we need separate alert for when there is no leader. And I am not so sure if we need to alert if the metrics scraper pods are missing. I mean it's good to know if the ovnkube-master-metrics pods are down, but the real deal is the ovnkube-master pods being down. Thoughts?

IMO this can be 3 different things, in priority order:

alert when no ovnkube master pod is running in the cluster
alert when there is no ovnkube master leader
metrics pods are down

Each one can be a separate alert and thus PR. #1 is the most pressing need to get our CI flakes taken care of. So let's do that first, and then file Jira cards for the others?

abhat · 2020-08-04T18:36:12Z

bindata/network/ovn-kubernetes/monitor.yaml

 ---
 apiVersion: v1
 kind: Service
 metadata:
  labels:
-    app: ovnkube-master-metrics
+    app: ovnkube-master


This makes sense. This needs to change to match the app: ovnkube-master

After our call yesterday, I tried it but it looks like the ports are not orthogonal. I haven't yet found a clear documentation on Prometheus that explains this. I have asked the monitoring-team to confirm this. So basically the scraping works only if I add a metrics port to ovnkube-db service as well.

So the way I see it, now we have two options:

Either we go back to my previous solution where we sort of umbrella all 6 pods under the ovn-kubernetes-master-metrics service and not use the ovnkube-db service at all - when it comes to metrics scraping.

We use both ovnkube-db and ovn-kubernetes-master-metrics but then define/expose the 9012 port on both. - I don't know if this will have an (undoing) effect on the TLS work done and all the http/https scraping stuff.

Following the change in PR openshift#581 the job name in the alert rule should be renamed to "ovnkube-master-metrics". Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>

tssurya · 2020-08-06T09:49:10Z

@tssurya I would actually use "ovnkube-db" since that's the service that actually fronts the master pods. If we use the metrics service, then technically we're just watching to see if the metrics proxy pods are running, I think. And the ovnkube master pods and the metrics proxy pods are independent daemonsets

So we tested this. Even if we use the metrics service, when the master pods go down, it will be detected since the 9102 port used for scraping is proxying the 29102 localhost port.

tssurya · 2020-08-06T09:50:13Z

/assign @abhat

tssurya · 2020-08-06T09:50:55Z

/test e2e-gcp-ovn

abhat · 2020-08-06T18:43:00Z

/lgtm
/retest

openshift-ci-robot · 2020-08-06T18:43:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhat, tssurya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhat]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2020-08-10T04:05:41Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-08-10T04:18:49Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-08-10T05:49:43Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-08-10T06:06:48Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-08-10T06:29:13Z

/retest

Please review the full test history for this PR and help us cut down flakes.

tssurya · 2020-08-10T10:08:47Z

/test e2e-gcp-ovn

openshift-bot · 2020-08-10T10:09:43Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-08-10T10:22:44Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-08-10T10:48:45Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-08-10T12:06:45Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-08-10T13:00:35Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-08-10T13:51:03Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-08-10T14:16:53Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-08-10T14:55:56Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-08-10T15:08:46Z

/retest

Please review the full test history for this PR and help us cut down flakes.

dcbw · 2020-08-10T15:22:26Z

/override ci/prow/e2e-gcp-ovn

last 4 gcp test failures have either been [sig-storage] PersistentVolumes [Feature:vsphere][Feature:ReclaimPolicy] [sig-storage] persistentvolumereclaim:vsphere [Feature:vsphere] should retain persistent volume when reclaimPolicy set to retain when associated claim is deleted [Suite:openshift/conformance/parallel] [Suite:k8s] or a few storage related ones about local volumes.

openshift-ci-robot · 2020-08-10T15:23:09Z

@dcbw: Overrode contexts on behalf of dcbw: ci/prow/e2e-gcp-ovn

In response to this:

/override ci/prow/e2e-gcp-ovn

last 4 gcp test failures have either been [sig-storage] PersistentVolumes [Feature:vsphere][Feature:ReclaimPolicy] [sig-storage] persistentvolumereclaim:vsphere [Feature:vsphere] should retain persistent volume when reclaimPolicy set to retain when associated claim is deleted [Suite:openshift/conformance/parallel] [Suite:k8s] or a few storage related ones about local volumes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2020-08-10T17:18:47Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-08-10T17:44:48Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-08-10T17:57:46Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-08-10T19:15:45Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-08-10T19:41:47Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-08-10T19:54:47Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-08-10T20:07:45Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-08-10T20:59:49Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-08-10T21:51:51Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-08-10T22:04:47Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2020-08-10T22:22:17Z

@tssurya: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-ovn-hybrid-step-registry	`ea72c82`	link	`/test e2e-ovn-hybrid-step-registry`
ci/prow/e2e-ovn-step-registry	`ea72c82`	link	`/test e2e-ovn-step-registry`
ci/prow/e2e-vsphere	`ea72c82`	link	`/test e2e-vsphere`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2020-08-10T22:56:48Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot requested review from aojea and juanluisvaladas August 3, 2020 16:46

dcbw mentioned this pull request Aug 3, 2020

ovnkube: fix NoRunningOvnKubeMaster alert rule #743

Closed

abhat reviewed Aug 3, 2020

View reviewed changes

tssurya force-pushed the change-ovn-master-metrics-jobname branch from 6062777 to a9ea052 Compare August 4, 2020 13:24

abhat reviewed Aug 4, 2020

View reviewed changes

bindata/network/ovn-kubernetes/alert-rules-control-plane.yaml Show resolved Hide resolved

tssurya force-pushed the change-ovn-master-metrics-jobname branch from a9ea052 to 95f303f Compare August 4, 2020 18:23

abhat reviewed Aug 4, 2020

View reviewed changes

OVN: Fix control plane metrics alert job name

ea72c82

Following the change in PR openshift#581 the job name in the alert rule should be renamed to "ovnkube-master-metrics". Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>

tssurya force-pushed the change-ovn-master-metrics-jobname branch from 95f303f to ea72c82 Compare August 5, 2020 20:03

openshift-ci-robot assigned abhat Aug 6, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 6, 2020

openshift-merge-robot merged commit c2e5f8a into openshift:master Aug 10, 2020

OVN: Fix control plane metrics alert job name #742

OVN: Fix control plane metrics alert job name #742

Conversation

tssurya commented Aug 3, 2020

dcbw commented Aug 3, 2020 • edited

abhat Aug 3, 2020

Choose a reason for hiding this comment

dcbw Aug 3, 2020

Choose a reason for hiding this comment

dcbw Aug 3, 2020

Choose a reason for hiding this comment

tssurya Aug 3, 2020

Choose a reason for hiding this comment

tssurya Aug 3, 2020

Choose a reason for hiding this comment

tssurya Aug 3, 2020 • edited

Choose a reason for hiding this comment

dcbw commented Aug 4, 2020

tssurya commented Aug 4, 2020

squeed commented Aug 4, 2020

squeed commented Aug 4, 2020

tssurya commented Aug 4, 2020

tssurya commented Aug 4, 2020

trozet commented Aug 4, 2020

squeed commented Aug 4, 2020

trozet commented Aug 4, 2020 • edited

tssurya commented Aug 4, 2020

abhat commented Aug 4, 2020

dcbw commented Aug 4, 2020

abhat Aug 4, 2020 • edited

Choose a reason for hiding this comment

tssurya Aug 5, 2020

Choose a reason for hiding this comment

tssurya commented Aug 6, 2020

tssurya commented Aug 6, 2020

tssurya commented Aug 6, 2020

abhat commented Aug 6, 2020

openshift-ci-robot commented Aug 6, 2020

openshift-bot commented Aug 10, 2020

openshift-bot commented Aug 10, 2020

openshift-bot commented Aug 10, 2020

openshift-bot commented Aug 10, 2020

openshift-bot commented Aug 10, 2020

tssurya commented Aug 10, 2020

openshift-bot commented Aug 10, 2020

openshift-bot commented Aug 10, 2020

openshift-bot commented Aug 10, 2020

openshift-bot commented Aug 10, 2020

openshift-bot commented Aug 10, 2020

openshift-bot commented Aug 10, 2020

openshift-bot commented Aug 10, 2020

openshift-bot commented Aug 10, 2020

openshift-bot commented Aug 10, 2020

dcbw commented Aug 10, 2020

openshift-ci-robot commented Aug 10, 2020

openshift-bot commented Aug 10, 2020

openshift-bot commented Aug 10, 2020

openshift-bot commented Aug 10, 2020

openshift-bot commented Aug 10, 2020

openshift-bot commented Aug 10, 2020

openshift-bot commented Aug 10, 2020

openshift-bot commented Aug 10, 2020

openshift-bot commented Aug 10, 2020

openshift-bot commented Aug 10, 2020

openshift-bot commented Aug 10, 2020

openshift-ci-robot commented Aug 10, 2020 • edited

openshift-bot commented Aug 10, 2020

dcbw commented Aug 3, 2020 •

edited

tssurya Aug 3, 2020 •

edited

trozet commented Aug 4, 2020 •

edited

abhat Aug 4, 2020 •

edited

openshift-ci-robot commented Aug 10, 2020 •

edited