Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-exporter does not come up on openshift e2e runs #85

Closed
smarterclayton opened this issue Sep 3, 2018 · 17 comments
Closed

node-exporter does not come up on openshift e2e runs #85

smarterclayton opened this issue Sep 3, 2018 · 17 comments
Assignees

Comments

@smarterclayton
Copy link
Contributor

I switched our prometheus e2e tests to use the cluster monitoring operator and I'm seeing some failures in about 1/4 runs. The most noticeable is that one run didn't have the node exporter installed (no pods created).

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/20830/pull-ci-origin-e2e-gcp/3161/

/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/test/extended/prometheus/prometheus.go:49
Expected
    <[]error | len:1, cap:1>: [
        {
            s: "no match for map[job:node-exporter] with health up and scrape URL ^https://.*/metrics$",
        },
    ]
to be empty
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/test/extended/prometheus/prometheus.go:123

In this run the e2e tests start at 13:37, but the prometheus test isn't run until 13:45, which should be more than enough time for node-exporter to come up. I see no pods created, which implies either the daemonset wasn't created, or the daemonset failed massively. I see no events for the daemonset in
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/20830/pull-ci-origin-e2e-gcp/3161/artifacts/e2e-gcp/events.json which implies it didn't get created.

I see the following in the logs for prometheus operator (which seems bad) but nothing in cluster monitoring operator that is excessive.

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/20830/pull-ci-origin-e2e-gcp/3161/artifacts/e2e-gcp/pods/openshift-monitoring_cluster-monitoring-operator-5cf8fccc6-mdc92_cluster-monitoring-operator.log.gz

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/20830/pull-ci-origin-e2e-gcp/3161/artifacts/e2e-gcp/pods/openshift-monitoring_prometheus-operator-6c9fddd47f-mb4br_prometheus-operator.log.gz

W0903 14:01:42.210505       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
level=info ts=2018-09-03T14:01:43.178075004Z caller=operator.go:732 component=prometheusoperator msg="sync prometheus" key=openshift-monitoring/k8s
W0903 14:01:43.178176       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
W0903 14:01:43.178307       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
W0903 14:01:43.196450       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
W0903 14:01:43.222385       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
W0903 14:01:43.222448       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
level=info ts=2018-09-03T14:01:43.222295876Z caller=operator.go:732 component=prometheusoperator msg="sync prometheus" key=openshift-monitoring/k8s
W0903 14:01:43.240970       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
W0903 14:02:03.033696       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
level=info ts=2018-09-03T14:02:03.033607297Z caller=operator.go:732 component=prometheusoperator msg="sync prometheus" key=openshift-monitoring/k8s
W0903 14:02:03.033767       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
W0903 14:02:03.048325       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
level=info ts=2018-09-03T14:02:19.767518749Z caller=operator.go:396 component=alertmanageroperator msg="sync alertmanager" key=openshift-monitoring/main
W0903 14:02:45.489186       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
W0903 14:02:45.489268       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
level=info ts=2018-09-03T14:02:45.489057156Z caller=operator.go:732 component=prometheusoperator msg="sync prometheus" key=openshift-monitoring/k8s
W0903 14:02:45.504357       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
@smarterclayton
Copy link
Contributor Author

@brancz a significant source of flakes in the 3.11 e2es

@brancz
Copy link
Contributor

brancz commented Sep 3, 2018

@mxinden those warnings were recently fixed, no? Did we update the Prometheus Operator version?

@brancz
Copy link
Contributor

brancz commented Sep 3, 2018

Something I could think of off the top of my head is, can you check whether that annotation is set to not enforce pods to go on the worker nodes?

// edit

this:

oc annotate ns/openshift-monitoring openshift.io/node-selector=

but looks like we do do that in the ansible roles

https://github.com/openshift/openshift-ansible/blob/b4443ae1bf649c61a1a1ca04076e82ce69d45520/roles/openshift_cluster_monitoring_operator/tasks/install.yaml#L44

@smarterclayton
Copy link
Contributor Author

We'd be getting events if we couldn't schedule.

@brancz
Copy link
Contributor

brancz commented Sep 3, 2018

Ah I actually found something in the logs of the cluster-monitoring-operator:

E0903 13:43:24.094244       1 operator.go:186] Syncing "openshift-monitoring/cluster-monitoring-config" failed                                                                           
E0903 13:43:24.187314       1 operator.go:187] Sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Alertmanager failed: waiting for Alertmanager object changes failed: timed out waiting for the condition

What do the Alertmanager logs say or do the Pods even created/scheduled from its StatefulSet?

@mxinden
Copy link
Contributor

mxinden commented Sep 3, 2018

For the record Index with name namespace does not exist was fixed in prometheus-operator/prometheus-operator#1706. Should just be ignored as client-go falls back to simply iterating the given store.

@s-urbaniak s-urbaniak self-assigned this Sep 5, 2018
@smarterclayton
Copy link
Contributor Author

@s-urbaniak
Copy link
Contributor

@smarterclayton Thanks a lot for the logs and pods traces, this helps tremendously! 👍

I think we can narrow it down to the cluster monitoring operator not reaching the point to create the daemonset node exporter at all.

In fact it also doesn't create the kube-state-metrics deployment (both artifacts are missing in the pods overview).

While looking at the operator logic:

tasks.NewTaskSpec("Updating Prometheus Operator", tasks.NewPrometheusOperatorTask(o.client, factory)),
tasks.NewTaskSpec("Updating Cluster Monitoring Operator", tasks.NewClusterMonitoringOperatorTask(o.client, factory)),
tasks.NewTaskSpec("Updating Grafana", tasks.NewGrafanaTask(o.client, factory)),
tasks.NewTaskSpec("Updating Prometheus-k8s", tasks.NewPrometheusTask(o.client, factory, config)),
tasks.NewTaskSpec("Updating Alertmanager", tasks.NewAlertmanagerTask(o.client, factory)),
tasks.NewTaskSpec("Updating node-exporter", tasks.NewNodeExporterTask(o.client, factory)),
tasks.NewTaskSpec("Updating kube-state-metrics", tasks.NewKubeStateMetricsTask(o.client, factory)),

I see the following log entries from https://storage.googleapis.com/origin-ci-test/pr-logs/pull/20830/pull-ci-origin-e2e-gcp/3161/artifacts/e2e-gcp/pods/openshift-monitoring_cluster-monitoring-operator-5cf8fccc6-mdc92_cluster-monitoring-operator.log.gz:

I0903 13:55:04.687436       1 tasks.go:37] running task Updating Prometheus Operator
...
I0903 13:55:23.094065       1 tasks.go:37] running task Updating Grafana
...
I0903 13:55:30.198715       1 tasks.go:37] running task Updating Prometheus-k8s
...
I0903 14:01:30.888512       1 tasks.go:37] running task Updating Alertmanager
E0903 13:49:12.972447       1 operator.go:186] Syncing "openshift-monitoring/cluster-monitoring-config" failed
E0903 13:49:12.972481       1 operator.go:187] Sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Alertmanager failed: waiting for Alertmanager object changes failed: timed out waiting for the condition
...
(retries the above a couple of times)
...
I0903 14:01:30.888512       1 tasks.go:37] running task Updating Alertmanager
I0903 14:01:30.888611       1 decoder.go:224] decoding stream as YAML
I0903 14:01:32.090741       1 decoder.go:224] decoding stream as YAML
I0903 14:01:32.187570       1 decoder.go:224] decoding stream as YAML
I0903 14:01:32.287753       1 decoder.go:224] decoding stream as YAML
I0903 14:01:32.304566       1 decoder.go:224] decoding stream as YAML
I0903 14:01:32.487412       1 decoder.go:224] decoding stream as YAML
I0903 14:01:32.490592       1 decoder.go:224] decoding stream as YAML
I0903 14:01:32.493077       1 decoder.go:224] decoding stream as YAML
I0903 14:01:32.587637       1 decoder.go:224] decoding stream as YAML
<EOF>

This might be a simple flake, where the e2e tests simply gives up "too fast", admittedly it takes a long time (~15 minutes).

@smarterclayton @brancz : are the monitoring (and also alertmanager) images downloaded from the internet or are they cached internally from Openshift?

@brancz
Copy link
Contributor

brancz commented Sep 5, 2018

It would be pretty invasive, but possible to parallelize some of these things. Right now the big dependency there is, is that the Prometheus Operator is setup first, then the only remaining dependency is between Prometheus and Grafana as the Grafana task sets up some resources that the Prometheus task depends on.

@s-urbaniak
Copy link
Contributor

Just some more data points from the event log, why alertmanager was reluctant to be started, especially alertmanager-main-2.

It seems the GCP persistend disks have quite some hickups when mounting. From https://storage.googleapis.com/origin-ci-test/pr-logs/pull/20830/pull-ci-origin-e2e-gcp/3161/artifacts/e2e-gcp/events.json:

            "firstTimestamp": "2018-09-03T13:39:44Z",
            "lastTimestamp": "2018-09-03T13:39:46Z",
            "message": "pod has unbound PersistentVolumeClaims (repeated 3 times)",
...
            "firstTimestamp": "2018-09-03T13:39:46Z",
            "lastTimestamp": "2018-09-03T13:39:46Z",
            "message": "Successfully assigned openshift-monitoring/alertmanager-main-2 to ci-op-3nmhd3lm-eb354-ig-n-w1lk",
...
            "firstTimestamp": "2018-09-03T13:39:54Z",
            "lastTimestamp": "2018-09-03T13:39:54Z",
            "message": "AttachVolume.Attach succeeded for volume \"pvc-d0f130e7-af7e-11e8-8cd1-42010a8e0005\" ",
...
            "firstTimestamp": "2018-09-03T13:41:49Z",
            "lastTimestamp": "2018-09-03T14:02:13Z",
            "message": "Unable to mount volumes for pod \"alertmanager-main-2_openshift-monitoring(d0f871a5-af7e-11e8-8cd1-42010a8e0005)\": timeout expired waiting for volumes to attach or mount for pod \"openshift-monitoring\"/\"alertmanager-main-2\". list of unmounted volumes=[alertmanager-main-db]. list of unattached volumes=[alertmanager-main-db config-volume secret-alertmanager-main-tls secret-alertmanager-main-proxy alertmanager-main-token-f24kl]",
...
            "firstTimestamp": "2018-09-03T13:50:03Z",
            "lastTimestamp": "2018-09-03T14:00:03Z",
            "message": "MountVolume.WaitForAttach failed for volume \"pvc-d0f130e7-af7e-11e8-8cd1-42010a8e0005\" : Could not find attached GCE PD \"kubernetes-dynamic-pvc-d0f130e7-af7e-11e8-8cd1-42010a8e0005\". Timeout waiting for mount paths to be created.",

@brancz
Copy link
Contributor

brancz commented Sep 5, 2018

@smarterclayton just to see if our suspicion is correct, would it be possible to increase the timeout to see if it eventually deploys?

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Sep 5, 2018 via email

@brancz
Copy link
Contributor

brancz commented Sep 5, 2018

Yes good point.

@s-urbaniak
Copy link
Contributor

@smarterclayton do you have any references to people in the storage team we can ping here? It seems there is not much we can do in the cluster-monitoring-operator itself.

@smarterclayton
Copy link
Contributor Author

@openshift/sig-storage

@smarterclayton
Copy link
Contributor Author

@s-urbaniak
Copy link
Contributor

closing this out here, as it does not seem to be related to the cluster monitoring operator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants