-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node-exporter does not come up on openshift e2e runs #85
Comments
@brancz a significant source of flakes in the 3.11 e2es |
@mxinden those warnings were recently fixed, no? Did we update the Prometheus Operator version? |
Something I could think of off the top of my head is, can you check whether that annotation is set to not enforce pods to go on the worker nodes? // edit this:
but looks like we do do that in the ansible roles |
We'd be getting events if we couldn't schedule. |
Ah I actually found something in the logs of the cluster-monitoring-operator:
What do the Alertmanager logs say or do the Pods even created/scheduled from its StatefulSet? |
For the record |
I see alertmanager pods here so they get created at some point https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/20830/pull-ci-origin-e2e-gcp/3161/artifacts/e2e-gcp/pods/ |
@smarterclayton Thanks a lot for the logs and pods traces, this helps tremendously! 👍 I think we can narrow it down to the cluster monitoring operator not reaching the point to create the daemonset node exporter at all. In fact it also doesn't create the kube-state-metrics deployment (both artifacts are missing in the pods overview). While looking at the operator logic: cluster-monitoring-operator/pkg/operator/operator.go Lines 238 to 244 in d6d5b11
I see the following log entries from https://storage.googleapis.com/origin-ci-test/pr-logs/pull/20830/pull-ci-origin-e2e-gcp/3161/artifacts/e2e-gcp/pods/openshift-monitoring_cluster-monitoring-operator-5cf8fccc6-mdc92_cluster-monitoring-operator.log.gz:
This might be a simple flake, where the e2e tests simply gives up "too fast", admittedly it takes a long time (~15 minutes). @smarterclayton @brancz : are the monitoring (and also alertmanager) images downloaded from the internet or are they cached internally from Openshift? |
It would be pretty invasive, but possible to parallelize some of these things. Right now the big dependency there is, is that the Prometheus Operator is setup first, then the only remaining dependency is between Prometheus and Grafana as the Grafana task sets up some resources that the Prometheus task depends on. |
Just some more data points from the event log, why alertmanager was reluctant to be started, especially alertmanager-main-2. It seems the GCP persistend disks have quite some hickups when mounting. From https://storage.googleapis.com/origin-ci-test/pr-logs/pull/20830/pull-ci-origin-e2e-gcp/3161/artifacts/e2e-gcp/events.json:
|
@smarterclayton just to see if our suspicion is correct, would it be possible to increase the timeout to see if it eventually deploys? |
Which timeout? If we think this is a GCE PD problem we should copy the
storage team in - I've never noticed this before, but if PVs are lagging
we'd want to know.
…On Wed, Sep 5, 2018 at 3:49 PM Frederic Branczyk ***@***.***> wrote:
@smarterclayton <https://github.com/smarterclayton> just to see if our
suspicion is correct, would it be possible to increase the timeout to see
if it eventually deploys?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#85 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p1mIdwDD8MlIrFVo7mIiBQdamlmRks5uYCqugaJpZM4WXvvA>
.
|
Yes good point. |
@smarterclayton do you have any references to people in the storage team we can ping here? It seems there is not much we can do in the cluster-monitoring-operator itself. |
@openshift/sig-storage |
closing this out here, as it does not seem to be related to the cluster monitoring operator. |
I switched our prometheus e2e tests to use the cluster monitoring operator and I'm seeing some failures in about 1/4 runs. The most noticeable is that one run didn't have the node exporter installed (no pods created).
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/20830/pull-ci-origin-e2e-gcp/3161/
In this run the e2e tests start at 13:37, but the prometheus test isn't run until 13:45, which should be more than enough time for node-exporter to come up. I see no pods created, which implies either the daemonset wasn't created, or the daemonset failed massively. I see no events for the daemonset in
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/20830/pull-ci-origin-e2e-gcp/3161/artifacts/e2e-gcp/events.json which implies it didn't get created.
I see the following in the logs for prometheus operator (which seems bad) but nothing in cluster monitoring operator that is excessive.
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/20830/pull-ci-origin-e2e-gcp/3161/artifacts/e2e-gcp/pods/openshift-monitoring_cluster-monitoring-operator-5cf8fccc6-mdc92_cluster-monitoring-operator.log.gz
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/20830/pull-ci-origin-e2e-gcp/3161/artifacts/e2e-gcp/pods/openshift-monitoring_prometheus-operator-6c9fddd47f-mb4br_prometheus-operator.log.gz
The text was updated successfully, but these errors were encountered: