Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clusteralerts and podalerts don't get recreated in icinga when searchlight-operator restarts #400

Open
mmta opened this issue Jul 19, 2018 · 1 comment

Comments

@mmta
Copy link
Contributor

mmta commented Jul 19, 2018

I'm only test this using the following (on v7.0.0):

  • NodeAlert: node-status and node-volume checks.
  • ClusterAlert: component-status, event, and a webhook checks.
  • PodAlert: pod-status check.

Whenever searchlight-operator pod restarts, NodeAlert hosts and services will automatically be registered back in icinga, but not so for ClusterAlert and PodAlert objects. I had to use kubectl delete and kubectl apply again to force their registration.

Looking at the logs after restart, I noticed only plugin.go and nodes.go keep retrying to connect like so:

icinga/searchlight-operator-84878b6df-r95z7[operator]: I0719 18:04:58.386974       1 plugin.go:60] Sync/Add/Update for SearchlightPlugin cert
icinga/searchlight-operator-84878b6df-r95z7[operator]: E0719 18:04:58.452500       1 worker.go:76] Failed to process key cert. Reason: command terminated with exit code 1
icinga/searchlight-operator-84878b6df-r95z7[operator]: E0719 18:04:58.742345       1 nodes.go:64] failed to reconcile alert for node k8sworker1c. reason: [: Put https://127.0.0.1:5665/v1/objects/hosts/icinga@node@k8sworker1c: dial tcp 127.0.0.1:5665: getsockopt: connection refused, : Put https://127.0.0.1:5665/v1/objects/hosts/icinga@node@k8sworker1c: dial tcp 127.0.0.1:5665: getsockopt: connection refused, : Put https://127.0.0.1:5665/v1/objects/hosts/icinga@node@k8sworker1c: dial tcp 127.0.0.1:5665: getsockopt: connection refused]
icinga/searchlight-operator-84878b6df-r95z7[operator]: E0719 18:04:58.742381       1 worker.go:76] Failed to process key k8sworker1c. Reason: [: Put https://127.0.0.1:5665/v1/objects/hosts/icinga@node@k8sworker1c: dial tcp 127.0.0.1:5665: getsockopt: connection refused, : Put https://127.0.0.1:5665/v1/objects/hosts/icinga@node@k8sworker1c: dial tcp 127.0.0.1:5665: getsockopt: connection refused, : Put https://127.0.0.1:5665/v1/objects/hosts/icinga@node@k8sworker1c: dial tcp 127.0.0.1:5665: getsockopt: connection refused]

So I tried removing return from this line:
https://github.com/appscode/searchlight/blob/a90f4bc264099230a02ca329df1c2a30a9e28d51/pkg/operator/pods.go#L28

Also this if condition:
https://github.com/appscode/searchlight/blob/a90f4bc264099230a02ca329df1c2a30a9e28d51/pkg/operator/cluster_alerts.go#L21

And now all my clusteralerts and podalerts registered back automatically after restart.

But I'm not sure what's the negative impact of skipping those checks?

@mmta
Copy link
Contributor Author

mmta commented Jul 21, 2018

OK, removing op.isValid(alert) most of the time will result in invalid memory address or nil pointer dereference error in op.clusterHost.Apply(alert). Previously I wasn't testing with enough number of alerts to see this.

I still want all ClusterAlert objects to be enqueued and retried during pod start-up though, so I ended up moving op.isValid(alert) from initClusterAlertWatcher() to reconcileClusterAlert().

Also increase the default MaxNumRequeues from 5 to 20 just to be sure.

Now everything register back automatically after restart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant