Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube SD errors leading to loss of connectivity to alertmanager #5345

Closed
vsliouniaev opened this Issue Mar 12, 2019 · 2 comments

Comments

Projects
None yet
2 participants
@vsliouniaev
Copy link
Contributor

vsliouniaev commented Mar 12, 2019

Bug Report

What did you do?
Ddeployed prometheus and alertmanager on kubernetes.
Deleted the alertmanager pods, which were re-created.

What did you expect to see?
Prometheus should re-connect to the new alertmanager pods through service-discovery and send alerts to the new instances

What did you see instead? Under which circumstances?
Prometheus fails to discover new pods and continues trying to send alerts to the old alertmanager instances.

I have tried repeating deleting the pods a few times and the behaviour does not reoccur.
Some googling leads me to an investigation done for projectcalico/typha#227 that appears to be quite similar

Environment

Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.2", GitCommit:"cff46ab41ff0bb44d8584413b598ad8360ec1def", GitTreeState:"clean", BuildDate:"2019-01-10T23:35:51Z", GoVersion:"go1.11.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.5", GitCommit:"51dd616cdd25d6ee22c83a858773b607328a18ec", GitTreeState:"clean", BuildDate:"2019-01-16T18:14:49Z", GoVersion:"go1.10.7", Compiler:"gc", Platform:"linux/amd64"}
  • System information:

docker

  • Prometheus version:

docker prometheus:v2.7.1

  • Alertmanager version:

docker alertmanager:v0.16.1

  • Prometheus configuration file:
global:
  scrape_interval: 30s
  scrape_timeout: 10s
  evaluation_interval: 30s
  external_labels:
    prometheus: monitoring/prom-op-prometheus-operato-prometheus
    prometheus_replica: prometheus-prom-op-prometheus-operato-prometheus-1
alerting:
  alert_relabel_configs:
  - separator: ;
    regex: prometheus_replica
    replacement: $1
    action: labeldrop
  alertmanagers:
  - kubernetes_sd_configs:
    - role: endpoints
      namespaces:
        names:
        - monitoring
    scheme: http
    path_prefix: /
    timeout: 10s
    relabel_configs:
    - source_labels: [__meta_kubernetes_service_name]
      separator: ;
      regex: prom-op-prometheus-operato-alertmanager
      replacement: $1
      action: keep
    - source_labels: [__meta_kubernetes_endpoint_port_name]
      separator: ;
      regex: web
      replacement: $1
      action: keep
  • Logs:
    Prometheus continuously repeats the message about sending alert, but will sometimes also log about the watch failure. This state doesn't get recovered from and requires prometheus to be restarted.
level=warn ts=2019-03-12T14:05:49.004134567Z caller=klog.go:86 component=k8s_client_runtime func=Warningf msg="/app/discovery/kubernetes/kubernetes.go:300: watch of *v1.Endpoints ended with: too old resource version: 14694876 (14697911)"
level=error ts=2019-03-12T14:05:49.015991523Z caller=notifier.go:481 component=notifier alertmanager=http://10.26.102.143:9093/api/v1/alerts count=2 msg="Error sending alert" err="Post http://10.26.102.143:9093/api/v1/alerts: context deadline exceeded"
level=error ts=2019-03-12T14:05:49.016072523Z caller=notifier.go:481 component=notifier alertmanager=http://10.26.28.57:9093/api/v1/alerts count=2 msg="Error sending alert" err="Post http://10.26.28.57:9093/api/v1/alerts: context deadline exceeded"
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Mar 13, 2019

Could you try to reproduce with --log.level=debug and share the logs?

@vsliouniaev

This comment has been minimized.

Copy link
Contributor Author

vsliouniaev commented Apr 18, 2019

Have been trying to reproduce this for some time, but was not able to. I suspect there may have been an issue with the cluster itself. Closing this, since I cannot continue leaving debug logs enabled and am no longer actively investigating

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.