Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus scrap manager get stuck after deleting ServiceMonitor #4992

Closed
skbly7 opened this Issue Dec 12, 2018 · 1 comment

Comments

Projects
None yet
2 participants
@skbly7
Copy link

skbly7 commented Dec 12, 2018

Bug Report

What did you do?
Deleted a ServiceMonitor via kubectl delete servicemonitors name-of-service-monitor

What did you expect to see?
After the deletion is successful, it should trigger successful reload of configuration in Prometheus by scrape manager component

What did you see instead? Under which circumstances?
I tried to replicate multiple times and it seems to hit 9/10 times when the count of ServiceMonitor is approx 20 or higher.
This is followed by complete deadlock/breaking of scrape manager and relevant Prometheus APIs like /service-discovery

Environment

  • System information:
# Prometheus container
$ uname -srm
Linux 3.10.0-693.el7.x86_64 x86_64
  • Prometheus version:
prometheus, version 2.5.0 (branch: HEAD, revision: 67dc912ac8b24f94a1fc478f352d25179c94ab9b)
  build user:       root@578ab108d0b9
  build date:       20181106-11:40:44
  go version:       go1.11.1
  • Prometheus configuration file:
    NOTE: This was reproducible for me with any 20 different type of jobs and not they don't need to be necessarily like those shared below. Below config is from testing environment I create with same config x 20 times and was able to reproduce there as well.
global:
  evaluation_interval: 30s
  scrape_interval: 30s
  external_labels:
    prometheus: default/web-k8s
    prometheus_replica: prometheus-web-k8s-0
rule_files:
- /etc/prometheus/rules/prometheus-web-k8s-rulefiles-0/*.yaml
scrape_configs:
- job_name: default/websites-www-1/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - websites
  scrape_interval: 360s
  metrics_path: /api/metrics
  params:
    api-key:
    - unique-key-1
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app
    regex: www
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_container_port_number
    regex: "8080"
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_endpoint_address_target_kind
    - __meta_kubernetes_endpoint_address_target_name
    separator: ;
    regex: Node;(.*)
    replacement: ${1}
    target_label: node
  - source_labels:
    - __meta_kubernetes_endpoint_address_target_kind
    - __meta_kubernetes_endpoint_address_target_name
    separator: ;
    regex: Pod;(.*)
    replacement: ${1}
    target_label: pod
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - source_labels:
    - __meta_kubernetes_service_label_prometheusJob
    target_label: job
    regex: (.+)
    replacement: ${1}
  - target_label: endpoint
    replacement: "8080"
- job_name: default/websites-www-10/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - websites
  scrape_interval: 360s
  metrics_path: /api/metrics
  params:
    api-key:
    - unique-key-1
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app
    regex: www
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_container_port_number
    regex: "8080"
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_endpoint_address_target_kind
    - __meta_kubernetes_endpoint_address_target_name
    separator: ;
    regex: Node;(.*)
    replacement: ${1}
    target_label: node
  - source_labels:
    - __meta_kubernetes_endpoint_address_target_kind
    - __meta_kubernetes_endpoint_address_target_name
    separator: ;
    regex: Pod;(.*)
    replacement: ${1}
    target_label: pod
  - source_labels:
[x20 same as above alerts]
alerting:
  alert_relabel_configs:
  - action: labeldrop
    regex: prometheus_replica
  alertmanagers: []
  • Logs:
    For example after deleting a ServiceMonitor websites-www-5
level=info ts=2018-12-12T07:40:15.428486665Z caller=main.go:632 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2018-12-12T07:40:15.441384313Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-12-12T07:40:15.442736454Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-12-12T07:40:15.444124294Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-12-12T07:40:15.445309059Z caller=main.go:658 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2018-12-12T07:40:16.128431634Z caller=manager.go:112 component="scrape manager" msg="error reloading target set" err="invalid config id:default/websites-www-5/0"
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Dec 12, 2018

This is fixed by #4894 and will be available in the upcoming v2.6.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.