Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraping of metrics stops #4736

Closed
smelchior opened this Issue Oct 13, 2018 · 32 comments

Comments

Projects
None yet
6 participants
@smelchior
Copy link

smelchior commented Oct 13, 2018

Proposal

The prometheus process stops to collect metrics after a while silently.

Bug Report

What did you do?
Run prometheus in our kubernetes cluster, scapring stops after some days of runtime
What did you expect to see?
Prometheus to collect my metrics :-)
What did you see instead? Under which circumstances?
Prometheus stops to collect metrics after a while. The process still runs normal and i can access the web ui. There are no errors in the logfile:

level=info ts=2018-09-10T15:00:02.683163067Z caller=head.go:348 component=tsdb msg="head GC completed" duration=135.644083ms
level=info ts=2018-09-10T15:00:03.357779834Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=674.543186ms
level=info ts=2018-09-10T15:00:05.40285549Z caller=compact.go:352 component=tsdb msg="compact blocks" count=3 mint=1536559200000 maxt=1536580800000 ulid=01CQ1ZY9VYFBA0TXW1GBBBKQQF sources="[01CQ1BB28SA780693DMQKNV40C 01CQ1J6TQJBVQHJ29RHW9T61RZ 01CQ1S2FHWQ3MCN2KCX0945BZW]"
level=info ts=2018-09-10T17:00:02.269244263Z caller=compact.go:398 component=tsdb msg="write block" mint=1536588000000 maxt=1536595200000 ulid=01CQ26SY1SECXT7R4WWMZAGV78
level=info ts=2018-09-10T17:00:02.632895764Z caller=head.go:348 component=tsdb msg="head GC completed" duration=134.117106ms
level=info ts=2018-09-10T17:00:03.157658829Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=524.698287ms

(scrapring stopped around 2018-09-10T17:18)
The /targets page shows a time of about 65h ago for all targets, although they should be scraped every minute.
I already posted this on the mailing list (since then we updated to 2.4.2, but this still happens the same way):
https://groups.google.com/d/topic/prometheus-users/bUFa24XKup8/discussion

I collected the go profiles the last time it happend. Whom can i send these to?

Environment

  • System information:
    Linux 4.4.0-1065-aws x86_64

  • Prometheus version:
    prometheus, version 2.4.2 (branch: HEAD, revision: c305ffaa092e94e9d2dbbddf8226c4813b1190a0) build user: root@dcde2b74c858 build date: 20180921-07:22:29 go version: go1.10.3
    (we use the prom/prometheus:v2.4.2 container image in our kubernetes cluster installed via the helm chart)

  • Prometheus configuration file:

global:
  evaluation_interval: 1m
  scrape_interval: 1m
  scrape_timeout: 10s

rule_files:
- /etc/config/rules
- /etc/config/alerts
scrape_configs:
- job_name: prometheus
  static_configs:
  - targets:
    - localhost:9090
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  job_name: kubernetes-apiservers
  kubernetes_sd_configs:
  - role: endpoints
  relabel_configs:
  - action: keep
    regex: default;kubernetes;https
    source_labels:
    - __meta_kubernetes_namespace
    - __meta_kubernetes_service_name
    - __meta_kubernetes_endpoint_port_name
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  job_name: kubernetes-nodes
  kubernetes_sd_configs:
  - role: node
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
  - replacement: kubernetes.default.svc:443
    target_label: __address__
  - regex: (.+)
    replacement: /api/v1/nodes/${1}/proxy/metrics
    source_labels:
    - __meta_kubernetes_node_name
    target_label: __metrics_path__
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  job_name: kubernetes-nodes-cadvisor
  kubernetes_sd_configs:
  - role: node
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
  - replacement: kubernetes.default.svc:443
    target_label: __address__
  - regex: (.+)
    replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
    source_labels:
    - __meta_kubernetes_node_name
    target_label: __metrics_path__
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
- job_name: kubernetes-service-endpoints
  kubernetes_sd_configs:
  - role: endpoints
  relabel_configs:
  - action: keep
    regex: true
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_scrape
  - action: replace
    regex: (https?)
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_scheme
    target_label: __scheme__
  - action: replace
    regex: (.+)
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_path
    target_label: __metrics_path__
  - action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    source_labels:
    - __address__
    - __meta_kubernetes_service_annotation_prometheus_io_port
    target_label: __address__
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - action: replace
    source_labels:
    - __meta_kubernetes_namespace
    target_label: kubernetes_namespace
  - action: replace
    source_labels:
    - __meta_kubernetes_service_name
    target_label: kubernetes_name
- honor_labels: true
  job_name: prometheus-pushgateway
  kubernetes_sd_configs:
  - role: service
  relabel_configs:
  - action: keep
    regex: pushgateway
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_probe
- job_name: kubernetes-services
  kubernetes_sd_configs:
  - role: service
  metrics_path: /probe
  params:
    module:
    - http_2xx
  relabel_configs:
  - action: keep
    regex: true
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_probe
  - source_labels:
    - __address__
    target_label: __param_target
  - replacement: blackbox
    target_label: __address__
  - source_labels:
    - __param_target
    target_label: instance
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: kubernetes_namespace
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: kubernetes_name
- job_name: kubernetes-pods
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    regex: true
    source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_scrape
  - action: replace
    regex: (.+)
    source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_path
    target_label: __metrics_path__
  - action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    source_labels:
    - __address__
    - __meta_kubernetes_pod_annotation_prometheus_io_port
    target_label: __address__
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - action: replace
    source_labels:
    - __meta_kubernetes_namespace
    target_label: kubernetes_namespace
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_name
    target_label: kubernetes_pod_name

alerting:
  alertmanagers:
  - kubernetes_sd_configs:
      - role: pod
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    path_prefix: /alertmanager
    relabel_configs:
    - source_labels: [__meta_kubernetes_namespace]
      regex: monitoring
      action: keep
    - source_labels: [__meta_kubernetes_pod_label_app]
      regex: prometheus
      action: keep
    - source_labels: [__meta_kubernetes_pod_label_component]
      regex: alertmanager
      action: keep
    - source_labels: [__meta_kubernetes_pod_container_port_number]
      regex:
      action: drop
@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Oct 14, 2018

Hi, could you share a screenshot of the targets page (with the URLs redacted)? This is a serious issue if reproducible but you're the only one to report it which makes me wonder if it's a config issue.

You could send the profiles to gouthamve [at] gmail.com and I'll make sure to forward to other maintainers.

@smelchior

This comment has been minimized.

Copy link
Author

smelchior commented Oct 14, 2018

Hi,
sorry, i do not have a screenshot available, but i will create one if this happens again.
The profiles should be in your inbox, thank you very much!

@smelchior

This comment has been minimized.

Copy link
Author

smelchior commented Nov 12, 2018

The issue occurred again today and i was able to get the screenshot. I also send you another email with the traces of today. I would really appreciate if someone could look into this. I will update to the latest version now to see if this fixes it, but i did not see anything in the changelog that might suggest that.
target_list

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Nov 12, 2018

@smelchior I'd be interested to look at the debugging info. You can send me the output of the promtool debug all command by email (address on my GitHub profile).

@smelchior

This comment has been minimized.

Copy link
Author

smelchior commented Nov 12, 2018

@simonpasquier i send you an email with the details i have, this includes the pprof debug info i retrieved via the /debug/.. endpoints. Thanks!

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Nov 12, 2018

I've had a quick look at the pprof data shared by @smelchior and I suspect that one scrape appended is stuck and it is blocking the other appenders. @krasi-georgiev thoughts?

Complete graph of goroutines:

fullprofile

Graph of goroutines matching on scrape:

scrapeprofile

@smelchior

This comment has been minimized.

Copy link
Author

smelchior commented Nov 15, 2018

The issue occured again, also after the upgrade to 2.5.0:

level=info ts=2018-11-14T05:00:01.872821754Z caller=compact.go:398 component=tsdb msg="write block" mint=1542160800000 maxt=1542168000000 ulid=01CW89E9Q08HB9BDXEMCV23WWF
level=info ts=2018-11-14T05:00:02.142369994Z caller=head.go:488 component=tsdb msg="head GC completed" duration=92.005722ms
level=info ts=2018-11-14T05:00:03.376933172Z caller=head.go:535 component=tsdb msg="WAL checkpoint complete" low=846 high=847 duration=1.234497198s
level=info ts=2018-11-14T07:00:01.910559942Z caller=compact.go:398 component=tsdb msg="write block" mint=1542168000000 maxt=1542175200000 ulid=01CW8GA0Z32SWRVA985BBVG85Z
level=info ts=2018-11-14T07:00:02.167353789Z caller=head.go:488 component=tsdb msg="head GC completed" duration=93.100468ms
level=info ts=2018-11-14T09:00:01.897886077Z caller=compact.go:398 component=tsdb msg="write block" mint=1542175200000 maxt=1542182400000 ulid=01CW8Q5R746P2BEVR9EWG54GDK
level=info ts=2018-11-14T09:00:02.158764781Z caller=head.go:488 component=tsdb msg="head GC completed" duration=94.209575ms
level=info ts=2018-11-14T09:00:03.462023342Z caller=head.go:535 component=tsdb msg="WAL checkpoint complete" low=848 high=849 duration=1.303185873s
level=info ts=2018-11-14T09:00:04.958486681Z caller=compact.go:352 component=tsdb msg="compact blocks" count=3 mint=1542153600000 maxt=1542175200000 ulid=01CW8Q5V5N7NCGRG2B7Z5QE82T sources="[01CW82JJF1FJZMYSD5ZGV92YY5 01CW89E9Q08HB9BDXEMCV23WWF 01CW8GA0Z32SWRVA985BBVG85Z]"
level=info ts=2018-11-14T11:00:01.923770198Z caller=compact.go:398 component=tsdb msg="write block" mint=1542182400000 maxt=1542189600000 ulid=01CW8Y1FF2740XWXBAB7EXWDR2
level=info ts=2018-11-14T11:00:02.20106374Z caller=head.go:488 component=tsdb msg="head GC completed" duration=90.254357ms


prometheus --version
prometheus, version 2.5.0 (branch: HEAD, revision: 67dc912ac8b24f94a1fc478f352d25179c94ab9b)
  build user:       root@578ab108d0b9
  build date:       20181106-11:40:44
  go version:       go1.11.1

uname -a
Linux prometheus-server-7797485495-psjtz 4.4.0-1070-aws #80-Ubuntu SMP Thu Oct 4 13:56:07 UTC 2018 x86_64 GNU/Linux

The log messages are the last ones. Sometime after that the scraping started to hang. I emailed the debug details to @simonpasquier.
The promtool debug all output is included, but the tool did hang at the end, so the output is probably not complete.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 28, 2018

hi @smelchior I can now start looking into this. Can you ping me on the prometheus-dev channel to see if you can help me replicate this.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 29, 2018

ping @smelchior

@smelchior

This comment has been minimized.

Copy link
Author

smelchior commented Nov 30, 2018

Sorry i was busy the last days. I am not sure if i can help you in this regard as i have no way to really reproduce this. It just happens in our env from time to time. We now have been without an issue for over 2 weeks, but i guess it might happen again anytime :-(

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 30, 2018

@smelchior thanks for the update. Would you mind to still ping me on irc as I might ask some additional detail and clues to try and replicate it myself.

The profiles show that it locks when writing to the WAL file so the first logical question is what is the storage type used for the WAL files.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 5, 2018

ping @smelchior

@smelchior

This comment has been minimized.

Copy link
Author

smelchior commented Dec 5, 2018

I tried to find you on Tuesday in IRC but had no luck, what is your username there?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 5, 2018

the same as here @krasi-georgiev #prometheus-dev

btw what is the storage type used for the WAL files?

@smelchior

This comment has been minimized.

Copy link
Author

smelchior commented Dec 5, 2018

Ok, i will get back to you tomorrow in IRC. The instances are running on K8S on AWS and the PV for the prometheus data is an SSD EBS Volume.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 5, 2018

also if possible try master as there were quite a few fixes the recent few weeks.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 12, 2018

@smelchior can you still replicate it with the master branch? 2.6 will be out in few days btw.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 20, 2018

@smelchior 2.6 is out , could you try it?

@smelchior

This comment has been minimized.

Copy link
Author

smelchior commented Dec 20, 2018

i have updated now, i will close this here for now and reopen should it happen again.
Thanks for your help!

@smelchior smelchior closed this Dec 20, 2018

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 20, 2018

thanks, appreciated!

@smelchior

This comment has been minimized.

Copy link
Author

smelchior commented Jan 10, 2019

unfortunately the issue happend again with this version:

prometheus, version 2.6.0 (branch: HEAD, revision: dbd1d58c894775c0788470944b818cc724f550fb)
  build user:       root@bf5760470f13
  build date:       20181217-15:14:46
  go version:       go1.11.3

Also this time i was not able to access the /targets page anymore, it just never responded. The other pages did still respond. Unfortunately i was not able to get the debug info this time either. The volume was still writable though, i checked in the container.

@smelchior smelchior reopened this Jan 10, 2019

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jan 18, 2019

does this happen on a single machine only or in different setups?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jan 18, 2019

we haven't had any other reports for such behaviour which makes me think it is something specific to your setup.

@smelchior

This comment has been minimized.

Copy link
Author

smelchior commented Jan 18, 2019

As discussed via IRC the issue occurred again and i send you the debug info via email. One other note, the first startup after prometheus has been restarted looks like this:

level=info ts=2019-01-18T14:45:56.737266103Z caller=main.go:243 msg="Starting Prometheus" version="(version=2.6.0, branch=HEAD, revision=dbd1d58c894775c0788470944b818cc724f550fb)"
level=info ts=2019-01-18T14:45:56.737402364Z caller=main.go:244 build_context="(go=go1.11.3, user=root@bf5760470f13, date=20181217-15:14:46)"
level=info ts=2019-01-18T14:45:56.737518945Z caller=main.go:245 host_details="(Linux 4.4.0-1072-aws #82-Ubuntu SMP Fri Nov 2 15:00:21 UTC 2018 x86_64 prometheus-server-56d4ffd7cd-q2n6x (none))"
level=info ts=2019-01-18T14:45:56.737579532Z caller=main.go:246 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-01-18T14:45:56.737666098Z caller=main.go:247 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-01-18T14:45:56.739227123Z caller=main.go:561 msg="Starting TSDB ..."
level=info ts=2019-01-18T14:45:56.739732937Z caller=web.go:429 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2019-01-18T14:45:56.740714707Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547272800000 maxt=1547294400000 ulid=01D1197EY16Q4E3TQ7KDXMF7N2
level=info ts=2019-01-18T14:45:56.741412541Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547294400000 maxt=1547316000000 ulid=01D11XTK495TVN7HV0031H6DNQ
level=info ts=2019-01-18T14:45:56.742609994Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547316000000 maxt=1547337600000 ulid=01D12JDTF1H5Q2EAH1101V7GH6
level=info ts=2019-01-18T14:45:56.744390244Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547337600000 maxt=1547359200000 ulid=01D1371089T4WK34BFY12PVKNG
level=info ts=2019-01-18T14:45:56.746814053Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547359200000 maxt=1547380800000 ulid=01D13VM63R490DS9TCA61ZZHG4
level=info ts=2019-01-18T14:45:56.747292783Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547380800000 maxt=1547402400000 ulid=01D14G7A523T9TMEQ8NAQYDJ6A
level=info ts=2019-01-18T14:45:56.747858797Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547402400000 maxt=1547424000000 ulid=01D154THFZQ9Q9X7PQ3JFS0K42
level=info ts=2019-01-18T14:45:56.748342377Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547424000000 maxt=1547445600000 ulid=01D15SDQ9FMM91VNNWYEFPQ9X4
level=info ts=2019-01-18T14:45:56.749727319Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547445600000 maxt=1547467200000 ulid=01D16E0X1MBBKQB0N5K7MZ50MV
level=info ts=2019-01-18T14:45:56.750195808Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547467200000 maxt=1547488800000 ulid=01D172M16QBRN4X9SJ9WF7S22V
level=info ts=2019-01-18T14:45:56.75069195Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547488800000 maxt=1547510400000 ulid=01D17Q78J3SS5HRGP7X38MVTHE
level=info ts=2019-01-18T14:45:56.751197549Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547510400000 maxt=1547532000000 ulid=01D18BTED195CFA0PQ5F1ZZBH2
level=info ts=2019-01-18T14:45:56.751734403Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547532000000 maxt=1547553600000 ulid=01D190DM4DZCT8BNE95RPZCYE1
level=info ts=2019-01-18T14:45:56.752204163Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547553600000 maxt=1547575200000 ulid=01D19N0RCR8E5WCF2WDCDC7JFX
level=info ts=2019-01-18T14:45:56.752661925Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547575200000 maxt=1547596800000 ulid=01D1A9KZJN5XJNF08MJZMSMYTQ
level=info ts=2019-01-18T14:45:56.753096304Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547618400000 maxt=1547625600000 ulid=01D1AY71913PDEBASMERC5NYAC
level=info ts=2019-01-18T14:45:56.753582971Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547596800000 maxt=1547618400000 ulid=01D1AY75D9GWKY7EB1M54N4WX1
level=info ts=2019-01-18T14:45:56.754198362Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547625600000 maxt=1547632800000 ulid=01D1GPD3Q48YZG8J9BFZXS4G5W
level=warn ts=2019-01-18T14:46:03.978761276Z caller=head.go:434 component=tsdb msg="unknown series references" count=24
level=info ts=2019-01-18T14:46:04.160915523Z caller=main.go:571 msg="TSDB started"
level=info ts=2019-01-18T14:46:04.160987494Z caller=main.go:631 msg="Loading configuration file" filename=/etc/config/prometheus.yml
level=info ts=2019-01-18T14:46:04.162707813Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-01-18T14:46:04.163257814Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-01-18T14:46:04.163735476Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-01-18T14:46:04.16417839Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-01-18T14:46:04.164712579Z caller=kubernetes.go:201 component="discovery manager notify" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-01-18T14:46:04.168358187Z caller=main.go:657 msg="Completed loading of configuration file" filename=/etc/config/prometheus.yml
level=info ts=2019-01-18T14:46:04.168381758Z caller=main.go:530 msg="Server is ready to receive web requests."
@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jan 30, 2019

waiting for the profiles with the block/mutes included.

@smelchior

This comment has been minimized.

Copy link
Author

smelchior commented Jan 30, 2019

I am waiting for the next crash to happen :)
If it helps we can close this until then and i can reopen it once it occurs again.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jan 30, 2019

no need , there is no rush.
That was more if a reminder to myself so I don't forget the status of the issue.

@nickbp

This comment has been minimized.

Copy link

nickbp commented Feb 10, 2019

Hi, I've been debugging a couple prometheus instances that were failing to fetch pods today, and stumbled across this bug. In my case I have a total of three prometheus instances in different namespaces on the same k8s cluster. Two of the instances are stuck while the third is looking fine. All instances seem totally responsive when accessing the prometheus frontend site, etc, the only real symptom on the stuck instances was that they stopped reporting the pod metrics.

One thing that stuck out was prometheus' /metrics endpoint showing the following:

- Working instance: net_conntrack_dialer_conn_attempted_total{dialer_name="kubernetes-pods"} 1700
- Stuck instances: net_conntrack_dialer_conn_attempted_total{dialer_name="kubernetes-pods"} 0

I've left debug dumps etc in IRC

Edit: Nevermind, turns out the issue I had been seeing was 100% user error: The two bad instances had an incorrect list of namespaces to query (in kubernetes_sd_configs.namespaces.names), while the working instance had the correct list. After fixing those entries in the bad instances, they suddenly started fetching/producing metrics again. So if anyone else sees something that looks like this issue, maybe do a quick check of your filter rules in prometheus.yml!

@vears91

This comment has been minimized.

Copy link

vears91 commented Feb 18, 2019

Maybe related #4249? I also saw targets not being scraped, while the web UI was still accessible.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Feb 27, 2019

@vears91 I doubt it, when we checked the profiles it indicates some blocking when writing to the database.

@smelchior any more info since we last chatted?

@smelchior

This comment has been minimized.

Copy link
Author

smelchior commented Feb 27, 2019

no did not happen again, maybe the DEBUG=1 "fixes" it ;-)

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Feb 27, 2019

so weird, maybe it was some glitch in the storage.

In that case will close it for now. Feel free to reopen if it happens again or if you have more info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.