Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus is restarting again and again #5016

Open
inyee786 opened this Issue Dec 19, 2018 · 22 comments

Comments

Projects
None yet
7 participants
@inyee786
Copy link

inyee786 commented Dec 19, 2018

Proposal

Use case. Why is this important?
using Prometheus with openebs volume and for 1 to 3 hour it work fine but after some time,
Prometheus is starting again and again and conf file not able to load

“Nice to have” is not a good use case. :)

Bug Report

What did you do?

What did you expect to see?
it should not restart again

What did you see instead? Under which circumstances?

Environment
GKE 8 node cluster

  • System information:

    insert output of uname -srm here
    Linux 4.15.0-1017-gcp x86_64

  • Prometheus version:

    insert output of prometheus --version here
    prom/prometheus:v2.6.0

  • Prometheus configuration file:

kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: prometheus-cstor 
apiVersion: v1
data:
  prometheus.yml: |-
    global:
      external_labels:
        slave: slave1
      scrape_interval: 5s
      evaluation_interval: 5s
    rule_files:
    # alert rules passed as argument in prometheus-deployment at given path
    - '/etc/prometheus-rules/alert.rules.yaml'
    scrape_configs: 
    - job_name : 'kubelets'
      scheme: http
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /metrics
      - source_labels: [__address__]
        regex: '(.*):10250'
        replacement: '${1}:10255'
        target_label: __address__    
    - job_name: 'kubernetes-apiservers'
      kubernetes_sd_configs:
      - role: endpoints
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
     bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https
    - job_name: 'kubernetes-cadvisor'
     scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
    - job_name: 'kubernetes-node-exporter'
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - source_labels: [__meta_kubernetes_role]
        action: replace
        target_label: kubernetes_role
      - source_labels: [__address__]
        regex: '(.*):10250'
        replacement: '${1}:9100'
        target_label: __address__
      - source_labels: [__meta_kubernetes_node_label_kubernetes_io_hostname]
        target_label: __instance__
      - source_labels: [job]
        regex: 'kubernetes-(.*)'
        replacement: '${1}'
        target_label: name
    - job_name: 'maya-volume-exporter'
      scheme: http
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_lable_monitoring]
        regex: volume_exporter_prometheus
        action: keep
    - job_name: 'prometheus'
      scheme: http
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_name]
        regex: prometheus-server
        action: keep
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name
    - job_name: 'openebs-volumes'
      scheme: http
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_monitoring]
        regex: volume_exporter_prometheus
        action: keep
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name
      - source_labels: [__meta_kubernetes_pod_label_vsm]
        action: replace
        target_label: openebs_pv
      - source_labels: [__meta_kubernetes_pod_label_openebs_pv]
        action: keep
      - source_labels: [__meta_kubernetes_pod_container_port_number]
        action: drop
        regex: '(.*)9501'
      - source_labels: [__meta_kubernetes_pod_container_port_number]
        action: drop
        regex: '(.*)3260'
      - source_labels: [__meta_kubernetes_pod_container_port_number]
        action: drop
        regex: '(.*)80'        


  • Logs:
kubectl log -f prometheus-deployment-5d94d7c787-bmfqv -n prometheus-cstor 
log is DEPRECATED and will be removed in a future version. Use logs instead.
level=info ts=2018-12-19T07:25:54.313576245Z caller=main.go:243 msg="Starting Prometheus" version="(version=2.6.0, branch=HEAD, revision=dbd1d58c894775c0788470944b818cc724f550fb)"
level=info ts=2018-12-19T07:25:54.31369562Z caller=main.go:244 build_context="(go=go1.11.3, user=root@bf5760470f13, date=20181217-15:14:46)"
level=info ts=2018-12-19T07:25:54.313733234Z caller=main.go:245 host_details="(Linux 4.15.0-1017-gcp #18~16.04.1-Ubuntu SMP Fri Aug 10 13:26:07 UTC 2018 x86_64 prometheus-deployment-5d94d7c787-bmfqv (none))"
level=info ts=2018-12-19T07:25:54.313775827Z caller=main.go:246 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-12-19T07:25:54.313806527Z caller=main.go:247 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2018-12-19T07:25:54.319759309Z caller=main.go:561 msg="Starting TSDB ..."
level=info ts=2018-12-19T07:25:54.321270216Z caller=web.go:429 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=warn ts=2018-12-19T07:25:54.348133331Z caller=wal.go:116 component=tsdb msg="last page of the wal is torn, filling it with zeros" segment=/prometheus/wal/00000054

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Dec 19, 2018

I suspect that the Prometheus container gets OOMed by the system. Please try to know whether there's something about this in the Kubernetes logs. Also what are the memory limits of the pod?

@inyee786

This comment has been minimized.

Copy link
Author

inyee786 commented Dec 19, 2018

resource limit

          resources:
            requests:
             memory: "250M"
              cpu: "250m"
            limits:
              memory: "5G"
              cpu: "1200m"
@inyee786

This comment has been minimized.

Copy link
Author

inyee786 commented Dec 19, 2018

@simonpasquier seen the kublet log, can't able to see any problem there

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Dec 20, 2018

Can you get any information from Kubernetes about whether it killed the pod or the application crashed? Maybe looking at the events...

@aixeshunter

This comment has been minimized.

Copy link
Contributor

aixeshunter commented Dec 21, 2018

I had a same issue before, the prometheus server restarted again and again. I deleted a wal file and then it was normal.

@inyee786

This comment has been minimized.

Copy link
Author

inyee786 commented Dec 21, 2018

@aixeshunter where i can find wal file??

@inyee786

This comment has been minimized.

Copy link
Author

inyee786 commented Dec 21, 2018

@simonpasquier , from the logs, think Prometheus pod is looking for prometheus.conf to be loaded but when it can't able to load the conf file it restarts the pod

and the pod was still there but it restarts the Prometheus container

@inyee786 inyee786 changed the title prmetheus is restating again and again Prometheus is restating again and again Dec 23, 2018

@inyee786

This comment has been minimized.

Copy link
Author

inyee786 commented Dec 25, 2018

@simonpasquier, after the below log the prometheus container restarted

kubectl log -f prometheus-deployment-d4576878c-r9bsc -n prometheus-cstor
log is DEPRECATED and will be removed in a future version. Use logs instead.
level=info ts=2018-12-25T08:44:54.483987435Z caller=main.go:243 msg="Starting Prometheus" version="(version=2.6.0, branch=HEAD, revision=dbd1d58c894775c0788470944b818cc724f550fb)"
level=info ts=2018-12-25T08:44:54.484143155Z caller=main.go:244 build_context="(go=go1.11.3, user=root@bf5760470f13, date=20181217-15:14:46)"
level=info ts=2018-12-25T08:44:54.484191509Z caller=main.go:245 host_details="(Linux 4.15.0-1017-gcp #18~16.04.1-Ubuntu SMP Fri Aug 10 13:26:07 UTC 2018 x86_64 prometheus-deployment-d4576878c-r9bsc (none))"
level=info ts=2018-12-25T08:44:54.484265138Z caller=main.go:246 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-12-25T08:44:54.484386962Z caller=main.go:247 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2018-12-25T08:44:54.485817825Z caller=main.go:561 msg="Starting TSDB ..."
level=info ts=2018-12-25T08:44:54.486651891Z caller=web.go:429 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=warn ts=2018-12-25T08:44:54.507230646Z caller=wal.go:116 component=tsdb msg="last page of the wal is torn, filling it with zeros" segment=/prometheus/wal/00000043
level=warn ts=2018-12-25T08:59:21.177362158Z caller=head.go:434 component=tsdb msg="unknown series references" count=32172
level=info ts=2018-12-25T08:59:23.507068573Z caller=main.go:571 msg="TSDB started"
level=info ts=2018-12-25T08:59:23.510001875Z caller=main.go:631 msg="Loading configuration file" filename=/etc/prometheus/conf/prometheus.yml
level=info ts=2018-12-25T08:59:23.744625645Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-12-25T08:59:23.77952836Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-12-25T08:59:23.781436297Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-12-25T08:59:23.783280368Z caller=main.go:657 msg="Completed loading of configuration file" filename=/etc/prometheus/conf/prometheus.yml
level=info ts=2018-12-25T08:59:23.783313293Z caller=main.go:530 msg="Server is ready to receive web requests."
@zrbcool

This comment has been minimized.

Copy link

zrbcool commented Dec 26, 2018

we have the same issue also with version prometheus:v2.6.0

level=info ts=2018-12-26T12:33:49.087814226Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1545811200000 maxt=1545818400000 ulid=01CZN2QNWGA67JM88PA6QV9BNV
level=warn ts=2018-12-26T12:33:49.090865065Z caller=wal.go:116 component=tsdb msg="last page of the wal is torn, filling it with zeros" segment=data/wal/00012483
level=warn ts=2018-12-26T12:34:25.186080245Z caller=head.go:434 component=tsdb msg="unknown series references" count=46
level=info ts=2018-12-26T12:34:26.063665629Z caller=main.go:571 msg="TSDB started"

lots of memory using before crash

image

in zabbix the timezone is +8 China time zone

@zrbcool

This comment has been minimized.

Copy link

zrbcool commented Dec 26, 2018

also paste the dmesg message when OOM

[10224513.627281] java invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[10224513.628838] java cpuset=/ mems_allowed=0
[10224513.630021] CPU: 6 PID: 9698 Comm: java Not tainted 3.10.0-514.26.2.el7.x86_64 #1
[10224513.631469] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[10224513.634259]  ffff88022bb85e20 00000000078798ca ffff880035f73910 ffffffff81687133
[10224513.635952]  ffff880035f739a0 ffffffff816820de ffffffff810eb0dc ffff8802313fc830
[10224513.637679]  ffff8802313fc848 0000000000000202 ffff88022bb85e20 ffff880035f73990
[10224513.639346] Call Trace:
[10224513.640485]  [<ffffffff81687133>] dump_stack+0x19/0x1b
[10224513.641886]  [<ffffffff816820de>] dump_header+0x8e/0x225
[10224513.643307]  [<ffffffff810eb0dc>] ? ktime_get_ts64+0x4c/0xf0
[10224513.644729]  [<ffffffff8113d22f>] ? delayacct_end+0x8f/0xb0
[10224513.646133]  [<ffffffff81184d0e>] oom_kill_process+0x24e/0x3c0
[10224513.647601]  [<ffffffff811847ad>] ? oom_unkillable_task+0xcd/0x120
[10224513.649057]  [<ffffffff81184856>] ? find_lock_task_mm+0x56/0xc0
[10224513.650440]  [<ffffffff81185546>] out_of_memory+0x4b6/0x4f0
[10224513.651801]  [<ffffffff81682be7>] __alloc_pages_slowpath+0x5d7/0x725
[10224513.653255]  [<ffffffff8118b655>] __alloc_pages_nodemask+0x405/0x420
[10224513.654690]  [<ffffffff811cf9ca>] alloc_pages_current+0xaa/0x170
[10224513.656092]  [<ffffffff81180be7>] __page_cache_alloc+0x97/0xb0
[10224513.657484]  [<ffffffff81183760>] filemap_fault+0x170/0x410
[10224513.658945]  [<ffffffffa01de016>] ext4_filemap_fault+0x36/0x50 [ext4]
[10224513.660452]  [<ffffffff811ac83c>] __do_fault+0x4c/0xc0
[10224513.661840]  [<ffffffff811accd3>] do_read_fault.isra.42+0x43/0x130
[10224513.663337]  [<ffffffff811b1461>] handle_mm_fault+0x6b1/0x1000
[10224513.664781]  [<ffffffff812feb58>] ? disk_seqf_stop+0x28/0x40
[10224513.666221]  [<ffffffff81325ef9>] ? copy_user_enhanced_fast_string+0x9/0x20
[10224513.667765]  [<ffffffff81692cc4>] __do_page_fault+0x154/0x450
[10224513.669219]  [<ffffffff816930a6>] trace_do_page_fault+0x56/0x150
[10224513.670669]  [<ffffffff8169274b>] do_async_page_fault+0x1b/0xd0
[10224513.672104]  [<ffffffff8168f238>] async_page_fault+0x28/0x30
[10224513.673583] Mem-Info:
[10224513.674754] active_anon:1915020 inactive_anon:164 isolated_anon:0
 active_file:827 inactive_file:4565 isolated_file:64
 unevictable:0 dirty:24 writeback:0 unstable:0
 slab_reclaimable:24059 slab_unreclaimable:6333
 mapped:547 shmem:231 pagetables:5495 bounce:0
 free:25678 free_pcp:126 free_cma:0
[10224513.683328] Node 0 DMA free:15908kB min:132kB low:164kB high:196kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[10224513.691233] lowmem_reserve[]: 0 2815 7804 7804
[10224513.692859] Node 0 DMA32 free:43724kB min:24328kB low:30408kB high:36492kB active_anon:2777056kB inactive_anon:140kB active_file:1828kB inactive_file:6812kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3129216kB managed:2884460kB mlocked:0kB dirty:52kB writeback:0kB mapped:1568kB shmem:300kB slab_reclaimable:31796kB slab_unreclaimable:6956kB kernel_stack:1344kB pagetables:6032kB unstable:0kB bounce:0kB free_pcp:436kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:17132 all_unreclaimable? yes
[10224513.702731] lowmem_reserve[]: 0 0 4989 4989
[10224513.704388] Node 0 Normal free:44892kB min:43120kB low:53900kB high:64680kB active_anon:4883024kB inactive_anon:516kB active_file:1708kB inactive_file:8924kB unevictable:0kB isolated(anon):0kB isolated(file):256kB present:5242880kB managed:5109528kB mlocked:0kB dirty:44kB writeback:0kB mapped:620kB shmem:624kB slab_reclaimable:64440kB slab_unreclaimable:18376kB kernel_stack:4144kB pagetables:15948kB unstable:0kB bounce:0kB free_pcp:1024kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:12201 all_unreclaimable? yes
[10224513.714422] lowmem_reserve[]: 0 0 0 0
[10224513.716062] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
[10224513.720096] Node 0 DMA32: 303*4kB (UE) 283*8kB (UE) 796*16kB (UE) 434*32kB (UE) 142*64kB (UEM) 21*128kB (UE) 4*256kB (UEM) 2*512kB (M) 1*1024kB (M) 0*2048kB 0*4096kB = 44948kB
[10224513.724628] Node 0 Normal: 536*4kB (UEM) 389*8kB (UEM) 1100*16kB (UEM) 368*32kB (UE) 103*64kB (UEM) 3*128kB (UM) 0*256kB 0*512kB 1*1024kB (M) 0*2048kB 0*4096kB = 42632kB
[10224513.729452] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[10224513.731961] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[10224513.735635] 4951 total pagecache pages
[10224513.738696] 0 pages in swap cache
[10224513.741736] Swap cache stats: add 0, delete 0, find 0/0
[10224513.745149] Free swap  = 0kB
[10224513.748179] Total swap = 0kB
[10224513.750935] 2097022 pages RAM
[10224513.753878] 0 pages HighMem/MovableOnly
[10224513.756894] 94548 pages reserved
[10224513.759841] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[10224513.763406] [  379]     0   379    13396       98      26        0             0 systemd-journal
[10224513.767063] [  404]     0   404    10880      135      21        0         -1000 systemd-udevd
[10224513.770685] [  418]     0   418    13854      114      26        0         -1000 auditd
[10224513.774262] [  442]    81   442     6131      121      16        0          -900 dbus-daemon
[10224513.777649] [  494]   998   494   131905     1388      55        0             0 polkitd
[10224513.781278] [  496]     0   496     4824       74      14        0             0 irqbalance
[10224513.783802] [  499]     0   499     6048       91      15        0             0 systemd-logind
[10224513.786063] [  507]     0   507     6461       53      18        0             0 atd
[10224513.788230] [  509]     0   509    31556      156      19        0             0 crond
[10224513.790394] [  514]     0   514    27508       32      10        0             0 agetty
[10224513.792557] [  515]     0   515    27508       32       9        0             0 agetty
[10224513.794711] [  804]     0   804    33731     8533      67        0             0 dhclient
[10224513.796877] [  865]     0   865   138411     2673      90        0             0 tuned
[10224513.799029] [  866]     0   866   121957      696     105        0             0 rsyslogd
[10224513.801146] [ 1024]    38  1024     7473      159      18        0             0 ntpd
[10224513.803193] [ 4001]     0  4001    26489      248      53        0         -1000 sshd
[10224513.805293] [ 4468]     0  4468    16855      681      16        0             0 aliyun-service
[10224513.807399] [ 9047]     0  9047    30508      153      12        0             0 wrapper
[10224513.809386] [ 9440]     0  9440   629576    18420      93        0             0 java
[10224513.811326] [10376]  1000 10376   227128     3610      48        0             0 node_exporter
[10224513.813328] [31010]  1000 31010   749249    15961     132        0             0 grafana-server
[10224513.815300] [31632]  1000 31632     4998     1507      14        0             0 prometheus-webh
[10224513.817238] [31706]  1000 31706     6078     1521      17        0             0 alertmanager
[10224513.819125] [18076]     0 18076     7917      187      20        0             0 AliYunDunUpdate
[10224513.821033] [18125]     0 18125    32957      468      60        0             0 AliYunDun
[10224513.822864] [29011]   996 29011    20195      203      39        0             0 zabbix_agentd
[10224513.824695] [29012]   996 29012    20195      339      39        0             0 zabbix_agentd
[10224513.826467] [29013]   996 29013    20301      254      39        0             0 zabbix_agentd
[10224513.828264] [29014]   996 29014    20301      251      39        0             0 zabbix_agentd
[10224513.830047] [29015]   996 29015    20301      254      39        0             0 zabbix_agentd
[10224513.831830] [29016]   996 29016    20195      241      39        0             0 zabbix_agentd
[10224513.833591] [23583]     0 23583    48144     4961      57        0             0 consul
[10224513.835247] [17720]     0 17720    36544      324      72        0             0 sshd
[10224513.836859] [17722]  1000 17722    36544      320      70        0             0 sshd
[10224513.838536] [17723]  1000 17723    28878      129      14        0             0 bash
[10224513.840090] [18121]  1000 18121 39599911  1851018    4078        0             0 prometheus
[10224513.841682] Out of memory: Kill process 18121 (prometheus) score 925 or sacrifice child
[10224513.843136] Killed process 18121 (prometheus) total-vm:158399644kB, anon-rss:7403716kB, file-rss:356kB, shmem-rss:0kB

@inyee786

This comment has been minimized.

Copy link
Author

inyee786 commented Dec 30, 2018

@zrbcool how many workload/application you are running in the cluster, did you added node selection for Prometheus deployment?

@inyee786

This comment has been minimized.

Copy link
Author

inyee786 commented Dec 30, 2018

@simonpasquier

kublet log while starting the Prometheus

2340 mount_linux.go:520] Disk successfully formatted (mkfs): ext4 - /dev/disk/by-path/ip-10.39.242.18:3260-iscsi-iqn.2016-09.com.openebs.cstor:pvc-ef41119c-0c41-11e9-8a82-42010af0015b-lun-0 /var/lib/kubelet/plugins/kubernetes.io/iscsi/iface-default/10.39.242.18:3260-iqn.2016-09.com.openebs.cstor:pvc-ef41119c-0c41-11e9-8a82-42010af0015b-lun-0
2340 operation_generator.go:495] MountVolume.WaitForAttach succeeded for volume "pvc-ef41119c-0c41-11e9-8a82-42010af0015b" (UniqueName: "kubernetes.io/iscsi/10.39.242.18:3260:iqn.2016-09.com.openebs.cstor:pvc-ef41119c-0c41-11e9-8a82-42010af0015b:0") pod "prometheus-deployment-7788f7f987-sr2g5" (UID: "eb6806c6-0c41-11e9-8a82-42010af0015b") DevicePath "/dev/disk/by-path/ip-10.39.242.18:3260-iscsi-iqn.2016-09.com.openebs.cstor:pvc-ef41119c-0c41-11e9-8a82-42010af0015b-lun-0"
2340 operation_generator.go:514] MountVolume.MountDevice succeeded for volume "pvc-ef41119c-0c41-11e9-8a82-42010af0015b" (UniqueName: "kubernetes.io/iscsi/10.39.242.18:3260:iqn.2016-09.com.openebs.cstor:pvc-ef41119c-0c41-11e9-8a82-42010af0015b:0") pod "prometheus-deployment-7788f7f987-sr2g5" (UID: "eb6806c6-0c41-11e9-8a82-42010af0015b") device mount path "/var/lib/kubelet/plugins/kubernetes.io/iscsi/iface-default/10.39.242.18:3260-iqn.2016-09.com.openebs.cstor:pvc-ef41119c-0c41-11e9-8a82-42010af0015b-lun-0"
2340 server.go:796] GET /pods/: (1.445177ms) 200 [[Go-http-client/1.1] 127.0.0.1:51274]
2340 operation_generator.go:557] MountVolume.SetUp succeeded for volume "pvc-ef41119c-0c41-11e9-8a82-42010af0015b" (UniqueName: "kubernetes.io/iscsi/10.39.242.18:3260:iqn.2016-09.com.openebs.cstor:pvc-ef41119c-0c41-11e9-8a82-42010af0015b:0") pod "prometheus-deployment-7788f7f987-sr2g5" (UID: "eb6806c6-0c41-11e9-8a82-42010af0015b")
2340 kuberuntime_manager.go:385] No sandbox for pod "prometheus-deployment-7788f7f987-sr2g5_prometheus-cstor(eb6806c6-0c41-11e9-8a82-42010af0015b)" can be found. Need to start a new one
2340 server.go:796] GET /pods/: (1.408174ms) 200 [[Go-http-client/1.1] 127.0.0.1:51278]
2340 kubelet.go:1910] SyncLoop (PLEG): "prometheus-deployment-7788f7f987-sr2g5_prometheus-cstor(eb6806c6-0c41-11e9-8a82-42010af0015b)", event: &pleg.PodLifecycleEvent{ID:"eb6806c6-0c41-11e9-8a82-42010af0015b", Type:"ContainerStarted", Data:"fc1422e637d60d0d86fad9f25537593e5c2a4f883276551468f1fc3fc3623663"}
2340 provider.go:119] Refreshing cache for provider: *gcp_credentials.dockerConfigUrlKeyProvider
2340 config.go:191] body of failing http response: &{0x6f0d10 0xc424302e00 0x6fb050}
2340 metadata.go:159] while reading 'google-dockercfg-url' metadata: http status code: 404 while fetching url http://metadata.google.internal./computeMetadata/v1/instance/attributes/google-dockercfg-url
2340 provider.go:119] Refreshing cache for provider: *credentialprovider.defaultDockerConfigProvider
2340 provider.go:119] Refreshing cache for provider: *gcp_credentials.dockerConfigKeyProvider
2340 config.go:191] body of failing http response: &{0x6f0d10 0xc42632a2c0 0x6fb050}
2340 metadata.go:142] while reading 'google-dockercfg' metadata: http status code: 404 while fetching url http://metadata.google.internal./computeMetadata/v1/instance/attributes/google-dockercfg
2340 server.go:796] GET /pods/: (1.409225ms) 200 [[Go-http-client/1.1] 127.0.0.1:51306]
2340 server.go:796] GET /healthz: (25.73µs) 200 [[curl/7.47.0] 127.0.0.1:51326]
2340 server.go:796] GET /pods/: (1.487285ms) 200 [[Go-http-client/1.1] 127.0.0.1:51328]
2340 server.go:796] GET /pods/: (1.369574ms) 200 [[Go-http-client/1.1] 127.0.0.1:51344]
2340 server.go:796] GET /pods/: (1.44487ms) 200 [[Go-http-client/1.1] 127.0.0.1:51348]
2340 server.go:796] GET /pods/: (16.705302ms) 200 [[Go-http-client/1.1] 127.0.0.1:51358]
2340 server.go:796] GET /pods/: (1.412461ms) 200 [[Go-http-client/1.1] 127.0.0.1:51362]
2340 server.go:796] GET /pods/: (1.241505ms) 200 [[Go-http-client/1.1] 127.0.0.1:51366]
2340 kube_docker_client.go:348] Stop pulling image "prom/prometheus:v2.6.0": "Status: Downloaded newer image for prom/prometheus:v2.6.0"
2340 server.go:796] GET /pods/: (1.446758ms) 200 [[Go-http-client/1.1] 127.0.0.1:51376]
2340 server.go:796] GET /stats/summary/: (362.07652ms) 200 [[Go-http-client/1.1] 10.36.2.249:57372]
2340 server.go:796] GET /stats/summary/: (461.436102ms) 200 [[Go-http-client/1.1] 10.36.2.127:39236]
2340 kubelet.go:1910] SyncLoop (PLEG): "prometheus-deployment-7788f7f987-sr2g5_prometheus-cstor(eb6806c6-0c41-11e9-8a82-42010af0015b)", event: &pleg.PodLifecycleEvent{ID:"eb6806c6-0c41-11e9-8a82-42010af0015b", Type:"ContainerStarted", Data:"191fc4e86b53460a4f31cab31b45590cbf51afe475b410541efacc3ec054791d"}
2340 server.go:796] GET /pods/: (1.113902ms) 200 [[Go-http-client/1.1] 127.0.0.1:51392]
2340 server.go:796] GET /pods/: (1.396775ms) 200 [[Go-http-client/1.1] 127.0.0.1:51396]
2340 server.go:796] GET /pods/: (1.424777ms) 200 [[Go-http-client/1.1] 127.0.0.1:51400]
2340 server.go:796] GET /pods/: (5.07432ms) 200 [[Go-http-client/1.1] 127.0.0.1:51404]
2340 server.go:796] GET /healthz: (54.621µs) 200 [[curl/7.47.0] 127.0.0.1:51406]
2340 server.go:796] GET /pods/: (1.0011ms) 200 [[Go-http-client/1.1] 127.0.0.1:51410]
2340 server.go:796] GET /pods/: (1.426049ms) 200 [[Go-http-client/1.1] 127.0.0.1:51452]
2340 server.go:796] GET /pods/: (1.420852ms) 200 [[Go-http-client/1.1] 127.0.0.1:51482]
2340 server.go:796] GET /pods/: (1.33687ms) 200 [[Go-http-client/1.1] 127.0.0.1:51498]
2340 server.go:796] GET /metrics/cadvisor: (2.35306293s) 200 [[Prometheus/2.6.0] 10.128.0.8:43152]
2340 server.go:796] GET /pods/: (1.574998ms) 200 [[Go-http-client/1.1] 127.0.0.1:51522]
2340 container_manager_linux.go:427] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
2340 server.go:796] GET /metrics: (20.232171ms) 200 [[Prometheus/2.6.0] 10.36.8.120:44464]
2340 server.go:796] GET /pods/: (1.617695ms) 200 [[Go-http-client/1.1] 127.0.0.1:51530]
2340 server.go:796] GET /pods/: (1.080512ms) 200 [[Go-http-client/1.1] 127.0.0.1:51548]
2340 server.go:796] GET /pods/: (1.515745ms) 200 [[Go-http-client/1.1] 127.0.0.1:51552]
2340 server.go:796] GET /pods/: (1.516076ms) 200 [[Go-http-client/1.1] 127.0.0.1:51558]
2340 server.go:796] GET /pods/: (1.831936ms) 200 [[Go-http-client/1.1] 127.0.0.1:51564]
2340 server.go:796] GET /healthz: (43.163µs) 200 [[curl/7.47.0] 127.0.0.1:51566]

kublet log at the time of Prometheus stop

2340 logs.go:383] Container "191fc4e86b53460a4f31cab31b45590cbf51afe475b410541efacc3ec054791d" is not running (state="CONTAINER_EXITED")
2340 server.go:796] GET /containerLogs/prometheus-cstor/prometheus-deployment-7788f7f987-sr2g5/prometheus?follow=true: (2h17m27.999789701s) 200 [[Go-http-client/1.1] 10.128.0.8:49676]
2340 server.go:796] GET /metrics/cadvisor: (4.275543863s) 200 [[Prometheus/2.6.0] 10.128.0.8:42126]
2340 server.go:796] GET /healthz: (24.369µs) 200 [[Go-http-client/1.1] 127.0.0.1:50104]
2340 kubelet.go:1910] SyncLoop (PLEG): "prometheus-deployment-7788f7f987-sr2g5_prometheus-cstor(eb6806c6-0c41-11e9-8a82-42010af0015b)", event: &pleg.PodLifecycleEvent{ID:"eb6806c6-0c41-11e9-8a82-42010af0015b", Type:"ContainerDied", Data:"191fc4e86b53460a4f31cab31b45590cbf51afe475b410541efacc3ec054791d"}
2340 kuberuntime_manager.go:513] Container {Name:prometheus Image:prom/prometheus:v2.6.0 Command:[] Args:[--config.file=/etc/prometheus/conf/prometheus.yml --storage.tsdb.path=/prometheus --storage.tsdb.retention=$(STORAGE_RETENTION)] WorkingDir: Ports:[{Name: HostPort:0 ContainerPort:9090 Protocol:TCP HostIP:}] EnvFrom:[] Env:[{Name:STORAGE_RETENTION Value: ValueFrom:&EnvVarSource{FieldRef:nil,ResourceFieldRef:nil,ConfigMapKeyRef:&ConfigMapKeySelector{LocalObjectReference:LocalObjectReference{Name:openebs-prometheus-tunables,},Key:storage-retention,Optional:nil,},SecretKeyRef:nil,}}] Resources:{Limits:map[cpu:{i:{value:700 scale:-3} d:{Dec:<nil>} s:700m Format:DecimalSI} memory:{i:{value:4294967296 scale:0} d:{Dec:<nil>} s:4Gi Format:BinarySI}] Requests:map[cpu:{i:{value:250 scale:-3} d:{Dec:<nil>} s:250m Format:DecimalSI} memory:{i:{value:262144000 scale:0} d:{Dec:<nil>} s:250Mi Format:BinarySI}]} VolumeMounts:[{Name:prometheus-storage-volume ReadOnly:false MountPath:/prometheus SubPath: MountPropagation:<nil>} {Name:prometheus-server-volume ReadOnly:false MountPath:/etc/prometheus/conf SubPath: MountPropagation:<nil>} {Name:prometheus-token-9qplc ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:nil ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:nil Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
2340 kuberuntime_manager.go:757] checking backoff for container "prometheus" in pod "prometheus-deployment-7788f7f987-sr2g5_prometheus-cstor(eb6806c6-0c41-11e9-8a82-42010af0015b)"
2340 server.go:796] GET /pods/: (1.439817ms) 200 [[Go-http-client/1.1] 127.0.0.1:50108]
2340 kubelet.go:1910] SyncLoop (PLEG): "prometheus-deployment-7788f7f987-sr2g5_prometheus-cstor(eb6806c6-0c41-11e9-8a82-42010af0015b)", event: &pleg.PodLifecycleEvent{ID:"eb6806c6-0c41-11e9-8a82-42010af0015b", Type:"ContainerStarted", Data:"21dd3ba546487d39e19426f432d4677412cde76f1de679687e10109277ee9b39"}
2340 server.go:796] GET /pods/: (1.215252ms) 200 [[Go-http-client/1.1] 127.0.0.1:50114]
2340 server.go:796] GET /pods/: (2.22793ms) 200 [[Go-http-client/1.1] 127.0.0.1:50118]
2340 logs.go:383] Container "191fc4e86b53460a4f31cab31b45590cbf51afe475b410541efacc3ec054791d" is not running (state="CONTAINER_EXITED")
2340 server.go:796] GET /containerLogs/prometheus-cstor/prometheus-deployment-7788f7f987-sr2g5/prometheus?follow=true: (6m52.299545209s) 200 [[Go-http-client/1.1] 10.128.0.8:42010]

@inyee786

This comment has been minimized.

Copy link
Author

inyee786 commented Dec 30, 2018

@aixeshunter did you have created docker image of Prometheus without a wal file?

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jan 2, 2019

The memory requirements depend mostly on the number of scraped time series (check the prometheus_tsdb_head_series metric) and heavy queries.

@inyee786 you could increase the memory limits of the Prometheus pod.

@zrbcool IIUC you're not running Prometheus with cgroup limits so you'll have to increase the amount of RAM or reduce the number of scrape targets.

@inyee786

This comment has been minimized.

Copy link
Author

inyee786 commented Jan 5, 2019

@simonpasquier
I have seen that Prometheus using less memory during first 2 hr, but after that memory uses increase to maximum limit, so their is some problem somewhere and
I am already given 5GB ram, how much more I have to increase?

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jan 7, 2019

A rough estimation is that you need at least 8kB per time series in the head (check the prometheus_tsdb_head_series metric). In addition you need to account for block compaction, recording rules and running queries.

@inyee786 inyee786 changed the title Prometheus is restating again and again Prometheus is restarting again and again Jan 9, 2019

@inyee786

This comment has been minimized.

Copy link
Author

inyee786 commented Jan 14, 2019

@simonpasquier
i got the below value of prometheus_tsdb_head_series
screenshot from 2019-01-14 18-06-11

and i used 2.0.0 version and it is working

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jan 17, 2019

@inyee786 can you increase the memory limits and see if it helps? getting the logs from the crashed pod would also be useful.

@dcvtruong

This comment has been minimized.

Copy link

dcvtruong commented Mar 11, 2019

Hi @simonpasquier,

I've also getting this error in the prometheus-server (v2.6.1 + k8s 1.13). I've increased the RAM but prometheus-server never recover. Is there a remedy or workaround?

	$ kubectl logs prometheus-server-7f4c577d49-lq2z6 -n monitoring prometheus-server
	level=info ts=2019-03-11T13:04:34.063372989Z caller=main.go:243 msg="Starting Prometheus" version="(version=2.6.1, branch=HEAD, revision=b639fe140c1f71b2cbad3fc322b17efe60839e7e)"
	level=info ts=2019-03-11T13:04:34.063483589Z caller=main.go:244 build_context="(go=go1.11.4, user=root@4c0e286fe2b3, date=20190115-19:12:04)"
	level=info ts=2019-03-11T13:04:34.063516589Z caller=main.go:245 host_details="(Linux 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 15 17:36:42 UTC 2018 x86_64 prometheus-server-7f4c577d49-lq2z6 (none))"
	level=info ts=2019-03-11T13:04:34.06354359Z caller=main.go:246 fd_limits="(soft=1048576, hard=1048576)"
	level=info ts=2019-03-11T13:04:34.06356819Z caller=main.go:247 vm_limits="(soft=unlimited, hard=unlimited)"
	level=info ts=2019-03-11T13:04:34.064543596Z caller=main.go:561 msg="Starting TSDB ..."
	level=info ts=2019-03-11T13:04:34.064602196Z caller=web.go:429 component=web msg="Start listening for connections" address=0.0.0.0:9090
	level=warn ts=2019-03-11T13:04:34.298504993Z caller=wal.go:116 component=tsdb msg="last page of the wal is torn, filling it with zeros" segment=/data/wal/00000266
@nickychow

This comment has been minimized.

Copy link

nickychow commented Mar 17, 2019

I got the exact same issues. The prometheus-server is running on 16G RAM worker nodes without the resource limits.

Kubelet logs totally normal.

here are the prometheus-server logs

level=warn ts=2019-03-17T13:50:05.657332763Z caller=wal.go:116 component=tsdb msg="last page of the wal is torn, filling it with zeros" segment=/data/wal/00000105
level=info ts=2019-03-17T13:50:13.064868189Z caller=main.go:509 msg="Stopping scrape discovery manager..."
level=info ts=2019-03-17T13:50:13.064926199Z caller=main.go:523 msg="Stopping notify discovery manager..."
level=info ts=2019-03-17T13:50:13.06494207Z caller=main.go:545 msg="Stopping scrape manager..."
level=info ts=2019-03-17T13:50:13.064980223Z caller=main.go:519 msg="Notify discovery manager stopped"
level=info ts=2019-03-17T13:50:13.065007435Z caller=main.go:505 msg="Scrape discovery manager stopped"
level=info ts=2019-03-17T13:50:13.064996356Z caller=main.go:539 msg="Scrape manager stopped"
level=info ts=2019-03-17T13:50:13.065054343Z caller=manager.go:736 component="rule manager" msg="Stopping rule manager..."
level=info ts=2019-03-17T13:50:13.065083549Z caller=manager.go:742 component="rule manager" msg="Rule manager stopped"
level=info ts=2019-03-17T13:50:13.065096144Z caller=notifier.go:521 component=notifier msg="Stopping notification manager..."
level=info ts=2019-03-17T13:50:13.065105591Z caller=main.go:708 msg="Notifier manager stopped"
level=error ts=2019-03-17T13:50:13.065335304Z caller=main.go:717 err="opening storage failed: invalid block sequence: block time ranges overlap: [mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s, blocks: 44]: <ulid: 01D65WS5TSKJVSN2FE9F78WT53, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65W6TYMA5889TMEKDFQTW7F, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65W8NN2GG07TX42D553VS70, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65WAGA6RG36077G7TZ7V0MQ, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65WCB1R5TJ90VSX0XXBGQ81, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65WE5RCG732MASVKHYAGPKX, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65WG0CPF3YFFKY091T5MFAA, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65WHV1ZHJFQZHKJKCBK72NJ, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65WKNP52HZ6J8RS6TACGWM5, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65WNGDWQAJ9W6KB5S20V812, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65WQB4AN358F3QAYDWTQCXB, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65W6N7A9493F8X7QJSA2VE2, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65WV0GDG56GV6SYHEK8KCQ9, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65WWV6XKEVKCGFG638B1AH3, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65WYNVPVMSMB0HAS7FRA4NB, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65X0GG5KM3KF15ZDXFTNZBE, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65X2B5WF1G18FZ32N9Y2ZN3, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65X45TE40979K7DDYGCK6DY, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65X60GQGDHBFBX02F68Y9FS, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65X7V6PJERDP0YBZGC49Y0F, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65X9NX94JX6S3D43S9Q0MWK, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65YKW8H4BTBW8XEKADCJGYN, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65XDB9VKGX7WBNYG01HV4ZS, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65XF5ZNW3BP89EK5SNJ9Y0M, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65XH0NY4J2EM6SH3CWTHYYS, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65XJVC6T32WNQYCZ76991BT, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65XMP2NBCV1FDVVY2E3ZYWJ, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65XPGS4MBFWDMXMD43S8QZR, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65XRBE4SYNJ709TQF87CDPT, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65XT64T185JV4V29SDVCDR2, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65XW0STE3DG0J63NF28F3BQ, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65XXVGTZGPH9JMX35VPMNR6, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65XZP7B6W8PRK32VYWPPZHQ, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65Y1GZ2X5AF3P4S8MAB2STZ, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65Y3BNMV6VFQBP28NRGPMWM, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65Y56CH4D431S1AQCDYHBWM, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65Y713MMJQBCY0XXEQD0MSY, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65Y8VTXQMTX7MQTDAWQHWHD, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65YAPJ4R5PDNS5SC7MFYX57, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65YCH9CZWM8ZYGTVHVTD2BN, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65YEC1CYEY0N1HZV7ZQAFJD, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65YG6SXV5VAQWJ8JQMFEP7G, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65YJ1K10MAT876RGTTCG6NS, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>, <ulid: 01D65XBGK8MCRP8J7T0QHRJRSR, mint: 1552816800000, maxt: 1552824000000, range: 2h0m0s>"
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Mar 18, 2019

@dcvtruong @nickychow your issues don't seem to be related to the original one. Data on disk seems to be corrupted somehow and you'll have to delete the data directory. Also make sure that you're running the latest stable version of Prometheus as recent versions include many stability improvements.

@vtomasr5

This comment has been minimized.

Copy link

vtomasr5 commented Mar 21, 2019

Hi,

We have the same problem. We increased the memory but it doesn't solve the problem.

We use consul for autodiscover the services that has the metrics. In our case, we've discovered that consul queries that are used for checking the services to scrap last too long and reaches the timeout limit.

http://consul.service.cluster00-pro.consul:8500/v1/catalog/service/gs-sysops-management-web-node?index=4646&stale=&wait=30000ms

As you can see, the index parameter in the URL is blocking the query as we've seen in the consul documentation. See https://www.consul.io/api/index.html#blocking-queries

Running some curl commands and omitting the index= parameter the answer is inmediate otherwise it lasts 30s.

Is there any configuration that we can tune or change in order to improve the service checking using consul?

EDIT: We use prometheus 2.7.1 and consul 1.4.3

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.