Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus 2.0: remote storage being full caused prometheus behavior-ed incorrectly #3796

Closed
tangyong opened this Issue Feb 5, 2018 · 2 comments

Comments

Projects
None yet
4 participants
@tangyong
Copy link

tangyong commented Feb 5, 2018

What did you do?

We are doing prometheus monitoring using grafana, consul and remote storage. While cratedb disk has been full and after we cleaned cratedb disk, we found that prometheus behavior-ed incorrectly.

grafana <- prometheus-> cratedb adapter -> cratedb , and prometheus uses consul service discovery.

What did you expect to see?

We think that , while cratedb disk has been full and after we cleaned cratedb disk,

  1. prometheus/consul should not be affected
  2. prometheus remote write/read should recover
  3. prometheus local storage should not be affected
  4. grafana should be not affected

What did you see instead? Under which circumstances?

Instead, we saw the following,

  1. prometheus/consul happened the following error(please seeing logs in details) :
    ... msg="Error refreshing service" ...
  2. prometheus remote write/read did not recover
  3. prometheus local storage has been affected
  4. grafana has been affected and can not display data

Environment

  • System information:

      Linux 2.6.32-279.el6.x86_64 x86_64
    
  • Prometheus version:

Version: 2.0.0
Revision: 0a74f98
Branch: HEAD
BuildUser: root@615b82cb36b6
BuildDate: 20171108-07:11:59
GoVersion: go1.9.2

  • Alertmanager version:

version: 0.12.0

  • Prometheus configuration file:

global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
external_labels:
prometheusName: NJXZ-P1
alerting:
alertmanagers:

  • static_configs:
    • targets:
      • xx.yy.zz.kk:9093
        scheme: http
        timeout: 10s
        rule_files:
  • /opt/prometheus-2.0.0.linux-amd64/rules.yml
    scrape_configs:
  • job_name: consul_sd_configs
    scrape_interval: 15s
    scrape_timeout: 10s
    metrics_path: /metrics
    scheme: http
    consul_sd_configs:
    • server: xx.yy.zz.kk:9996
      tag_separator: ','
      scheme: http
      services:
      • promether-exporter
        relabel_configs:
    • source_labels: [__meta_consul_service]
      separator: ;
      regex: (.*)
      target_label: job
      replacement: $1
      action: replace
    • source_labels: [__meta_consul_node]
      separator: ;
      regex: (.*)
      target_label: instance
      replacement: $1
      action: replace
    • source_labels: [__meta_consul_tags]
      separator: ;
      regex: ',(.),(.),(.),(.),(.*),'
      target_label: appId
      replacement: $1
      action: replace
    • source_labels: [__meta_consul_tags]
      separator: ;
      regex: ',(.),(.),(.),(.),(.*),'
      target_label: ldc
      replacement: $2
      action: replace
    • source_labels: [__meta_consul_tags]
      separator: ;
      regex: ',(.),(.),(.),(.),(.*),'
      target_label: env
      replacement: $3
      action: replace
    • source_labels: [__meta_consul_tags]
      separator: ;
      regex: ',(.),(.),(.),(.),(.*),'
      target_label: ip
      replacement: $4
      action: replace
    • source_labels: [__meta_consul_tags]
      separator: ;
      regex: ',(.),(.),(.),(.),(.*),'
      target_label: software
      replacement: $5
      action: replace
  • job_name: CTDSA_NJXZ_SIT_ xx.yy.zz.kk_JBOSSSERVER
    scrape_interval: 15s
    scrape_timeout: 10s
    metrics_path: /metrics
    scheme: http
    static_configs:
    • targets:
      • xx.yy.zz.kk:9100
        labels:
        appId: CTDSA
        env: SIT
        ip: xx.yy.zz.kk
        ldc: NJXZ
        software: JBOSSSERVER
    • targets: []
      labels:
      appId: CTDSA
      env: SIT
      ip: xx.yy.zz.kk
      ldc: NJXZ
      software: JBOSSSERVER
  • job_name: CTDSA_NJXZ_SIT_ xx.yy.zz.kk_pushgateway
    scrape_interval: 15s
    scrape_timeout: 10s
    metrics_path: /metrics
    scheme: http
    static_configs:
    • targets:
      • xx.yy.zz.kk:9091
        labels:
        appId: CTDSA
        env: SIT
        ip: xx.yy.zz.kk
        ldc: NJXZ
        software: pushgateway
    • targets: []
      labels:
      appId: CTDSA
      env: SIT
      ip: xx.yy.zz.kk
      ldc: NJXZ
      software: pushgateway
  • Alertmanager configuration file:

I felt alertmanager is not relevant to the issue, so please allow me ignore it.

  • Logs:

level=info ts=2018-02-05T02:09:14.673576977Z caller=queue_manager.go:338 component=remote msg="Currently resharding, skipping."
level=warn ts=2018-02-05T02:09:15.828729838Z caller=queue_manager.go:225 component=remote msg="Remote storage queue full, discarding sample. Multiple subsequent messages of this kind may be suppressed."
level=info ts=2018-02-05T02:09:22.132293894Z caller=queue_manager.go:253 component=remote msg="Stopping remote storage..."
level=error ts=2018-02-05T02:09:35.918461623Z caller=consul.go:283 component="target manager" discovery=consul msg="Error refreshing service" service=promether-exporter err="Get http://10.27.136.227:9996/v1/catalog/service/promether-exporter?index=116543&wait=30000ms: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Jun 22, 2018

Hi could you update to the latest prometheus and see if that fixes the issue?

@tomwilkie

This comment has been minimized.

Copy link
Member

tomwilkie commented Mar 4, 2019

This should be fixed by the new WAL based remote_write code in 2.8.

@tomwilkie tomwilkie closed this Mar 4, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.