Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus 2.0: after killing remote storage adapter, data is lost on remote storage #3800

Closed
tangyong opened this Issue Feb 5, 2018 · 7 comments

Comments

Projects
None yet
2 participants
@tangyong
Copy link

tangyong commented Feb 5, 2018

What did you do?

We are doing prometheus monitoring using grafana, consul and remote storage. After we killed the cratedb adapter, we found that after cratedb is recovered, during the period, data is lost on remote storage(that is to say, the data during the period has not bee resent into the remote storage) .

grafana <- prometheus-> cratedb adapter -> cratedb , and prometheus uses consul service discovery.

What did you expect to see?

We think that the data during the period should be resent into the remote storage.

What did you see instead? Under which circumstances?

The data during the period has not bee resent into the remote storage.

Environment

  • System information:

    Linux 2.6.32-279.el6.x86_64 x86_64

  • Prometheus version:

Version: 2.0.0
Revision: 0a74f98
Branch: HEAD
BuildUser: root@615b82cb36b6
BuildDate: 20171108-07:11:59
GoVersion: go1.9.2

  • Alertmanager version:

    version: 0.12.0

  • Prometheus configuration file:

global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
external_labels:
prometheusName: NJXZ-P1
alerting:
alertmanagers:

static_configs:
targets:
xx.yy.zz.kk:9093
scheme: http
timeout: 10s
rule_files:
/opt/prometheus-2.0.0.linux-amd64/rules.yml
scrape_configs:
job_name: consul_sd_configs
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
consul_sd_configs:
server: xx.yy.zz.kk:9996
tag_separator: ','
scheme: http
services:
promether-exporter
relabel_configs:
source_labels: [__meta_consul_service]
separator: ;
regex: (.)
target_label: job
replacement: $1
action: replace
source_labels: [__meta_consul_node]
separator: ;
regex: (.
)
target_label: instance
replacement: $1
action: replace
source_labels: [__meta_consul_tags]
separator: ;
regex: ',(.),(.),(.),(.),(.),'
target_label: appId
replacement: $1
action: replace
source_labels: [__meta_consul_tags]
separator: ;
regex: ',(.),(.),(.),(.),(.
),'
target_label: ldc
replacement: $2
action: replace
source_labels: [__meta_consul_tags]
separator: ;
regex: ',(.),(.),(.),(.),(.),'
target_label: env
replacement: $3
action: replace
source_labels: [__meta_consul_tags]
separator: ;
regex: ',(.),(.),(.),(.),(.
),'
target_label: ip
replacement: $4
action: replace
source_labels: [meta_consul_tags]
separator: ;
regex: ',(.),(.),(.),(.),(.*),'
target_label: software
replacement: $5
action: replace
job_name: CTDSA_NJXZ_SIT
xx.yy.zz.kk_JBOSSSERVER
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
targets:
xx.yy.zz.kk:9100
labels:
appId: CTDSA
env: SIT
ip: xx.yy.zz.kk
ldc: NJXZ
software: JBOSSSERVER
targets: []
labels:
appId: CTDSA
env: SIT
ip: xx.yy.zz.kk
ldc: NJXZ
software: JBOSSSERVER
job_name: CTDSA_NJXZ_SIT
xx.yy.zz.kk_pushgateway
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
targets:
xx.yy.zz.kk:9091
labels:
appId: CTDSA
env: SIT
ip: xx.yy.zz.kk
ldc: NJXZ
software: pushgateway
targets: []
labels:
appId: CTDSA
env: SIT
ip: xx.yy.zz.kk
ldc: NJXZ
software: pushgateway

  • Alertmanager configuration file:

I felt alertmanager is not relevant to the issue, so please allow me ignore it.

  • Logs:
    level=warn ts=2018-02-02T01:47:25.921298237Z caller=queue_manager.go:485 component=remote msg="Error sending samples to remote storage" count=4 err="Post http://10.37.149.76:9268/write: dial tcp 10.37.149.76:9268: getsockopt: connection refused"
    level=warn ts=2018-02-02T01:47:26.022207822Z caller=queue_manager.go:485 component=remote msg="Error sending samples to remote storage" count=4 err="Post http://10.37.149.76:9268/write: dial tcp 10.37.149.76:9268: getsockopt: connection refused"
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 5, 2018

How long was it down for? There are already retires, but buffering indefinitely would threaten the reliability of the Prometheus server's core functionality.

@tangyong

This comment has been minimized.

Copy link
Author

tangyong commented Feb 6, 2018

@brian-brazil I will arrange re-test and confirm the duration of adapter down, then re-sending the test result.

@tangyong

This comment has been minimized.

Copy link
Author

tangyong commented Feb 6, 2018

@brian-brazil We have done a more detailed test again as following:

[Test Process]

  1. kill cratedb adapter process
    2)wait half an hour (30 min) to recover the cratedb process

[Results between killing and recovering]
1)Prometheus
status: query error happened in console , however, data is still in local tsdb.
image

log is as following ("http://10.37.149.76:9268" is the cratedb adapter endpoint):

level=warn ts=2018-02-06T07:54:43.342766278Z caller=queue_manager.go:485 component=remote msg="Error sending samples to remote storage" count=100 err="Post http://10.37.149.76:9268/write: dial tcp 10.37.149.76:9268: getsockopt: connection refused"
level=warn ts=2018-02-06T07:54:43.444068027Z caller=queue_manager.go:485 component=remote msg="Error sending samples to remote storage" count=100 err="Post http://10.37.149.76:9268/write: dial tcp 10.37.149.76:9268: getsockopt: connection refused"
level=warn ts=2018-02-06T07:54:43.545798144Z caller=queue_manager.go:485 component=remote msg="Error sending samples to remote storage" count=100 err="Post http://10.37.149.76:9268/write: dial tcp 10.37.149.76:9268: getsockopt: connection refused"
level=warn ts=2018-02-06T07:54:43.647194118Z caller=queue_manager.go:485 component=remote msg="Error sending samples to remote storage" count=100 err="Post http://10.37.149.76:9268/write: dial tcp 10.37.149.76:9268: getsockopt: connection refused"
level=warn ts=2018-02-06T07:54:43.748752122Z caller=queue_manager.go:485 component=remote msg="Error sending samples to remote storage" count=100 err="Post http://10.37.149.76:9268/write: dial tcp 10.37.149.76:9268: getsockopt: connection refused"

My question:
Here, using console to query instant vector should not connect remote storage, I feel a litter curious.

2)Grafana
status: can not display normally.
image

3)CrateDB
status: being not affected. only "prometheus->cratedb remote write" does not work. This is normal.

[Results after recovering]
1)Prometheus
status: query is normal in console and can query the data between killing and recovering.

2)Grafana
status: display normally and can display the data between killing and recovering.

3)CrateDB
status: miss the data between killing and recovering.

  • end of testing.
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 6, 2018

This is expected behaviour, the retry behaviour is only meant to handle brief blips - not extended outages.

@tangyong

This comment has been minimized.

Copy link
Author

tangyong commented Feb 6, 2018

@brian-brazil thanks quick reply!

However, I have a question:

Between killing and recovering, using console to query instant vector should not connect remote storage, I feel a litter curious. Instead, prometheus should firstly query local tsdb and at this time point, local tsdb works well.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 6, 2018

That's #2573

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.