Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upPrometheus 2.0: after killing remote storage adapter, data is lost on remote storage #3800
Comments
This comment has been minimized.
This comment has been minimized.
|
How long was it down for? There are already retires, but buffering indefinitely would threaten the reliability of the Prometheus server's core functionality. |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil I will arrange re-test and confirm the duration of adapter down, then re-sending the test result. |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil We have done a more detailed test again as following: [Test Process]
[Results between killing and recovering] log is as following ("http://10.37.149.76:9268" is the cratedb adapter endpoint): level=warn ts=2018-02-06T07:54:43.342766278Z caller=queue_manager.go:485 component=remote msg="Error sending samples to remote storage" count=100 err="Post http://10.37.149.76:9268/write: dial tcp 10.37.149.76:9268: getsockopt: connection refused" My question: 2)Grafana 3)CrateDB [Results after recovering] 2)Grafana 3)CrateDB
|
This comment has been minimized.
This comment has been minimized.
|
This is expected behaviour, the retry behaviour is only meant to handle brief blips - not extended outages. |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil thanks quick reply! However, I have a question: Between killing and recovering, using console to query instant vector should not connect remote storage, I feel a litter curious. Instead, prometheus should firstly query local tsdb and at this time point, local tsdb works well. |
This comment has been minimized.
This comment has been minimized.
|
That's #2573 |
brian-brazil
closed this
Feb 6, 2018
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 22, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |


tangyong commentedFeb 5, 2018
What did you do?
We are doing prometheus monitoring using grafana, consul and remote storage. After we killed the cratedb adapter, we found that after cratedb is recovered, during the period, data is lost on remote storage(that is to say, the data during the period has not bee resent into the remote storage) .
grafana <- prometheus-> cratedb adapter -> cratedb , and prometheus uses consul service discovery.
What did you expect to see?
We think that the data during the period should be resent into the remote storage.
What did you see instead? Under which circumstances?
The data during the period has not bee resent into the remote storage.
Environment
System information:
Linux 2.6.32-279.el6.x86_64 x86_64
Prometheus version:
Version: 2.0.0
Revision: 0a74f98
Branch: HEAD
BuildUser: root@615b82cb36b6
BuildDate: 20171108-07:11:59
GoVersion: go1.9.2
Alertmanager version:
version: 0.12.0
Prometheus configuration file:
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
external_labels:
prometheusName: NJXZ-P1
alerting:
alertmanagers:
static_configs:
targets:
xx.yy.zz.kk:9093
scheme: http
timeout: 10s
rule_files:
/opt/prometheus-2.0.0.linux-amd64/rules.yml
scrape_configs:
job_name: consul_sd_configs
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
consul_sd_configs:
server: xx.yy.zz.kk:9996
tag_separator: ','
scheme: http
services:
promether-exporter
relabel_configs:
source_labels: [__meta_consul_service]
separator: ;
regex: (.)
target_label: job
replacement: $1
action: replace
source_labels: [__meta_consul_node]
separator: ;
regex: (.)
target_label: instance
replacement: $1
action: replace
source_labels: [__meta_consul_tags]
separator: ;
regex: ',(.),(.),(.),(.),(.),'
target_label: appId
replacement: $1
action: replace
source_labels: [__meta_consul_tags]
separator: ;
regex: ',(.),(.),(.),(.),(.),'
target_label: ldc
replacement: $2
action: replace
source_labels: [__meta_consul_tags]
separator: ;
regex: ',(.),(.),(.),(.),(.),'
target_label: env
replacement: $3
action: replace
source_labels: [__meta_consul_tags]
separator: ;
regex: ',(.),(.),(.),(.),(.),'
target_label: ip
replacement: $4
action: replace
source_labels: [meta_consul_tags]
separator: ;
regex: ',(.),(.),(.),(.),(.*),'
target_label: software
replacement: $5
action: replace
job_name: CTDSA_NJXZ_SIT xx.yy.zz.kk_JBOSSSERVER
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
targets:
xx.yy.zz.kk:9100
labels:
appId: CTDSA
env: SIT
ip: xx.yy.zz.kk
ldc: NJXZ
software: JBOSSSERVER
targets: []
labels:
appId: CTDSA
env: SIT
ip: xx.yy.zz.kk
ldc: NJXZ
software: JBOSSSERVER
job_name: CTDSA_NJXZ_SIT xx.yy.zz.kk_pushgateway
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
targets:
xx.yy.zz.kk:9091
labels:
appId: CTDSA
env: SIT
ip: xx.yy.zz.kk
ldc: NJXZ
software: pushgateway
targets: []
labels:
appId: CTDSA
env: SIT
ip: xx.yy.zz.kk
ldc: NJXZ
software: pushgateway
I felt alertmanager is not relevant to the issue, so please allow me ignore it.
level=warn ts=2018-02-02T01:47:25.921298237Z caller=queue_manager.go:485 component=remote msg="Error sending samples to remote storage" count=4 err="Post http://10.37.149.76:9268/write: dial tcp 10.37.149.76:9268: getsockopt: connection refused"
level=warn ts=2018-02-02T01:47:26.022207822Z caller=queue_manager.go:485 component=remote msg="Error sending samples to remote storage" count=4 err="Post http://10.37.149.76:9268/write: dial tcp 10.37.149.76:9268: getsockopt: connection refused"