Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus read remote influxdb Invalid after a period of time #4739

Open
cxhuawei opened this Issue Oct 15, 2018 · 20 comments

Comments

Projects
None yet
3 participants
@cxhuawei
Copy link

cxhuawei commented Oct 15, 2018

Proposal

Use case. Why is this important?

“Nice to have” is not a good use case. :)

Bug Report

What did you do?
Two prometheus write data to influxdb.
An other prometheus read data from influxdb by influxdb's api and grafana generates charts.
What did you expect to see?
Prometheus can get data from influxdb.

What did you see instead? Under which circumstances?
It worked properly and after some hours it can not get newly added data. If you restart the read prometheus , it will be ok again.

Environment
centos 7

  • System information:
    Linux 3.10.0-693.2.2.el7.x86_64 x86_64
    insert output of uname -srm here

  • Prometheus version:
    2.4.3
    insert output of prometheus --version here

  • Alertmanager version:

    insert output of alertmanager --version here (if relevant to the issue)

  • Prometheus configuration file:

global:
  scrape_interval:     30s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 30s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets: ["172.16.68.221:9093"]
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "/etc/prometheus/alert_rules/*.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
#scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
#  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

#    static_configs:
#    - targets: ['localhost:9090']

#scrape_configs:
#    # Override the global default and scrape targets from this job every 5 seconds.
#  - job_name: 'pushgateway'
#    scrape_interval: 30s
#    static_configs:
#      - targets: ['172.16.68.221:9091']
#        labels:
#          group: 'pushgateway'
#  - job_name: 'ecs_group'
#    scrape_interval: 30s
#    file_sd_configs:
#    - refresh_interval: 1m
#      files:
#      - ./conf.d/*.json
#  - job_name: 'speech_gw'
#    metrics_path: '/debug/metrics'
#    scrape_interval: 1m
#    file_sd_configs:
#    - refresh_interval: 1m
#      files:
#      - ./conf.d/sc-speech-gw.yml
#

#remote_write:
#  - url: "http://influxdb_adapter:9201/write"
remote_read:
  - url: "http://172.16.68.224:8086/api/v1/prom/read?db=prometheus&epoch=ms&rp=autogen"
  • Alertmanager configuration file:
insert configuration here (if relevant to the issue)
  • Logs:
insert Prometheus and Alertmanager logs relevant to the issue here
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 15, 2018

Please share the logs of the Prometheus server. Anything relevant in the InfluxDB logs?

@cxhuawei

This comment has been minimized.

Copy link
Author

cxhuawei commented Oct 15, 2018

@simonpasquier Hi simon, there is no obvious error logs and the grafana chart is below
image
As the picture shows, prometheus can not get the data after 20:00. If I restart it , the picture will be ok

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 15, 2018

Try running with --log.level=debug. You can also take a look at the net_conntrack*{dialer_name="remote_storage"} metrics.

@cxhuawei

This comment has been minimized.

Copy link
Author

cxhuawei commented Oct 15, 2018

@simonpasquier
logs:
pro_prometheus-front.1.2tw5pr40pqrz@rokid-ops-1.hz.rokid.com | level=info ts=2018-10-15T08:26:15.762344509Z caller=main.go:523 msg="Server is ready to receive web requests."
pro_prometheus-front.1.2tw5pr40pqrz@rokid-ops-1.hz.rokid.com | level=debug ts=2018-10-15T08:26:15.762769295Z caller=manager.go:183 component="discovery manager notify" msg="discoverer exited" provider=string/0
pro_prometheus-front.1.2tw5pr40pqrz@rokid-ops-1.hz.rokid.com | level=info ts=2018-10-15T11:27:14.608373524Z caller=compact.go:398 component=tsdb msg="write block" mint=1539590400000 maxt=1539597600000 ulid=01CSVQNS4D6G07DPAJNW4VCJE4
pro_prometheus-front.1.2tw5pr40pqrz@rokid-ops-1.hz.rokid.com | level=info ts=2018-10-15T11:27:14.613941143Z caller=head.go:446 component=tsdb msg="head GC completed" duration=1.73509ms

net_conntrack*{dialer_name="remote_storage"} return no data

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 15, 2018

net_conntrack*{dialer_name="remote_storage"} return no data

try this {__name__=~"net_conntrack.+",dialer_name="remote_storage"} instead.

@cxhuawei

This comment has been minimized.

Copy link
Author

cxhuawei commented Oct 15, 2018

@cxhuawei

This comment has been minimized.

Copy link
Author

cxhuawei commented Oct 15, 2018

Hi simon, I check the influxdb logs

172.18.0.12,172.16.68.221 - - [15/Oct/2018:21:30:52 +0800] "POST /query?db=prometheus&epoch=ms&params=%7B%7D&q=SELECT+value+FROM+%22autogen%22.%2F%5Enet_conntrack.%2B%24%2F+WHERE+%22dialer_name%22+%3D+%27remote_storage%27+AND+time+%3E%3D+1539566700000ms+AND+time+%3C%3D+1539604800000ms+GROUP+BY+%2A HTTP/1.1" 200 4285 "-" "InfluxDBClient" 8950769b-d07e-11e8-ba06-000000000000 22427

I query the data on 21:30:52 but prometheus filter the data before 20:00(1539604800000), there are some other same logs. The last query time stop at 20:00... @simonpasquier

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 15, 2018

@cxhuawei

This comment has been minimized.

Copy link
Author

cxhuawei commented Oct 15, 2018

The parameter means to qeury data from remote storage each time ,but I don't have local storage ..
What is the impact of this parameter, I am a bit confused ? Thanks : )

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 15, 2018

Which flags do you use to start Prometheus?

@cxhuawei

This comment has been minimized.

Copy link
Author

cxhuawei commented Oct 15, 2018

@simonpasquier

  - '--config.file=/etc/prometheus/prometheus.yml'
  - '--web.enable-lifecycle'
  - '--log.level=debug'
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 15, 2018

but I don't have local storage

There's always local storage.

@cxhuawei

This comment has been minimized.

Copy link
Author

cxhuawei commented Oct 15, 2018

You means prometheus get data and cache them into memory. But when I refresh grafana, I always can see a read request to influxdb and the time of intercepting the data is not correct just like above log. @simonpasquier

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 16, 2018

Can you confirm that you use the native InfluxDB remote read endpoint?
Can you check that all clocks are synchronized?
Have you tried setting read_recent to true?
When I say that there's always local storage, it means that Prometheus will always write the samples to its local storage even when remote write/read is used.

@cxhuawei

This comment has been minimized.

Copy link
Author

cxhuawei commented Oct 16, 2018

I set read_recent to true and the problem has been solved , thank you . But
I am wondering why this problem does not occur when remote_read and remote_write are on one machine? @simonpasquier

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 16, 2018

Can you check that all clocks are synchronized?

@cxhuawei

This comment has been minimized.

Copy link
Author

cxhuawei commented Oct 16, 2018

All clocks are synchronized but they are in different time zones. One write prometheus is in UTC time zone and others in UTC + 8:00 timezone. @simonpasquier

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 16, 2018

It shouldn't matter for Prometheus as all times are converted to UTC. I can't say for InfluxDB.

@cxhuawei

This comment has been minimized.

Copy link
Author

cxhuawei commented Oct 16, 2018

It shouldn't matter for InfluxDB because writring to InfluxDB is totally OK . The error is that the time period for fetching data is incorrect when remote_read and remote_write are assigned to different machines.

@liuzhi1986

This comment has been minimized.

Copy link

liuzhi1986 commented Jan 23, 2019

I have the same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.