Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Federated prometheus server doesn't update metric drop #3882

Closed
lucapalano opened this Issue Feb 23, 2018 · 3 comments

Comments

Projects
None yet
2 participants
@lucapalano
Copy link

lucapalano commented Feb 23, 2018

What did you do?
I configured different prometheus servers and one federated Prometheus which scapes from others (host1, host2, host3) with attached configuration prometheus.txt

The federated reports, in the graph, the following federated metrics:
prometheus-kafka-federated-metrics

There are 3 metrics/lines. They were scraped from a prometheus server (1st level prometheus server) which is in charge of collecting all the metrics for customer "test". Each metric was reported by a separated agent and is constantly monitored in order to trigger an alert in case a metric drops (in other words, in case of service failure).

So, I did some tests in order to simulate a service failure by stopping the reporting agent.

What did you expect to see?
I expect to see the metric drop on both prometheus servers (the federated one and the 1st level prometheus server) after trying to stop the agent.

What did you see instead? Under which circumstances?
I was able to see a metric drop down only on the "1st level prometheus server" (the one which collects metrics from the agents) as reported by the red line in this graph:
1st-level-prometheus-server
The problem was with the federated server. Interruption wasn't immediate:
prometheus-federated-metric-dropped
I waited for about 8 minutes and 4 seconds before collecting an interruption in the graph line. In those minutes, the graph line was stuck on 40 without fluctuations.

I wasn't able to perform any tuning in order to fix this behaviour. So, this looks like a bug (IMHO).

Environment

  • System information:
    Linux 3.10.0-327.36.3.el7.x86_64 x86_64 (where the agent and the prometheus servers run)

  • Prometheus version:
    prometheus, version 2.0.0 (branch: HEAD, revision: 0a74f98)
    build user: root@615b82cb36b6
    build date: 20171108-07:11:59
    go version: go1.9.2

  • Prometheus configuration file:

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
      datacenter: 'global'

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first.rules"
  # - "second.rules"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: datacenter_federation
    honor_labels: true
    metrics_path: /federate
    params:
      'match[]':
        - '{__name__=~"fed.*"}'
    static_configs:
      - targets:
        - host1:9090
        - host2:9090
        - host3:9090

Thanks in advance for your replies and your help! :-)

Luca

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 23, 2018

This is the expected staleness behaviour for data with timestamps such as from federation. This is not how federation is meant to be used for this and other reasons, see https://www.robustperception.io/federation-what-is-it-good-for/

@lucapalano

This comment has been minimized.

Copy link
Author

lucapalano commented Feb 23, 2018

Thank you @brian-brazil . I see your point. As general behaviour of federated prometheus node, I would prefer that a federated nodes wouldn't report any data when no data is pull from the scraping node instead of keeping the data value stuck on 40 for about 8 minutes:
prometheus-federated-metric-dropped 16 22 17

Luca

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.