Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throttled Ingestion sets up metric to 0 #2117

Closed
jjneely opened this Issue Oct 25, 2016 · 7 comments

Comments

Projects
None yet
3 participants
@jjneely
Copy link
Contributor

jjneely commented Oct 25, 2016

What did you do?

For various reasons one of the Prometheus instances that scrapes the node exporter for a region hit throttled ingestion mode. When a scrape is skipped, the resulting up metric appears to be set to 0.

What did you expect to see?

I expected that the up metric would not be updated and that Prometheus would not fire false alerts. Brain appears to agree:

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/prometheus-users/hXlvClKp_8w/m9xxcK-8AAAJ

What did you see instead? Under which circumstances?

What happened was that this Prometheus instance began issuing alerts that hundreds of machines where down.

Subject: [FIRING:199] HostDown (ALLOCATED node node page)
Subject: [FIRING:239] HostDown (ALLOCATED node node page)
ALERT HostDown
  IF up{cmdb_status="ALLOCATED", job="node"} == 0
  FOR 3m
  LABELS {
    severity="page",
  }
  ANNOTATIONS {
    summary = "Host {{ $labels.instance }} down",
    description = "Host {{ $labels.instance }} node exporter unresponsive for 3 minutes",
    runbook = "Missing",
  }

Environment

  • System information:

    Linux 3.13.0-85-generic x86_64

  • Prometheus version:

    prometheus, version 1.2.1 (branch: XXX, revision: 1.2.1-1+JENKINS~trusty.20161012205023)
    build user: jjneely@42lines.net
    build date: 2016-10-12T20:50:55Z
    go version: go1.7.1

@brian-brazil brian-brazil added this to the 2.x milestone Oct 25, 2016

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Oct 25, 2016

This would arguably be a breaking change, so goes with 2.0.

@jjneely

This comment has been minimized.

Copy link
Contributor Author

jjneely commented Oct 25, 2016

The 2.x milestone implies that there are cases where folks depend on a job's up metric to transition to zero to notify them of performance problems with the Prometheus instance itself.

Is that the case?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Oct 25, 2016

It's a semantic change, so it's breaking even if noone is depending on it.

I would hope noone is monitoring Prometheus that way, metamon is the way to go here.

@jjneely

This comment has been minimized.

Copy link
Contributor Author

jjneely commented Oct 25, 2016

For completeness, I've done the following to attempt to work around this issue in my alerting:

up{job="node"} == 0 and on() (sum(rate(prometheus_target_skipped_scrapes_total[5m])) or vector(0)) == 0
@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Oct 25, 2016

Just as a side note: If your Prometheus server regularly throttles ingestion, you have a big problem. Throttling ingestion is a last resort of the server to keep itself alive. It severely impedes your ability to monitor. It must be rare, and some alert should definitely wake somebody up if it ever happens.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 15, 2017

This is no longer possible in 2.0, as there's no throttling.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.