Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An alert is not fired again after it has been resolved once #1372

Closed
alon-argus opened this Issue Feb 4, 2016 · 5 comments

Comments

Projects
None yet
4 participants
@alon-argus
Copy link

alon-argus commented Feb 4, 2016

It looks like a critical bug, in the new 0.17.0rc1 pre-release version.

Reproduction steps:

  1. Make some alert to fire (Alertmanager gets an alert)
  2. Resolve it so it'll stop firing (Alertmanager alert is cleared)
  3. Make the alert to fire again

Expected result:
Alert should appear as 'firing' in 'Alerts' page, and sent to Alertmanager.

Actual result:
Alert is not firing in 'Alerts' page ((0 active)), and Alertmanager doesn't get it as well.

Additional information:
My alert:

ALERT vehicleServer_response_err
  IF (sum(rate(server_requests_processing_time_millis_total{status!="200"}[15s])) BY (status)) > 0
  FOR 40s
  ANNOTATIONS {
    description="server returns non-200 status code, for more than 40 seconds (status:  {$labels.status}}).",
    summary="server returned non-200 status code"
  }

When the alert's rule triggers again from a different status code, the alert does fire though.

Can't see any errors in stdout.

@brian-brazil brian-brazil added the bug label Feb 4, 2016

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Feb 5, 2016

I can reproduce.

# prometheus.yml
global:
  scrape_interval:     5s
  evaluation_interval: 5s

rule_files:
  - 1372.rules

scrape_configs:
  - job_name: 'node_exporter'
    target_groups:
      - targets: ['localhost:9100']
# 1372.rules
ALERT vehicleServer_response_err
  IF foobar{status!="200"} > 0
  FOR 40s
  ANNOTATIONS {
    description="server returns non-200 status code, for more than 40 seconds (status:  {$labels.status}}).",
    summary="server returned non-200 status code"
  }
./prometheus &
./node_exporter --collector.textfile.directory=data/ &

while true; do
  echo "foobar{status=\"200\"} 42\nfoobar{status=\"500\"} 0" > data/1372.prom
  sleep 60
  echo "foobar{status=\"200\"} 2\nfoobar{status=\"500\"} 100" > data/1372.prom
  sleep 60
done

screenshot from 2016-02-04 22-14-18
screenshot from 2016-02-04 22-14-37

@grobie grobie added the Critical label Feb 5, 2016

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Feb 5, 2016

Note this only affects flapping alerts with a flap cycle of <15m.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Feb 5, 2016

Thanks a lot for catching that one!

@fabxc fabxc closed this in #1373 Feb 5, 2016

@alon-argus

This comment has been minimized.

Copy link
Author

alon-argus commented Feb 5, 2016

Cool!

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.