Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade from 2.4.3 to 2.5.0 fails #4839

Closed
davidkarlsen opened this Issue Nov 8, 2018 · 3 comments

Comments

Projects
None yet
3 participants
@davidkarlsen
Copy link

davidkarlsen commented Nov 8, 2018

Bug Report

What did you do?
Upgrade from 2.4.3 to 2.5.0

What did you expect to see?
Startup OK as before.

What did you see instead? Under which circumstances?
prometheus answer with HTTP500

Environment
docker - using your image of docker-hub

  • System information:

uname -srm
Linux 3.10.0-862.14.4.el7.x86_64 x86_64

  • Prometheus version:

2.5.0

  • Prometheus configuration file:
# Managed by salt /platforms/ccm/prometheus/files/prometheus.yml.jinja

# my global config
global:
  scrape_interval: 15s # By default, scrape targets every 15 seconds.
  evaluation_interval: 15s # By default, scrape targets every 15 seconds.
  scrape_timeout: 10s
  # scrape_timeout is set to the global default (10s).

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    monitor: 'CCM'
    finodsenv: preprod-global

# Load and evaluate rules in this file every 'evaluation_interval' seconds.
rule_files:
  - '/etc/prometheus/dynarules/*.yml'

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    metrics_path: /prometheus/metrics
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'dynaconf'
    metrics_path: '/finods/metrics'
    file_sd_configs:
     - files: [ '/etc/prometheus/dynaconf/*.yml', '/etc/prometheus/dynaconf-pci/*.yml' ]
  - job_name: 'consul'
    consul_sd_configs:
      - server: '10.246.89.52:8500'
        datacenter: 'global'
    metric_relabel_configs:
      - source_labels: [container_label_container_group]
        target_label: container_group
    relabel_configs:
      - source_labels: [__meta_consul_tags]
        regex: .*prom_monitored.*
        action: keep
      - source_labels: [__meta_consul_service]
        target_label: job 
      - source_labels: [__meta_consul_tags]
        regex: .*,alias-([^,]+),.*
        replacement: '${1}'
        target_label: alias
      - source_labels: [__meta_consul_tags]
        regex: .*,metrics_path=([^,]+),.*
        replacement: '${1}'
        target_label: __metrics_path__
      - source_labels: [__meta_consul_tags]
        regex: .*,finodsgroup=([^,]+),.*
        replacement: '${1}'
        target_label: finodsgroup
      - source_labels: [__meta_consul_tags]
        regex: .*,container_group=([^,]+),.*
        replacement: '${1}'
        target_label: container_group


From the log - notice the undefined function rate.

The rule it barfs at is:

less alert_gc.rules.yml 
groups:
- name: GC-rules
  rules:
  - alert: highMarkSweepVsScavenge
    expr: (rate(jvm_gc_collection_seconds_count{gc="PS MarkSweep"}[5m]) > IGNORING(gc)
      rate(jvm_gc_collection_seconds_count{gc="PS Scavenge"}[5m]) and rate(jvm_gc_collection_seconds_count{gc="PS
      Scavenge"}[5m]) > 0.01) or (rate(jvm_gc_collection_seconds_count{gc="MarkSweepCompact"}[5m])
      > IGNORING(gc) rate(jvm_gc_collection_seconds_count{gc="Copy"}[5m]) and rate(jvm_gc_collection_seconds_count{gc="Copy"}[5m])
      > 0.01)
    for: 5m
    annotations:
      description: 'http://{{ $labels.instance }} has 5m rate: {{ rate(jvm_gc_collection_seconds_count[5m])
        }} for the last 5m'
      summary: High GC MarkSweep activity vs Copy/Scavenge on instance http://{{ $labels.instance
        }} - this indicates low memory condition and/or a memory-leak
  - alert: highGcPercentage
    expr: ((
      sum(rate(jvm_gc_collection_seconds_sum{job="$jobs",instance=~"$instances",finodsgroup="$finodsgroup"}[5m])) without(gc)
      /
      rate(process_cpu_seconds_total{job="$jobs",instance=~"$instances",finodsgroup="$finodsgroup"}[5m])
      )*100) > 7
    for: 5m
    annotations:
      description: High GC percentage on http://{{ $labels.instance }}
      summary: High GC percentage - See https://fswiki.evry.com/display/architecture/Analyzing+a+malfunctioning+application and possibly increase heap
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Nov 8, 2018

You've got a rule that is invalid:

Nov  8 13:39:27 alt-aot-g-fou01 docker/aa6b680d10ea[1366]: level=error ts=2018-11-08T12:39:27.413053751Z caller=manager.go:675 component="rule manager" msg="loading groups failed" err="group \"GC-rules\", rule 0, \"highMarkSweepVsScavenge\": msg=template: __alert_highMarkSweepVsScavenge:1: function \"rate\" not defined"

Since v2.5.0, this is a hard fail of Prometheus. This is noted in the changelog: "Rules: Error out at load time for invalid templates, rather than at evaluation time. #4537".

@davidkarlsen

This comment has been minimized.

Copy link
Author

davidkarlsen commented Nov 9, 2018

You're right - the failing rule was it.

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Nov 9, 2018

Can we reopen this?

It was a problem for us too. That should be checked BEFORE loading the TSDB, so that is errors nicely and not after 5 minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.