Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric exports failed while prometheus is up #4320

Open
oded-dd opened this Issue Jun 28, 2018 · 4 comments

Comments

Projects
None yet
2 participants
@oded-dd
Copy link

oded-dd commented Jun 28, 2018

Bug Report

We are experiencing metric data lost with Prometheus. It seems that after couple of days Prometheus stops scraping from our exporters. Service not fails, and able to access the exporters (testing it by running curl http://<remote_host>:<port>/metrics - data is accessible using curl) but no data is returned.

We are running on Azure cloud and we are using static config and azure_sd config for scraping.
Prometheus runs on a virtual machine using binaries, not as docker.

  • System information:

    Linux 4.13.0-1011-azure x86_64

  • Prometheus version:

    prometheus, version 2.2.1 (branch: HEAD, revision: bc6058c)
    build user: root@149e5b3f0829
    build date: 20180314-14:15:45
    go version: go1.10

  • Alertmanager version:

alertmanager, version 0.14.0 (branch: HEAD, revision: 30af4d051b37ce817ea7e35b56c57a0e2ec9dbb0)
build user: root@37b6a49ebba9
build date: 20180213-08:16:42
go version: go1.9.2

  • Prometheus configuration file:
global:
  scrape_interval: "15s"

  external_labels:
    monitor: "alertmanager"

rule_files:
  - alerts/*.rules
  - recordings/*.rules

scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'kafka-brokers'
    scrape_interval: 30s

    static_configs:
      - targets: ['prod-kafka-001:7071', 'prod-kafka-002:7071', 'prod-kafka-003:7071', 'prod-kafka-004:7071', 'prod-kafka-005:7071', 'prod-kafka-006:7071']

  - job_name: 'redis'
    scrape_interval: 30s

    static_configs:
      - targets: ['redis1-prod:9121']
        labels:
          name: redis1-prod

  - job_name: 'kafka-offsets'
    scrape_interval: 30s

    static_configs:
      - targets: ['prod-kafka-001:9308', 'prod-kafka-002:9308', 'prod-kafka-003:9308', 'prod-kafka-004:9308', 'prod-kafka-005:9308', 'prod-kafka-006:9308']

  - job_name: 'nodes'
    scrape_interval: 30s

    azure_sd_configs:
    - subscription_id: ""
      tenant_id: ""
      client_id: ""
      client_secret: ""

      refresh_interval: 120s
      port: 9100

    relabel_configs:
    - action: "keep"
      regex: "prod"
      source_labels: ["__meta_azure_machine_tag_env"]
    - source_labels: ["__meta_azure_machine_name"]
      target_label: "name"
    - source_labels: ["__meta_azure_machine_tag_role"]
      target_label: "role"

  - job_name: 'traffic-stream'
    scrape_interval: 30s

    static_configs:
      - targets: ['generalservices2-prod:5001']
        labels:
          name: generalservices2-prod

  - job_name: 'elasticsearch-nodes'
    scrape_interval: 30s

    azure_sd_configs:
    - subscription_id: ""
      tenant_id: ""
      client_id: ""
      client_secret: ""

      refresh_interval: 120s
      port: 9108

    relabel_configs:
    - action: "keep"
      regex: "prod"
      source_labels: ["__meta_azure_machine_tag_env"]
    - action: "keep"
      regex: "elasticsearch-(.+)"
      source_labels: ["__meta_azure_machine_tag_role"]
    - source_labels: ["__meta_azure_machine_name"]
      target_label: "name"
    - source_labels: ["__meta_azure_machine_tag_role"]
      target_label: "role"
    - source_labels: ["__meta_azure_machine_tag_cluster"]
      target_label: "cluster"
    - source_labels: ["__meta_azure_machine_tag_type"]
      target_label: "cluster_type"

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['localhost:9093']
  • Alertmanager configuration file:
global:
  resolve_timeout: 5m

  # The smarthost and SMTP sender used for mail notifications
  smtp_smarthost: 'smtp-relay.gmail.com:25'
  smtp_from: 'operations@coralogix.com'

  # The API URL to use for Slack notifications
  slack_api_url: ''

templates:
 - '/etc/alertmanager/templates/*.tmpl'

route:
  group_by: ['alertname', 'role']
  receiver: 'incidents'
  group_interval: 5m

  routes:
  - match:
      action: 'pagerduty'
    group_wait: 10s
    receiver: 'pagerduty'

receivers:
- name: 'incidents'
  email_configs:
  - send_resolved: true
    to:  'monitoring@coralogix.com'
  slack_configs:
  - send_resolved: true
    channel: '#incidents'
    title: '{{ template "custom_title" . }}'
    text: '{{ template "custom_slack_message" . }}'

- name: 'pagerduty'
  pagerduty_configs:
  - service_key: ''
  • Logs:
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 28, 2018

Can you share a screenshot of the target status page?

@oded-dd

This comment has been minimized.

Copy link
Author

oded-dd commented Jun 28, 2018

I've restarted the server so it might not help, but, see attached

image

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 28, 2018

To debug an issue like this we need information from the broken process, as there's little left lying around after a restart. If it happens again let us know.

The expanded view on the targets page is what's of interest, in particular when the last scrape was.

@oded-dd

This comment has been minimized.

Copy link
Author

oded-dd commented Jun 28, 2018

I will share it with you once it will happen again. Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.