Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrapes time out with no error in logs and no apparent problem with scrape endpoint #5383

Open
chrisfw opened this Issue Mar 18, 2019 · 2 comments

Comments

Projects
None yet
2 participants
@chrisfw
Copy link

chrisfw commented Mar 18, 2019

Bug Report

What did you do?
Configured prometheus to scrape primary target at 15s intervals

What did you expect to see?

No gaps in metrics loaded via scrape

What did you see instead? Under which circumstances?

On two different occasions after a period of time running prometheus, I noticed sampling gaps and it appears from the scrape_duration_seconds metric that scrapes were timing out at 10s. The first time it occurred, it seemed to typically affect every other scrape. The machine prometheus is running on was under some load at the time, but I confirmed through curl that the scrape endpoint was responding and the response time was substantially less than 10 seconds every time, so the endpoint does not appear to be the cause. I checked the logs of both prometheus and the scrape endpoint and neither reported any errors. Both times this occurred, the only viable solution seemed to be restarting prometheus. After the second occurrence I thought a bug report was in order.

Environment

Red Hat Enterprise Linux Server release 7.6 (Maipo)

  • System information:

    Linux 4.14.35-1844.1.3.el7uek.x86_64 x86_64

  • Prometheus version:
    prometheus, version 2.7.1 (branch: HEAD, revision: 62e591f)
    build user: root@f9f82868fc43
    build date: 20190131-11:16:59
    go version: go1.11.5

  • Prometheus configuration file:

# my global config
global:

  external_labels:
    origin_prometheus: irv_prometheus

  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'appserver-pcp'
    metrics_path: '/pmapi/1/metrics'
    params:
       target : ['filesys','kernel','mem','disk','network', 'proc']       
    scrape_interval: 10s
    static_configs:
      - targets: ['127.0.0.1:44323']


  - job_name: 'appserver-pcp-hotproc'
    metrics_path: '/pmapi/1/metrics'
    params:
       target : ['hotproc']
    scrape_interval: 120s
    static_configs:
      - targets: ['127.0.0.1:44323']

scrape_timeout

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Mar 20, 2019

We would need profile data to troubleshoot this issue. Next time it happens, you need to run the promtool debug all ... command and attach the output here.

@chrisfw

This comment has been minimized.

Copy link
Author

chrisfw commented Mar 23, 2019

Thank you. Will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.