Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus 1.2.0 suddenly stops scraping targets #2068

Closed
svenmueller opened this Issue Oct 9, 2016 · 22 comments

Comments

Projects
None yet
@svenmueller
Copy link

svenmueller commented Oct 9, 2016

What did you do?
Since upgrading from version 1.1.3 to version 1.2.0, Prometheus stops scraping all targets (node, prometheus) after some time. After restarting prometheus, it works properly for some time, but then it stops scraping all targets again.

What did you expect to see?
All targets should be in state "UP" and last scrape time should be less then 5 seconds (for node targets).

What did you see instead? Under which circumstances?
Targets show "UNKNOWN" instead of "UP".

Environment

  • System information:

Linux 3.13.0-95-generic x86_64

  • Prometheus version:

prometheus, version 1.2.0 (branch: master, revision: 522c933)
build user: root@c8088ddaf2a8
build date: 20161007-12:53:55
go version: go1.6.3

  • Prometheus configuration file:
# my global config
global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.
  evaluation_interval: 15s # By default, scrape targets every 15 seconds.
  # scrape_timeout is set to the global default (10s).

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
      hostname: 'my-hostname'

# Load and evaluate rules in this file every 'evaluation_interval' seconds.
rule_files:
  - "/etc/prometheus/rules/*.rules"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ['localhost:9090']


  # Scrape the Node Exporter every 5 seconds.
  - job_name: 'node'
    scrape_interval: 5s
    file_sd_configs:
      - files:
        - /etc/prometheus/targets/*.yaml

    relabel_configs:
      - source_labels: [__address__]
        regex: (.*):9100
        replacement: ${1}
        target_label: instance
      - source_labels: [instance]
        regex: .*\.([\w,-]+)\.ct-app\.com
        replacement: ${1}
        target_label: region
      - source_labels: [instance]
        regex: (\w+).*
        replacement: ${1}
        target_label: customer
      - source_labels: [instance]
        regex: .*-(dev|stage|prod)-.*
        replacement: ${1}
        target_label: environment
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]  # Look for a HTTP 200 response.
    static_configs:
      - targets: # Target to probe
        - www.domain.de
    relabel_configs:
      - source_labels: [__address__]
        regex: (.*)(:80)?
        target_label: __param_target
        replacement: ${1}
      - source_labels: [__param_target]
        regex: (.*)
        target_label: instance
        replacement: ${1}
      - source_labels: []
        regex: .*
        target_label: __address__
        replacement: blackbox:9115  # Blackbox exporter.
@grobie

This comment has been minimized.

Copy link
Member

grobie commented Oct 9, 2016

Anything in the log files?

On Sun, Oct 9, 2016, 13:32 Sven Müller notifications@github.com wrote:

What did you do?
Since upgrading from version 1.1.3 to version 1.2.0, Prometheus stops
scraping all targets (node, prometheus) after some time. After
restarting prometheus, it works properly for some time, but then it stops
scraping all targets again.

What did you expect to see?
All targets should be in state "UP" and last scrape time should be less
then 5 seconds (for node targets).

What did you see instead? Under which circumstances?
Targets show "UNKNOWN" instead of "UP".

Environment

  • System information:

Linux 3.13.0-95-generic x86_64

  • Prometheus version:

prometheus, version 1.2.0 (branch: master, revision: 522c933
522c933
)
build user: root@c8088dd
https://github.com/root/prometheus/commit/c8088ddaf2a8
build date: 20161007-12:53:55
go version: go1.6.3

  • Prometheus configuration file:

my global config

global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
evaluation_interval: 15s # By default, scrape targets every 15 seconds.

scrape_timeout is set to the global default (10s).

Attach these labels to any time series or alerts when communicating with

external systems (federation, remote storage, Alertmanager).

external_labels:
hostname: 'my-hostname'

Load and evaluate rules in this file every 'evaluation_interval' seconds.

rule_files:

  • "/etc/prometheus/rules/*.rules"

A scrape configuration containing exactly one endpoint to scrape:

Here it's Prometheus itself.

scrape_configs:

The job name is added as a label job=<job_name> to any timeseries scraped from this config.

  • job_name: 'prometheus'

    Override the global default and scrape targets from this job every 5 seconds.

    scrape_interval: 5s

    metrics_path defaults to '/metrics'

    scheme defaults to 'http'.

    static_configs:

    • targets: ['localhost:9090']

    Scrape the Node Exporter every 5 seconds.

  • job_name: 'node'
    scrape_interval: 5s
    file_sd_configs:

    • files:
      • /etc/prometheus/targets/*.yaml

    relabel_configs:

    • source_labels: [address]
      regex: (.*):9100
      replacement: ${1}
      target_label: instance
    • source_labels: [instance]
      regex: .*.([\w,-]+).ct-app.com
      replacement: ${1}
      target_label: region
    • source_labels: [instance]
      regex: (\w+).*
      replacement: ${1}
      target_label: customer
    • source_labels: [instance]
      regex: .-(dev|stage|prod)-.
      replacement: ${1}
      target_label: environment
  • job_name: 'blackbox'
    metrics_path: /probe
    params:
    module: [http_2xx] # Look for a HTTP 200 response.
    static_configs:

    • targets: # Target to probe
    • source_labels: [address]
      regex: (.*)(:80)?
      target_label: __param_target
      replacement: ${1}
    • source_labels: [__param_target]
      regex: (.*)
      target_label: instance
      replacement: ${1}
    • source_labels: []
      regex: .*
      target_label: address
      replacement: blackbox:9115 # Blackbox exporter.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#2068, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAANaJmhNzY4GR7RLlkbgyGo6VswVW-9ks5qyNDrgaJpZM4KR-qS
.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Oct 9, 2016

Can you grab a goroutine dump when this happens?

http://localhost:9090/debug/pprof/goroutine?debug=2

@svenmueller

This comment has been minimized.

Copy link
Author

svenmueller commented Oct 9, 2016

Hi,

The prometheus instance just stopped scraping again all targets. Here are the resquested resources:

full log: https://gist.github.com/svenmueller/f96bece4f7852e6d5e20be87d858b552
gorouting dump: https://gist.github.com/svenmueller/223e903ac7703354364d9cef824f6541

--Sven

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Oct 9, 2016

First of all, I see multiple problems in your log file:

Most relevant however, it looks like your storage is sometimes throttled:

https://gist.github.com/svenmueller/f96bece4f7852e6d5e20be87d858b552#file-gistfile1-txt-L11

This means that targets will not get their samples stored anymore (at least intermittently), as the storage applies backpressure. If it's a new target, that will also cause it to stay in UNKNOWN state.

Looking at this part of the goroutine dump, the fact that it just says [select] and not [select, xxx minutes] means that Prometheus has just logged about being throttled, so that seemed to be the problem at the time you took the stacktrace:

https://gist.github.com/svenmueller/223e903ac7703354364d9cef824f6541#file-gistfile1-txt-L293-L297

You can also see that scrapers for your targets are in principle running:

https://gist.github.com/svenmueller/223e903ac7703354364d9cef824f6541#file-gistfile1-txt-L407-L429

But because the storage cannot keep up (probably disk IO problems?), it would hit this branch and not count the targets as scraped at all:

https://github.com/prometheus/prometheus/blob/master/retrieval/scrape.go#L430

Are you seeing the prometheus_target_skipped_scrapes_total counter metric going up?

I'm not sure if 1.2.0 changed relevant storage stuff, or whether it just so happens that something caused your storage to get overloaded around the same time...

@marcbradshaw

This comment has been minimized.

Copy link

marcbradshaw commented Oct 10, 2016

I am seeing similar issues, I see a bunch of these errors in log

@4000000057fb08421a02beac time="2016-10-09T23:17:12-04:00" level=error msg="Error refreshing service list: Unexpected response code: 500 (rpc error: rpc error: fail
ed to get conn: rpc error: lead thread didn't get connection)" source="consul.go:127"
@4000000057fb08421a09ac1c time="2016-10-09T23:17:12-04:00" level=error msg="Error refreshing service prom_node_exporter: Unexpected response code: 500 (rpc error: r
pc error: failed to get conn: rpc error: lead thread didn't get connection)" source="consul.go:218"

followed by

@4000000057fb087b3a033fec time="2016-10-09T23:18:09-04:00" level=warning msg="Storage has entered rushed mode." chunksToPersist=402607 maxChunksToPersist=524288 max
MemoryChunks=1048576 memoryChunks=1132516 source="storage.go:1587" urgencyScore=0.8005142211914062

@marcbradshaw

This comment has been minimized.

@svenmueller

This comment has been minimized.

Copy link
Author

svenmueller commented Oct 10, 2016

Hi,

After switching back to version 1.1.3, prometheus keeps scraping the targets properly again.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Oct 10, 2016

I think I found the issue.

PR imminent.

@klausenbusk

This comment has been minimized.

Copy link

klausenbusk commented Oct 10, 2016

I think I have the same issues.
Prometheus 1.2.0 hasn't scrapped anything for 8 hours after throttling:

Oct 10 06:32:04 foo docker[24097]: time="2016-10-10T06:32:04Z" level=error msg="Storage needs throttling. Scrapes and rule evaluations will be skipped." chunksToPersist=27008 maxChunksToPersist=524288 maxToleratedMemChunks=165000 memoryChunks=165042 source="storage.go:888"

Metrics from Prometheus: http://sprunge.us/ZVLI
Log file: http://sprunge.us/iMOM (sorry for all the json error, need to fix that).
prometheus.yml

scrape_configs:
  - job_name: "prometheus"

    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'spconnect'
    scrape_interval: 30s
    file_sd_configs:
      - files: ["/etc/prometheus/foo.json"]

  - job_name: "coreos"
    file_sd_configs:
      - files: ["/etc/prometheus/node_exporter.json"]

Prometheus is started as:

/bin/prometheus -config.file=/etc/prometheus/prometheus.yml -storage.local.path=/var/lib/prometheus -web.console.libraries=/etc/prometheus/console_libraries -web.console.templates=/etc/prometheus/consoles -storage.local.memory-chunks 150000

Please say if you need more info.
Prometheus is running in a Debian Jessie Docker (1.10.3) container on a CoreOS 1122.2.0 stable on a DigitalOcean 2GB ram node.

@klausenbusk

This comment has been minimized.

Copy link

klausenbusk commented Oct 10, 2016

Goroutine dump: http://sprunge.us/YGbY

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Oct 10, 2016

I'm working on releasing 1.2.1

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Oct 10, 2016

Release is there: https://github.com/prometheus/prometheus/releases/tag/v1.2.1

Binaries are built as I'm speaking…

@beorn7 beorn7 closed this Oct 10, 2016

@commarla

This comment has been minimized.

Copy link

commarla commented Oct 10, 2016

Thanks a lots, I have encountered the same issue twice this weekend

@raypettersen

This comment has been minimized.

Copy link

raypettersen commented Oct 13, 2016

Same here. Our dev-prometheus stopped scraping twice in a short time. Thanks for the fix!

@metral

This comment has been minimized.

Copy link

metral commented Oct 29, 2016

I'm still seeing this issue on v1.2.1. I'm using quay.io/prometheus/prometheus:v1.2.1 on k8s v1.4.0

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Oct 30, 2016

@metral that must be a different issue then. Could you file a new issue using the template and provide your diagnostics so that we have a chance to find out what's going on?

@sirhopcount

This comment has been minimized.

Copy link

sirhopcount commented Nov 18, 2016

We also see this issue on version 1.3.0. Prometheus stop scraping several (but not all) endpoints.

The endpoints that are no longer being scraped all stop at the same time. We verified that Prometheus can reach the endpoints. We couldn't find any relevant errors in the logs.

Version info:

Version     1.3.0
Revision    18254a172b1e981ed593442b2259bd63617d6aca
Branch  master
BuildUser   root@d363f050a0e0
BuildDate   20161101-17:06:27
GoVersion   go1.7.3
@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Nov 18, 2016

What service discovery are you using? Any configuration reloads in-between?
Configuration file and logs would be ideal.

@strzelecki-maciek

This comment has been minimized.

Copy link

strzelecki-maciek commented Nov 28, 2016

Sorry, my bad! removed!

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Nov 28, 2016

@strzelecki-maciek You are describing a different issue. Posting it as a follow-up to a different and long fixed issue will not raise any attention. Please file a fresh issue using the template to make sure you are providing the information we need to investigate. Thank you.

@gvenka008c

This comment has been minimized.

Copy link

gvenka008c commented Jun 8, 2017

Anyone here seen the below error? We are seeing the similar issue where Prometheus stops scrapping after certain period of time. (say 24 hrs)

goroutine 5907380 [IO wait]:
net.runtime_pollWait(0x7f8d0c6b0a18, 0x72, 0x21)
	/usr/local/go/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xc468d1c378, 0x72, 0x262a980, 0x261f5d0)
	/usr/local/go/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitRead(0xc468d1c378, 0xc4657204d1, 0x1)
	/usr/local/go/src/net/fd_poll_runtime.go:80 +0x34
net.(*netFD).Read(0xc468d1c310, 0xc4657204d1, 0x1, 0x1, 0x0, 0x262a980, 0x261f5d0)
	/usr/local/go/src/net/fd_unix.go:250 +0x1b7
net.(*conn).Read(0xc42e020b38, 0xc4657204d1, 0x1, 0x1, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:181 +0x70
github.com/prometheus/prometheus/vendor/golang.org/x/net/netutil.(*limitListenerConn).Read(0xc442fa9680, 0xc4657204d1, 0x1, 0x1, 0xc464304078, 0xc4614b6fd0, 0x4d38e7)
	<autogenerated>:6 +0x6b
net/http.(*connReader).backgroundRead(0xc4657204c0)
	/usr/local/go/src/net/http/server.go:656 +0x58
created by net/http.(*connReader).startBackgroundRead
	/usr/local/go/src/net/http/server.go:652 +0xdf

@brian-brazil brian-brazil added kind/bug and removed kind/bug labels Jul 14, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.