Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus 2.0 reload configuration issues #3707

Closed
Chevron94 opened this Issue Jan 19, 2018 · 17 comments

Comments

Projects
None yet
4 participants
@Chevron94
Copy link

Chevron94 commented Jan 19, 2018

What did you do?
Prometheus sends samples to our remote storage. One time when remote storage is going to down, we see in logs:
level=warn ts=2018-01-19T08:29:08.516631616Z caller=queue_manager.go:485 component=remote msg="Error sending samples to remote storage" count=1 err="Post <storage_url>: getsockopt: no route to host"
If we try to reload prometheus configuration by sending post on <prometheus_url>/-/reload when remote storage is down, this request is never finished. Looks like prometheus trying to send data to storage again and again, before reload configuration.
What did you expect to see?
Prometheus should reload configuration
What did you see instead? Under which circumstances?
Configuration wasn't reloaded, no response on POST <prometheus_url>/-/reload request
Environment

  • System information:

    Linux 3.10.0-514.26.2.el7.x86_64 x86_64

  • Prometheus version:

    2.0.0

  • Prometheus configuration file:

global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s
  external_labels:
    monitor: default
...
remote_write:
- url: <remote_storage_url>
  remote_timeout: 30s
  write_relabel_configs:
  - source_labels: [indicatorName]
    separator: ;
    regex: (.+)
    replacement: $1
    action: keep
  queue_config:
    capacity: 100000
    max_shards: 1000
    max_samples_per_send: 100
    batch_send_deadline: 5s
    max_retries: 10
    min_backoff: 30ms
    max_backoff: 100ms
  • Logs:
2018/01/19 08:25:22 Redirected: /-/reload
level=info ts=2018-01-19T08:25:22.313752379Z caller=main.go:490 msg="Loading configuration file" filename=/config/prometheus/default.yml
level=info ts=2018-01-19T08:25:22.319062591Z caller=queue_manager.go:253 component=remote msg="Stopping remote storage..."
level=warn ts=2018-01-19T08:25:23.406480826Z caller=queue_manager.go:485 component=remote msg="Error sending samples to remote storage" count=8 err="Post <remote_storage_url>: getsockopt: no route to host"
level=warn ts=2018-01-19T08:25:26.412528632Z caller=queue_manager.go:485 component=remote msg="Error sending samples to remote storage" count=8 err="Post <remote_storage_url>: getsockopt: no route to host"
level=warn ts=2018-01-19T08:25:29.418470232Z caller=queue_manager.go:485 component=remote msg="Error sending samples to remote storage" count=8 err="Post <remote_storage_url>: getsockopt: no route to host"
@Chevron94

This comment has been minimized.

Copy link
Author

Chevron94 commented Jan 23, 2018

One more case with low performance from prometheus 2.* Our prometheus was deployed in openshift. It has about 60 targets which were monitored by snmp, blackbox and other exporters. When we tried to update prometheus configuration from external network, we receive 504 error (30-seconds timeout). When we execute reload from internal network we receive response about successfull update in 90-120 seconds.

Prometheus started on local PC:
we have 2000 targets which are monitored by snmp. Some of this are available, some - not.
Prometheus 2.* works slowly with configuration update than 1.7.1 in around 100 times. Why is it could be? It looks like performance bug.

Prometheus 2.*
$ time curl -X POST http://localhost:9090/-/reload
real    0m 10.007s
user    0m 0.004s
sys     0m 0.007s
Prometheus 1.7.1
$ time curl -X POST http://localhost:9090/-/reload
real    0m 0.122s
user    0m 0.005s
sys     0m 0.002s

prometheus configuration:

global:
  scrape_interval:     60s
  evaluation_interval: 60s
  external_labels:
      monitor: 'codelab-monitor'
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'node'
    scrape_interval: 5s 
    static_configs:
      - targets: ['localhost:9100']
  - job_name: 'snmpjob'
    metrics_path: /snmp
    params:
      module: [base]
    static_configs:
      - targets:
        - 192.168.56.1:20000
        - 192.168.56.1:20001
        - 192.168.56.1:20002
        - 192.168.56.1:20003
...
        - 192.168.56.1:21997
        - 192.168.56.1:21998
        - 192.168.56.1:21999
        - 192.168.56.1:22000
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9116

@Chevron94 Chevron94 changed the title Prometheus 2.0 unable to reload configuration if remote storage is down Prometheus 2.0 reload configuration issues Jan 23, 2018

@Chevron94

This comment has been minimized.

Copy link
Author

Chevron94 commented Jan 29, 2018

for case with 2000 targets from trace (prometheus 2.0)

github.com/prometheus/prometheus/retrieval.(*scrapeLoop).run N=4702 net/http.(*Transport).getConn.func4 N=508 main.main.func2 N=1 github.com/prometheus/prometheus/retrieval.(*TargetManager).reload.func1 N=3 net/http.(*persistConn).readLoop N=523 net/http.(*conn).serve N=2 internal/singleflight.(*Group).doCall N=3 net/http.(*persistConn).writeLoop N=523 main.main.func4 N=1 runtime.timerproc N=1 github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*SegmentWAL).run N=1 runtime/trace.Start.func1 N=1 github.com/prometheus/prometheus/retrieval.(*scrapePool).reload.func1 N=2003 context.WithDeadline.func2 N=31 github.com/prometheus/prometheus/vendor/github.com/cockroachdb/cmux.(*cMux).serve N=1 github.com/prometheus/prometheus/web.(*Handler).Run.func5 N=1 net.(*netFD).connect.func2 N=511 github.com/prometheus/prometheus/discovery.(*TargetSet).updateProviders.func1 N=3 net/http.(*connReader).backgroundRead N=1 github.com/prometheus/prometheus/discovery.(*StaticProvider).Run N=3 N=1136

@Chevron94 Chevron94 closed this Jan 29, 2018

@Chevron94 Chevron94 reopened this Jan 29, 2018

@Chevron94

This comment has been minimized.

Copy link
Author

Chevron94 commented Jan 29, 2018

block
sched

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jan 29, 2018

@tomwilkie

This comment has been minimized.

Copy link
Member

tomwilkie commented Jan 29, 2018

I think this is a dupe of #2972

@tomwilkie

This comment has been minimized.

Copy link
Member

tomwilkie commented Jan 29, 2018

Sorry, just seen the latest update - looks more retrieval related.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jan 31, 2018

@tomwilkie I did a bit of tracing as well and the blocking happens at the QueueManager.Stop()

func (t *QueueManager) Stop() {
log.Infof("Stopping remote storage...")
close(t.quit)
t.wg.Wait()
t.shardsMtx.Lock()
defer t.shardsMtx.Unlock()
t.shards.stop()
log.Info("Remote storage stopped.")
}

I will continue digging , but some more input might be useful.

block

@tomwilkie

This comment has been minimized.

Copy link
Member

tomwilkie commented Jan 31, 2018

Its definitely the case the remote write being down can block shutdown. This issue seemed to imply retrieval could too. If thats not the case, close as dupe of #2972 and I'll do a fix.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jan 31, 2018

ok I will wait for your fix and will continue digging if it doesn't resolve.

@vitaly-m

This comment has been minimized.

Copy link

vitaly-m commented Jan 31, 2018

Hi, unfortunately we have such issue not only in case of unavailable remote storage, but also in case of many targets (configured using file sd config).

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jan 31, 2018

yes that is also a problem and should be addressed in #3762

@Chevron94

This comment has been minimized.

Copy link
Author

Chevron94 commented May 23, 2018

@tomwilkie, @krasi-georgiev
Why we need lock in this function in queue_manager.go? During config reloading it can stack us for a long time:
func (s *shards) stop() { for _, shard := range s.queues { close(shard) } s.wg.Wait() }
if we remove it, reloading would work a little faster. Also if we change remote storage url during retries it could helps us do not loose samples and sends it faster to new storage.

@tomwilkie

This comment has been minimized.

Copy link
Member

tomwilkie commented May 23, 2018

The guarantee we offer upstream remote write endpoints is in-order sample delivery; with the wait group, and old shard can be continue flushing after a new shard starts, delivering samples out of order.

@Chevron94

This comment has been minimized.

Copy link
Author

Chevron94 commented May 23, 2018

What if we made "force" stopping shards:
in case of resharding we all do as is (stop old shards then start new), in case of reloading configuration: made force (without wait)? In this case we could save in-order sample delivery.

func (s *shards) stop(force bool) {
	for _, shard := range s.queues {
		close(shard)
	}
	if force == false {
		s.wg.Wait()
	}
}
@tomwilkie

This comment has been minimized.

Copy link
Member

tomwilkie commented May 23, 2018

If you skip the wait, samples will be sent out of order. If you want to force the shards to stop, you need #3773

@Chevron94

This comment has been minimized.

Copy link
Author

Chevron94 commented May 23, 2018

Thanks, will look at it.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.