Fix for the slow updates of targets changes #4526

krasi-georgiev · 2018-08-22T01:27:14Z

The scrape manage receiver channel now just saves the target sets
and another backgorund runner updates the scrape loops every 5 seconds.
This is so that the scrape manager doesn't block the receiving channel
when it does the long background reloading of the scrape loops.

Active and dropped targets are now saved in each scrape pool instead of
the scrape manager. This is mainly to avoid races when getting the
targets via the web api.

When reloading the scrape loops now happens in parallel to speed up the
final disared state and this also speeds up the prometheus's shutting
down.

Also updated some funcs signatures in the web package for consistency.

fixes: #4124
fixes: #4301

I will run a benchmark once the #4523 is merged and I rebase it to this PR as well.

Signed-off-by: Krasi Georgiev kgeorgie@redhat.com

krasi-georgiev · 2018-08-28T15:39:10Z

/benchmark

prombot · 2018-08-28T15:39:16Z

@krasi-georgiev: Welcome to Prometheus Benchmarking Tool.

The two prometheus versions that will be compared are pr-4526 and master

The logs can be viewed at the links provided in the GitHub check blocks at the end of this conversation

After successfull deployment, the benchmarking metrics can be viewed at :

promethues-meta - label {"namespace" : "prombench-4526"}
grafana - template-variable "pr-number" : 4526

To stop the benchmark process comment /benchmark cancel .

In response to this:

/benchmark

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

krasi-georgiev · 2018-08-28T20:27:41Z

@grobie still polishing few details, but you can already test this version or wait few days to clear all small issues.

The last major problem is with the http2 go client update in golang/net@1c05540 , as long at your k8s cluster doesn't use the default connection stream limit of 1000, you should be fine.
Alternatevely you can disable HTTP2 by running prometheus with

DISABLE_HTTP2=true ./prometheus ....

more details about the http2 client issue.
kubernetes/client-go#456

krasi-georgiev · 2018-08-29T10:31:13Z

the benchmark is looking good. When I scale up and down the targets the CPU and memory usage is higher, but this is to be expected as all processing is now done in parallel.

krasi-georgiev · 2018-08-29T10:33:14Z

Since the bug was slow targets updates this will be tested when we implement the e2e SD tests.

simonpasquier

this also speeds up the prometheus's shutting down.

I don't see anything in this PR related to that. Am I missing something?

scrape/manager.go

+			scrapeConfig, ok := m.scrapeConfigs[setName]
+			if !ok {
+				level.Error(m.logger).Log("msg", "error reloading target set", "err", fmt.Sprintf("invalid config id:%v", setName))
+				return


krasi-georgiev · 2018-08-30T00:17:33Z

/benchmark cancel

beorn7 · 2018-09-05T12:38:02Z

Hi @krasi-georgiev , I'm back at work and would like to canary all those SD changes. What's the state of this PR? Should this still be included? You said you wanted to rebase this before further checking.

simonpasquier · 2018-09-05T16:03:56Z

@beorn7 IIUC #4124 that you reported on targets not being updated should be fixed now on master by #4523 and #4556. This PR is more about optimizations that would speed up the recreation of the scrape loops. Also #3912 (which merges identical SD configurations into a single provider and thus avoids hammering the K8S API) has been merged. Finally Krasi has found a couple of issues with the K8S go client (#4518 and #4528) but not sure that they apply to your environment.

In summary, you're more than welcome to test the current master on your clusters 😏

beorn7 · 2018-09-05T16:24:37Z

OK, I'll try current master first and leave this PR alone for now.

krasi-georgiev · 2018-09-05T21:22:31Z

I have rebased it.

krasi-georgiev · 2018-09-11T11:25:11Z

this also speeds up the prometheus's shutting down.

I don't see anything in this PR related to that. Am I missing something?

when reloading the targets and stopping old scrapes, Prometheus will shutdown only when the reloading is over and all old scrapes are stopped. processing these in parallel allows Prometheus to shutdown a lot quicker.

krasi-georgiev · 2018-09-16T07:00:14Z

review please 😝

krasi-georgiev · 2018-09-19T09:03:03Z

@beorn7 , @grobie would you mind testing this?

simonpasquier · 2018-09-19T09:59:07Z

LGTM but I think it would worth to have unit tests for this.

krasi-georgiev · 2018-09-19T10:03:41Z

@simonpasquier what unit tests did you have in mind?

This PR fixes the time it takes to reload the scrape loops.
An e2e with some timings would be flaky, but suggestions are welcome.

beorn7 · 2018-09-19T10:35:52Z

I can run this on one of our servers at my next convenience.

krasi-georgiev · 2018-09-19T10:54:10Z

@beorn7 appreciated!

simonpasquier · 2018-09-19T11:42:19Z

I was thinking about something similar to #4582 although looking at the manager's code it would require some adjustments to be able to skip the effective launch of the scrapers.

beorn7 · 2018-09-19T13:19:12Z

Running this now (with current state of master merged in).

Disclaimer: I'm not using race detector (would not work on the heavily loaded test machine), but I recommend to do so before merging this.

I guess this change should result in faster reloads of the config file. I sighup'd the test server and the 2.4.0 production server a few times and couldn't detect a significant improvement, mostly because reload times are vastly different from case to case, from 5s to 30s, more or less the same on both servers.

Restart time is quite long on this server (around 12m (!)) and isn't significantly different with this PR merged in. (Perhaps the new WAL implementation is slower? I don't remember such long startup times from earlier 2.x versions. But perhaps our load has just increased so much in the meantime...)

Altogether, the two servers seem to behave mostly the same.

krasi-georgiev · 2018-09-19T13:30:11Z

I tested this mainly with the k8s discovery.
Using the config from: #4124 (comment)

I think 100 jobs with 20-30 targets per jobs.

Than randomly scaling up and down the number of targets between 1 and 20 .

Before this pr it takes minutes to reflect the changes in the /targets web gui and it can never catch up in an environment of constantly changing targets.

krasi-georgiev · 2018-09-19T13:36:47Z

When I tested locally it also improved the shutdown time as now old scrape loops are stopped in parallel.

simonpasquier · 2018-09-19T13:42:50Z

@krasi-georgiev have you retried your tests with the latest master? I suspect that the various improvements to the SD manager may have change the situation.

beorn7 · 2018-09-19T14:05:52Z

@krasi-georgiev I see. I guess our test server doesn't see quickly changing targets that often. I might be able to identify one of those, and then run the canary binary there.

beorn7 · 2018-09-19T14:14:57Z

Looking around, with 2.4.0, all our practical uses of K8s SD seem to update fast enough so that any further improvements would not stick out of the noise.

krasi-georgiev · 2018-09-19T15:52:20Z

Maybe @brancz can also give it a try since it is very easy to replicate using his configs even using minikibe.

@simonpasquier this change is in the scrape manager which is completely independent from the sd manger.
Which changes do you think might be relevant here?

simonpasquier · 2018-09-19T16:01:53Z

Which changes do you think might be relevant here

Mostly #3912. It might be that optimizing the discovery part (basically much less requests to the K8S API) reduced the contention and made the problem less acute on the scrape side.

krasi-georgiev · 2018-09-19T16:45:53Z

aah yes that definitely helped, but processing the loops in parallel solves a completely different problem.

beorn7 · 2018-09-20T10:16:27Z

After letting it run overnight, I can just say that it works at least as well as the 2.4.0 one.

simonpasquier · 2018-09-21T12:34:39Z

Scaling from 0 to 2000 targets back and forth, I definitely see an improvement when down-scaling. WDYT of adding a few tests?

krasi-georgiev · 2018-09-21T15:47:51Z

the only useful test that I can see is making sure the Scrape manager never blocks the tsets <-chan map[string][]*targetgroup.Group

Is this what you had in mind as well?

simonpasquier · 2018-09-21T15:56:28Z

Yes. Also making sure that data received on the channel triggers scrape updates and that everything is teared down properly on shutdown. I know that it wasn't tested before but it looks like a good opportunity to fix it.

krasi-georgiev · 2018-09-21T17:27:30Z

@simonpasquier yes good idea, I added some tests.

Just not sure what you had in mind for:

everything is teared down properly on shutdown.

simonpasquier · 2018-09-24T14:27:47Z

scrape/manager.go

+		}
+
+		wg.Add(1)
+		// Run the sync in paralel as these take a while and at high load can't catch up.


s/paralel/parallel/

simonpasquier · 2018-09-24T14:28:36Z

👍 It would be even better to test the logic including reload() but it is more involved and wasn't covered before either...

everything is teared down properly on shutdown.

That calling Stop() does the right thing.

krasi-georgiev · 2018-09-24T16:51:50Z

Yeah I didn't want to take it that far as this would probably end being too high maintenance.
In general it is not a bad idea, but it is outside the topic of this PR.
Even if someone decides to implement it let's leave it for another PR so we keep this one on topic and don't block it for any longer that is needed.
The same for the teardown.

The scrape manage receiver channel now just saves the target sets and another backgorund runner updates the scrape loops every 5 seconds. This is so that the scrape manager doesn't block the receiving channel when it does the long background reloading of the scrape loops. Active and dropped targets are now saved in each scrape pool instead of the scrape manager. This is mainly to avoid races when getting the targets via the web api. When reloading the scrape loops now happens in parallel to speed up the final disared state and this also speeds up the prometheus's shutting down. Also updated some funcs signatures in the web package for consistency. Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>

krasi-georgiev mentioned this pull request Aug 28, 2018

Race condition starting HEAD prometheus with static scrape configs #4551

Closed

krasi-georgiev force-pushed the sd-bug branch from 38a87f1 to 445e3b9 Compare August 28, 2018 11:55

krasi-georgiev changed the title ~~[WIP] fix for the slow updates of targets changes~~ Fix for the slow updates of targets changes Aug 28, 2018

prombot added the benchmark label Aug 28, 2018

krasi-georgiev force-pushed the sd-bug branch from 445e3b9 to 15e9ec9 Compare August 29, 2018 10:27

krasi-georgiev requested review from simonpasquier and grobie August 29, 2018 10:29

simonpasquier reviewed Aug 29, 2018

View reviewed changes

prombot removed the benchmark label Aug 30, 2018

krasi-georgiev mentioned this pull request Aug 30, 2018

discovery: coalesce identical SD configurations #3912

Merged

krasi-georgiev force-pushed the sd-bug branch from 15e9ec9 to 6b36538 Compare September 5, 2018 21:22

krasi-georgiev requested a review from gouthamve September 11, 2018 11:22

krasi-georgiev force-pushed the sd-bug branch from 9b0e803 to fc7bf2b Compare September 21, 2018 17:29

simonpasquier reviewed Sep 24, 2018

View reviewed changes

simonpasquier approved these changes Sep 26, 2018

View reviewed changes

krasi-georgiev force-pushed the sd-bug branch from eb1ecf8 to eda48a2 Compare September 26, 2018 08:24

krasi-georgiev merged commit 47a673c into prometheus:master Sep 26, 2018

krasi-georgiev deleted the sd-bug branch October 4, 2018 12:29

This was referenced Oct 25, 2018

web/api/v1: fix targets endpoint #4783

Merged

web: display job label in targets page #4806

Merged

This was referenced Nov 22, 2018

web: group targets by job then instance #4898

Merged

fix deadlock in scrape manager #4894

Merged

simonpasquier mentioned this pull request Feb 12, 2019

Prometheus hangs and crashes - without any error #5205

Closed

Fix for the slow updates of targets changes #4526

Fix for the slow updates of targets changes #4526

Conversation

krasi-georgiev commented Aug 22, 2018 • edited

krasi-georgiev commented Aug 28, 2018

prombot commented Aug 28, 2018

krasi-georgiev commented Aug 28, 2018 • edited

krasi-georgiev commented Aug 29, 2018 • edited

krasi-georgiev commented Aug 29, 2018

simonpasquier left a comment

Choose a reason for hiding this comment

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

krasi-georgiev commented Aug 30, 2018

beorn7 commented Sep 5, 2018

simonpasquier commented Sep 5, 2018

beorn7 commented Sep 5, 2018

krasi-georgiev commented Sep 5, 2018

krasi-georgiev commented Sep 11, 2018

krasi-georgiev commented Sep 16, 2018

krasi-georgiev commented Sep 19, 2018

simonpasquier commented Sep 19, 2018

krasi-georgiev commented Sep 19, 2018

beorn7 commented Sep 19, 2018

krasi-georgiev commented Sep 19, 2018

simonpasquier commented Sep 19, 2018

beorn7 commented Sep 19, 2018

krasi-georgiev commented Sep 19, 2018 • edited

krasi-georgiev commented Sep 19, 2018

simonpasquier commented Sep 19, 2018

beorn7 commented Sep 19, 2018

beorn7 commented Sep 19, 2018

krasi-georgiev commented Sep 19, 2018

simonpasquier commented Sep 19, 2018

krasi-georgiev commented Sep 19, 2018 • edited

beorn7 commented Sep 20, 2018

simonpasquier commented Sep 21, 2018

krasi-georgiev commented Sep 21, 2018

simonpasquier commented Sep 21, 2018

krasi-georgiev commented Sep 21, 2018

simonpasquier Sep 24, 2018

Choose a reason for hiding this comment

simonpasquier commented Sep 24, 2018

krasi-georgiev commented Sep 24, 2018

krasi-georgiev commented Aug 22, 2018 •

edited

krasi-georgiev commented Aug 28, 2018 •

edited

krasi-georgiev commented Aug 29, 2018 •

edited

krasi-georgiev commented Sep 19, 2018 •

edited

krasi-georgiev commented Sep 19, 2018 •

edited