Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for the slow updates of targets changes #4526

Merged
merged 1 commit into from Sep 26, 2018

Conversation

krasi-georgiev
Copy link
Contributor

@krasi-georgiev krasi-georgiev commented Aug 22, 2018

The scrape manage receiver channel now just saves the target sets
and another backgorund runner updates the scrape loops every 5 seconds.
This is so that the scrape manager doesn't block the receiving channel
when it does the long background reloading of the scrape loops.

Active and dropped targets are now saved in each scrape pool instead of
the scrape manager. This is mainly to avoid races when getting the
targets via the web api.

When reloading the scrape loops now happens in parallel to speed up the
final disared state and this also speeds up the prometheus's shutting
down.

Also updated some funcs signatures in the web package for consistency.

fixes: #4124
fixes: #4301

I will run a benchmark once the #4523 is merged and I rebase it to this PR as well.

Signed-off-by: Krasi Georgiev kgeorgie@redhat.com

@krasi-georgiev krasi-georgiev changed the title [WIP] fix for the slow updates of targets changes Fix for the slow updates of targets changes Aug 28, 2018
@krasi-georgiev
Copy link
Contributor Author

/benchmark

@prombot
Copy link
Contributor

prombot commented Aug 28, 2018

@krasi-georgiev: Welcome to Prometheus Benchmarking Tool.

The two prometheus versions that will be compared are pr-4526 and master

The logs can be viewed at the links provided in the GitHub check blocks at the end of this conversation

After successfull deployment, the benchmarking metrics can be viewed at :

To stop the benchmark process comment /benchmark cancel .

In response to this:

/benchmark

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@krasi-georgiev
Copy link
Contributor Author

krasi-georgiev commented Aug 28, 2018

@grobie still polishing few details, but you can already test this version or wait few days to clear all small issues.

The last major problem is with the http2 go client update in golang/net@1c05540 , as long at your k8s cluster doesn't use the default connection stream limit of 1000, you should be fine.
Alternatevely you can disable HTTP2 by running prometheus with

DISABLE_HTTP2=true ./prometheus ....

more details about the http2 client issue.
kubernetes/client-go#456

@krasi-georgiev
Copy link
Contributor Author

krasi-georgiev commented Aug 29, 2018

the benchmark is looking good. When I scale up and down the targets the CPU and memory usage is higher, but this is to be expected as all processing is now done in parallel.

@krasi-georgiev
Copy link
Contributor Author

Since the bug was slow targets updates this will be tested when we implement the e2e SD tests.

Copy link
Member

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also speeds up the prometheus's shutting down.

I don't see anything in this PR related to that. Am I missing something?

scrapeConfig, ok := m.scrapeConfigs[setName]
if !ok {
level.Error(m.logger).Log("msg", "error reloading target set", "err", fmt.Sprintf("invalid config id:%v", setName))
return

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

@krasi-georgiev
Copy link
Contributor Author

/benchmark cancel

@beorn7
Copy link
Member

beorn7 commented Sep 5, 2018

Hi @krasi-georgiev , I'm back at work and would like to canary all those SD changes. What's the state of this PR? Should this still be included? You said you wanted to rebase this before further checking.

@simonpasquier
Copy link
Member

@beorn7 IIUC #4124 that you reported on targets not being updated should be fixed now on master by #4523 and #4556. This PR is more about optimizations that would speed up the recreation of the scrape loops. Also #3912 (which merges identical SD configurations into a single provider and thus avoids hammering the K8S API) has been merged. Finally Krasi has found a couple of issues with the K8S go client (#4518 and #4528) but not sure that they apply to your environment.

In summary, you're more than welcome to test the current master on your clusters 😏

@beorn7
Copy link
Member

beorn7 commented Sep 5, 2018

OK, I'll try current master first and leave this PR alone for now.

@krasi-georgiev
Copy link
Contributor Author

I have rebased it.

@krasi-georgiev
Copy link
Contributor Author

this also speeds up the prometheus's shutting down.

I don't see anything in this PR related to that. Am I missing something?

when reloading the targets and stopping old scrapes, Prometheus will shutdown only when the reloading is over and all old scrapes are stopped. processing these in parallel allows Prometheus to shutdown a lot quicker.

@krasi-georgiev
Copy link
Contributor Author

review please 😝

@krasi-georgiev
Copy link
Contributor Author

@beorn7 , @grobie would you mind testing this?

@simonpasquier
Copy link
Member

LGTM but I think it would worth to have unit tests for this.

@krasi-georgiev
Copy link
Contributor Author

@simonpasquier what unit tests did you have in mind?

This PR fixes the time it takes to reload the scrape loops.
An e2e with some timings would be flaky, but suggestions are welcome.

@beorn7
Copy link
Member

beorn7 commented Sep 19, 2018

I can run this on one of our servers at my next convenience.

@krasi-georgiev
Copy link
Contributor Author

@beorn7 appreciated!

@simonpasquier
Copy link
Member

I was thinking about something similar to #4582 although looking at the manager's code it would require some adjustments to be able to skip the effective launch of the scrapers.

@beorn7
Copy link
Member

beorn7 commented Sep 19, 2018

Running this now (with current state of master merged in).

Disclaimer: I'm not using race detector (would not work on the heavily loaded test machine), but I recommend to do so before merging this.

I guess this change should result in faster reloads of the config file. I sighup'd the test server and the 2.4.0 production server a few times and couldn't detect a significant improvement, mostly because reload times are vastly different from case to case, from 5s to 30s, more or less the same on both servers.

Restart time is quite long on this server (around 12m (!)) and isn't significantly different with this PR merged in. (Perhaps the new WAL implementation is slower? I don't remember such long startup times from earlier 2.x versions. But perhaps our load has just increased so much in the meantime...)

Altogether, the two servers seem to behave mostly the same.

@krasi-georgiev
Copy link
Contributor Author

krasi-georgiev commented Sep 19, 2018

I tested this mainly with the k8s discovery.
Using the config from: #4124 (comment)

I think 100 jobs with 20-30 targets per jobs.

Than randomly scaling up and down the number of targets between 1 and 20 .

Before this pr it takes minutes to reflect the changes in the /targets web gui and it can never catch up in an environment of constantly changing targets.

@krasi-georgiev
Copy link
Contributor Author

When I tested locally it also improved the shutdown time as now old scrape loops are stopped in parallel.

@simonpasquier
Copy link
Member

@krasi-georgiev have you retried your tests with the latest master? I suspect that the various improvements to the SD manager may have change the situation.

@beorn7
Copy link
Member

beorn7 commented Sep 19, 2018

@krasi-georgiev I see. I guess our test server doesn't see quickly changing targets that often. I might be able to identify one of those, and then run the canary binary there.

@beorn7
Copy link
Member

beorn7 commented Sep 19, 2018

Looking around, with 2.4.0, all our practical uses of K8s SD seem to update fast enough so that any further improvements would not stick out of the noise.

@krasi-georgiev
Copy link
Contributor Author

Maybe @brancz can also give it a try since it is very easy to replicate using his configs even using minikibe.

@simonpasquier this change is in the scrape manager which is completely independent from the sd manger.
Which changes do you think might be relevant here?

@simonpasquier
Copy link
Member

Which changes do you think might be relevant here

Mostly #3912. It might be that optimizing the discovery part (basically much less requests to the K8S API) reduced the contention and made the problem less acute on the scrape side.

@krasi-georgiev
Copy link
Contributor Author

krasi-georgiev commented Sep 19, 2018

aah yes that definitely helped, but processing the loops in parallel solves a completely different problem.

@beorn7
Copy link
Member

beorn7 commented Sep 20, 2018

After letting it run overnight, I can just say that it works at least as well as the 2.4.0 one.

@simonpasquier
Copy link
Member

Scaling from 0 to 2000 targets back and forth, I definitely see an improvement when down-scaling. WDYT of adding a few tests?

@krasi-georgiev
Copy link
Contributor Author

the only useful test that I can see is making sure the Scrape manager never blocks the tsets <-chan map[string][]*targetgroup.Group

Is this what you had in mind as well?

@simonpasquier
Copy link
Member

Yes. Also making sure that data received on the channel triggers scrape updates and that everything is teared down properly on shutdown. I know that it wasn't tested before but it looks like a good opportunity to fix it.

@krasi-georgiev
Copy link
Contributor Author

@simonpasquier yes good idea, I added some tests.

Just not sure what you had in mind for:

everything is teared down properly on shutdown.

}

wg.Add(1)
// Run the sync in paralel as these take a while and at high load can't catch up.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/paralel/parallel/

@simonpasquier
Copy link
Member

👍 It would be even better to test the logic including reload() but it is more involved and wasn't covered before either...

everything is teared down properly on shutdown.

That calling Stop() does the right thing.

@krasi-georgiev
Copy link
Contributor Author

Yeah I didn't want to take it that far as this would probably end being too high maintenance.
In general it is not a bad idea, but it is outside the topic of this PR.
Even if someone decides to implement it let's leave it for another PR so we keep this one on topic and don't block it for any longer that is needed.
The same for the teardown.

The scrape manage receiver channel now just saves the target sets
and another backgorund runner updates the scrape loops every 5 seconds.
This is so that the scrape manager doesn't block the receiving channel
when it does the long background reloading of the scrape loops.

Active and dropped targets are now saved in each scrape pool instead of
the scrape manager. This is mainly to avoid races when getting the
targets via the web api.

When reloading the scrape loops now happens in parallel to speed up the
final disared state and this also speeds up the prometheus's shutting
down.

Also updated some funcs signatures in the web package for consistency.

Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prometheus configuration reloads slowly K8s SD: Targets are not updated
4 participants