Improve blocking queries on services that do not exist #4810

ShimmerGlass · 2018-10-17T09:28:28Z

Background

When making a blocking query on a missing service (was never registered, or is not registered anymore) the query returns as soon as any service is updated.
On clusters with frequent updates (5~10 updates/s in our DCs) these queries virtually do not block, and clients with no protections againt this waste ressources on the agent and server side. Clients that do protect against this get updates later than they should because of the backoff time they implement between requests.

Implementation

While reducing the number of unnecessary updates we still want :

Clients to be notified as soon as when the last instance of a service disapears.
Clients to be notified whenever there's there is an update for the service.
Clients to be notified as soon as the first instance of the requested service is added.

To reduce the number of unnecessary updates we need to block when a request to a missing service is made. However in the following case :

Client client1 makes a query for service foo, gets back a node and X-Consul-Index 42
foo is unregistered
client1 makes a query for foo with index=42 -> foo does not exist, the query blocks and client1 is not notified of the change on foo

We could store the last raft index when each service was last alive to know wether we should block on the incoming query or not, but that list could grow indefinetly.
We instead store the last raft index when a service was unregistered and use it when a query targets a service that does not exist.
When a service srv is unregistered this "missing service index" is always greater than any X-Consul-Index held by the clients while srv was up, allowing us to immediatly notify them.

Client client1 makes a query for service foo, gets back a node and X-Consul-Index: 42
foo is unregistered, we set the "missing service index" to 43
client1 makes a blocking query for foo with index=42 -> foo does not exist, we check against the "missing service index" and return immediatly with X-Consul-Index: 43
client1 makes a blocking query for foo with index=43 -> we block
Other changes happen in the cluster, but foo still doesn't exist and "missing service index" hasn't changed, the query is still blocked
foo is registered again on index 62 -> foo exists and its index is greater than 43, we unblock the query

pierresouchay

LGTM

This looks like a great enhancement to reduce furthermore the load on a cluster where clients are watching a non-existing service without any kind of rate limit. It also ensure that any client observing a soon-to-exist service will be notified quite less and will, because the full disappearance of a service happens far less than updates on services.

We had example (such as prometheus with Consul support) of implementation literally DOSing our clusters with this kind of behaviour. On another topic, the cache implementation will benefit from it as well.

ShimmerGlass · 2018-11-26T09:57:14Z

Did you have time to take a look at this ? During our last Consul incident several of our services died, resulting in lots of requests because the blocking queries watching them did not block anymore. This put an additional load on the servers which were in bad shape already and has a worrying snowballing effect.

banks

Thanks @Aestek.

This is simple and I understand the case it solves thanks!

My one concern is that although it might be manageable in practice, this does result in potentially unbounded server memory growth over time.

In a workload for example where services are registered with fairly dynamic names so they come and go a lot, this will leave behind index records for every service name that ever existed in the cluster with no way to purge those.

It would have to be a really pathological case to cause a major problem, but it's still not ideal to be adding something which by-design "leaks" state in this way with no bounds or way to recover the space.

In the near future we are likely to add a concept of global config for a service rather than only the registrations of individual nodes. This feature is pre-design so not a firm commitment and no details decided yet, but it's likely to appear in the next months. With that, there is a simple answer - you allow explicitly registering the (logical) service as "persistent" in it's global config regardless of if any actual instances are up and then we can enable the storage of these "deregistration" indexes only for "persistent" services. If the service goes away for good, the operator can remove it's global config or set persistent to false, and at that point servers can go delete all the indexes.

So I'm not quite sure whether to leave this a while until we have a firmer idea on that stuff and do it "properly", or build some other mechanism that does the same but will potentially be replaced soon? I'd be reluctant to make any public config or API changes to support this that we then have to keep forever even though they'll possibly be obsolete soon.

We could though, if you think it's worth the effort for the interim, add a fixed TTL like a week and then introduce a reaper goroutine on the leader that checks every hour for deregistered_index entries that have not had any service registered for a week and remove them?

Do you have opinions on that?

banks · 2018-11-26T13:04:27Z

agent/consul/state/catalog.go

@@ -12,7 +12,8 @@ import (
 )

 const (
-	servicesTableName = "services"
+	servicesTableName       = "services"
+	serviceMissingIndexName = "service_missing"


Maybe naming it "service_deregistered" would be clearer?

ShimmerGlass · 2018-11-26T14:12:21Z

Maybe I'm missing something but this should not cause a memory leak : we only keep track of the last index any service was deregistered at. This is a single entry in the radix regardless of the number of the number of service.

banks · 2018-11-26T14:53:26Z

You are right - my apologies I assumed this was per-service like the other index in use in the same function.

That does remove that concern and is a lot simpler which is great.

I guess that does make a pathalogical case where services are being registered and de-registered (so that zero instances of the service any more) every second no better than now though since that index will be incrementing all the time, but that seems pretty rare in practice so I think it's a good solution for now.

Could we rename that index and references in this PR to be more reflective of it's actual semantic, I suggested something before but given that its global maybe last_service_extinction and then define in the doc comment for it that we consider a service to be "extinct" when the last instance of it deregisters but that any service extinction will update the same single index?

Then I think this would be great to merge as it is!

Thanks for all the thought here.

ShimmerGlass · 2018-11-26T15:21:53Z

@banks done

agent/consul/state/catalog.go

banks

LGTM!

banks · 2019-01-10T16:03:26Z

agent/consul/state/catalog.go

-		}
+// If the index is not set for the service, it will return the missing
+// service index.
+// The service_last_extinction is set to the last raf index when a service


Suggested change

// The service_last_extinction is set to the last raf index when a service

// The service_last_extinction is set to the last raft index when a service

Once this typo is fixed I will merge this.

When making a blocking query on a missing service (was never registered, or is not registered anymore) the query returns as soon as any service is updated. On clusters with frequent updates these queries virtually do not block, and clients with no protections againt this waste ressources on the agent and server side. Clients that do protect against this get updates later than they should because of the backoff time they implement between requests. To remedy that we track the last index when a service was last unregistered and instead block on it when a query targets a missing service. This reduces the number of unnecessary updates and still correctly notify client when a change do happen.

ShimmerGlass · 2019-01-11T09:05:03Z

@mkeeler woops, fixed

mkeeler · 2019-01-11T14:26:12Z

Thanks @Aestek

pierresouchay · 2019-01-11T14:28:52Z

@mkeeler @banks Thank you very much!

pierresouchay approved these changes Oct 17, 2018

View reviewed changes

ShimmerGlass mentioned this pull request Nov 20, 2018

Add minDelayOnEmptyResult in cache config rickfast/consul-client#360

Merged

banks added this to the Upcoming milestone Nov 26, 2018

banks requested changes Nov 26, 2018

View reviewed changes

ShimmerGlass force-pushed the missing-service-idx branch from 24fa561 to a8c9ac3 Compare November 26, 2018 15:21

pearkes modified the milestones: Upcoming, 1.4.1 Nov 30, 2018

pearkes assigned mkeeler Jan 7, 2019

banks reviewed Jan 10, 2019

View reviewed changes

agent/consul/state/catalog.go Outdated Show resolved Hide resolved

banks approved these changes Jan 10, 2019

View reviewed changes

ShimmerGlass force-pushed the missing-service-idx branch from a8c9ac3 to 0f4a0a9 Compare January 11, 2019 09:04

mkeeler merged commit 4afbe79 into hashicorp:master Jan 11, 2019

ShimmerGlass deleted the missing-service-idx branch January 11, 2019 14:42

banks mentioned this pull request Jan 30, 2019

Consul 1.4.x high CPU #5295

Closed

pierresouchay mentioned this pull request Aug 30, 2019

Added rate_limit support to avoid sending too many requests to Consul Agent hashicorp/consul-template#1066

Closed

pierresouchay mentioned this pull request Sep 23, 2019

Blocking Queries on ACL resources and Config Entries Unblock Too Often #6530

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve blocking queries on services that do not exist #4810

Improve blocking queries on services that do not exist #4810

ShimmerGlass commented Oct 17, 2018 •

edited

Loading

pierresouchay left a comment

ShimmerGlass commented Nov 26, 2018

banks left a comment

banks Nov 26, 2018

ShimmerGlass commented Nov 26, 2018

banks commented Nov 26, 2018

ShimmerGlass commented Nov 26, 2018

banks left a comment

banks Jan 10, 2019

mkeeler Jan 10, 2019

ShimmerGlass commented Jan 11, 2019

mkeeler commented Jan 11, 2019

pierresouchay commented Jan 11, 2019

	// The service_last_extinction is set to the last raf index when a service
	// The service_last_extinction is set to the last raft index when a service

Improve blocking queries on services that do not exist #4810

Improve blocking queries on services that do not exist #4810

Conversation

ShimmerGlass commented Oct 17, 2018 • edited Loading

Background

Implementation

pierresouchay left a comment

Choose a reason for hiding this comment

ShimmerGlass commented Nov 26, 2018

banks left a comment

Choose a reason for hiding this comment

banks Nov 26, 2018

Choose a reason for hiding this comment

ShimmerGlass commented Nov 26, 2018

banks commented Nov 26, 2018

ShimmerGlass commented Nov 26, 2018

banks left a comment

Choose a reason for hiding this comment

banks Jan 10, 2019

Choose a reason for hiding this comment

mkeeler Jan 10, 2019

Choose a reason for hiding this comment

ShimmerGlass commented Jan 11, 2019

mkeeler commented Jan 11, 2019

pierresouchay commented Jan 11, 2019

ShimmerGlass commented Oct 17, 2018 •

edited

Loading