Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve blocking queries on services that do not exist #4810

Merged
merged 1 commit into from
Jan 11, 2019

Conversation

ShimmerGlass
Copy link
Contributor

@ShimmerGlass ShimmerGlass commented Oct 17, 2018

Background

When making a blocking query on a missing service (was never registered, or is not registered anymore) the query returns as soon as any service is updated.
On clusters with frequent updates (5~10 updates/s in our DCs) these queries virtually do not block, and clients with no protections againt this waste ressources on the agent and server side. Clients that do protect against this get updates later than they should because of the backoff time they implement between requests.

Implementation

While reducing the number of unnecessary updates we still want :

  • Clients to be notified as soon as when the last instance of a service disapears.
  • Clients to be notified whenever there's there is an update for the service.
  • Clients to be notified as soon as the first instance of the requested service is added.

To reduce the number of unnecessary updates we need to block when a request to a missing service is made. However in the following case :

  1. Client client1 makes a query for service foo, gets back a node and X-Consul-Index 42
  2. foo is unregistered
  3. client1  makes a query for foo with index=42 -> foo does not exist, the query blocks and client1 is not notified of the change on foo

We could store the last raft index when each service was last alive to know wether we should block on the incoming query or not, but that list could grow indefinetly.
We instead store the last raft index when a service was unregistered and use it when a query targets a service that does not exist.
When a service srv is unregistered this "missing service index" is always greater than any X-Consul-Index held by the clients while srv was up, allowing us to immediatly notify them.

  1. Client client1 makes a query for service foo, gets back a node and X-Consul-Index: 42
  2. foo is unregistered, we set the "missing service index" to 43
  3. client1 makes a blocking query for foo with index=42 -> foo does not exist, we check against the "missing service index" and return immediatly with X-Consul-Index: 43
  4. client1 makes a blocking query for foo with index=43 -> we block
  5. Other changes happen in the cluster, but foo still doesn't exist and "missing service index" hasn't changed, the query is still blocked
  6. foo is registered again on index 62 -> foo exists and its index is greater than 43, we unblock the query

Copy link
Contributor

@pierresouchay pierresouchay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

This looks like a great enhancement to reduce furthermore the load on a cluster where clients are watching a non-existing service without any kind of rate limit. It also ensure that any client observing a soon-to-exist service will be notified quite less and will, because the full disappearance of a service happens far less than updates on services.

We had example (such as prometheus with Consul support) of implementation literally DOSing our clusters with this kind of behaviour. On another topic, the cache implementation will benefit from it as well.

@ShimmerGlass
Copy link
Contributor Author

Did you have time to take a look at this ? During our last Consul incident several of our services died, resulting in lots of requests because the blocking queries watching them did not block anymore. This put an additional load on the servers which were in bad shape already and has a worrying snowballing effect.

@banks banks added this to the Upcoming milestone Nov 26, 2018
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Aestek.

This is simple and I understand the case it solves thanks!

My one concern is that although it might be manageable in practice, this does result in potentially unbounded server memory growth over time.

In a workload for example where services are registered with fairly dynamic names so they come and go a lot, this will leave behind index records for every service name that ever existed in the cluster with no way to purge those.

It would have to be a really pathological case to cause a major problem, but it's still not ideal to be adding something which by-design "leaks" state in this way with no bounds or way to recover the space.

In the near future we are likely to add a concept of global config for a service rather than only the registrations of individual nodes. This feature is pre-design so not a firm commitment and no details decided yet, but it's likely to appear in the next months. With that, there is a simple answer - you allow explicitly registering the (logical) service as "persistent" in it's global config regardless of if any actual instances are up and then we can enable the storage of these "deregistration" indexes only for "persistent" services. If the service goes away for good, the operator can remove it's global config or set persistent to false, and at that point servers can go delete all the indexes.

So I'm not quite sure whether to leave this a while until we have a firmer idea on that stuff and do it "properly", or build some other mechanism that does the same but will potentially be replaced soon? I'd be reluctant to make any public config or API changes to support this that we then have to keep forever even though they'll possibly be obsolete soon.

We could though, if you think it's worth the effort for the interim, add a fixed TTL like a week and then introduce a reaper goroutine on the leader that checks every hour for deregistered_index entries that have not had any service registered for a week and remove them?

Do you have opinions on that?

@@ -12,7 +12,8 @@ import (
)

const (
servicesTableName = "services"
servicesTableName = "services"
serviceMissingIndexName = "service_missing"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe naming it "service_deregistered" would be clearer?

@ShimmerGlass
Copy link
Contributor Author

Maybe I'm missing something but this should not cause a memory leak : we only keep track of the last index any service was deregistered at. This is a single entry in the radix regardless of the number of the number of service.

@banks
Copy link
Member

banks commented Nov 26, 2018

You are right - my apologies I assumed this was per-service like the other index in use in the same function.

That does remove that concern and is a lot simpler which is great.

I guess that does make a pathalogical case where services are being registered and de-registered (so that zero instances of the service any more) every second no better than now though since that index will be incrementing all the time, but that seems pretty rare in practice so I think it's a good solution for now.

Could we rename that index and references in this PR to be more reflective of it's actual semantic, I suggested something before but given that its global maybe last_service_extinction and then define in the doc comment for it that we consider a service to be "extinct" when the last instance of it deregisters but that any service extinction will update the same single index?

Then I think this would be great to merge as it is!

Thanks for all the thought here.

@ShimmerGlass
Copy link
Contributor Author

@banks done

@pearkes pearkes modified the milestones: Upcoming, 1.4.1 Nov 30, 2018
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

}
// If the index is not set for the service, it will return the missing
// service index.
// The service_last_extinction is set to the last raf index when a service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The service_last_extinction is set to the last raf index when a service
// The service_last_extinction is set to the last raft index when a service

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this typo is fixed I will merge this.

When making a blocking query on a missing service (was never registered,
 or is not registered anymore) the query returns as soon as any service
is updated.
On clusters with frequent updates these queries virtually do not block,
and clients with no protections againt this waste ressources on the agent
and server side. Clients that do protect against this get updates later
than they should because of the backoff time they implement between
requests.
To remedy that we track the last index when a service was last
unregistered and instead block on it when a query targets a missing
service. This reduces the number of unnecessary updates and still
correctly notify client when a change do happen.
@ShimmerGlass
Copy link
Contributor Author

@mkeeler woops, fixed

@mkeeler
Copy link
Member

mkeeler commented Jan 11, 2019

Thanks @Aestek

@mkeeler mkeeler merged commit 4afbe79 into hashicorp:master Jan 11, 2019
@pierresouchay
Copy link
Contributor

@mkeeler @banks Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants