New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

X-Consul-Index reflects global service churn when watching a single service causing issues with sleeping between blocking calls #3890

pierresouchay opened this Issue Feb 15, 2018 · 21 comments


None yet
4 participants

pierresouchay commented Feb 15, 2018

After #3712 (which is not really a big issue for us) we found another weird issue with X-Consul-Index while trying to improve Consul discovery within Prometheus: prometheus/prometheus#3814

When watching a service using a X-Consul-Index, if waiting more than 3 seconds before watching again the same path, the next call won't block and will return immediately the same data (which has not changed)

Is there a reason for that or is it a bug ?

In order to reproduce, here is a script:

if test ${1:-none} = "none"
  echo "USAGE: $0 Consul_URL"
  echo "       Example: localhost:8500/v1/health/service/MY_SUPER_SERVICE"
  exit 1

while true;
  curl -fs --dump-header "$headers" -o "${content}.new" "${url}" || { "echo Failed to query ${url}"; exit 1; }
  if test $index -ne 0
    diff -u "$content" "$" && echo " diff: No Differences found in service"
  index=$(grep "X-Consul-Index" "$headers" | sed 's/[^0-9]*\([0-9][0-9]*\)[^0-9]*/\1/g')
  if test "${index:-not_found}" = "not_found"
    # We are in a part of Consul that does not output X-Consul-Index in headers, switch to poll
    sleep 5
  if test ${CONSUL_SLEEP_DELAY:-0} -gt 0
  mv "$" "$content"
  printf "X-Consul-Index: $index at $(date) \b"

When launching at the same time on our cluster the same script with a different delay between watches, here is what we observe:

With delay < 3 : the calls do block

export CONSUL_SLEEP_DELAY=1; my-consul-agent:8500/v1/catalog/service/graphite-relay
X-Consul-Index: 851481857 at Thu 15 Feb 2018 10:18:20 CET

With delay > 3, the calls do not block, and return immediately

X-Consul-Index: 851481857 at Thu 15 Feb 2018 10:18:24 CET diff: No Differences found in service
X-Consul-Index: 851482009 at Thu 15 Feb 2018 10:18:28 CET diff: No Differences found in service
X-Consul-Index: 851482140 at Thu 15 Feb 2018 10:18:32 CET diff: No Differences found in service
X-Consul-Index: 851482260 at Thu 15 Feb 2018 10:18:36 CET diff: No Differences found in service
X-Consul-Index: 851482374 at Thu 15 Feb 2018 10:18:40 CET diff: No Differences found in service

This comment has been minimized.


banks commented Feb 15, 2018

Hi Pierre,

The docs on blocking queries say:

A critical note is that the return of a blocking request is no guarantee of a change. It is possible that the timeout was reached or that there was an idempotent write that does not affect the result of the query.

So it's not a bug that you get same result necessarily although not blocking at all is unexpected.

That said I can't reproduce that behaviour using your script locally, even with 4 second sleeps it still blocks as expected: (note I changed the wait time to 20 seconds to generate the output below because I got bored or waiting but same result with your original 5m:

$ env CONSUL_SLEEP_DELAY=4 ~/Downloads/ 'http://localhost:8500/v1/health/node/mynode'
X-Consul-Index: 5 at Thu 15 Feb 2018 13:35:17 GMT diff: No Differences found in service
X-Consul-Index: 5 at Thu 15 Feb 2018 13:35:42 GMT diff: No Differences found in service
X-Consul-Index: 5 at Thu 15 Feb 2018 13:36:06 GMT

This comment has been minimized.


banks commented Feb 15, 2018

Ah, the index increases every time in your second output I think are the clue we need. When you hit a blocking endpoint with an index param we will immediately return the current response if the index is stale since we don't know if you missed updates (unless we traversed the whole raft log between your index and current one to see if the thing you are watching changed...)

So the reason your script works with no sleep is that you are starting the next watch immediately from where you left off. in the case of 4 seconds, you clearly have enough write activity on your cluster to mean that if you want a few seconds the index will have changed therefore you are always calling with a stale index and so always getting an immediate return.

In my case it was a local dev cluster with no other writes occurring so even if I waited 4 seconds, the index I used wasn't stale so blocking happens.

I don't know that anything else can work - if you want to use blocking queries to stay in sync with state you kind of have to "keep up" with current index state since Consul has no way to know in between blocking calls that you are still interested in that thing, who you are or what index you last saw that thing at.

In other words, having a delay between blocking calls is working as expected and generally is not what you want to do on a real cluster with any amount of activity.

Let me know if I've missed something and that explanation doesn't make sense to you!

@banks banks closed this Feb 15, 2018


This comment has been minimized.


pierresouchay commented Feb 15, 2018

Ok, I think I understand the problem:

Basically, we have updates very frequently (since our cluster is large enough) -> X-Consul-Index is updated every 2-3 seconds.

We have an issue because when we watch (for instance /v1/catalog/services ) -> we are downloading continuously data (as soon as your are receiving data, another update comes and you re-download the whole catalog).

So, the idea is to add a minimum delay (see my PR here for instance in consul-template: hashicorp/consul-template#1066 )

The problem is that we expected such behavior :

/catalog/services/redis index=2
/catalog/services/mysql index = 3

Someone watches /services/redis with X-Consul-Index = 2 => wait until /services/redis is updated... fine

But if in the meantime, a new service is updated, for instance a new service my_new_service, we have:

/catalog/services/redis index=2
/catalog/services/mysql index = 3
/catalog/services/my_new_service index = 4

Then, at that time, if a watch is performed on /catalog/services/redis with index=2, then the data is immediatly returned (while the first watcher is still waiting for /catalog/services/redis)

This is a bit a problem for us because:

  • watching with the same parameters does not lead to the same result whenever a watch has been sent to 't' to 't+epsilon'
  • we cannot implement rate limit effectively

It also means that on a cluster with enough updates/secs, all clients watching are contiuously downloading the watch... => Problem for us.

Here is a full test for you you can simulate ton your laptop:

Reproduce the issue locally

Launch Consul agent in dev mode:

consul agent -dev

Register a service named redis:

curl --request PUT -d '{"name": "redis", "port":8000}' localhost:8500/v1/agent/service/register

Watch the changes with the script, lets call this output OUTPUT1

export CONSUL_SLEEP_DELAY=1; localhost:8500/v1/catalog/service/redis
X-Consul-Index: 2277 at Thu 15 Feb 2018 16:43:03 CET

In another window, launch this script:

while true; do curl --request PUT -d '{"name": "hello-'"$(date +%s)"'", "port":8000}' localhost:8500/v1/agent/service/register; sleep 1; done

=> The output of OUTPUT1 is not changing at all => Exactly what we expect

Ok, now, launch in parallel the same exact script as OUTPUT1 in a new window, but lets call this OUTPUT2:

Watch the changes with the script, lets call this output OUTPUT2 (OUTPUT1 is still running)

here is the output:

export CONSUL_SLEEP_DELAY=1; localhost:8500/v1/catalog/service/redis
X-Consul-Index: 2351 at Thu 15 Feb 2018 16:44:50 CET diff: No Differences found in service
X-Consul-Index: 2352 at Thu 15 Feb 2018 16:44:51 CET diff: No Differences found in service
X-Consul-Index: 2353 at Thu 15 Feb 2018 16:44:52 CET diff: No Differences found in service
X-Consul-Index: 2354 at Thu 15 Feb 2018 16:44:53 CET diff: No Differences found in service

=> oh no ! while we have the same parameters as output1, we are continuously downloading content


So it means that:

The same exact program does not output the same data depending whether it is running while the cluster is loaded or not.

It is not possible to efficiently watch changes in a large Consul Cluster.

That's a big deal for us.

h1. How to fix this Consul behavior ?

To me, what should be the behavior:

for each endpoint, ie: /services/my_service -> Consul store the Index of last update, let's call it my_service_index

When someone is watching /services/my_service if X-Consul-Index provided < my_service_index return immediately, if not watch as usual.


This comment has been minimized.


pierresouchay commented Feb 15, 2018

@banks I think it might make sense to re-open this issue


This comment has been minimized.


pierresouchay commented Feb 16, 2018

@banks Do you think we might re-open this ?
This is a real issue on large scale clusters


This comment has been minimized.


banks commented Feb 16, 2018

Your suggestion could work but it to do it robustly amounts to a pretty major rearchitecture of how watches work I think - for one thing it would make the HTTP API stateful or require explicitly registering "persistent" watches so Consul can separately log changes that affect only the subject of that watch.

I'm not familiar with the other issue you linked but I'll read through properly and talk to the team. It maybe as @slackpad said in the other issue that there are a few people who have run into these effects and we should consider this case and the options open (even if they are just to document this behaviour better for now). I'll reopen or merge as appropriate.

That said it seems to me that the sleep between blocking calls is the main cause of issue for you. With no sleep you might still get unlucky due to concurrent updates within the RTT of the previous call but unless you are doing hundreds of updates a second or have extremely poor latencies most of the time you will get efficient blocking behaviour you want. Note that once a blocking query is made with a non-stale index, it won't return on any other write except for one that actually touches the entity you are watching. But if you sleep externally at a time longer than you average update rate, you never give your blocking client the chance to "catch up".

I think you mention "rate limiting" is the reason you want to sleep but it seems to fundamentally be at odds with how stateless long polls work. Is there a real issue with not sleeping between blocking calls?

Thanks for you help and patience :)


This comment has been minimized.


banks commented Feb 16, 2018

I re-read and looked at your consul-template PR and I think I understand now.

It seems the rate limiting is needed because you have some endpoints that legitimately change really often (like watching for all service definitions with lots of service churn) and downloading the full response in a busy loop is too expensive.

In that case then yes HTTP long-poll simply can't be an efficient solution and your best option is to explicitly switch to regular polls that don't block (which work fine and let you trade staleness and resource usage). Even if you pass wait and index this is exactly the behaviour you get anyway since they will always be stale and return right away.

Where I'm not so clear is why you still want to use blocking queries and sleeps for endpoints that don't change very often. Could you elaborate on that part please?

I'm trying to understand your use-case fully because I fear the "real" solution is pretty involved :)


This comment has been minimized.


pierresouchay commented Feb 16, 2018

Hello Paul, thank you very much for your response

Basically, I am not convinced the sleep is the reason, the reason is simply that as long as X-Consul-Index changes fast enough, a watcher cannot submit its index before it changes, and so the watch won't be effective and return immediately data. So basically, this would happen on a non-sleeping client slow enough (or not using the local consul agent but using a remote one with enough latency)

On a API side, all people have been talking to (several of our engineers did implement watches in various languages) do interpret the X-Consul-Index as: the last time I have seen this resource, its version was with this version number, notify me when the version on this endpoint do change. Which looks like a reasonable contract.

But: as my tests shows it, the contact is not the following, the contract is:

  • if you have the last version of X-Consul-Index (a global index for the whole cluster), then, you can wait for your changes, otherwise, Consul does return the value immediately.
  • but there is no real reason for the client to being able to reach the Consul Agent fast enough for this condition to be true

Why all of this is a concern?

For several endpoints, X-Consul-Index is updated very often: /v1/catalog/services for instance, but also any endpoint not existing, for instance see #3712 ) in our large cluster is updated every second (or around that). When watching this endpoint, we then spend our time doing requests to Consul Agent which forwards it to Consul Server => on our clusters several hundreds of MBits/sec between the Consul Agent and the Consul Server for a single client watching for changes (see hashicorp/consul-template#1065 and the corresponding PR: hashicorp/consul-template#1066 ).

So, basically, the idea is: the client does not care to have a change every few milliseconds, we want changes in a reasonable time, for instance 5 seconds, so, no need to re-watch a endpoint as soon as we received the data. Let's wait 5 seconds, and re-watch again. BUT the behavior of Consul as described in this bug defeats it, since data will be returned immediately. It means we are basically not doing watches, be are doing active polling (thus X-Consul-Index has no interest at all for large clusters).

In order to build libraries using those mechanisms, you want as a developer some reasonable expectations:

  • as a developer am protecting the cluster, so to avoid issues, I want to avoid DOSing the cluster, so I add a guard which means that if I am doing too many requests, I am gonna wait for a few time and get the result. But since I am watching, I want the result immediately only if I really need to, otherwise I am doing active polling
  • I don't want to handle specific cases for /v1/catalog/services and /v1/health/some-unexisting-service (see #3712), I want the same mechanics for all endpoints

This can not be achieved with the current Consul implementation easily, thus, I would suggest to do the following:

  • for each endpoint being watched on a Consul Agent, when a update is being sent, store the index associated with this endpoint
  • when a watch is performed on the endpoint, compare the provided X-Consul-Index sent by the client, if the value of the index is lower than the last update (or last update is not set), do send the update immediately, otherwise, wait for the next update.

It would fix efficiently the issue and be a clearer contract, it would fix lots of bug and would avoid DOSing the large Consul Cluster.

Just for some experience. In our company, due to this kind of issues, we have been DOSed by:

All these apps have been written by reasonable, clever and different people, so the current implementation of Consul does not look satisfying

I am OK to try implementing this if you want.



This comment has been minimized.


pierresouchay commented Feb 17, 2018

Hello @banks ,

First POC

I've been working on a patch with very good results. Basically, when retrieving the index of changes of a service, the current implementation does this in catalog.go:

idx := maxIndexTxn(tx, "nodes", "services")

Which basically retrieves the max consul index for both nodes and services.

I have been doing a patch that computes the maximum index of services when retrieving the list of services by patching the line 777 by the following:

        idx := 0
	for service := services.Next(); service != nil; service = services.Next() {
		serviceNode := service.(*structs.ServiceNode)
		if serviceNode.ModifyIndex > idx {
			idx = serviceNode.ModifyIndex
		results = append(results, serviceNode)
        if idx < 1 {
		idx = maxIndexTxn(tx, "nodes", "services")
	return idx, results, nil

This works quite well (I don't have the issue anymore), it suffers from a drawback however: since I am working on existing services to compute the X-Consul-Index, there might be a race condition when removing services.

Consider this example:

Max Index := 79 (aka global X-Consul-Index is 79)
/v1/catalog/services/service1 [(instance1 v=12), (instance2 v=43), (instance3 v=50)] => v=50

A user retrieve the result and the X-Consul-Index (whenever we return 50 or 79 in HTTP headers does not change anything)

At index=80, instance2 is removed

At Index=85 the user come back with its value X-Consul-Index = 79

Computation of Index is now:
Max Index := 85
/v1/catalog/services/service1 [(instance1 v=12), (instance3 v=50)] => v=50

So the user will block and won't be notified of the deletion of instance2, so, my approach does not work for removal of services, this is a dead end.

Possible solution

So, what I propose is the following:

  1. When creating a service, we also create a new index called service.<serviceName> in the store
  2. everytime data related to this service is updated (healthchecks, instance added, instance removed, node metadata of a node having this service...) => we increase the idx for service.<serviceName>
  3. when a service is removed IF the service has no more instance, we removed the index service.<serviceName>
  4. When querying health, catalog or anything related to a service, we compute the index the following: If index service.<serviceName>, use it as current X-Consul-Index for this service
    Otherwise, return the same as today (aka maxIndexTxn(tx, "nodes", "services"))

I think such a patch is not that much intrusive and will really help all tools performing watches.

If I submit such patch, would you consider its inclusion in upstream?

@banks banks reopened this Feb 19, 2018


This comment has been minimized.


banks commented Feb 19, 2018

Hi Pierre - thanks for the follow up.

I've been trying to fully understand the issue here. From you're last message I think you fully understanding this, but I'm going to spell out the subtleties and my own bad assumptions here for others who are following too.

  1. X-Consul-Index is NOT the global raft index for the whole cluster. (My mistake from earlier). As the docs say:

    This is a unique identifier representing the current state of the requested resource.
    (Emphasis mine)

  2. X-Consul-Index is the same thing as ModifyIndex in the response for many cases (e.g. get a key-value or single service. and represents the index at which that resource last changed.)
  3. The issue you are having is specific to service registrations. I've not checked every other API so it might not be limited to that but it doesn't for example affect simple KV watches, node fetches of a single node etc.
  4. For service watches, X-Consul-Index reflects the raft index at which the global service/node index was updated.
    // Get the table index.
    idx := maxIndexTxn(tx, "nodes", "services")
    . This is "correct" because that catalog is the data structure all services modify on register so any change to it might potentially affect the result of a service endpoint response. It's not optimal because any change to the nodes registered for any service (which will alter that table) will cause new blocking queries for any specific service to be considered "stale". (existing queries already "blocking" won't spuriously return as they are watching the actual in-memory state only new ones coming in with a now-stale index param).

So for vast majority of users, blocking semantics work perfectly well and actually match the docs and your original expectations for the index parameter's semantics.

The issue you have is a combination of factors:

  1. Very high churn, specifically on service registration/deregistration on order of one every few seconds
  2. Desire to limit bandwidth by waiting between blocking queries which never allows catch up to the rapidly changing index. (If you don't wait even with your high church blocking works almost all of the time although with some pretty extreme combination of high local DC network latency and service registration rate it might never catch up).
  3. The limitation in the current implementation that we check the index of the entire index of service nodes not just the one specific service when looking at an individual service.

Possible Work-arounds

I'm sure you already looked into this thoroughly but as far as I can see it still works fine currently even at your registration rate, if you don't sleep between blocking calls.

I understand for results with a large payload and regular change (e.g. all services in DC) you might want to limit bandwidth with sleep. It seems that is easy to do buy switching to short polls and a delay of your choice though.

So a solution with no changes:

  1. For cases where the payload is large and genuinely changes often (e.g. /v1/catalog/services, don't try to block at all - just hit it every X seconds and pick X that matches your desired balance of bandwidth and staleness.

  2. For cases where you want updates for a single service in face of other services churning, Keep using blocking API without any delay. This should still work and be essentially optimal unless you have extremely high DC network latency or hundreds of (de-)registrations per second - neither of those conditions are really within the current design boundaries of Consul anyway as far as I can see.

Is there some specific reason that you can't take this approach? I realise it would be nice if you never had to think about this and blocking just worked for you every time, but I'm trying to work out if that's a "nice to have" of if there is some fundamental limitation that means you can't take this approach and really do need to have single-service watches that sleep between calls?

Your Solution

Your proposed solution is reasonable although it's not a totally trivial change.

I initially assumed there was a simpler option just like you suggested but as you found de-registrations would not be caught.

I think such a patch is not that much intrusive and will really help all tools performing watches.

I don't think it's super intrusive but it's not a no-brainer either.

As far as I can see the vast majority of users don't run into this problem for the reasons I mentioned so while it would be a good addition I'm not sure if it would be immediately useful to a lot of Consul users.

If you think it's important enough to contribute despite the workarounds I mentioned (or because they really won't work for you for some reason) then I think your approach sounds reasonable :)

Thanks for all your effort chasing this down. I've changed the issue title to reflect the scope/feature request a little more specifically.

@banks banks changed the title from weird behaviour of X-Consul-Index when waiting more than 3 seconds between blocking calls to X-Consul-Index reflects global service churn when watching a single service causing issues with sleeping between blocking calls Feb 19, 2018

@banks banks added the enhancement label Feb 19, 2018

@banks banks added this to the Unplanned milestone Feb 19, 2018


This comment has been minimized.


pierresouchay commented Feb 19, 2018

@banks Thank you very much for your comments as well as your explanations.

Actually, we already have the issues without sleeping. On some of our code, the watchers are performing operations on the services, since we have some service with many many machines, the time needed to process the changes + the number of changes per second (around 3 X-Consul-Index per seconds on the biggest cluster) do create this issue (while mitigated by the patch on my colleague #3642 ): some of our apps spent their time garbage collecting since those apps are downloading around 50Mbyte/sec on a single service.

I suspect it will improve a lot the performance on tools such as consul-template as well (we have around 1.3k different kind of services with around 5k machines per cluster), I'll try to give you measurements soon if I can.

Thank you



This comment has been minimized.


banks commented Mar 2, 2018

Thanks @pierresouchay for your work getting this improvement in!


This comment has been minimized.


pierresouchay commented Mar 7, 2018

@banks For the record, here is the effect of the patch on our cluster: this graph shows the number of blocking queries the servers are receiving per second.

  • red server: consul server follower patched
  • blue server: consul server follower not patched

blue and red servers have been restarted around the same time.

The inflexion point of the red line is when we applied the patch on one of our 3 servers.

screen shot 2018-03-07 at 18 33 30

On network bandwidth side, a traffic went from an average of 720Mbps (peaks around 1.5 Gpbs) to 361Mbps (peak around 450Mbps) ...

screen shot 2018-03-07 at 18 44 07

This patch really changes the performance on our side. Thank you for Merging it!

kamaradclimber pushed a commit to criteo/consul-templaterb that referenced this issue Mar 12, 2018

Tune retry on non-diff to 3 seconds by default.
Linked to hashicorp/consul#3890

Having duration of more than 3 seconds might actually increase the

Change-Id: Iff9104af5f1ce52cadc449a10de657d5c1b2a968

@pearkes pearkes modified the milestones: Unplanned, 1.0.7 Mar 13, 2018


This comment has been minimized.

takeda commented Mar 14, 2018

Just a note, this change probably should be labeled as a breaking change in the changelog. There's a possibility that code might rely on the old behavior, to reduce number of open connections to the consul agent (e.g. when monitoring 10 services it opens a single connection and when X-Consul-Index changes it will check status of other 9 services)


This comment has been minimized.


pierresouchay commented Mar 14, 2018

@takeda this kind of usage is unlikely: when timing was right, the call was still blocking (see the description here: #3890 (comment)), so having this pattern did need: a very hi write rate in cluster AND to delay a call. But you had no confidence it would block or not. It means that for small clusters it does not change anything, for large, it might, but very hard to get it right, so unlikely someone uses this behavior for its own good.

What we are seeing in our clusters is a huge increase of apps waiting for watches and a very significant decrease of load, CPU and bandwidth on all Consul agent/servers


This comment has been minimized.


banks commented Mar 14, 2018

@pierresouchay thanks for the follow-up info - really good to see.

@takeda we are not considering it a "breaking change" which would imply being held until a major version release. You are right there are potential edge-case clients that might rely on undocumented side-effects of the old implementation and we've been careful to word that adequately in the upgrade guide as a clear notice. Hopefully that is adequate.

See the changes in PR:

I've also made the CHANGELOG entry more explicit (good catch thanks!).


This comment has been minimized.

takeda commented Mar 14, 2018

The reason why I mentioned is is that my code will be broken by this change, and if I did it, chances are that other people did it too.

I personally think this type of change shouldn't be included when only patchlevel number is increased (i.e. the last number of the version). If you have to include upgrade instructions like these: that's a sign that the change should be included at least with the minor version increase (i.e. the second digit). Most projects follow this convention.

Having said that I already know that I need to make changes in my code before upgrading to 1.0.7 so will be careful around that version, thanks also adding the extra note.


This comment has been minimized.


pierresouchay commented Mar 14, 2018

Oh, ok, sorry about that :(

Just curious, how did it break the code? What is the assumption broken?


This comment has been minimized.

takeda commented Mar 14, 2018

The assumption that X-Consul-Index is global for all health checks, undocumented but still.

Basically your issue is that with a large cluster servers are bombarded with requests, so this was a measure to mitigate that. Since previously all health checks were sharing the same index my code was just tried to minimize requests.

If there are 10 services to check, the code only was checking state of the first service, if the version was changed it then queried other 9 services for changes.

I think overall this change is for the better, but I wouldn't call it a small changes that warrants to be in 1.0.7 unless there wont be 1.0.7 and 1.1.0 will be released instead.

To be frank I really wish Consul would come with a similar protocol to Zookeeper, where clients were using long standing single connections. And upon connecting they just specify all paths that they watch and ZK would only send notification about what changed.

No need to make multiple connections, reconnecting after wait timer expires and no X-Consul-Index to track.


This comment has been minimized.


banks commented Mar 14, 2018

Thanks @takeda,

It's a fine line to walk - we originally were extremely cautious about this patch for reasons like this and requested those upgrade docs even though it wasn't changing a documented guarantee of the API. We also verified that official tools will all be OK with it.

But we'll review again as we decide on 1.0.7 cut if the risk seems too great. The fact that it will effect at least one person who managed to spot it in time is interesting.


This comment has been minimized.


pierresouchay commented Mar 19, 2018

@banks after a weeks of testing on all of our clusters, we found a little bug in the implementation (When deleting healthchecks on a node): #3970

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment