Reloading Consul re-runs all watch commands every time. #571

Open
darron opened this Issue Jan 2, 2015 · 36 comments

Projects

None yet
@darron
Contributor
darron commented Jan 2, 2015

I'm reloading Consul to load in new config files as the containers are built - but it appears as if it's re-running each defined watch command every time Consul is reloaded.

Note all the newly created docker containers that are being stopped and started (from the watch command):

http://shared.froese.org/2014/Screen_Shot_2015-01-02_at_4.34.32_PM.png

Example watch command is here:

https://gist.github.com/darron/481604459ccfde4d401a

Is this expected behavior? I had thought that if the config had changed, it should obviously reload and probably re-run, but this is surprising to me.

I have a few watch commands:

https://gist.github.com/darron/38af49ad1352a913d360

Looking through the api, I don't see another way to register a watch command.

In the end, I was trying to launch approximately 50 containers, it was still going 4 hours later - launching and re-launching 2300+ containers and counting:

http://shared.froese.org/2014/v87sd-17-50.jpg

Am I "doing it wrong"?

@darron
Contributor
darron commented Jan 3, 2015

A watch command seems to be run:

  1. When a config file is loaded.
  2. When Consul is HUP'ed and it rechecks them all.

I don't think those commands should be run unless the watch is actually "triggered" - but that's what appears to be the behavior.

@darron
Contributor
darron commented Jan 4, 2015

Just found issue #342 that describes the fire on create behavior.

@armon
Member
armon commented Jan 5, 2015

The fire-on-create one is a weird semantic thing. To me it seemed that most use cases would need the initial data plus any deltas (e.g. initial HAProxy config, plus updates), so we always fire on the first run. I guess there are cases you may not care. I want to make that a flag to the watchers.

The re-fire is just caused by us doing the dumbest reload possible. e.g. drop everything and rebuild everything, instead of complex change detection logic. (Was the watcher added/removed/modified)

@sean-
Contributor
sean- commented Jan 5, 2015

Assuming that the executable called by the watch is capable of making an idempotent change seems reasonable (or at the very worst, an identically generated config file + 1x SIGHUP should not be problematic in the common case). FWIW, we're designing the tools called by consul-template around the assumption that watches will fire haphazardly for many different and unknown reasons.

@darron
Contributor
darron commented Jan 5, 2015

All of the commands I have been using are idempotent, the problem is that as you add watches it gets run and re-run and re-run over and over and over.

If I have 50 containers with 2 watches per container (service and KV) then if ANY container is added or removed, all 50 stop and start each time there's a change to any Consul config file.

I have looked at the shell environment and the JSON that's passed to the handler - and there's no obvious difference that I can see - so there's no real way to know it's different outside of Consul.

I was attempting to start an AWS box that very comfortably runs 50+ small web containers - I was never able to finish and killed the box after 2300 container restarts. It couldn't really ever catch up after one container had been loaded and the next one finished.

@darron darron added a commit to octohost/octohost that referenced this issue Jan 6, 2015
@darron darron Disabling due to hashicorp/consul#571 fd11592
@armon
Member
armon commented Jan 6, 2015

@darron oh are you saying that the watchers compound over time? e.g. a new set of 50 watchers is added on each reload?

@darron
Contributor
darron commented Jan 6, 2015

Here's how it is working with the current state of Consul:

  1. There's already 49 containers with a kv and service watch each.
  2. Add a new site to Docker.
  3. Add a KV watch. (Hup Consul - as a result, all 98 watches fire again - which start to reload 49 containers and rebuild 49 nginx config files).
  4. Add a service watch. (Hup Consul - as a result, all 99 watches fire again - which start to reload 50 containers and rebuild 49 nginx config files).

So - approximately 200 watch events fire because Consul is HUPed - not because they change or anything.

I could cut the reloads by 50% only reloading Consul once - but that didn't really help much - still containers that didn't need to be restarted were being restarted over and over.

I ended up disabling the KV container restart watch - the first one described here:

http://blog.froese.org/2014/12/30/how-octohost-uses-consul-watches/

It's just not lightweight enough to justify including the way Consul fires them each time.

@dgshep
dgshep commented May 27, 2015

Hey guys. I also ran into this issue as well, and there are some potentially nasty behaviors if any of the handlers invoke consul reloadeither directly on indirectly. Essentially the watch will continue to refire until all file descriptors are exhausted:

2015/05/27 18:55:16 [ERR] agent: Failed to invoke watch handler 'consul reload': fork/exec /bin/sh: too many open files

here is the given watch config:

{ "watches": [ { "handler": "consul reload", "type": "event", "name": "break-consul" } ] }

This was with consul 0.5.0

@jsok
jsok commented Jul 2, 2015

I think a possible solution is to use the LTime field in the events and store that in a file each time your handler runs. That way, if you receive the same event with an LTime which is less-than-or-equal to the one you have stored in the file you can ignore it.

Can anyone confirm that LTime is reliable for this?

@ryanuber
Member

@jsok event watches are a bit of a snowflake, as they are powered by the gossip layer. LTime should be safe to use in the way you describe for event watches since it is a monotonic value, but it will not apply to other watch types and will not be part of the payload returned for those watches.

@joelmoss

Ok, so I find this all very weird...

Triggering an event that is being watched gives me the following output from the watch (a single event):

[{"ID":"afe6d0d3-c8f7-bea6-0311-2b3672cdb3fd","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":9}]

But reloading using consul reload also triggers this event (as this issue confirms), however, the output from the watch is a list of all events past:

[{"ID":"95a7f085-ac1d-d916-6ba4-749e8d102a5e","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":5},{"ID":"10681ffe-b7b6-a0a0-d3a3-fa802d997258","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":7},{"ID":"5b373801-54f2-2bfa-bcb0-a4bcd3aacacf","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":8},{"ID":"afe6d0d3-c8f7-bea6-0311-2b3672cdb3fd","Name":"testevent","Payload":null,"NodeFilter":"","ServiceFilter":"","TagFilter":"","Version":1,"LTime":9}]

Why is that, as it makes very little sense to me.

@joelmoss

@ryanuber what is Ltime?

@jsok
jsok commented Jul 29, 2015

LTime is serfs implementation of a lamport clock. The docs go into more depth: https://www.serfdom.io/docs/internals/gossip.html

I believe the reason you get all the events on reload is because the other nodes do not know what the last gossiped event that your local agent received. So to synchronise your local agent the other nodes will re-send all previous events.

@joelmoss

ok, but right now I have only one node for testing.

@BjRo
BjRo commented Aug 10, 2015

Can I work around this by checking and remembering the CONSUL_INDEX env var in my handler (if I have a watch of type key or keyprefix)?

@BjRo
BjRo commented Aug 10, 2015

@armon Do you still plan to add the flag to the watches (to turn off fire-on-create)? If so, any plans when this is going to be available?

@armon
Member
armon commented Aug 11, 2015

@BjRo We are tracking it here, but no concrete plans. Lots of more pressing things.

@ssenaria

I think I ran into this issue as well. This is stopping us from adopting Consul since it would fire a deploy if we have to restart Consul.

@ssenaria

Any updates on this?

@pl1ght
pl1ght commented Sep 29, 2015

This is critical for us as part of our one to many plans. I can't have my events all firing anytime there is an edit or reload. I'd consider this a pressing issue.

@vkhatri
vkhatri commented Jan 2, 2016

I am also planning to use Consul Watch - Event triggers, whether it is a deployment or HTTP call fired by monitoring system to fix a check (e.g. restart ntpd if ntp peer check failed) with bunch of other events handler. This is still a blocker for me, i am not happy with the workaround as i had to put all the commands in a script handler instead. As a work around i am verifying the payload and restart the service if only payload matches.

https://gist.github.com/vkhatri/1c3d9b287338ed0288c0

Would be great to have a configuration parameter to prevent event trigger on service start/reload.

@darron
Contributor
darron commented Jan 2, 2016

Because I was tired of launching processes with scripts that kept duplicating functionality, I built a small Go based tool to help with this limitation:

https://github.com/darron/sifter

It handles event and key watches - and doesn't allow the watch to fire if:

  1. Event: It's already seen that LTime value.
  2. Key: the hash of the payload is the same as before.

We've been running it in production to protect event watches for a few weeks now - key watches are less tested but have worked in my local testing.

@sc0rp10
sc0rp10 commented Mar 1, 2016

+1
We too need disabling "fire-on-create".

@kaelumania

We have the same issues with event watches. We want to use events to manually initialize container or schedule some tasks (like deploy). But the watches get executed in the beginning, which ist not our desired behaviour. This is also a serious blocker for us using consul. We are thinking about using consul exec instead, but that has serious security issues allowing arbitrary commands to be executed. +1

@ssenaria
ssenaria commented Apr 7, 2016

Any ETA on when a fix or enhancement would be scheduled for?

@bfgoodrich

Also wondering if there is any update on this, running into this exact same issue.

@zpeterg
zpeterg commented May 24, 2016

I also am running into this problem and would appreciate a solution.

@jippi
Contributor
jippi commented May 25, 2016

👍

@ssenaria

I know this isn't a fix but we've been using Sifter and it's been great.

@davidneudorfer

@ssenaria can you expand on how you use Sifter?

@highlyunavailable
Contributor

https://github.com/darron/sifter looks pretty straightforward, it just wraps your commands and does some pre-checking.

@bfgoodrich

Has this been looked at recently? This is still a huge pain and I would rather not have to keep a history of watches that have already been ran.

@alisade
alisade commented Aug 29, 2016 edited

this is sooo needed, we are saving modifyindex inside consul itself and checking against that and if we see a change we actually run the handler.

@taharah
taharah commented Dec 27, 2016

👍 to getting this looked at...

@gamefundas

Consul 0.7.2 and still waiting...

@ccomb
ccomb commented Feb 1, 2017

I'm currently learning Consul and while thinking about the question, I believe this is the job of the event handler to remember which event has been processed for real. Actually I find it even rather safe than all the recent events are fired again, it allows to be sure the handler doesn't forget an event.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment