Persistent cluster membership #61

derpston · 2013-11-10T01:28:50Z

I was surprised by the behaviour of the agent when giving it a SIGINT - it gracefully leaves the cluster by notifying other nodes before it shuts down. When restarting the agent, it is no longer a member of the cluster. It doesn't attempt to rejoin, and the other nodes don't attempt to contact it to tell it to rejoin.

When killing the agent with a SIGTERM or SIGKILL, it makes no attempt to leave the cluster gracefully. Other agents eventually notice that it disappeared, and make regular attempts to contact it. When the agent comes back up, other agents will tell it to rejoin the cluster, and it does.

I found this behaviour surprising because I expected cluster persistence to be less fragile. If I issue a serf join foo.example.com I expect that this action won't be undone unless I later issue a serf leave on that agent, or a force-leave on another agent.

So, I propose:

Having serf agents not attempt to leave the cluster when given a SIGINT.
Adding a serf leave command, to compliment force-leave, that performs an orderly leaving of the local agent from the cluster.
Having serf agents persist enough cluster state locally (/var/lib/serf/ring?) to be able to bootstrap that node back into the cluster immediately on startup, even if many other agents are down.

This would have some benefits:

Better operational predictability, no side-effects.
Faster agent recovery time - not waiting for other agents to attempt a reconnect.
Resilience against whole-cluster state loss, such as during a power failure that affects a cluster whose agents are all in the same rack.

This is basically the same ring/cluster persistence model used by Riak, I believe. I've used Riak in production for over a year and I've grown to love the resilience. Nodes go up, nodes go down, and the operator never has to do a thing to maintain cluster state.

In terms of implementation, we have a few options:

The operator could store a member list in the config.
Serf could periodically (or more likely, on change events) write a cluster state file to local disk.
An init script could, before shutting down serf, query the member list, store it locally, and write this into the config so it is ready for the next startup.

Of these options, I think (1) is poor because it requires the operator stay on top of cluster membership and write it to every agent's config file. This seems like error-prone busywork to me.

Option (3) feels fragile to me - it seems like a hack. I feel like cluster membership persistence should be a first class feature, so I would be in favour of option (2) and having serf do this by default.

It was suggested that this could be implemented as a plugin for use with the eventual plugin system. This feels like just a slightly tidier version of option (3), so I'm not keen on it.

While I'm obviously not in a position to dictate project goals, I feel like serf (and every other piece of potentially production software) should strive for:

Robustness in the face of broad network, system, and power failures.
Operational predictability - don't surprise the operator, don't have side effects.
Minimal configuration.

I think Riak's cluster membership model is perfect in this regard, and I think it should be a model for distributed system membership.

Opinions, anyone? :)

Thanks for reading!

The text was updated successfully, but these errors were encountered:

armon · 2013-11-10T02:16:23Z

I want to separate the behavior of Serf vs the fragility of the current model.

Currently, sending a SIGINT to Serf is logically the same as what a serf leave command would do.
It may be unexpected that a SIGINT does a graceful leave. We can definitely add a serf leave, as well
as configuration to change the signal handling of INT/TERM. There is a good opportunity to have a discussion
on what is the most expected behavior here to clarify what Serf will do.

In terms of fragility, if the nodes does a graceful leave, it would be very strange in my
mind if it automatically rejoined the cluster later. Under a failure condition, the node will
rejoin the cluster after a few seconds. I think this behavior is nice and does have the robustness
you described. I agree, if we store the ring state locally, we can accelerate the bootstrap
back into the cluster, but it is a very minor saving for the complexity of the on-disk state.
Especially since the reconnect timer is tunable, it can be lowered as desired to get the node
back into the cluster within seconds.

derpston · 2013-11-10T02:34:31Z

I think we're pretty much on the same page. :)

I agree that a node doing a graceful leave rejoining the cluster would be strange, that should never happen.

I like the current failure handling/rejoining behaviour. It could be sped up, but you're right, I wouldn't implement the local state storage just for that. Not worth it. If we already have local state, however, we may as well use it.

The real benefit of storing cluster state locally is avoiding the possibility that the entire cluster state will be lost. That could be caused by a big power outage, or perhaps more likely, something like a puppet misconfiguration that stops every serf process. A bad package upgrade, a typo in a config file, a mistake with a remote mass execution tool like salt, etc.

Regardless of how it happens or how likely it is, I think the effect of losing the entire cluster state is quite serious - requiring the operator to manually go and rejoin each node. If this happens just once, the irritated operator will be tempted to configure the list of all nodes in the config file, which I think is completely the wrong way to solve the problem. I think we should avoid encouraging the wrong behaviour like that, so I think that if it is possible to make serf just magically handle this, it should.

Playing devil's advocate, are there cases in which persistent membership in a production environment would be undesirable?

armon · 2013-11-10T02:44:26Z

I see what you are saying. I agree, it would be nice to have a safe guard against total cluster loss.
I don't think there is a downside to persistent membership information, its just a balance of adding complexity
to the code, and managing the user interaction to be sensible.

What we could do is is add a new -snapshot and -bootstrap flag. Serf could periodically dump known hosts to the snapshot, and if the -bootstrap option is present, it will attempt to connect to every node until one succeeds. It is a very minimal solution, but does allow for a fast recovery local as well as from total cluster failure. The snapshot interval can also be exposed as a configurable value, but we will just default it to something sane, and automatically do a snapshot on SIGINT/SIGTERM.

armon · 2013-11-10T02:47:08Z

Additionally, I think we can add serf leave as a command. In the configs, we can let the user tune leave_on_sigint and leave_on_sigterm to control the signal handling behavior.

I think we SIGINT as graceful, and SIGTERM as non-graceful as the defaults is sane, but it may not be expected
behavior and feedback in that case would be good.

derpston · 2013-11-10T03:01:54Z

Sounds good! If it were up to me (I don't know any golang yet...) I'd avoid exposing this to the user at all and make it automatic and unconfigurable. In the absence of any use cases where persistence is undesirable, that is. I can't think of one, but hopefully others will chime in with something I haven't thought of!

I'm having some trouble coming up with a reason why the user (other than a serf developer) would want to adjust signal handling behaviour. I also don't think it's useful/sane to do a graceful cluster leave on receiving any signal, I think it's a relatively rare/special case and should never be accidentally triggered. I think it ought to happen when executing serf agent leave only. Is that reasonable?

In general I have a preference for opinionated software over configurable, and I feel like this could be applied here. If we don't really need those config options, we could make persistence automatic and mostly unconfigurable. (Of course, the user can just go and delete the persistence file, but I think that's pretty reasonable, I imagine it would be an extreme and rare action!)

Anyway - sounds like we're agreed on the general ideas here, and thanks for being open to it. Some more points of view from other users who want to run serf in production would be appreciated. :)

armon · 2013-11-10T03:08:17Z

The signal handling for leave is important for integration into tools like systemd, upstart, etc. It also makes it really easy if you are running Serf as a sub-process to just send it a signal to kill vs leave. SIGINT can only really be accidentally done if you are running Serf manually on a CLI and the user does a control-c, which is not a sane production setup.

In terms of configuring the snapshot/bootstrap, its hard to predict user environments and needs. For example, internally we can rely on our service discovery mechanism to do the bootstrap, and so we can avoid the node local state. Also having the user explicitly provide a path to the file avoids any issues with permissions or assumptions we might make as the serf developers.

I agree strongly that the software should be opinionated, but having the configurations is great when you really need to change something to work within your environment. In the default case, as with most serf settings, everything can just be left to the defaults.

Anyways, I will split this into a few sub-tickets so it can be tracked individually. Thanks for the feedback.

mitchellh · 2013-11-10T03:47:06Z

I've just caught up on everything. I agree with everything said here. I'll break it down by point, each of which can be its own ticket.

serf leave - Agreed, let's do this.
snapshot - I agree with this. I also like opinionated software but as @armon said, asking the user to provide a path eliminates a lot of edge cases. Also, implementation detail: I think a -snapshot= flag is all you need. You don't need a bootstrap because the bootstrap could just read in that same snapshot file that was specified... And specifically for our case we don't need this feature since as @armon said, we use internal service discovery to auto-join.
Signal behavior - Configurable is good. We upgrade Serf by putting the new binary in place, doing a SIGKILL, then starting the new agent. I think most service managers like systemd and upstart send a SIGINT maybe on stop... having this configurable would allow you to have this behavior.

thedrow · 2013-11-12T10:02:50Z

@mitchellh Shouldn't it be --snapshot?

mitchellh · 2013-11-12T11:35:10Z

@thedrow The way Go's standard command-line parsing library works, it is just one dash. I think this is some oddity from plan 9. I've gotten used to it.

thedrow · 2013-11-12T14:02:43Z

@mitchellh But most people would find it strange no?

derpston · 2013-11-12T14:31:02Z

Probably the wrong issue for discussing this, but I agree that single-dash arg prefixing is weird for those used to more GNU-style arg parsing.

http://superuser.com/questions/372203/whats-the-difference-between-one-dash-and-two-dashes-for-command-prompt-paramet

If possible, I'd say the double dash for long arg format is preferable, but in the context of this issue it's not relevant. I don't think the implementation language ought to "leak" like this into the operational side. New issue for GNU-style arg parsing instead of Go's standard one?

thedrow · 2013-11-12T15:11:51Z

@derpston +1

mitchellh · 2013-11-12T16:03:10Z

@derpston I have two projects with this single line non-GNU format and I haven't had any real complaints. if you use double dashes then the help will be shown and you'll see. I really enjoy using the standard library arg parsing library and would rather not go away from that.

derpston · 2013-11-12T16:08:25Z

No complaint here, just a mild preference for the GNU style for consistency reasons. Not a problem.

armon · 2013-12-01T23:30:11Z

Closing this ticket, as I've replaced it with a few sub-tickets.

kamilion · 2014-02-18T10:39:53Z

Just stating my own preferences here, hope this doesn't reopen the bug.

Currently I rely on the SIGINT graceful leave behavior -- I have serf deployed on machines that boot from Intel Z-130 2GB industrial USB units containing Ubuntu-derived custom LiveISOs with TORAM=Yes in the boot parameters. They come and go in varying states and when they shut down, they're gone, and so is all of their configuration.

If that physical machine comes back (via a separate IPMI management VLAN), it's welcomed as a completely new node to the cluster, as it's role may have changed due to a load requirement that triggered the IPMI machine start up sequence. If we need more databases, it will be a database for a while until the load subsides.

It may even have a new IP address due to DHCP assignment. In practice, our DHCPD tends to assign the same addresses over and over by hashing the MAC address somehow. But at least two of our machines have a collision and fight over 10.0.10.116 when they're both asked to join the database subnet.

Recently due to the debian decision to standardize on systemd, we've started moving from a ubuntu base over to a debian+systemd infrastructure that actually lets us sanely shutdown an entire role and all of it's processes, and then perform a REST call to see if we're more useful immediately adopting a new role and rapidly pivoting to a completely new configuration without rebooting, or choosing to save power and just shutdown.

Serf currently does it's job and has very little care with who gets what role when or why.

That said, I'm currently taking a look at the result of #86 to see if dumping that to the USB stick on shutdown is better than our current method of joining to whatever a REST call's results on bootup says to join.

Either way; just pointing out serf can get used in a lot of very weird ways.

armon · 2014-02-18T19:24:38Z

@kamilion Glad to hear it is working for your use case. We tried to make sure that the snapshots did not change the existing behavior if you choose not to use them. The snapshot feature is mostly great to guard against agents failing and allowing them to auto re-join (on bug, power loss, etc). But if you are running the OS in memory only, the snapshot is likely useless since there is no real "durable" state.

armon mentioned this issue Nov 15, 2013

BUG: serf join after a SIGTERM does not ignore replay log #71

Closed

This was referenced Dec 1, 2013

Support leave command #82

Closed

Configurable signal handling behavior #83

Closed

Support cluster snapshot, allows for automatic rejoining on failure #84

Closed

armon closed this as completed Dec 1, 2013

champtar mentioned this issue May 4, 2020

SpeakerList improvements metallb/metallb#595

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persistent cluster membership #61

Persistent cluster membership #61

derpston commented Nov 10, 2013

armon commented Nov 10, 2013

derpston commented Nov 10, 2013

armon commented Nov 10, 2013

armon commented Nov 10, 2013

derpston commented Nov 10, 2013

armon commented Nov 10, 2013

mitchellh commented Nov 10, 2013

thedrow commented Nov 12, 2013

mitchellh commented Nov 12, 2013

thedrow commented Nov 12, 2013

derpston commented Nov 12, 2013

thedrow commented Nov 12, 2013

mitchellh commented Nov 12, 2013

derpston commented Nov 12, 2013

armon commented Dec 1, 2013

kamilion commented Feb 18, 2014

armon commented Feb 18, 2014

Persistent cluster membership #61

Persistent cluster membership #61

Comments

derpston commented Nov 10, 2013

armon commented Nov 10, 2013

derpston commented Nov 10, 2013

armon commented Nov 10, 2013

armon commented Nov 10, 2013

derpston commented Nov 10, 2013

armon commented Nov 10, 2013

mitchellh commented Nov 10, 2013

thedrow commented Nov 12, 2013

mitchellh commented Nov 12, 2013

thedrow commented Nov 12, 2013

derpston commented Nov 12, 2013

thedrow commented Nov 12, 2013

mitchellh commented Nov 12, 2013

derpston commented Nov 12, 2013

armon commented Dec 1, 2013

kamilion commented Feb 18, 2014

armon commented Feb 18, 2014