Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only extend cluster database when reaching three members #6230

Open
stgraber opened this issue Sep 23, 2019 · 5 comments

Comments

@stgraber
Copy link
Member

commented Sep 23, 2019

As it currently stands when bringing up clustering, the first three members all immediately become database members. This isn't ideal as unfortunately many people will only bring up a two servers cluster, ending up in the worst situation where loosing either of them leads to a broken cluster.

I believe it would instead be preferable to not have the second member act as a database member until a third member is joined at which point both second and third should be promoted to database members at the same time.

As part of this we should also update our clustering documentation to more strongly explain why two members clusters should be avoided and to more definitely recommend our users run clusters of at least three members.

@stgraber stgraber added this to the later milestone Sep 23, 2019
@stgraber

This comment has been minimized.

Copy link
Member Author

commented Sep 23, 2019

@freeekanayaka this can all be done in LXD itself correct?

We only need to modify the joining logic so we don't bring up the database unless there are at least three members in the cluster, at which point we should promote whatever members are needed to have a database backed by three of them.

@freeekanayaka

This comment has been minimized.

Copy link
Member

commented Sep 24, 2019

Yes, in principle that would work indeed, and we already have high-level logic promotion that should be possible to reuse.

@stgraber stgraber added the Easy label Oct 3, 2019
@jackstenglein

This comment has been minimized.

Copy link

commented Oct 19, 2019

@stgraber

Hello, myself and another student from UT Austin would like to take on this issue for our virtualization class.

@stgraber

This comment has been minimized.

Copy link
Member Author

commented Oct 21, 2019

Thanks!

For this one, there is no API extension or much in the way of a user visible change.
What we're looking for is that turning on clustering (initial member) will bring that member up with the database role, the first joining member (second cluster member) will not have that role, entirely relying on the initial member for the database, then when joining a third member, both the second and third member will get the database role.

This is because the RAFT database model we use requires consensus, consensus with just two members is problematic as the loss of either causes a stuck database. As in that scenario, the loss of either member would cause the loss of the database, changing to having just the single database member until a third is added actually improves the odd of recovery from 0% to 50% (the initial member can deal with the loss of the second).

The changes here are likely to all happen in lxd/cluster and lxd/api_cluster.go which is where the logic managing joining and leaving cluster members resides. We effectively need to change the logic so that the member doesn't get promoted to a database member unless we now have a total of 3 members, in which case, the second member also needs to be told to become a database member.

Testing this should be straightforward enough by running a test cluster in containers or VMs, joining more and more members each time checking lxc cluster list to see what member is running the database.

As for expected commits, I'm mainly expecting two for this case:

  • "lxd/cluster: Only promote to database role if >= 3 members"
  • "doc/clustering: Document database role during cluster scaling"

@freeekanayaka and myself should be able to help you with any question you have.
The cluster logic can take a little while to wrap your head around, a good first step is likely to setup a simple 3 to 5 members cluster, get a feel of how things work and see the difference between a member that's running the database and one that isn't (main difference is content of database/global).

@freeekanayaka

This comment has been minimized.

Copy link
Member

commented Oct 21, 2019

On top of what @stgraber said, I'll add that you might need to modify the tests in test/suites/clustering.sh in case there's anything there that assumes that 2-member cluster has 2 database members (I don't think there's any test assuming that, but just a heads up).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants
You can’t perform that action at this time.