-
Notifications
You must be signed in to change notification settings - Fork 918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design & RFC: orchestrator on raft #175
Comments
Consensus implementation discussion: there are multiple ways to implement this, all of which have pros and cons. I'll list only the few that I deem likely:
A discussion of the merits of various approaches follows. Some considerationsThe The Example for changes that need to be advertised: Some perceptions and idle thoughtsI have successfully worked with the hashicorp/raft library while authoring freno. Change advertising can be a bit of a hassle, but otherwise this library supports:
Notable that while Consul uses the
Assuming stable leadership, we may change the way we advertise changes as following:
K/V-wise, Choosing a leader via The Embedded The |
Further idle thought: it should be easy enough to implement |
|
Since all |
tracking ops applied via raft:
|
@shlomi-noach I think this could be one of the bigger wins in moving to raft consensus. It might extend the time to make a decision on failover, but could possibly also reduce issues with identifying split-brain scenarios due to network partitions. |
@leeparayno it is one of the major catalysts for this development; credit @grypyrg for first bringing this to my attention over a year ago. |
etcd raft is not really complicated if you want to look into it. It is designed in a way that it is flexible and portable. It powers quite a few active and noticeable distributed systems: https://github.com/coreos/etcd/tree/master/raft#notable-users. |
@xiang90 thank you :) it did seem to me to be more complicated to set up; right now I'm working with the From the not-so-deep look I took into the However I don't want to make myself appear that lazy. It was easier to pick up on the |
Check out: https://github.com/coreos/etcd/tree/master/contrib/raftexample. Most of etcd/raft users started with it. But, yea, etcd/raft is a lower level thing than most of other raft implantations for the reason I mentioned above. There is some effort to make it look like other raft (https://github.com/compose/canoe) without losing its flexibility. |
|
Objective
cross DC
orchestrator
deployment with consensus for failovers and mitigating fencing scenarios.secondary (optional) objective: remove MySQL backend dependency.
Current status
At this time (release )
orchestrator
nodes use a shared backend MySQL server for communication & leader election.The high availability of the
orchestrator
setup is composed of:orchestrator
service nodesThe former is easily achieved by running more nodes, all of which connect to the backend database.
The latter is achieved via:
1. Circular Master-Master replication
orchestrator
service indc1
, one MySQL master and oneorchestrator
service indc2
dc1
anddc2
may cause bothorchestrator
services to consider themselves as leaders. Such a network partition is likely to also break cross DC replication thatorchestrator
is meant to tackle in the first place, leading to independent failovers by bothorchestrator
services, leading to potential chaos.2. Galera/XtraDB/InnoDB Cluster setup
3
Galera nodes, in multi-writer mode (each node is writable)3
orchestrator
nodesorchestrator
node connects to a specific Galera nodedc1
) still leaves2
connected Galera nodes, making a quorum. By nature oforchestrator
leader election, theorchestrator
service indc1
cannot become a leader. One of the other two will.A different use case; issues with current design
With existing design, one
orchestrator
node is theleader
, and only the leader discovers and probes MySQL servers. There is no sense in having multiple probingorchestartor
nodes because they all use and write to the same backend DB.By virtue of this design, only one
orchestrator
is running failure detection. There is no sense in having multipleorchestrator
run failure detection because they all rely on the exact same dataset.orchestrator
uses a holistic approach to detecting failure (e.g. in order to declare master failure it conslults replicas to confirm they think their master is broken, too). However, this detection only runs from a single node, and is hence susceptible to network partitioning / fencing.If the
orchestrator
leader runs ondc1
, anddc1
happens to be partitioned away, the leader cannot handle failover to servers indc2
.The cross-DC Galera layout, suggested above, can solve this case, since the isolated
orchestrator
node will never be the active node.We have a use case where we not only don't want to rely on Galera, we also don't even want to rely on MySQL. We want to have a more lightweight, simpler deployment without the hassle of extra databases.
Our specific use case lights up a new design offered in this Issue, bottom-to-top; but let's now observe the offered design top-to-bottom.
orchestrator/raft design
The
orchestrator/raft
design suggests:orchestrator
nodes to communicate viaraft
orchestrator
node to have its own private database backend serverorchestrator
node is to handle its own private DB. There is no replication. There is no DB High Availability setup.orchestrator
nodes run independent detections (each is probing the MySQL servers independently, and typically they should all end up seeing the same picture).orchestrator
nodes to communicate between themselves (viaraft
) changes that are not detected by probing the MySQL servers.orchestrator
nodes run failure detection. They each have their own dataset to analyze.orchestrator
is the leader (decided byraft
consensus)orchestrator
nodes, getting a quorum approval for "do you all agree there's a failure case on this server?"Noteworthy is that cross-
orchestrator
communication is sparse; health-messages will run once per second, and other than that the messages will be mostly user-initiated input, such asbegin-downtime
or recovery steps etc. See breakdown further below.Implications
Since each
orchestrator
node has its own private backend DB, there's no need to sync the databases. There is no replication. Eachorchestrator
node is responsible for maintaining its own DB.There is no specific requirement for MySQL. In fact, there's a POC running on SQLite.
orchestrator
binaryWe get failure detection quorum. As illustrated above, multiple independent
orchestrator
nodes will each run failure analysis.orchestrator
node in its own DC, then we have fought fencing:orchestrator
ndoe in that DC is isolated, hence cannot be the leaderorchestrator
nodes will agree on the type of failure; they will make for a quorum. One of them will be the leader, which will kick in the failover.Is this a simpler or a more complex setup?
An
orchestrator/raft/sqlite
setup would be a simpler setup, which does not involve provisioning MySQL servers. One would need to configorchestrator
with theraft
nodes identities, andorchestrator
will take it from there.An
orchestrator/raft/mysql
is naturally more complex thanorchestrator/raft/sqlite
, however:orchestrator
backendorchestrator
backendImplementation notes
All group members will run independent discovery, and so general discovery information doesn't need to be passed between
orchestrator
nodes.The following information will need to pass between
orchestrator
nodes as group messages:begin-downtime
(but can be discarded if end of downtime is already in the past)end-downtime
begin-maintenance
(but can be discarded if end of maintenance is already in the past)end-maintenance
forget
discover
, so that completely new instances can be shared with allorchestrator
nodessubmit-pool-instances
, user generated info mapping instances to a poolregister-candidate
- a user-based instruction to flag instances with promotion rulesack-cluster-recoveries
register-hostname-unresolve
deregister-hostname-unresolve
Easiest setup would be to load balance
orchestrator
nodes behind proxy, (e.g.haproxy
), such that the proxy would only direct traffic to the active (leader) node.isAuthorizedForAction
can be extended to reject requests on a follower.orchestrator
CLI, because that would directly talk to the DB instead of going throughraft
.orchestrator
service is running, it is OK to open the SQLite DB file from another process (theorchestrator
CLI) and read from it.Add
/api/leader-check
, to be used by load-balancers. Leader will return200 OK
and followers will return404 Not Found
We'd need to be able to bootstrap an empty DB node joining a quorum. For example, if one node out of
3
completely burns, there's still a quorum, but adding a newly provisioned node requires loading of all the data.SQLite
, copy+paste DB fileMySQL
, dump + importorchestrator
schemacc @github/database-infrastructure @dbussink
The text was updated successfully, but these errors were encountered: