question: disaster recovery strategy with federation v2 #450
Comments
/sub |
The control plane of fedv2 is tied to a single cluster. It's possible for a cluster to span availability zones where the latency between nodes across the zones is within the bounds required by etcd (<10ms), which may offer an increase in resiliency. If the cluster goes down, though, the fedv2 control plane goes offline. fedv2 being implemented as a kube api limits the options for DR to those available for kube itself. That said, workloads will continue to run in member clusters if the fedv2 control plane goes offline or loses connectivity to any of the member clusters. This characteristic is similar to how the kubelets will continue to run established pods even if the kube api becomes inaccessible. Similarly, as you say, checking the health of backend endpoints for a given frontend endpoint is a key feature of most load balancers. This maximizes the chances of sending traffic to healthy endpoints, and once programmed a load balancer could continue to serve traffic even with the fedv2 control plane offline. I think this is one of the main advantages of directing traffic with load balancers instead of dns. Another possibility to reduce the cost of fedv2 control plane downtime is pull propagation. In this mode, rather than the fedv2 control plane being responsible for applying changes to member clusters, the control plane would generate the desired configuration for member clusters and write it to a distributed store (e.g. s3/gcs/etc). Controllers in member clusters would be responsible for reading from the distributed store and applying configuration. This is only a partial solution to ensure that cluster configuration is maintained. Resilient traffic coordination (i.e. load balancing) would still be required. I'll leave commentary on the multicluster dns to @shashidharatd and @font. |
@marun |
@raffaelespazzoli What are you intending 'global' to mean in this context? Most load balancers I've worked with are pointed to by DNS and are then programmed with health-checkable backends for a given frontend (a configured combination of ip/host/path). The scope of DNS can of course vary, from local network up to the internet. |
Even when federation control plane not running in cloud, one of the hosted DNS server (AWS Route53 or Google CloudDNS, etc..) could be chosen as the Global DNS server, which would solve the problem described above.
yes, you are right. multiple dns server instances need to run in every cluster. (atleast more than one cluster) for resilience. I am not sure whether CoreDNS provides such a feature.
The ServiceDNS controller in federation control plane does a basic health check and updates the DNS records accordingly (remove entries for services missing backing endpoints). But this mechanism has flaws.
|
@marun Global Load Balancer (GLB) or Global Traffic Manager (GTM) are all terms used by different vendors to indicate global DNSs. Usually these tools have the ability to be deployed in cluster mode, where there is an instance in each data-center and then the instances are all synchronized. They can make routing decisions (which IP to return) based on IP geo localization and can run a health check on the configured endpoints. Here is an example from the opensource space: https://www.yourserver.se/blog/20-powergslb-powerdns-backend-with-monitoring F5, cisco also all have the same capabilities. |
@raffaelespazzoli So 'global' as I'm hearing you describe it sounds like a mechanism to coordinate configuration across a distributed set of load balancers. Below that coordination mechanism, load balancing is pretty much the same regardless of the scope of the the network it serves. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
One of the reason for building multiple clusters, potentially in multiple data centers, is to implement a disaster recovery strategy.
What is the recommended disaster recovery strategy for federation V2? Obviously the fact that the federation control plane exists only in only one cluster can be an issue.
One more question more specifically on the management of dnsrecord objects. Assuming that we are not running on the cloud or that we cannot use the cloud provided DNSes (because, for example we are federating clusters from multiple cloud providers) and that we want to self-host a global load balancer (implemented by for example coredns), the fact that the dnsrecord objects only exist in one cluster can again be a limitation.
In fact for such an implementation we would need the dns server instances to exist in every cluster for resilience to a disaster in any of the clusters, but only the cluster with the federation control plane has the information.
Moreover, many enterprise-grade global load balancers have the ability to independently health-check endpoints and to remove them from the DNS records of the health check fail. Should this feature be added to the dnsrecord CRD as an option? I think it would allow to build more stable architectures.
The text was updated successfully, but these errors were encountered: