question: disaster recovery strategy with federation v2 #450

raffaelespazzoli · 2018-11-26T14:26:45Z

One of the reason for building multiple clusters, potentially in multiple data centers, is to implement a disaster recovery strategy.
What is the recommended disaster recovery strategy for federation V2? Obviously the fact that the federation control plane exists only in only one cluster can be an issue.

One more question more specifically on the management of dnsrecord objects. Assuming that we are not running on the cloud or that we cannot use the cloud provided DNSes (because, for example we are federating clusters from multiple cloud providers) and that we want to self-host a global load balancer (implemented by for example coredns), the fact that the dnsrecord objects only exist in one cluster can again be a limitation.
In fact for such an implementation we would need the dns server instances to exist in every cluster for resilience to a disaster in any of the clusters, but only the cluster with the federation control plane has the information.
Moreover, many enterprise-grade global load balancers have the ability to independently health-check endpoints and to remove them from the DNS records of the health check fail. Should this feature be added to the dnsrecord CRD as an option? I think it would allow to build more stable architectures.

sdminonne · 2018-11-26T14:27:43Z

/sub

marun · 2018-11-26T19:12:24Z

The control plane of fedv2 is tied to a single cluster. It's possible for a cluster to span availability zones where the latency between nodes across the zones is within the bounds required by etcd (<10ms), which may offer an increase in resiliency. If the cluster goes down, though, the fedv2 control plane goes offline. fedv2 being implemented as a kube api limits the options for DR to those available for kube itself.

That said, workloads will continue to run in member clusters if the fedv2 control plane goes offline or loses connectivity to any of the member clusters. This characteristic is similar to how the kubelets will continue to run established pods even if the kube api becomes inaccessible. Similarly, as you say, checking the health of backend endpoints for a given frontend endpoint is a key feature of most load balancers. This maximizes the chances of sending traffic to healthy endpoints, and once programmed a load balancer could continue to serve traffic even with the fedv2 control plane offline. I think this is one of the main advantages of directing traffic with load balancers instead of dns.

Another possibility to reduce the cost of fedv2 control plane downtime is pull propagation. In this mode, rather than the fedv2 control plane being responsible for applying changes to member clusters, the control plane would generate the desired configuration for member clusters and write it to a distributed store (e.g. s3/gcs/etc). Controllers in member clusters would be responsible for reading from the distributed store and applying configuration. This is only a partial solution to ensure that cluster configuration is maintained. Resilient traffic coordination (i.e. load balancing) would still be required.

I'll leave commentary on the multicluster dns to @shashidharatd and @font.

raffaelespazzoli · 2018-11-26T22:29:09Z

@marun Similarly, as you say, checking the health of backend endpoints for a given frontend endpoint is a key feature of most load balancers. here I was referring to global load balancers (i.e. DNS servers) not local load balancers. Feature rich Global load balancers can check the health of the endpoints behind a FQDN.

marun · 2018-11-27T01:30:17Z

@raffaelespazzoli What are you intending 'global' to mean in this context? Most load balancers I've worked with are pointed to by DNS and are then programmed with health-checkable backends for a given frontend (a configured combination of ip/host/path). The scope of DNS can of course vary, from local network up to the internet.

shashidharatd · 2018-11-27T05:58:55Z

One more question more specifically on the management of dnsrecord objects. Assuming that we are not running on the cloud or that we cannot use the cloud provided DNSes (because, for example we are federating clusters from multiple cloud providers) and that we want to self-host a global load balancer (implemented by for example coredns), the fact that the dnsrecord objects only exist in one cluster can again be a limitation.

Even when federation control plane not running in cloud, one of the hosted DNS server (AWS Route53 or Google CloudDNS, etc..) could be chosen as the Global DNS server, which would solve the problem described above.

In fact for such an implementation we would need the dns server instances to exist in every cluster for resilience to a disaster in any of the clusters, but only the cluster with the federation control plane has the information.

yes, you are right. multiple dns server instances need to run in every cluster. (atleast more than one cluster) for resilience. I am not sure whether CoreDNS provides such a feature.

Moreover, many enterprise-grade global load balancers have the ability to independently health-check endpoints and to remove them from the DNS records of the health check fail. Should this feature be added to the dnsrecord CRD as an option? I think it would allow to build more stable architectures.

The ServiceDNS controller in federation control plane does a basic health check and updates the DNS records accordingly (remove entries for services missing backing endpoints). But this mechanism has flaws.

The health check is done from federation control plane ( running in one of the cluster currently) and the actual data path from client to service instance may be intact.
There is no sure way to guarantee the data path from client to service instance.

raffaelespazzoli · 2018-11-27T13:02:28Z

@marun Global Load Balancer (GLB) or Global Traffic Manager (GTM) are all terms used by different vendors to indicate global DNSs. Usually these tools have the ability to be deployed in cluster mode, where there is an instance in each data-center and then the instances are all synchronized. They can make routing decisions (which IP to return) based on IP geo localization and can run a health check on the configured endpoints. Here is an example from the opensource space: https://www.yourserver.se/blog/20-powergslb-powerdns-backend-with-monitoring

F5, cisco also all have the same capabilities.

marun · 2018-11-27T15:02:36Z

@raffaelespazzoli So 'global' as I'm hearing you describe it sounds like a mechanism to coordinate configuration across a distributed set of load balancers. Below that coordination mechanism, load balancing is pretty much the same regardless of the scope of the the network it serves.

fejta-bot · 2019-07-09T04:17:05Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-08-08T05:04:39Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-09-07T06:02:09Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-09-07T06:02:16Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

shashidharatd closed this as completed Nov 27, 2018

shashidharatd reopened this Nov 27, 2018

marun mentioned this issue Mar 13, 2019

federation controller disaster recovery Support #636

Closed

marun added priority/backlog Higher priority than priority/awaiting-more-evidence. kind/feature Categorizes issue or PR as related to a new feature. labels Apr 10, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 9, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 8, 2019

k8s-ci-robot closed this as completed Sep 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question: disaster recovery strategy with federation v2 #450

question: disaster recovery strategy with federation v2 #450

raffaelespazzoli commented Nov 26, 2018

sdminonne commented Nov 26, 2018

marun commented Nov 26, 2018

raffaelespazzoli commented Nov 26, 2018

marun commented Nov 27, 2018

shashidharatd commented Nov 27, 2018 •

edited

raffaelespazzoli commented Nov 27, 2018

marun commented Nov 27, 2018

fejta-bot commented Jul 9, 2019

fejta-bot commented Aug 8, 2019

fejta-bot commented Sep 7, 2019

k8s-ci-robot commented Sep 7, 2019

question: disaster recovery strategy with federation v2 #450

question: disaster recovery strategy with federation v2 #450

Comments

raffaelespazzoli commented Nov 26, 2018

sdminonne commented Nov 26, 2018

marun commented Nov 26, 2018

raffaelespazzoli commented Nov 26, 2018

marun commented Nov 27, 2018

shashidharatd commented Nov 27, 2018 • edited

raffaelespazzoli commented Nov 27, 2018

marun commented Nov 27, 2018

fejta-bot commented Jul 9, 2019

fejta-bot commented Aug 8, 2019

fejta-bot commented Sep 7, 2019

k8s-ci-robot commented Sep 7, 2019

shashidharatd commented Nov 27, 2018 •

edited