Skip to content
This repository has been archived by the owner on Apr 25, 2023. It is now read-only.

question: disaster recovery strategy with federation v2 #450

Closed
raffaelespazzoli opened this issue Nov 26, 2018 · 11 comments
Closed

question: disaster recovery strategy with federation v2 #450

raffaelespazzoli opened this issue Nov 26, 2018 · 11 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/backlog Higher priority than priority/awaiting-more-evidence.

Comments

@raffaelespazzoli
Copy link

One of the reason for building multiple clusters, potentially in multiple data centers, is to implement a disaster recovery strategy.
What is the recommended disaster recovery strategy for federation V2? Obviously the fact that the federation control plane exists only in only one cluster can be an issue.

One more question more specifically on the management of dnsrecord objects. Assuming that we are not running on the cloud or that we cannot use the cloud provided DNSes (because, for example we are federating clusters from multiple cloud providers) and that we want to self-host a global load balancer (implemented by for example coredns), the fact that the dnsrecord objects only exist in one cluster can again be a limitation.
In fact for such an implementation we would need the dns server instances to exist in every cluster for resilience to a disaster in any of the clusters, but only the cluster with the federation control plane has the information.
Moreover, many enterprise-grade global load balancers have the ability to independently health-check endpoints and to remove them from the DNS records of the health check fail. Should this feature be added to the dnsrecord CRD as an option? I think it would allow to build more stable architectures.

@sdminonne
Copy link
Contributor

/sub

@marun
Copy link
Contributor

marun commented Nov 26, 2018

The control plane of fedv2 is tied to a single cluster. It's possible for a cluster to span availability zones where the latency between nodes across the zones is within the bounds required by etcd (<10ms), which may offer an increase in resiliency. If the cluster goes down, though, the fedv2 control plane goes offline. fedv2 being implemented as a kube api limits the options for DR to those available for kube itself.

That said, workloads will continue to run in member clusters if the fedv2 control plane goes offline or loses connectivity to any of the member clusters. This characteristic is similar to how the kubelets will continue to run established pods even if the kube api becomes inaccessible. Similarly, as you say, checking the health of backend endpoints for a given frontend endpoint is a key feature of most load balancers. This maximizes the chances of sending traffic to healthy endpoints, and once programmed a load balancer could continue to serve traffic even with the fedv2 control plane offline. I think this is one of the main advantages of directing traffic with load balancers instead of dns.

Another possibility to reduce the cost of fedv2 control plane downtime is pull propagation. In this mode, rather than the fedv2 control plane being responsible for applying changes to member clusters, the control plane would generate the desired configuration for member clusters and write it to a distributed store (e.g. s3/gcs/etc). Controllers in member clusters would be responsible for reading from the distributed store and applying configuration. This is only a partial solution to ensure that cluster configuration is maintained. Resilient traffic coordination (i.e. load balancing) would still be required.

I'll leave commentary on the multicluster dns to @shashidharatd and @font.

@raffaelespazzoli
Copy link
Author

@marun Similarly, as you say, checking the health of backend endpoints for a given frontend endpoint is a key feature of most load balancers. here I was referring to global load balancers (i.e. DNS servers) not local load balancers. Feature rich Global load balancers can check the health of the endpoints behind a FQDN.

@marun
Copy link
Contributor

marun commented Nov 27, 2018

@raffaelespazzoli What are you intending 'global' to mean in this context? Most load balancers I've worked with are pointed to by DNS and are then programmed with health-checkable backends for a given frontend (a configured combination of ip/host/path). The scope of DNS can of course vary, from local network up to the internet.

@shashidharatd
Copy link
Contributor

shashidharatd commented Nov 27, 2018

One more question more specifically on the management of dnsrecord objects. Assuming that we are not running on the cloud or that we cannot use the cloud provided DNSes (because, for example we are federating clusters from multiple cloud providers) and that we want to self-host a global load balancer (implemented by for example coredns), the fact that the dnsrecord objects only exist in one cluster can again be a limitation.

Even when federation control plane not running in cloud, one of the hosted DNS server (AWS Route53 or Google CloudDNS, etc..) could be chosen as the Global DNS server, which would solve the problem described above.

In fact for such an implementation we would need the dns server instances to exist in every cluster for resilience to a disaster in any of the clusters, but only the cluster with the federation control plane has the information.

yes, you are right. multiple dns server instances need to run in every cluster. (atleast more than one cluster) for resilience. I am not sure whether CoreDNS provides such a feature.

Moreover, many enterprise-grade global load balancers have the ability to independently health-check endpoints and to remove them from the DNS records of the health check fail. Should this feature be added to the dnsrecord CRD as an option? I think it would allow to build more stable architectures.

The ServiceDNS controller in federation control plane does a basic health check and updates the DNS records accordingly (remove entries for services missing backing endpoints). But this mechanism has flaws.

  1. The health check is done from federation control plane ( running in one of the cluster currently) and the actual data path from client to service instance may be intact.
  2. There is no sure way to guarantee the data path from client to service instance.

@raffaelespazzoli
Copy link
Author

@marun Global Load Balancer (GLB) or Global Traffic Manager (GTM) are all terms used by different vendors to indicate global DNSs. Usually these tools have the ability to be deployed in cluster mode, where there is an instance in each data-center and then the instances are all synchronized. They can make routing decisions (which IP to return) based on IP geo localization and can run a health check on the configured endpoints. Here is an example from the opensource space: https://www.yourserver.se/blog/20-powergslb-powerdns-backend-with-monitoring

F5, cisco also all have the same capabilities.

@marun
Copy link
Contributor

marun commented Nov 27, 2018

@raffaelespazzoli So 'global' as I'm hearing you describe it sounds like a mechanism to coordinate configuration across a distributed set of load balancers. Below that coordination mechanism, load balancing is pretty much the same regardless of the scope of the the network it serves.

@marun marun added priority/backlog Higher priority than priority/awaiting-more-evidence. kind/feature Categorizes issue or PR as related to a new feature. labels Apr 10, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 9, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 8, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

No branches or pull requests

6 participants