Skip to content

Failover time very high in layer2 mode #298

@ghaering

Description

@ghaering

Is this a bug report or a feature request?:

Both, probably.

What happened:

I tested with layer 2 mode and simulated node failure by shutting down the node that the load balancer IP was on. What then happened it that it took approx. 5 minutes for the IP to be switched to the other node in the cluster (1 master, 2 nodes). After much experimentation I came to the conclusion that the node being down and being "NotReady" did not initiate the switch of the IP address. The 5 minute timeout seems to be caused by the default pod eviction timeout of Kubernetes, which is 5 minutes. That means it takes 5 minutes for a pod on a node that is not available to be deleted. Default "node monitor grace period is 40 seconds, btw.". So that means it currently takes almost 6 minutes with default confguration for an IP address to be switched.

I made things a lot better by decreasing both settings like this:
- --pod-eviction-timeout=20s
- --node-monitor-grace-period=20s
in /etc/kubernetes/manifests/kube-controller-manager.yaml

This makes MetalLB switch the IP in case of node failure in the sub-minute range.

What you expected to happen:

To be honest what I would expect that the whole process takes maybe max. 5 seconds.

How to reproduce it (as minimally and precisely as possible):

Create a Kubernetes 1.11.1 cluster with kubeadm (single master, two nodes). Calico networking.

kubectl apply -f https://raw.githubusercontent.com/google/metallb/v0.7.2/manifests/metallb.yaml
kubectl apply -f metallb-cfg.yml
kubectl apply -f tutorial-2.yaml

➜ metallb-test cat metallb-cfg.yml
apiVersion: v1
kind: ConfigMap
metadata:
namespace: metallb-system
name: config
data:
config: |
address-pools:
- name: default
protocol: layer2
addresses:
- 10.115.195.206-10.115.195.208

Then

watch curl --connect-timeout 1 http://10.115.195.206

to see if the nginx app is reachable.

Then

kubectl logs -f --namespace metallb-system speaker-xxxxxxxxx

To see which node has the IP address assigned at the moment.
ssh into the machine and "poweroff".

Wait for how long it takes until the "watch curl" is successful again.

Anything else we need to know?:

Environment:

  • MetalLB version: v0.7.2
  • Kubernetes version: v1.11.1
  • BGP router type/version: N/A
  • OS (e.g. from /etc/os-release): CentOS 7
  • Kernel (e.g. uname -a):Linux cp-k8s-ghdev02-node-01.ewslab.eos.lcl 3.10.0-693.el7.x86_64 Implement BGP add-path #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions