-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
MetalLB Version: v0.8.1
Kubernetes version: v1.15.2
Network addon: Calico v3.7.3
Kube-proxy config: iptables
We have an application deployed onto our test Kubernetes cluster as a Deployment with two replicas and use RollingUpdate as the strategy. Every so often, when an update happens, the application would become unavailable externally to the cluster (ie. via MetalLB) for a reasonable number of minutes. The application remains available on the internal Service VIP for the entire time.
This application takes a while to startup, so the rolling update can take a while, which is why we have begun noticing it with this application but hadn't seen it previously with other things, which startup almost immediately.
Further digging suggested that what actually happens during an update is that a new ReplicaSet is created. Its scaled from 0 to 1 and the old ReplicaSet is scaled from 2 to 1. That seems fine. What I struggled with this is that this sometimes caused the outage and sometimes didn't.
If I described the service, I would see Endpoints listed, it would usually contain the IP:PORT number of the one remaining pod in the old ReplicaSet. Which again makes alot of sense, one of the pods is gone and the new pod created by the new ReplicaSet isn't ready yet.
I saw from the speaker logs messages like:
{"caller":"main.go:272","event":"serviceWithdrawn","ip":"","msg":"withdrawing service announcement","pool":"test-static","protocol":"layer2","reason":"notOwner","service":"namespace/app-name","ts":"2019-09-03T10:10:09.322434472Z"}
Which led me to layer2_controller.go, specifically the ShouldAnnounce function and the usableNodes functions. They seem pretty simple to understand, so I added some extra debug logging, rolled my own containers, pushed them to our local registry and updated the MetalLB speakers in our cluster.
I forced a couple of updates to the application to happen and I couldn't break anything, but eventually it broke and I think my debugging shows why. So here goes:
I logged what the variable 'subset' is just inside the for loop of the usableNodes function and I see this (truncated to the mostly useful parts):
{"addresses":[{"ip":"10.10.97.59","nodeName":"server5","targetRef":{"kind":"Pod","namespace":"namespace","name":"app-name-6494589445-hcqs6","SOME_UUID","resourceVersion":"8854491"}}],
"notReadyAddresses":[{"ip":"10.10.97.57","nodeName":"server5","targetRef":{"kind":"Pod","namespace":"namespace","name":"app-name-55b56674b-tt5j7","uid":"ANOTHER_UUID","resourceVersion":"8863517"}}],
What I quickly realised is that the outage doesn't occur when nodeName isn't the same between the endpoints in addresses and notReadyAddresses. But when it is the same, I get the outage.
Other debugging that logged what the 'usable' variable in the same function is set to, when the nodeNames are different above I would see something like this:
{"server1":false,"server5":true},"caller":"layer2_controller.go:66","ip":"10.30.2.185","pool":"test-static","protocol":"layer2","service":"namespace/app-name"
When the outage was happening, I would see:
{"server5":false},"caller":"layer2_controller.go:66","ip":"10.30.2.185","pool":"test-static","protocol":"layer2","service":"namespace/app-name"
So I worked through the code to try to work out what was happening I I believe it's this. If the new ReplicaSet creates its first pod on the same Kubernetes node as the old ReplicaSet, then 'subset' in usableNodes will be like my output above. Addresses within subset will be the IP of the old remaining pod with nodeX as the nodeName and NotReadyAddresses will be the IP of the new not yet ready pod and the same nodeName as the old pod.
Because the code in usableNodes will always set usable["nodeX"] to false if an endpoints nodeName is listed in NotReadyAddresses, this one node can never become true, despite one of its endpoints actually being ready.
This is obviously related to our small test environment (3 nodes), our Deployment spec replicas being set so low (2) and our rollingUpdate strategy being set to maxUnavailable (1) on the small replica sets.
Regardless, I do not think this is the correct behaviour given that it works fine if the pods end up on different nodes.
The naive part of me thinks that the actual endpoints addresses should be taken into account in some way rather than just the node name. But despite this code being reasonably simple to understand, there's clearly a reason why it was written this way... and I don't know what that reason was. The comment in the code here almost suggests that this behaviour was intentional and I do not understand the possible repercussions of changing it.
If anyone can point me in the right direction i'm happy to try to fix it myself.
Thanks