New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: check endpoint slice update after backend pool update for local service to prevent mismatch #4536
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: nilo19 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
By(fmt.Sprintf("Checking the node count in the local service backend pool to equal %d)", len(nodes))) | ||
nodeNames, err := getDeploymentPodsNodeNames(cs, ns.Name, testDeploymentName) | ||
Expect(err).NotTo(HaveOccurred()) | ||
By(fmt.Sprintf("Checking the node count in the local service backend pool to equal %d)", len(nodeNames))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By(fmt.Sprintf("Checking the node count in the local service backend pool to equal %d)", len(nodeNames))) | |
By(fmt.Sprintf("Checking the node count in the local service backend pool to equal %d", len(nodeNames))) |
return false, nil | ||
} | ||
utils.Logf("Pod %s is running on node %s", pod.Name, pod.Spec.NodeName) | ||
res[pod.Spec.NodeName] = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there're Pods not from this deployment, will they be put into res as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should not be other pods, but I didn't notice that we don't stop the process if failed to delete host exec pod, so there could be other pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the function is called getDeploymentPodsNodeNames
, I think it is better to only consider Pods of the deployment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
4a086ad
to
f3eabac
Compare
/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz |
1 similar comment
/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz |
e86eb8c
to
b985045
Compare
467a678
to
6d3c662
Compare
/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz |
/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz |
1 similar comment
/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz |
/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz |
5 similar comments
/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz |
/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz |
/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz |
/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz |
/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz |
pkg/provider/azure.go
Outdated
az.excludeLoadBalancerNodes.Insert(prevNode.ObjectMeta.Name) | ||
az.nodesWithCorrectLoadBalancerByPrimaryVMSet.Delete(strings.ToLower(prevNode.ObjectMeta.Name)) | ||
} | ||
|
||
// Remove from nodePrivateIPs cache. | ||
for _, address := range getNodePrivateIPAddresses(prevNode) { | ||
klog.V(4).Infof("removing IP address %s of the node %s", address, prevNode.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could we increase the log level for this line (e.g. to 6)? it may generate large volume of logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
// There are chances that the endpointslice changes after EnsureHostsInPool, so | ||
// need to check endpointslice for a second time. | ||
if err := az.checkAndApplyLocalServiceBackendPoolUpdates(*lb, service); err != nil { | ||
return nil, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log an error before return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
var err error | ||
lb1, err = cloud.reconcileBackendPoolHosts(lb1, existingLBs, &svc, []*v1.Node{}, clusterName, "vmss", lbBackendPoolIDs) | ||
assert.NoError(t, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add some negative test cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
currentIPsInBackendPools[bpName] = currentIPs | ||
} | ||
} | ||
az.applyIPChangesAmongLocalServiceBackendPoolsByIPFamily(*lb.Name, serviceName, currentIPsInBackendPools, expectedIPs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how would the error be returned back on failures here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be in the service events.
if err != nil { | ||
return err | ||
} | ||
var expectedIPs []string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about using Set here? in that way, you could compare by Set.Equal before L573 and skip reconciling if the Sets are not changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case it will be skipped in L616.
} | ||
currentIPsInBackendPools[bpName] = currentIPs | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should skip the reconcling if the IP list are not changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will skip in L616.
ipv4 = append(ipv4, ip) | ||
} | ||
} | ||
az.reconcileIPsInLocalServiceBackendPoolsAsync(lbName, serviceName, currentIPsInBackendPoolsIPv6, ipv6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, how is the error propagated back to the caller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be in the service events.
description string | ||
}{ | ||
{ | ||
description: "test", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add some negative tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -235,6 +238,36 @@ var _ = Describe("Ensure LoadBalancer", Label(utils.TestSuiteLabelMultiSLB), fun | |||
}) | |||
}) | |||
|
|||
func getDeploymentPodsNodeNames(kubeClient clientset.Interface, namespace, deploymentName string) (map[string]bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add some comments for the new function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
…service to prevent mismatch
6d3c662
to
3f73d58
Compare
/lgtm |
/cherrypick release-1.28 |
@nilo19: new pull request created: #4659 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
/kind bug
/kind failing-test
What this PR does / why we need it:
There are chances that the endpoints in an endpointslice updates slower than the actual state. This may bring an issue if the endpointslice updates after we update backend pool but before the service reconciliation finishes. In this case, the endpointslice update would be ignored. This PR checks the endpointslice state after updating backend pool, and reconciles the IPs in the backend pool for a second time to make sure they keep aligned with the endpoints in the endpointslice.
Which issue(s) this PR fixes:
Fixes #
Related #4013
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: