Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: check endpoint slice update after backend pool update for local service to prevent mismatch #4536

Merged
merged 1 commit into from Sep 18, 2023

Conversation

nilo19
Copy link
Contributor

@nilo19 nilo19 commented Aug 31, 2023

What type of PR is this?

/kind bug
/kind failing-test

What this PR does / why we need it:

There are chances that the endpoints in an endpointslice updates slower than the actual state. This may bring an issue if the endpointslice updates after we update backend pool but before the service reconciliation finishes. In this case, the endpointslice update would be ignored. This PR checks the endpointslice state after updating backend pool, and reconciles the IPs in the backend pool for a second time to make sure they keep aligned with the endpoints in the endpointslice.

Which issue(s) this PR fixes:

Fixes #
Related #4013

Special notes for your reviewer:

Does this PR introduce a user-facing change?

fix: check endpoint slice update after backend pool update for local service to prevent mismatch

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 31, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nilo19

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 31, 2023
By(fmt.Sprintf("Checking the node count in the local service backend pool to equal %d)", len(nodes)))
nodeNames, err := getDeploymentPodsNodeNames(cs, ns.Name, testDeploymentName)
Expect(err).NotTo(HaveOccurred())
By(fmt.Sprintf("Checking the node count in the local service backend pool to equal %d)", len(nodeNames)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
By(fmt.Sprintf("Checking the node count in the local service backend pool to equal %d)", len(nodeNames)))
By(fmt.Sprintf("Checking the node count in the local service backend pool to equal %d", len(nodeNames)))

return false, nil
}
utils.Logf("Pod %s is running on node %s", pod.Name, pod.Spec.NodeName)
res[pod.Spec.NodeName] = true
Copy link
Contributor

@lzhecheng lzhecheng Aug 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there're Pods not from this deployment, will they be put into res as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should not be other pods, but I didn't notice that we don't stop the process if failed to delete host exec pod, so there could be other pods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the function is called getDeploymentPodsNodeNames, I think it is better to only consider Pods of the deployment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@nilo19 nilo19 force-pushed the chore/node-count branch 3 times, most recently from 4a086ad to f3eabac Compare August 31, 2023 05:21
@nilo19 nilo19 changed the title chore: the expected node count in the backend pool should be measured… [WIP] chore: the expected node count in the backend pool should be measured… Aug 31, 2023
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 31, 2023
@nilo19
Copy link
Contributor Author

nilo19 commented Sep 4, 2023

/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz

1 similar comment
@nilo19
Copy link
Contributor Author

nilo19 commented Sep 4, 2023

/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz

@nilo19 nilo19 force-pushed the chore/node-count branch 2 times, most recently from e86eb8c to b985045 Compare September 5, 2023 01:11
@nilo19 nilo19 changed the title [WIP] chore: the expected node count in the backend pool should be measured… [WIP] fix: check endpoint slice update after backend pool update for local service to prevent mismatch Sep 5, 2023
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Sep 5, 2023
@nilo19 nilo19 force-pushed the chore/node-count branch 2 times, most recently from 467a678 to 6d3c662 Compare September 5, 2023 01:50
@nilo19
Copy link
Contributor Author

nilo19 commented Sep 5, 2023

/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz

@nilo19 nilo19 changed the title [WIP] fix: check endpoint slice update after backend pool update for local service to prevent mismatch fix: check endpoint slice update after backend pool update for local service to prevent mismatch Sep 5, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 5, 2023
@nilo19
Copy link
Contributor Author

nilo19 commented Sep 6, 2023

/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz

1 similar comment
@nilo19
Copy link
Contributor Author

nilo19 commented Sep 6, 2023

/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz

@nilo19
Copy link
Contributor Author

nilo19 commented Sep 6, 2023

/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz

5 similar comments
@nilo19
Copy link
Contributor Author

nilo19 commented Sep 6, 2023

/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz

@nilo19
Copy link
Contributor Author

nilo19 commented Sep 6, 2023

/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz

@nilo19
Copy link
Contributor Author

nilo19 commented Sep 7, 2023

/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz

@nilo19
Copy link
Contributor Author

nilo19 commented Sep 7, 2023

/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz

@nilo19
Copy link
Contributor Author

nilo19 commented Sep 7, 2023

/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz

az.excludeLoadBalancerNodes.Insert(prevNode.ObjectMeta.Name)
az.nodesWithCorrectLoadBalancerByPrimaryVMSet.Delete(strings.ToLower(prevNode.ObjectMeta.Name))
}

// Remove from nodePrivateIPs cache.
for _, address := range getNodePrivateIPAddresses(prevNode) {
klog.V(4).Infof("removing IP address %s of the node %s", address, prevNode.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we increase the log level for this line (e.g. to 6)? it may generate large volume of logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// There are chances that the endpointslice changes after EnsureHostsInPool, so
// need to check endpointslice for a second time.
if err := az.checkAndApplyLocalServiceBackendPoolUpdates(*lb, service); err != nil {
return nil, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log an error before return.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


var err error
lb1, err = cloud.reconcileBackendPoolHosts(lb1, existingLBs, &svc, []*v1.Node{}, clusterName, "vmss", lbBackendPoolIDs)
assert.NoError(t, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add some negative test cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

currentIPsInBackendPools[bpName] = currentIPs
}
}
az.applyIPChangesAmongLocalServiceBackendPoolsByIPFamily(*lb.Name, serviceName, currentIPsInBackendPools, expectedIPs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would the error be returned back on failures here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be in the service events.

if err != nil {
return err
}
var expectedIPs []string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about using Set here? in that way, you could compare by Set.Equal before L573 and skip reconciling if the Sets are not changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it will be skipped in L616.

}
currentIPsInBackendPools[bpName] = currentIPs
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should skip the reconcling if the IP list are not changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will skip in L616.

ipv4 = append(ipv4, ip)
}
}
az.reconcileIPsInLocalServiceBackendPoolsAsync(lbName, serviceName, currentIPsInBackendPoolsIPv6, ipv6)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, how is the error propagated back to the caller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be in the service events.

description string
}{
{
description: "test",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add some negative tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -235,6 +238,36 @@ var _ = Describe("Ensure LoadBalancer", Label(utils.TestSuiteLabelMultiSLB), fun
})
})

func getDeploymentPodsNodeNames(kubeClient clientset.Interface, namespace, deploymentName string) (map[string]bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add some comments for the new function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 10, 2023
@lzhecheng
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 18, 2023
@k8s-ci-robot k8s-ci-robot merged commit e14df09 into kubernetes-sigs:master Sep 18, 2023
17 checks passed
@nilo19 nilo19 deleted the chore/node-count branch September 19, 2023 00:45
@nilo19
Copy link
Contributor Author

nilo19 commented Sep 20, 2023

/cherrypick release-1.28

@k8s-infra-cherrypick-robot

@nilo19: new pull request created: #4659

In response to this:

/cherrypick release-1.28

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants