Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support local service in multiple standard load balancer mode #4450

Merged

Conversation

nilo19
Copy link
Contributor

@nilo19 nilo19 commented Aug 15, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

In multiple slb mode, each local service owns a backend pool named after the service name. The backend pool will be created in the service reconciliation loop when the service is created or updated from external traffic policy cluster. It will be deleted in the service reconciliation loop when: 1, the service is deleted; 2, the service is changed to etp cluster; 3, the cluster is migrated from multi-slb to single-slb; and 4, the service is moved to another load balancer.
Besides the service reconciliation loop, an endpointslice informer is added in this pr. It watches all endpointslices of local services, monitors any updating events, and updates the corresponding backend pool. Considering local services may churn quickly, the informer will send backend pool updating operations to a buffer queue. The queue merges operations targeting to the same backend pool, and updates them every 30s.

Which issue(s) this PR fixes:

Fixes #
Related: #4013

Special notes for your reviewer:

Doc: #4451

Does this PR introduce a user-facing change?

feat: support local service in multiple standard load balancer mode

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 15, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nilo19

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Aug 15, 2023
@nilo19
Copy link
Contributor Author

nilo19 commented Aug 15, 2023

/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz

@nilo19 nilo19 force-pushed the feat/multi-slb/local-svc/main branch from ef79b2e to 68fa2cd Compare August 15, 2023 08:29
@nilo19
Copy link
Contributor Author

nilo19 commented Aug 15, 2023

/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz

@nilo19 nilo19 force-pushed the feat/multi-slb/local-svc/main branch from 68fa2cd to ccb8c6e Compare August 15, 2023 14:37
@nilo19
Copy link
Contributor Author

nilo19 commented Aug 16, 2023

/test pull-cloud-provider-azure-e2e-ccm-vmss-multi-slb-capz

if strings.HasPrefix(strings.ToLower(pointer.StringDeref(bp.Name, "")), strings.ToLower(bpName)) {
found = true
}
if len(*bp.LoadBalancerBackendAddresses) != expectedCount {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean all bps will have the same amount of nodes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, only one bp should be created.

_, err = cs.AppsV1().Deployments(ns.Name).Update(context.Background(), deployment, metav1.UpdateOptions{})
Expect(err).NotTo(HaveOccurred())
err = wait.PollUntilContextTimeout(context.Background(), 5*time.Second, 5*time.Minute, false, func(ctx context.Context) (bool, error) {
if err := checkNodeCountInBackendPoolByServiceIPs(tc, clusterName, expectedBPName, ips, len(nodes)); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so len(nodes) <= 5?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be any number. I have a wrong log in L192.

@nilo19 nilo19 force-pushed the feat/multi-slb/local-svc/main branch from ccb8c6e to 8513bdf Compare August 16, 2023 06:15
@@ -264,6 +264,11 @@ type Config struct {

// DisableAPICallCache disables the cache for Azure API calls. It is for ARG support and not all resources will be disabled.
DisableAPICallCache bool `json:"disableAPICallCache,omitempty" yaml:"disableAPICallCache,omitempty"`

// RouteUpdateIntervalInSeconds is the interval for updating routes. Default is 30 seconds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks making those intervals configurable.

@@ -247,10 +238,77 @@ func TestEnsureHostsInPoolNodeIP(t *testing.T) {
},
},
},
{
desc: "local service",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you make the desc for those tests more clear? e.g. something like local service should xxxx.

}

func (op *loadBalancerBackendPoolUpdateOperation) wait() batchOperationResult {
return <-op.result
return batchOperationResult{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still used somewhere? seems it should be deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is used in route updater.

"service name", op.serviceName,
"load balancer name", op.loadBalancerName,
"backend pool name", op.backendPoolName,
"related IPs", strings.Join(op.nodeIPs, ","))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/related IPs/node IPs/

"service name", op.serviceName,
"load balancer name", op.loadBalancerName,
"backend pool name", op.backendPoolName,
"related IPs", strings.Join(op.nodeIPs, ","))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node IPs


// getLocalServiceEndpointsNodeNames gets the node names that host all endpoints of the local service.
func (az *Cloud) getLocalServiceEndpointsNodeNames(service *v1.Service) (sets.Set[string], error) {
eps, err := az.KubeClient.DiscoveryV1().EndpointSlices(service.Namespace).List(context.Background(), metav1.ListOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we get this from local watch cache instead of a list API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will add a cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a cache, please check the latest commit.
BTW, there is a problem theoretically: if the endpointslice doesn't finish updating when we ensure hosts in pool, there could be a discrepancy between the # of node IPs in the pool and the actual state. This is because the endpoint slice update may be slower than the pods update. After testing multiple times, I haven't found this behavior but I think this could happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's a valid issue on large scale of workloads. Could we also trigger the backend pool updates on the endpointslice add/delete events?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced offline. There is no need to support backend pool updates for add/delete events. The latency issue will be fixed in a patch version of v1.28.

@nilo19 nilo19 force-pushed the feat/multi-slb/local-svc/main branch 5 times, most recently from 745c706 to 4e3536d Compare August 17, 2023 05:46
@nilo19 nilo19 force-pushed the feat/multi-slb/local-svc/main branch from 4e3536d to 9f9d85c Compare August 18, 2023 03:02

// setUpEndpointSlicesInformer creates an informer for EndpointSlices of local services.
// It watches the update events and send backend pool update operations to the batch updater.
// TODO (niqi): the update of endpointslice may be slower than tue update of endpoint pods. Need to fix this.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a todo

var ep *discovery_v1.EndpointSlice
az.endpointSlicesCache.Range(func(key, value interface{}) bool {
endpointSlice := value.(*discovery_v1.EndpointSlice)
if strings.EqualFold(getServiceNameOfEndpointSlice(endpointSlice), service.Name) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check service namespace also

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching, please check the new commit.

@jwtty
Copy link
Member

jwtty commented Aug 18, 2023

/test pull-cloud-provider-azure-e2e-ccm-vmss-capz

} else {
key := strings.ToLower(getServiceName(service))
si, found := bi.getLocalServiceInfo(key)
if found && !strings.EqualFold(si.lbName, lbName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if originally cluster was using single LB and a local service was created (so this map does not contain the service) and then cluster is updated to multiple LB (and service may need to be changed to another LB)? Want to make sure it still works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We reconcile all managed LB's backendpools in the outer caller of ensureHostsInPool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok if the map doesn't contain the service here, we just block the case where the lb name in the map is incorrect.

az.endpointSlicesCache.Store(strings.ToLower(fmt.Sprintf("%s/%s", newES.Namespace, newES.Name)), newES)

key := strings.ToLower(fmt.Sprintf("%s/%s", newES.Namespace, svcName))
si, found := az.getLocalServiceInfo(key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When user migrates from single LB to multiple LBs, a local svc may be not be found right? Will that cause any problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it won't, in this case the main loop will take over the process.

@jwtty
Copy link
Member

jwtty commented Aug 18, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 18, 2023
@jwtty
Copy link
Member

jwtty commented Aug 19, 2023

/test pull-cloud-provider-azure-e2e-ccm-vmssflex-capz

@k8s-ci-robot k8s-ci-robot merged commit 58f4150 into kubernetes-sigs:master Aug 19, 2023
17 checks passed
@nilo19 nilo19 deleted the feat/multi-slb/local-svc/main branch August 21, 2023 01:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants