feat(load-balancer): ignore nodes that don't use known provider IDs #530

boedy · 2023-10-04T10:55:31Z

I'm running a cluster with cloud and baremetel nodes. I've disabled networking, but still want to use support for load balancers. Currently if an node doesn't have a valid hcloud:// provider-id the EnsureLoadBalancer method early returns on this line:

https://github.com/hetznercloud/hcloud-cloud-controller-manager/blob/51b83ecf12a1673f35b7d8f35ee0659c3e2d59bb/internal/hcops/load_balancer.go#L599C11-L599C18

Instead of early returning couldn't we just continue and exclude the node from selection?

The text was updated successfully, but these errors were encountered:

apricote · 2023-10-04T11:09:11Z

This is not supported right now, I would actually expect HCCM to remove any non-Hetzner Cloud nodes from your cluster.

We plan to add support for Robot servers, please subscribe to #523 if you are interested in this.

boedy · 2023-10-04T12:49:17Z

Next to Hetzner's root servers we also include (non hetzner) edge servers in our cluster that run in different datacenters. Naturally we wouldn't want HCCM to remove these nodes. HCCM in it's current state it works fine, but we currently require to add the load-balancer.hetzner.cloud/node-selector to all our load balanced services.

Would be easier if HHCM could just ignore nodes that don't have a valid hcloud:// provider-id.

apricote · 2023-10-05T06:29:33Z

What version of HCCM are you running? HCCM does remove any nodes that it can not associate with known servers. Or rather, kubernetes/cloud-provider does this, we only implement a few interfaces and that library actually talks to Kubernetes and decides what to do.

In general, I would recommend you talk to sig-cloud-provider with these requirement, as they are different from the way kubernetes/cloud-provider is currently built. k/c-p expects that all Nodes in the cluster belong to ONE cloud-provider implementation.

boedy · 2023-10-14T10:48:27Z

I'm running the latest version (v1.18). What part of the codebase is responsible for removing the unassociated nodes?

apricote · 2023-10-16T05:39:11Z

That would be the CloudNodeLifecycleController in kubernetes/cloud-provider: https://github.com/kubernetes/cloud-provider/blob/8fe710fc4036e9992a88a5c89fdf9eaf4987b56f/controllers/nodelifecycle/node_lifecycle_controller.go#L153-L179

This in turn calls InstanceV2.InstanceExists() in HCCM

hcloud-cloud-controller-manager/hcloud/instances.go

Lines 74 to 84 in 8775196

    
           func (i *instances) InstanceExists(ctx context.Context, node *corev1.Node) (bool, error) { 
        
           	const op = "hcloud/instancesv2.InstanceExists" 
        
           	metrics.OperationCalled.WithLabelValues(op).Inc() 
        
           	server, err := i.lookupServer(ctx, node) 
        
           	if err != nil { 
        
           		return false, err 
        
           	} 
        
           	return server != nil, nil 
        
           }

Which tries to lookup the Server in Hetzner Cloud API by ID or by Name:

hcloud-cloud-controller-manager/hcloud/instances.go

Lines 49 to 72 in 8775196

    
           // lookupServer attempts to locate the corresponding hcloud.Server for a given corev1.Node 
        
           // It returns an error if the Node has an invalid provider ID or if API requests failed. 
        
           // It can return a nil [*hcloud.Server] if neither the ProviderID nor the Name matches an existing server. 
        
           func (i *instances) lookupServer(ctx context.Context, node *corev1.Node) (*hcloud.Server, error) { 
        
           	var server *hcloud.Server 
        
           	if node.Spec.ProviderID != "" { 
        
           		serverID, err := providerid.ToServerID(node.Spec.ProviderID) 
        
           		if err != nil { 
        
           			return nil, fmt.Errorf("failed to convert provider id to server id: %w", err) 
        
           		} 
        
           		server, _, err = i.client.Server.GetByID(ctx, serverID) 
        
           		if err != nil { 
        
           			return nil, fmt.Errorf("failed to lookup server \"%d\": %w", serverID, err) 
        
           		} 
        
           	} else { 
        
           		var err error 
        
           		server, _, err = i.client.Server.GetByName(ctx, node.Name) 
        
           		if err != nil { 
        
           			return nil, fmt.Errorf("failed to lookup server \"%s\": %w", node.Name, err) 
        
           		} 
        
           	} 
        
           	return server, nil 
        
           }

boedy · 2023-10-16T09:22:20Z

I see. Nodes are only removed when the lookup process returns no errors. Given that all our servers are configured with a ProviderID (if non is provided K3S uses a default btw), the error message failed to convert provider id to server id: %w is generated. This, in turn, stops the node from being deleted. This behavior, in my opinion, is not just preferable but also logical. The HCCM should primarily focus on nodes within its jurisdiction. To illustrate, allowing HCCM to delete nodes identified by other providers, like digitalocean://12345678, would be overstepping its intended boundaries.

Now, circling back to the initial issue: we aim for a seamless integration with the load balancer without the need for the load-balancer.hetzner.cloud/node-selector annotation. A minor tweak in the code can achieve this without compromising the integrity of the HCCM:

// Extract HC server IDs of all K8S nodes assigned to the K8S cluster.
for _, node := range nodes {
    id, err := providerIDToServerID(node.Spec.ProviderID)
    if err != nil {
	// return changed, fmt.Errorf("%s: %w", op, err)
	continue
    }
    k8sNodeIDs[id] = true
    k8sNodeNames[id] = node.Name
}

This change ensures that nodes without a valid Hetzner Cloud ID are simply skipped, rather than causing the entire process to fail. It's a more graceful way to handle such scenarios and provides better flexibility for mixed-node environments like ours.

I genuinely believe this adjustment will not only benefit us but also other users who might have similar hybrid setups. It's about enhancing the adaptability of HCCM without compromising its core principles.

github-actions · 2024-01-14T12:46:07Z

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

apricote · 2024-01-15T07:05:14Z

v1.19.0 added support for Robot servers in HCCM, so users that only want to have Hetzner Cloud & Robot servers in the same cluster, check out Clusters with Robot Servers.

The HCCM should primarily focus on nodes within its jurisdiction. To illustrate, allowing HCCM to delete nodes identified by other providers, like digitalocean://12345678, would be overstepping its intended boundaries.

Please talk to sig-cloud-provider about this. The current behavior in regards to non-Hetzner servers is not guaranteed and comes from an upstream Kubernetes library.

Making the changes you suggested sounds good for two reasons:

The Load Balancer reconciler is more resilient to unexpected cluster conditions (ie. hybrid cloud)
Your use case would be possible

Especially now that we have Events in place, this will be properly communicate to cluster operators.

The code around this changed a bit, I would suggest the following changes (as well as tests to verify them):

 	// Extract HC server IDs of all K8S nodes assigned to the K8S cluster.
 	for _, node := range nodes {
 		id, isCloudServer, err := providerid.ToServerID(node.Spec.ProviderID)
 		if err != nil {
+			if errors.Is(providerid.UnknownPrefixError) {
+				// ProviderID has unknown prefix, cluster might have non-hccm nodes that can not be added to the
+				// Load Balancer. Emitting an event and ignoring that Node in this reconciliation loop.
+				l.Recorder.Eventf(node, corev1.EventTypeWarning, "UnknownProviderIDPrefix", "Node could not be added to Load Balancer for service %s because the provider ID does not match any known format", svc.Name))
+				continue
+			}
 			return changed, fmt.Errorf("%s: %w", op, err)
 		}
 		if isCloudServer {
 			k8sNodeIDsHCloud[id] = true
 		} else {
 			k8sNodeIDsRobot[int(id)] = true
 		}
 		k8sNodes[id] = node
 	}

In addition we need to modify internal/providerid to use a struct for the error:

package providerid

struct UnknownPrefixError {
  ProviderID string
}

func (e *UnknownPrefixError) Error() string {
  return fmt.Sprintf("providerID does not have one of the the expected prefixes (%s, %s, %s): %s", prefixCloud, prefixRobot, prefixRobotLegacy, e.ProviderID)
}

And update providerid.ToServerID():

 	default:
+ 		return 0, false, &UnknownPrefixError{ providerID }
- 		return 0, false, fmt.Errorf("providerID does not have one of the the expected prefixes (%s, %s, %s): %s", prefixCloud, prefixRobot, prefixRobotLegacy, providerID)
 	}

github-actions · 2024-04-14T14:54:24Z

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

boedy · 2024-04-16T20:36:35Z

@apricote Sorry, I seemed to have missed your reply.

Please talk to sig-cloud-provider about this. The current behavior in regards to non-Hetzner servers is not guaranteed and comes from an upstream Kubernetes library.

I'm a bit confused; the changes in question are not in the upstream library right? I'm willing to work on a PR to get this in.

apricote · 2024-04-17T05:47:52Z

There are two issues here:

`kubernetes/cloud-provider` assumes single (cloud) provider

This is an official Kubernetes library that we rely on to talk to Kubernetes. We "only" implement a few interfaces to talk to the Hetzner Cloud API. Almost all of the functionality is dictated by this library.

This library assumes that every single node in the cluster is managed by a single cloud-controller-manager. This assumption is being made at every single point in its codebase. This is also the reason why HCCM tries to delete any Nodes that it does not manage.

Right now the deletion "does not happen" because of an error check we have, but this is incidental and I will not guarantee that we keep this forever in the current way.

If you want to run clusters with nodes from different providers (ie. Hetzner & your edge servers), then please talk to sig-cloud-provider about this requirement. They are the ones that can make changes to the library to reliably allow for such use cases.

Load Balancer Controller breaking

The Load Balancer code on our side could be improved, I am totally fine with making these changes as outlined in my last comment. Feel free to open a PR, if you have any questions we are available to help you through the process :)

boedy · 2024-04-17T08:53:59Z

Thanks for clarifying. I did some digging and it seems this is already being discussed (kubernetes/cloud-provider#35) upstream.

I'll try to make some time in the weekend / evening hours to work on a PR for the LB part.

github-actions bot added the stale label Jan 14, 2024

apricote added enhancement New feature or request and removed stale labels Jan 15, 2024

apricote changed the title ~~EnsureLoadBalancer: Ignore nodes that don't us hcloud id~~ EnsureLoadBalancer: Ignore nodes that don't use known provider IDs Jan 15, 2024

apricote changed the title ~~EnsureLoadBalancer: Ignore nodes that don't use known provider IDs~~ feat(load-balancer): ignore nodes that don't use known provider IDs Jan 15, 2024

github-actions bot added the stale label Apr 14, 2024

apricote removed the stale label Apr 15, 2024

jooola added the pinned label Apr 15, 2024

apricote mentioned this issue May 3, 2024

Node Addresses won't get updated when using Wireguard for Cluster Creation #635

Open

apricote added the help wanted Extra attention is needed label Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(load-balancer): ignore nodes that don't use known provider IDs #530

feat(load-balancer): ignore nodes that don't use known provider IDs #530

boedy commented Oct 4, 2023

apricote commented Oct 4, 2023

boedy commented Oct 4, 2023

apricote commented Oct 5, 2023

boedy commented Oct 14, 2023

apricote commented Oct 16, 2023

boedy commented Oct 16, 2023

github-actions bot commented Jan 14, 2024

apricote commented Jan 15, 2024 •

edited

Loading

github-actions bot commented Apr 14, 2024

boedy commented Apr 16, 2024

apricote commented Apr 17, 2024

boedy commented Apr 17, 2024

feat(load-balancer): ignore nodes that don't use known provider IDs #530

feat(load-balancer): ignore nodes that don't use known provider IDs #530

Comments

boedy commented Oct 4, 2023

apricote commented Oct 4, 2023

boedy commented Oct 4, 2023

apricote commented Oct 5, 2023

boedy commented Oct 14, 2023

apricote commented Oct 16, 2023

boedy commented Oct 16, 2023

github-actions bot commented Jan 14, 2024

apricote commented Jan 15, 2024 • edited Loading

v1.19.0 added support for Robot servers in HCCM, so users that only want to have Hetzner Cloud & Robot servers in the same cluster, check out Clusters with Robot Servers.

github-actions bot commented Apr 14, 2024

boedy commented Apr 16, 2024

apricote commented Apr 17, 2024

kubernetes/cloud-provider assumes single (cloud) provider

Load Balancer Controller breaking

boedy commented Apr 17, 2024

apricote commented Jan 15, 2024 •

edited

Loading

`kubernetes/cloud-provider` assumes single (cloud) provider