Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unrecoverable error when joining node attempts to retrieve etcd member list from itself #9661

Closed
brandond opened this issue Mar 6, 2024 · 9 comments
Assignees
Labels
kind/enhancement An improvement to existing functionality
Milestone

Comments

@brandond
Copy link
Contributor

brandond commented Mar 6, 2024

Tracking issue for the sequence of events discussed in rancher/rke2#5557 (comment)

If users deploy an external load-balancer or DNS round-robin address list to provide the fixed registration endpoint, and add server nodes to the target pool before they have finished joining the cluster, nodes may attempt to join themselves. This can leave the joining node in a permanently broken state that requires manual cleanup to resolve.

We should enhance the cluster join process to allow detecting cases where a server is attempting to join itself, and either retry or return an error, rather than continuing on with partial information.

@brandond brandond added this to the Backlog milestone Mar 6, 2024
@rajivml
Copy link

rajivml commented Mar 8, 2024

Is it possible to get this prioritised please and have it backported to 1.26 because we are running into this issue on all our LTS releases and right now this issue is getting masked in the rke2-server retry loop that we have put in

Ideally, the node joining should work on the first attempt and we noticed sometimes it would take 5 to 10 restarts (40-60 minutes)

@brandond
Copy link
Contributor Author

brandond commented Mar 8, 2024

I don't have a timeline for when this will be resolved. Certainly not for the March releases, as code freeze is today. I'm confused what you mean by LTS, we do not offer LTS releases for k3s or rke2.

The only way I'm aware of to reproduce this requires an incorrectly configured environment where a server is sent its own requests when joining the cluster. If you are seeing this on a regular basis then you need to change the way you are configuring your fixed registration endpoints, and ensure that you do not send registration traffic servers before they are ready.

@rajivml
Copy link

rajivml commented Mar 8, 2024

I'm confused what you mean by LTS, we do not offer LTS releases for k3s or rke2.
Sorry for the confusion, when I mean LTS, it's our LTS release cadence where we bundle RKE2 and we periodically provide k8's update to our customers

We do have a health probe defined for 9345 port at load-balancer to route traffic destined for 9345 port. It seems that it is forwarding request to same node since rke2 is bringing up this port the moment rke2 service is started. Should we change the health probe to use a different port i.e either use etcd port or API server port, would it solve the issue and will there be any repercussions if we use a different port as a health probe on LB?

@brandond
Copy link
Contributor Author

brandond commented Mar 8, 2024

Same answer as I gave at rancher/rke2#5557 (comment), except for k3s the health-check would look like:

curl -ksf https://node:token@172.17.0.8:6443/v1-k3s/readyz

@caroline-suse-rancher caroline-suse-rancher removed this from the Backlog milestone Mar 8, 2024
@caroline-suse-rancher caroline-suse-rancher added the kind/enhancement An improvement to existing functionality label Mar 8, 2024
@rajivml
Copy link

rajivml commented Mar 11, 2024

@brandond Most of the load balancers support health probe on single port so its not practically possible to have health probes on multiple ports

Also the curl based check you have suggested that is also not feasible via LB as we won't be aware of token at the time of configuring healthprobe on the backend pool

Is it possible to have health probe on a single port and if yes what would be that port/component ?

@brandond
Copy link
Contributor Author

You can set the token manually; you don't have to let the server generate a random one for you.

If all of your servers are control-plane+etcd, I would health check the apiserver on 6443.

@rajivml
Copy link

rajivml commented Mar 12, 2024

yeah all our servers are control-plane + etcd, we will try with 6443 and get back here with the outcome. Thanks Brandon

@mynktl
Copy link

mynktl commented Mar 28, 2024

@brandond With 6443 as health probe for rke2 port, We are not seeing any issue. Thanks for you help 🙏

@mdrahman-suse
Copy link

Unable to replicate / validate in k3s, but was able to replicate with rke2. Validation is done on the rke2 commit with k3s pull through and tracked in here: rancher/rke2#5804 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement An improvement to existing functionality
Projects
Status: Done Issue
Development

No branches or pull requests

5 participants