Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot join etcd-only node to an all-role initial server #4784

Closed
rancher-max opened this issue Dec 17, 2021 · 12 comments
Closed

Cannot join etcd-only node to an all-role initial server #4784

rancher-max opened this issue Dec 17, 2021 · 12 comments
Assignees
Labels
kind/enhancement An improvement to existing functionality
Milestone

Comments

@rancher-max
Copy link
Contributor

rancher-max commented Dec 17, 2021

The current state of etcd-only nodes requires a specific order:

NODE1 etcd-only 
curl -fL https://get.k3s.io | INSTALL_K3S_VERSION=v1.22.5-rc1+k3s1 sh -s - server --cluster-init --token demo --disable-apiserver --disable-controller-manager --disable-scheduler 

NODE2 no etcd 
curl -fL https://get.k3s.io | INSTALL_K3S_VERSION=v1.22.5-rc1+k3s1 sh -s - server --token demo --disable-etcd --server https://<NODE1 IP>:6443 

NODE3 ALL ROLES 
curl -fL https://get.k3s.io | INSTALL_K3S_VERSION=v1.22.5-rc1+k3s1 sh -s - server --token demo --server https://<NODE1 IP>:6443

If bringing up node3 first (with --cluster-init flag), then trying to join node1 (without cluster-init but with --server), node1 will fail to join and will not correctly start. Instead, it will loop with:

level=info msg="Waiting for API server to become available"
level=warning msg="Deploy controller node name is empty or too long, and will not be tracked via server side apply field management"
level=info msg="Failed to set etcd role label: failed to register CRDs: Get \"https://127.0.0.1:6444/apis/apiextensions.k8s.io/v1/customresourcedefinitions\": dial tcp 127.0.0.1:6444: connect: connection refused"

We should enhance this functionality to allow any configuration to work. Some valid scenarios would be:

  1. node1=etcd-only, node2=cp, node3=worker
  2. node1=cp, node2=etcd-only, node3=worker
  3. node1=etcd+cp, node2=all roles
  4. node1=all roles, node2=etcd-only
  5. node1=all roles, node2=cp

Edit to add more valid scenarios:
6. node1=etcd-only, node2=etcd-only, node3=etcd-only, node4=cp-only, node5=worker

@rancher-max rancher-max added the kind/enhancement An improvement to existing functionality label Dec 17, 2021
@rancher-max rancher-max added this to the v1.22.6+k3s1 milestone Dec 17, 2021
@rancher-max rancher-max added this to To Triage in Development [DEPRECATED] via automation Dec 17, 2021
@brandond
Copy link
Contributor

brandond commented Dec 17, 2021

When we fix this, let's add an ADR to capture some requirements for supported role combinations (--disable-x flags) and ordering. I believe the only order we've tested so far is

  1. node1=etcd-only, node2=cp, node3=worker

@rancher-max
Copy link
Contributor Author

I believe this issue is also affecting cluster creations in rancher when doing -only nodes. The following configs I saw fail in rancher:

  • 3 nodes all roles, 2 nodes etcd-only, 1 node cp-only, 1 node worker-only.
    • Result here was:
    • 3 nodes all roles are up and running, but no other nodes are
    • Same logs as shown in this issue description show up on the etcd-only nodes.
    • Other nodes have not started the k3s process at all and are stuck Waiting for etcd to be available
  • 3 nodes etcd-only, 2 nodes cp-only, 3 nodes worker-only
    • Result here was:
    • No nodes are up and running. etcd-only nodes have two different repeating logs. One had:
    Jan 11 23:11:17 maxk3split-maxk3splitetcd-ac4f3f1a-x8rxp k3s[2121]: time="2022-01-11T23:11:17.744386898Z" level=warning msg="servers addresses are not yet set"
    Jan 11 23:11:22 maxk3split-maxk3splitetcd-ac4f3f1a-x8rxp k3s[2121]: time="2022-01-11T23:11:22.739308259Z" level=warning msg="Deploy controller node name is empty or too long, and will not be tracked via server side apply field management"
    Jan 11 23:11:22 maxk3split-maxk3splitetcd-ac4f3f1a-x8rxp k3s[2121]: time="2022-01-11T23:11:22.740784566Z" level=info msg="Failed to set etcd role label: Get \"https://127.0.0.1:6444/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions\": dial tcp 127.0.0.1:6444: connect: connection refused"
    
    And the other two had:
    Jan 11 23:27:06 maxk3split-maxk3splitetcd-ac4f3f1a-qvs4s k3s[2117]: time="2022-01-11T23:27:06.819345817Z" level=info msg="Waiting for API server to become available"
    Jan 11 23:27:11 maxk3split-maxk3splitetcd-ac4f3f1a-qvs4s k3s[2117]: time="2022-01-11T23:27:11.696957728Z" level=warning msg="Deploy controller node name is empty or too long, and will not be tracked via server side apply field management"
    Jan 11 23:27:11 maxk3split-maxk3splitetcd-ac4f3f1a-qvs4s k3s[2117]: time="2022-01-11T23:27:11.697868398Z" level=info msg="Failed to set etcd role label: Get \"https://127.0.0.1:6444/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions\": dial tcp 127.0.0.1:6444: connect: connection refused"
    Jan 11 23:27:16 maxk3split-maxk3splitetcd-ac4f3f1a-qvs4s k3s[2117]: time="2022-01-11T23:27:16.696126485Z" level=warning msg="Deploy controller node name is empty or too long, and will not be tracked via server side apply field management"
    Jan 11 23:27:16 maxk3split-maxk3splitetcd-ac4f3f1a-qvs4s k3s[2117]: time="2022-01-11T23:27:16.697405484Z" level=info msg="Failed to set etcd role label: Get \"https://127.0.0.1:6444/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions\": dial tcp 127.0.0.1:6444: connect: connection refused"
    Jan 11 23:27:17 maxk3split-maxk3splitetcd-ac4f3f1a-qvs4s k3s[2117]: {"level":"warn","ts":"2022-01-11T23:27:17.060Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
    Jan 11 23:27:17 maxk3split-maxk3splitetcd-ac4f3f1a-qvs4s k3s[2117]: time="2022-01-11T23:27:17.060482368Z" level=info msg="Failed to test data store connection: context deadline exceeded"
    Jan 11 23:27:21 maxk3split-maxk3splitetcd-ac4f3f1a-qvs4s k3s[2117]: time="2022-01-11T23:27:21.696204907Z" level=warning msg="Deploy controller node name is empty or too long, and will not be tracked via server side apply field management"
    Jan 11 23:27:21 maxk3split-maxk3splitetcd-ac4f3f1a-qvs4s k3s[2117]: time="2022-01-11T23:27:21.696774378Z" level=info msg="Failed to set etcd role label: Get \"https://127.0.0.1:6444/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions\": dial tcp 127.0.0.1:6444: connect: connection refused"
    Jan 11 23:27:26 maxk3split-maxk3splitetcd-ac4f3f1a-qvs4s k3s[2117]: time="2022-01-11T23:27:26.697017648Z" level=warning msg="Deploy controller node name is empty or too long, and will not be tracked via server side apply field management"
    Jan 11 23:27:26 maxk3split-maxk3splitetcd-ac4f3f1a-qvs4s k3s[2117]: time="2022-01-11T23:27:26.698441406Z" level=info msg="Failed to set etcd role label: Get \"https://127.0.0.1:6444/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions\": dial tcp 127.0.0.1:6444: connect: connection refused"
    Jan 11 23:27:31 maxk3split-maxk3splitetcd-ac4f3f1a-qvs4s k3s[2117]: time="2022-01-11T23:27:31.696149687Z" level=warning msg="Deploy controller node name is empty or too long, and will not be tracked via server side apply field management"
    Jan 11 23:27:31 maxk3split-maxk3splitetcd-ac4f3f1a-qvs4s k3s[2117]: time="2022-01-11T23:27:31.696940670Z" level=info msg="Failed to set etcd role label: Get \"https://127.0.0.1:6444/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions\": dial tcp 127.0.0.1:6444: connect: connection refused"
    Jan 11 23:27:32 maxk3split-maxk3splitetcd-ac4f3f1a-qvs4s k3s[2117]: {"level":"warn","ts":"2022-01-11T23:27:32.063Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
    Jan 11 23:27:32 maxk3split-maxk3splitetcd-ac4f3f1a-qvs4s k3s[2117]: time="2022-01-11T23:27:32.064082348Z" level=info msg="Failed to test data store connection: context deadline exceeded"
    Jan 11 23:27:36 maxk3split-maxk3splitetcd-ac4f3f1a-qvs4s k3s[2117]: time="2022-01-11T23:27:36.697351901Z" level=warning msg="Deploy controller node name is empty or too long, and will not be tracked via server side apply field management"
    Jan 11 23:27:36 maxk3split-maxk3splitetcd-ac4f3f1a-qvs4s k3s[2117]: time="2022-01-11T23:27:36.698639554Z" level=info msg="Failed to set etcd role label: Get \"https://127.0.0.1:6444/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions\": dial tcp 127.0.0.1:6444: connect: connection refused"
    
    • No other nodes had started the k3s process and were stuck Waiting for etcd to be available

I believe it is all related to this issue, so this will become critical for GA of k3s in Rancher. Note these same issues do NOT show up with the same configurations for rke2.

@brandond
Copy link
Contributor

brandond commented Jan 11, 2022

@ShylajaDevadiga, @mdrahman-suse, and I just bisected a similar issue (same repeating dial tcp 127.0.0.1:6444: connect: connection refused message) with cluster-reset restore to 05f1bc6 - I'm investigating a fix right now.

@cwayne18 cwayne18 moved this from To Triage to Next Up in Development [DEPRECATED] Jan 12, 2022
@brandond
Copy link
Contributor

@rancher-max can you see what you get with the most recent RC?

@rancher-max
Copy link
Contributor Author

It looks like it's still failing for me through rancher. Using config:
3 servers all roles, 2 nodes etcd-only, 1 node cp-only, 1 node worker only

What's interesting is that in this case, sometimes I see the 3 servers all roles come up and running and the others not, and sometimes I don't even see the 3 servers all roles come up (so nothing is up and running)

@Oats87
Copy link
Member

Oats87 commented Jan 14, 2022

Yep, this is why Rancher CI won't pass right now when we bumped K3s/RKE2 versions

Do we know what version of K3s this broke in?

@Oats87
Copy link
Member

Oats87 commented Jan 14, 2022

I'm particularly suspect of #4246

Rancher provisioning tests stop passing after crossing the threshold of the inclusion of that version i.e.
v1.21.5+k3s2 works, v1.21.6+k3s1 breaks.

@rancher-max
Copy link
Contributor Author

Rancher provisioning tests stop passing after crossing the threshold of the inclusion of that version i.e.
v1.21.5+k3s2 works, v1.21.6+k3s1 breaks.

This is consistent with the behavior I see in Rancher 2.6.3 with k3s provisioning, so I believe @Oats87 is likely correct with the commit that broke this.

@brandond brandond self-assigned this Jan 18, 2022
@brandond brandond moved this from Next Up to Working in Development [DEPRECATED] Jan 18, 2022
@brandond
Copy link
Contributor

I can take this on for the next cycle, since it seems related to the agent ready channel stuff that I added.

@katran001 katran001 modified the milestones: v1.22.6+k3s1, v1.22.7+k3s1 Jan 25, 2022
@katran001
Copy link

katran001 commented Jan 25, 2022

Moving this to the next milestone since the team didn't have a chance to work on it for v1.22.5+k3s1

@brandond
Copy link
Contributor

Did not get worked for the February release, let's reschedule to March.

@rancher-max
Copy link
Contributor Author

Validated using v1.23.5-rc1+k3s1

Performed the following 5 scenarios:

  • Scenario 1: node1=etcd-only
  • Scenario 2: node1=all roles
  • Scenario 3: double etcd-only, node1=etcd-only
  • Scenario 4: double etcd-only, node1=all roles
  • Scenario 5: node1=cp only

Note that on scenario 5, there is an expected failure in starting k3s. It receives: level=fatal msg="invalid flag use; --server is required with --disable-etcd"

Development [DEPRECATED] automation moved this from To Test to Done Issue / Merged PR Mar 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement An improvement to existing functionality
Projects
No open projects
Development

No branches or pull requests

4 participants