-
Notifications
You must be signed in to change notification settings - Fork 355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue cluster - component - konnectivity #1113
Comments
Do you have and LB setup in front of the controllers? As stated in docs HA controlplane needs LB so that e.g. konnectivity-agents can establish proper connections with ALL controllers. |
It was not clear that this was a requirement, requesting now for load-balancer, and then change the configuration of the controllers. I will update the config file used during the installation process
I have executed the following installation for the cluster setup:
I will update the /opt/docker/common/docker-configuration/k0s/k0s-cluster-config.yaml file witht the load-balancer reference. Then I need to stop/start the controllers, this is clear for the controller 1 where the config file was used, but what about the controller2/3? |
Currently you need to copy over the same config bits to all controllers so they get booted up the same way. And naturally a restart of all is needed if the config changes. We're building a more "robust" way to handle the config as a kubernetes object but that work is still ongoing (#1008) |
The requirement for LB basically stems from the way konnectivity is used to "tunnel" the api --> worker communications via konnectivity-agents. As we see from the logs, each API instance is trying to connect to the workers and pods (e.g. the metrics stuff) thus each API servers needs to have a konnectivity-agent connected so that it can actually reach the worker services. When there's no LB and no |
HA F5 LB in place, but I see
|
We continue to have these errors, I have checked some scenarios but still no solution for this. |
Linked to this issue I see that kubectl commands from two of the three controllers is slow. |
This seems to be related to a running metrics pod. When not running I see that "/usr/local/bin/k0s kubectl -n kube-system get deploy" on all my controllers is fast. At the same time I don't see those errors:
Once the metrics is started, the above errors are back, and the kubectl command on two of the three is very slow. I think the one that is fast is the leader. |
Logs metrics pod
|
I restarted the activities on K0s. Anybody a clue what the issue may be? |
Are you able to access metrics through What do you see in What we have seen is that metrics being "not ready" is slowing down the API discovery (the process where kubectl/other clients discover which APIs are available on the kube apiserver) as it reports some 5xx responses for the discovery. Maybe check the kubectl command in verbose mode (maybe with |
/usr/local/bin/k0s kubectl get apiservices v1beta1.metrics.k8s.io -o yaml
Same on all controllers. Out of the three controllers there is always one fast where the other two replies, but much slower. |
On the ones that are slow I see Sep 27 16:44:16 lxdocapa20 k0s: time="2021-09-27 16:44:16" level=info msg="E0927 16:44:16.778876 10295 server.go:413] "Failed to get a backend" err="No backend available"" component=konnectivity I don't see this on the node that replies fast to kubectl commands. |
Logs metrics pod doesn't say much: sh-4.2# /usr/local/bin/k0s kubectl -n kube-system logs -f metrics-server-6bd95db5f4-snvmw Neither inside the /var/log/messages, on host where the metrics pod is running, there is nothing special. From time to time I see inside the konnectivity pods, seems to be connection to the metrics server: E0927 14:54:20.745230 1 client.go:515] "conn write failure" err="write tcp 10.244.1.161:54652->10.97.148.61:443: use of closed network connection" connectionID=44253 /usr/local/bin/k0s kubectl -n kube-system get svc |
Can this be related, #596? But when I check sh-4.2# while true; do lsof -p 12838 |wc -l && date ; sleep 5; done <<< API-server sh-4.2# while true; do lsof -p 119702 |wc -l && date ; sleep 5; done <<<< Konnectivity server |
Checked lsof on all controllers for /var/lib/k0s/bin/konnectivity-server Controller 1 sh-4.2# while true; do lsof -p 119702 |wc -l && date ; sleep 5; done Controller 2 sh-4.2# while true; do lsof -p 10295 |wc -l && date ; sleep 5; done Controller 3 sh-4.2# while true; do lsof -p 11413 |wc -l && date ; sleep 5; done |
The lsof for the apiserver is better Controller 1 sh-4.2# while true; do lsof -p 12838 |wc -l && date ; sleep 5; done Controller 2 sh-4.2# while true; do lsof -p 10250 |wc -l && date ; sleep 5; done Controller 3 sh-4.2# while true; do lsof -p 11387 |wc -l && date ; sleep 5; done |
It does sound slightly similar at least. But in that issue the root cause was identified and fixed. Could you also check that the konnektivity-agent pods are properly configured for the
where |
When looking at the konnectivity agents
10.100.114.99 --> this is the IP where the agent runs This is what we use to access the cluster: |
What do you mean by agent, the konnectivity-agent? If you've configured the controllers with What config you have on the konnectivity-server processes on controllers? You can check that e.g. with |
I have changed the '--proxy-server-host=10.100.114.99' to '--proxy-server-host=k8s-k0s-test.toyota-europe.com', and this solved the issue, now all three managers are executing kubectl commands the same. For high available solution I did the following changes, see external address. apiVersion: k0s.k0sproject.io/v1beta1 But seems I had to update the daemonset konnectivity-agent too: ... |
hmm, k0s should detect the change in the |
I haven't been able to reproduce a case where the |
I just did a complete reinstall of K0s cluster
But inside the konnectivity daemonset I still see:
So, still the IP address and not the external address? |
@wheestermans31 I am not able to reproduce this. After that the new cluster has following definition for konnectivity agent (see below) Let's check, how do you start the k0s binary and which version do you use? I have a feeling that your set up due to some reason ignores the k0s.yaml, is it possible that you pass the wrong value for
|
Same problem with latest v1.22.2+k0s.2 (single controller+worker host) , every 5 minutes in k0s.log:
|
@sacesare I am not sure if it is related to the issue we discussed here (which I think originally was caused by different configs for controller nodes) The log you shown looks fishy though my gut feelings say that it is related to the connection timeout between agent and proxy and generally it's a noise. |
@Soider maybe ... but why it's qualified as "error" and no mention of timeout? And ... it's timeout on single node cluster (localhost) ? @wheestermans31 log contains very similar lines. |
Based on "out of Github" discussions this original issue was solved. So closing. |
I have a up-and-running cluster with three controllers and two workers.
The kubectl is not giving me any issue, from that point of view it looks fine.
But the kubectl command is fast on one of the controllers, and very slow on the two others. probably fast on the leader of the cluster.
The only point I see are the following logging related to the konnectivity component.
Logs on node that is fast:
But on the ones that are slow I see the following logs:
Probably related...
The text was updated successfully, but these errors were encountered: