New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot exec,logs/top to pods on some nodes #3784
Comments
any help on this? |
This sounds like some nodes konnectivity-agent is not able to connect to the controlplane konnectivity-server. If you SSH into the problematic nodes and see the konnectivity-agent logs in |
@jnummelin Looks like You are right. I can see in logs
|
Looks like when I removed this section from cluster config I'm able to get logs and exec to all pods spec:
api:
address: 135.125.75.8
externalAddress: api1.my-devbox.cloud
sans:
- 135.125.75.8 Even if koneectivy-agents are not producing any other errors there still remains issue with metrics server (i can see a lot of errors about not able to scape a node ....) and dns service is not working properly (I guess that it is related to a networking in cluster issue) |
Hi Vojbarzz, If they have connectivity probably we’ll need to do a traceroute to see the way the traffic goes and see where it gets lost. This can be tricky without actually acquiring tcpdumps… |
Hi @vojbarzz I'm keeping the comunication just here rather than the forum for simplicity of tracking. In the node where you're hosting the node, find the PID of the container which you want to reach. In this case I'm being lazy and getting coredns and using
If you want to use another pod you can use IMPORTANT: Do NOT use tcpdump -i any, -i any does some tricks to detect duplicate packets among several interfaces which make it not suitable to detect connectivity problems. Once you have the pid, we enter the pod's namespace with
We also want to run another tcpdump instance in the worker hosting the pod:
Finally we want to add a tcpdump in the host attempting to connect:
Then we run a simple ping in the client host to the pod IP:
Let it run a few seconds and stop everything with control+c. Now we have to analyze the pcap files. Normally I'd use wireshark but to avoid using images I'll just use tshark which is a CLI utility, I will refer to 10.244.0.4 a lot, so keep in mind it's always the pod IP. First let's see the expected values in the client:
In your case I expect you only see the echo request but not the reply. Now we have to check the pod host, this is a working scenario:
Now here there are the following possible scenarios:
Now, I expect that your issue is the first one. If this is the case I can think only think of two possible solutions:
|
The issue is marked as stale since no activity has been recorded in 30 days |
Before creating an issue, make sure you've checked the following:
Platform
Version
v1.28.4+k0s.0
Sysinfo
`k0s sysinfo`
What happened?
kubectl top nodes
kubectl -n traefik get pods -o wide
kubectl -n traefik top pods
kubectl -n traefik logs --tail 1 traefik-0
{"ClientAddr":"127.0.0.1:56606","ClientHost":"127.0.0.1","ClientPort":"56606","ClientUsername":"-","DownstreamContentSize":2,"DownstreamStatus":200,"Duration":41500,"GzipRatio":0,"OriginContentSize":0,"OriginDuration":0,"OriginStatus":0,"Overhead":41500,"RequestAddr":":8081","RequestContentSize":0,"RequestCount":69,"RequestHost":"","RequestMethod":"HEAD","RequestPath":"/ping","RequestPort":"8081","RequestProtocol":"HTTP/1.1","RequestScheme":"http","RetryAttempts":0,"RouterName":"ping@internal","StartLocal":"2023-12-03T18:27:15.586834968Z","StartUTC":"2023-12-03T18:27:15.586834968Z","entryPointName":"traefik","level":"info","msg":"","time":"2023-12-03T18:27:15Z"}
kubectl -n traefik logs --tail 1 traefik-1
Error from server: Get "https://54.36.127.120:10250/containerLogs/traefik/traefik-1/traefik?tailLines=1": dial tcp 54.36.127.120:10250: i/o timeout
kubectl -n traefik logs --tail 1 traefik-2
{"ClientAddr":"127.0.0.1:38120","ClientHost":"127.0.0.1","ClientPort":"38120","ClientUsername":"-","DownstreamContentSize":2,"DownstreamStatus":200,"Duration":33851,"GzipRatio":0,"OriginContentSize":0,"OriginDuration":0,"OriginStatus":0,"Overhead":33851,"RequestAddr":":8081","RequestContentSize":0,"RequestCount":54,"RequestHost":"","RequestMethod":"HEAD","RequestPath":"/ping","RequestPort":"8081","RequestProtocol":"HTTP/1.1","RequestScheme":"http","RetryAttempts":0,"RouterName":"ping@internal","StartLocal":"2023-12-03T18:27:17.489153837Z","StartUTC":"2023-12-03T18:27:17.489153837Z","entryPointName":"traefik","level":"info","msg":"","time":"2023-12-03T18:27:17Z"}
kubectl -n traefik logs --tail 1 traefik-3
{"ClientAddr":"127.0.0.1:54082","ClientHost":"127.0.0.1","ClientPort":"54082","ClientUsername":"-","DownstreamContentSize":2,"DownstreamStatus":200,"Duration":31521,"GzipRatio":0,"OriginContentSize":0,"OriginDuration":0,"OriginStatus":0,"Overhead":31521,"RequestAddr":":8081","RequestContentSize":0,"RequestCount":60,"RequestHost":"","RequestMethod":"HEAD","RequestPath":"/ping","RequestPort":"8081","RequestProtocol":"HTTP/1.1","RequestScheme":"http","RetryAttempts":0,"RouterName":"ping@internal","StartLocal":"2023-12-03T18:27:17.971224244Z","StartUTC":"2023-12-03T18:27:17.971224244Z","entryPointName":"traefik","level":"info","msg":"","time":"2023-12-03T18:27:17Z"}
kubectl -n traefik exec -t traefik-0 -- hostname
fra1
kubectl -n traefik exec -t traefik-1 -- hostname
Error from server: error dialing backend: dial tcp 54.36.127.120:10250: i/o timeout
kubectl -n traefik exec -t traefik-2 -- hostname
gra2
kubectl -n traefik exec -t traefik-3 -- hostname
fra2
Steps to reproduce
Expected behavior
I cannot see any related errors in logs :(
Actual behavior
top nodes
for some nodesScreenshots and logs
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: