Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot exec,logs/top to pods on some nodes #3784

Closed
4 tasks done
vojbarzz opened this issue Dec 3, 2023 · 8 comments
Closed
4 tasks done

Cannot exec,logs/top to pods on some nodes #3784

vojbarzz opened this issue Dec 3, 2023 · 8 comments
Assignees
Labels
bug Something isn't working Stale

Comments

@vojbarzz
Copy link

vojbarzz commented Dec 3, 2023

Before creating an issue, make sure you've checked the following:

  • You are running the latest released version of k0s
  • Make sure you've searched for existing issues, both open and closed
  • Make sure you've searched for PRs too, a fix might've been merged already
  • You're looking at docs for the released version, "main" branch docs are usually ahead of released versions.

Platform

Linux 5.15.0-89-generic #99-Ubuntu SMP Mon Oct 30 20:42:41 UTC 2023 x86_64 GNU/Linux
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

Version

v1.28.4+k0s.0

Sysinfo

`k0s sysinfo`
Machine ID: "28848d1a4bf60ae21d0f286f5ea7deb4d1dceeda179183f15a35afd250586075" (from machine) (pass)
Total memory: 62.7 GiB (pass)
Disk space available for /var/lib/k0s: 827.1 GiB (pass)
Name resolution: localhost: [127.0.0.1] (pass)
Operating system: Linux (pass)
  Linux kernel release: 5.15.0-89-generic (pass)
  Max. file descriptors per process: current: 1048576 / max: 1048576 (pass)
  AppArmor: active (pass)
  Executable in PATH: modprobe: /usr/sbin/modprobe (pass)
  Executable in PATH: mount: /usr/bin/mount (pass)
  Executable in PATH: umount: /usr/bin/umount (pass)
  /proc file system: mounted (0x9fa0) (pass)
  Control Groups: version 2 (pass)
    cgroup controller "cpu": available (pass)
    cgroup controller "cpuacct": available (via cpu in version 2) (pass)
    cgroup controller "cpuset": available (pass)
    cgroup controller "memory": available (pass)
    cgroup controller "devices": available (assumed) (pass)
    cgroup controller "freezer": available (assumed) (pass)
    cgroup controller "pids": available (pass)
    cgroup controller "hugetlb": available (pass)
    cgroup controller "blkio": available (via io in version 2) (pass)
  CONFIG_CGROUPS: Control Group support: built-in (pass)
    CONFIG_CGROUP_FREEZER: Freezer cgroup subsystem: built-in (pass)
    CONFIG_CGROUP_PIDS: PIDs cgroup subsystem: built-in (pass)
    CONFIG_CGROUP_DEVICE: Device controller for cgroups: built-in (pass)
    CONFIG_CPUSETS: Cpuset support: built-in (pass)
    CONFIG_CGROUP_CPUACCT: Simple CPU accounting cgroup subsystem: built-in (pass)
    CONFIG_MEMCG: Memory Resource Controller for Control Groups: built-in (pass)
    CONFIG_CGROUP_HUGETLB: HugeTLB Resource Controller for Control Groups: built-in (pass)
    CONFIG_CGROUP_SCHED: Group CPU scheduler: built-in (pass)
      CONFIG_FAIR_GROUP_SCHED: Group scheduling for SCHED_OTHER: built-in (pass)
        CONFIG_CFS_BANDWIDTH: CPU bandwidth provisioning for FAIR_GROUP_SCHED: built-in (pass)
    CONFIG_BLK_CGROUP: Block IO controller: built-in (pass)
  CONFIG_NAMESPACES: Namespaces support: built-in (pass)
    CONFIG_UTS_NS: UTS namespace: built-in (pass)
    CONFIG_IPC_NS: IPC namespace: built-in (pass)
    CONFIG_PID_NS: PID namespace: built-in (pass)
    CONFIG_NET_NS: Network namespace: built-in (pass)
  CONFIG_NET: Networking support: built-in (pass)
    CONFIG_INET: TCP/IP networking: built-in (pass)
      CONFIG_IPV6: The IPv6 protocol: built-in (pass)
    CONFIG_NETFILTER: Network packet filtering framework (Netfilter): built-in (pass)
      CONFIG_NETFILTER_ADVANCED: Advanced netfilter configuration: built-in (pass)
      CONFIG_NF_CONNTRACK: Netfilter connection tracking support: module (pass)
      CONFIG_NETFILTER_XTABLES: Netfilter Xtables support: module (pass)
        CONFIG_NETFILTER_XT_TARGET_REDIRECT: REDIRECT target support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_COMMENT: "comment" match support: module (pass)
        CONFIG_NETFILTER_XT_MARK: nfmark target and match support: module (pass)
        CONFIG_NETFILTER_XT_SET: set target and match support: module (pass)
        CONFIG_NETFILTER_XT_TARGET_MASQUERADE: MASQUERADE target support: module (pass)
        CONFIG_NETFILTER_XT_NAT: "SNAT and DNAT" targets support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_ADDRTYPE: "addrtype" address type match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_CONNTRACK: "conntrack" connection tracking match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_MULTIPORT: "multiport" Multiple port match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_RECENT: "recent" match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_STATISTIC: "statistic" match support: module (pass)
      CONFIG_NETFILTER_NETLINK: module (pass)
      CONFIG_NF_NAT: module (pass)
      CONFIG_IP_SET: IP set support: module (pass)
        CONFIG_IP_SET_HASH_IP: hash:ip set support: module (pass)
        CONFIG_IP_SET_HASH_NET: hash:net set support: module (pass)
      CONFIG_IP_VS: IP virtual server support: module (pass)
        CONFIG_IP_VS_NFCT: Netfilter connection tracking: built-in (pass)
        CONFIG_IP_VS_SH: Source hashing scheduling: module (pass)
        CONFIG_IP_VS_RR: Round-robin scheduling: module (pass)
        CONFIG_IP_VS_WRR: Weighted round-robin scheduling: module (pass)
      CONFIG_NF_CONNTRACK_IPV4: IPv4 connetion tracking support (required for NAT): unknown (warning)
      CONFIG_NF_REJECT_IPV4: IPv4 packet rejection: module (pass)
      CONFIG_NF_NAT_IPV4: IPv4 NAT: unknown (warning)
      CONFIG_IP_NF_IPTABLES: IP tables support: module (pass)
        CONFIG_IP_NF_FILTER: Packet filtering: module (pass)
          CONFIG_IP_NF_TARGET_REJECT: REJECT target support: module (pass)
        CONFIG_IP_NF_NAT: iptables NAT support: module (pass)
        CONFIG_IP_NF_MANGLE: Packet mangling: module (pass)
      CONFIG_NF_DEFRAG_IPV4: module (pass)
      CONFIG_NF_CONNTRACK_IPV6: IPv6 connetion tracking support (required for NAT): unknown (warning)
      CONFIG_NF_NAT_IPV6: IPv6 NAT: unknown (warning)
      CONFIG_IP6_NF_IPTABLES: IP6 tables support: module (pass)
        CONFIG_IP6_NF_FILTER: Packet filtering: module (pass)
        CONFIG_IP6_NF_MANGLE: Packet mangling: module (pass)
        CONFIG_IP6_NF_NAT: ip6tables NAT support: module (pass)
      CONFIG_NF_DEFRAG_IPV6: module (pass)
    CONFIG_BRIDGE: 802.1d Ethernet Bridging: module (pass)
      CONFIG_LLC: module (pass)
      CONFIG_STP: module (pass)
  CONFIG_EXT4_FS: The Extended 4 (ext4) filesystem: built-in (pass)
  CONFIG_PROC_FS: /proc file system support: built-in (pass)

What happened?

kubectl top nodes

NAME   CPU(cores)   CPU%        MEMORY(bytes)   MEMORY%     
fra1   943m         5%          6193Mi          9%          
fra2   29m          0%          1872Mi          2%          
gra1   <unknown>    <unknown>   <unknown>       <unknown>   
gra2   <unknown>    <unknown>   <unknown>       <unknown> 

kubectl -n traefik get pods -o wide

NAME        READY   STATUS    RESTARTS   AGE   IP                NODE   NOMINATED NODE   READINESS GATES
traefik-0   1/1     Running   0          21m   135.125.189.83    fra1   <none>           <none>
traefik-1   1/1     Running   0          20m   54.36.127.120     gra1   <none>           <none>
traefik-2   1/1     Running   0          19m   54.36.127.37      gra2   <none>           <none>
traefik-3   1/1     Running   0          18m   135.125.188.239   fra2   <none>           <none>

kubectl -n traefik top pods

NAME        CPU(cores)   MEMORY(bytes)   
traefik-0   1m           25Mi            
traefik-3   2m           25Mi  

kubectl -n traefik logs --tail 1 traefik-0

{"ClientAddr":"127.0.0.1:56606","ClientHost":"127.0.0.1","ClientPort":"56606","ClientUsername":"-","DownstreamContentSize":2,"DownstreamStatus":200,"Duration":41500,"GzipRatio":0,"OriginContentSize":0,"OriginDuration":0,"OriginStatus":0,"Overhead":41500,"RequestAddr":":8081","RequestContentSize":0,"RequestCount":69,"RequestHost":"","RequestMethod":"HEAD","RequestPath":"/ping","RequestPort":"8081","RequestProtocol":"HTTP/1.1","RequestScheme":"http","RetryAttempts":0,"RouterName":"ping@internal","StartLocal":"2023-12-03T18:27:15.586834968Z","StartUTC":"2023-12-03T18:27:15.586834968Z","entryPointName":"traefik","level":"info","msg":"","time":"2023-12-03T18:27:15Z"}

kubectl -n traefik logs --tail 1 traefik-1

Error from server: Get "https://54.36.127.120:10250/containerLogs/traefik/traefik-1/traefik?tailLines=1": dial tcp 54.36.127.120:10250: i/o timeout

kubectl -n traefik logs --tail 1 traefik-2

{"ClientAddr":"127.0.0.1:38120","ClientHost":"127.0.0.1","ClientPort":"38120","ClientUsername":"-","DownstreamContentSize":2,"DownstreamStatus":200,"Duration":33851,"GzipRatio":0,"OriginContentSize":0,"OriginDuration":0,"OriginStatus":0,"Overhead":33851,"RequestAddr":":8081","RequestContentSize":0,"RequestCount":54,"RequestHost":"","RequestMethod":"HEAD","RequestPath":"/ping","RequestPort":"8081","RequestProtocol":"HTTP/1.1","RequestScheme":"http","RetryAttempts":0,"RouterName":"ping@internal","StartLocal":"2023-12-03T18:27:17.489153837Z","StartUTC":"2023-12-03T18:27:17.489153837Z","entryPointName":"traefik","level":"info","msg":"","time":"2023-12-03T18:27:17Z"}

kubectl -n traefik logs --tail 1 traefik-3

{"ClientAddr":"127.0.0.1:54082","ClientHost":"127.0.0.1","ClientPort":"54082","ClientUsername":"-","DownstreamContentSize":2,"DownstreamStatus":200,"Duration":31521,"GzipRatio":0,"OriginContentSize":0,"OriginDuration":0,"OriginStatus":0,"Overhead":31521,"RequestAddr":":8081","RequestContentSize":0,"RequestCount":60,"RequestHost":"","RequestMethod":"HEAD","RequestPath":"/ping","RequestPort":"8081","RequestProtocol":"HTTP/1.1","RequestScheme":"http","RetryAttempts":0,"RouterName":"ping@internal","StartLocal":"2023-12-03T18:27:17.971224244Z","StartUTC":"2023-12-03T18:27:17.971224244Z","entryPointName":"traefik","level":"info","msg":"","time":"2023-12-03T18:27:17Z"}

kubectl -n traefik exec -t traefik-0 -- hostname

fra1

kubectl -n traefik exec -t traefik-1 -- hostname

Error from server: error dialing backend: dial tcp 54.36.127.120:10250: i/o timeout

kubectl -n traefik exec -t traefik-2 -- hostname

gra2

kubectl -n traefik exec -t traefik-3 -- hostname

fra2

Steps to reproduce

  • use this file to install cluster
apiVersion: k0sctl.k0sproject.io/v1beta1
kind: Cluster
metadata:
  name: k0s-ovh
spec:
  hosts:
    - ssh:
        address: 135.125.75.8
        user: vojbarz
        port: 22
        keyPath: ~/.ssh/id_ed25519_sk.pub@YubiKey5C
      role: controller
    - ssh:
        address: 135.125.189.83
        user: vojbarz
        port: 22
        keyPath: ~/.ssh/id_ed25519_sk.pub@YubiKey5C
      role: worker
    - ssh:
        address: 135.125.188.239
        user: vojbarz
        port: 22
        keyPath: ~/.ssh/id_ed25519_sk.pub@YubiKey5C
      role: worker
    - ssh:
        address: 54.36.127.120
        user: vojbarz
        port: 22
        keyPath: ~/.ssh/id_ed25519_sk.pub@YubiKey5C
      role: worker
    - ssh:
        address: 54.36.127.37
        user: vojbarz
        port: 22
        keyPath: ~/.ssh/id_ed25519_sk.pub@YubiKey5C
      role: worker
  k0s:
    version: v1.28.4+k0s.0
    dynamicConfig: false
    config:
      apiVersion: k0s.k0sproject.io/v1beta1
      kind: ClusterConfig
      metadata:
        name: my-k0s-cluster
      spec:
        api:
          address: 135.125.75.8
          externalAddress: api1.my-devbox.cloud
          sans:
            - 135.125.75.8
          extraArgs:
            service-node-port-range: "80-32767"
        telemetry:
          enabled: true
        extensions:
          storage:
            type: openebs_local_storage
  • deploy any pods, spreaded across a cluster
  • try to get logs or exec to pod
  • try to top nodes or top pods

Expected behavior

  • get logs of all pods
  • exec to all pod
  • see cpu&memory for all pods or nodes

I cannot see any related errors in logs :(

Actual behavior

  • I'm not able to see output of top nodes for some nodes
  • I'm not able to exec/see logs for pods on those nodes (but not

Screenshots and logs

No response

Additional context

No response

@vojbarzz vojbarzz added the bug Something isn't working label Dec 3, 2023
@vojbarzz
Copy link
Author

vojbarzz commented Dec 5, 2023

any help on this?

@jnummelin
Copy link
Collaborator

This sounds like some nodes konnectivity-agent is not able to connect to the controlplane konnectivity-server. If you SSH into the problematic nodes and see the konnectivity-agent logs in /var/log/containers/..., do you see any connectivity errors?

@vojbarzz
Copy link
Author

vojbarzz commented Dec 25, 2023

@jnummelin Looks like You are right. I can see in logs

2023-12-25T13:22:33.30604359Z stderr F I1225 13:22:33.306027       1 options.go:143] WarnOnChannelLimit set to false.
2023-12-25T13:22:33.30604631Z stderr F I1225 13:22:33.306031       1 options.go:144] SyncForever set to false.
2023-12-25T13:22:33.319847649Z stderr F I1225 13:22:33.319806       1 client.go:210] "Connect to server" serverID="4bdb44dcc6cc5d7dbd0556a518d0898c16cf2a0acfca78374def77abf7f07886"
2023-12-25T13:22:33.319860299Z stderr F I1225 13:22:33.319818       1 clientset.go:222] "sync added client connecting to proxy server" serverID="4bdb44dcc6cc5d7dbd0556a518d0898c16cf2a0acfca78374def77abf7f07886"
2023-12-25T13:22:33.319866839Z stderr F I1225 13:22:33.319831       1 client.go:321] "Start serving" serverID="4bdb44dcc6cc5d7dbd0556a518d0898c16cf2a0acfca78374def77abf7f07886" agentID="54.36.127.37"
2023-12-25T13:22:58.023922898Z stderr F I1225 13:22:58.023858       1 client.go:354] "Received DIAL_REQ" serverID="4bdb44dcc6cc5d7dbd0556a518d0898c16cf2a0acfca78374def77abf7f07886" agentID="54.36.127.37" dialID=2515941358220829674 dialAddress="10.99.120.98:443"
2023-12-25T13:22:58.025469416Z stderr F I1225 13:22:58.025427       1 client.go:354] "Received DIAL_REQ" serverID="4bdb44dcc6cc5d7dbd0556a518d0898c16cf2a0acfca78374def77abf7f07886" agentID="54.36.127.37" dialID=3735552473575831730 dialAddress="10.99.120.98:443"
2023-12-25T13:23:03.024711594Z stderr F I1225 13:23:03.024645       1 client.go:420] "error dialing backend" error="dial tcp 10.99.120.98:443: i/o timeout" dialID=2515941358220829674 connectionID=1 dialAddress="10.99.120.98:443"
2023-12-25T13:23:03.02558778Z stderr F I1225 13:23:03.025548       1 client.go:420] "error dialing backend" error="dial tcp 10.99.120.98:443: i/o timeout" dialID=3735552473575831730 connectionID=2 dialAddress="10.99.120.98:443"```

and I'm really not able to connect to this port from 2 nodes of 4 :(

any idea how to get it resolved? there is no firewall, all 4 nodes having exactly the same config, ....

@vojbarzz
Copy link
Author

Looks like when I removed this section from cluster config I'm able to get logs and exec to all pods

spec:
        api:
          address: 135.125.75.8
          externalAddress: api1.my-devbox.cloud
          sans:
            - 135.125.75.8

Even if koneectivy-agents are not producing any other errors there still remains issue with metrics server (i can see a lot of errors about not able to scape a node ....) and dns service is not working properly (I guess that it is related to a networking in cluster issue)

@twz123
Copy link
Member

twz123 commented Jan 12, 2024

@juanluisvaladas
Copy link
Contributor

Hi Vojbarzz,
Can you please verify that nodes can communicate with nodes on the other zone on port TCP/179?

If they have connectivity probably we’ll need to do a traceroute to see the way the traffic goes and see where it gets lost. This can be tricky without actually acquiring tcpdumps…

@juanluisvaladas
Copy link
Contributor

juanluisvaladas commented Jan 16, 2024

Hi @vojbarzz I'm keeping the comunication just here rather than the forum for simplicity of tracking.

In the node where you're hosting the node, find the PID of the container which you want to reach. In this case I'm being lazy and getting coredns and using ps to do that:

worker0:/# ps aux | grep /coredns
 1660 root      3:21 /coredns -conf /etc/coredns/Corefile
<snip>

If you want to use another pod you can use crictl pods --name <pod name> to find the pod id and then crictl inspect <pod id> to find the pod's PID, crictl is not included in k0s.

IMPORTANT: Do NOT use tcpdump -i any, -i any does some tricks to detect duplicate packets among several interfaces which make it not suitable to detect connectivity problems.

Once you have the pid, we enter the pod's namespace with nsenter -n -t <pid>and then run the tcpdump:

worker0:/# nsenter -n -t 1660 # now we are in the pod net namespace but we still have the node's filesystem and the previous shell's privileges.
worker0:/# tcpdump -i eth0 -w /tmp/pod.pcap

We also want to run another tcpdump instance in the worker hosting the pod:

worker0:/# tcpdump -i eth0 -w /tmp/podhost.pcap

Finally we want to add a tcpdump in the host attempting to connect:

worker1:/# tcpdump -i eth0 -w /tmp/client.pcap

Then we run a simple ping in the client host to the pod IP:

worker1:/# ping 10.244.0.4

Let it run a few seconds and stop everything with control+c.

Now we have to analyze the pcap files. Normally I'd use wireshark but to avoid using images I'll just use tshark which is a CLI utility, I will refer to 10.244.0.4 a lot, so keep in mind it's always the pod IP.

First let's see the expected values in the client:

worker1:/tmp# tshark -r client.pcap  -Y 'icmp && ip.addr == 10.244.0.4'
    1   0.000000   172.17.0.4 → 10.244.0.4   ICMP 98 Echo (ping) request  id=0x31d7, seq=2/512, ttl=64
    2   0.000115   10.244.0.4 → 172.17.0.4   ICMP 98 Echo (ping) reply    id=0x31d7, seq=2/512, ttl=63 (request in 1)
<snip>

In your case I expect you only see the echo request but not the reply.

Now we have to check the pod host, this is a working scenario:

worker0:/tmp# tshark -r podhost.pcap -Y 'icmp && ip.addr == 10.244.0.4'
    4   0.430094   172.17.0.4 → 10.244.0.4   ICMP 98 Echo (ping) request  id=0x31d7, seq=3/768, ttl=64
    5   0.430202   10.244.0.4 → 172.17.0.4   ICMP 98 Echo (ping) reply    id=0x31d7, seq=3/768, ttl=63 (request in 4)

Now here there are the following possible scenarios:

  1. You don't see either echo request or reply (what I expect) which means the packets are lost between both hosts.
  2. You see echo request but not reply, which means the packets are lost inside the node
  3. You see both request and reply which means the packets are lost on their way back.

Now, I expect that your issue is the first one. If this is the case I can think only think of two possible solutions:

  1. Allow the traffic to go between zones in OVH (which I don't know if it will be possible to begin with)
  2. Use an overlay (which has a big penalty in latency and a reasonably small penalty in bandwidth). We don't allow to use an overlay with kube-router in k0s (even though kube-router has the possibility), but you can use calico instead: https://docs.k0sproject.io/v1.28.5+k0s.0/networking/#calico

Copy link
Contributor

The issue is marked as stale since no activity has been recorded in 30 days

@github-actions github-actions bot added the Stale label Feb 15, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

4 participants