Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[zfs] kind create cluster crashes #2987

Closed
jrwren opened this issue Oct 31, 2022 · 11 comments Β· Fixed by #2989
Closed

[zfs] kind create cluster crashes #2987

jrwren opened this issue Oct 31, 2022 · 11 comments Β· Fixed by #2989
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@jrwren
Copy link

jrwren commented Oct 31, 2022

What happened:

I ran kind create cluster and have a Go stacktrace. I ran again with --retain for additional logs for this report.

 $ kind create cluster --retain                                         
Creating cluster "kind" ...                                                                              
 βœ“ Ensuring node image (kindest/node:v1.25.3) πŸ–Ό                                                          
 βœ“ Preparing nodes πŸ“¦                                                                                    
 βœ“ Writing configuration πŸ“œ                                                                              
 βœ— Starting control-plane πŸ•ΉοΈ                                                                              
ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged kind
-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6"
 failed with error: exit status 1                                                                        
Command Output: I1031 19:15:30.693671     150 initconfiguration.go:254] loading configuration from "/kind
/kubeadm.conf"                                                                                           
W1031 19:15:30.695745     150 initconfiguration.go:331] [config] WARNING: Ignored YAML document with Grou
pVersionKind kubeadm.k8s.io/v1beta3, Kind=JoinConfiguration                                              
[init] Using Kubernetes version: v1.25.3                                                                 
[certs] Using certificateDir folder "/etc/kubernetes/pki"                                                
I1031 19:15:30.702885     150 certs.go:112] creating a new certificate authority for ca                  
[certs] Generating "ca" certificate and key                                                              
I1031 19:15:30.813990     150 certs.go:522] validating certificate period for ca certificate             
[certs] Generating "apiserver" certificate and key                                                       
[certs] apiserver serving cert is signed for DNS names [kind-control-plane kubernetes kubernetes.default 
kubernetes.default.svc kubernetes.default.svc.cluster.local localhost] and IPs [10.96.0.1 172.19.0.2 127.
0.0.1]                                                                                                   
[certs] Generating "apiserver-kubelet-client" certificate and key                                        
I1031 19:15:31.060575     150 certs.go:112] creating a new certificate authority for front-proxy-ca      
[certs] Generating "front-proxy-ca" certificate and key                                                  
I1031 19:15:31.193915     150 certs.go:522] validating certificate period for front-proxy-ca certificate 
[certs] Generating "front-proxy-client" certificate and key                                              
I1031 19:15:31.268075     150 certs.go:112] creating a new certificate authority for etcd-ca             
[certs] Generating "etcd/ca" certificate and key                                                         
I1031 19:15:31.404285     150 certs.go:522] validating certificate period for etcd/ca certificate        
[certs] Generating "etcd/server" certificate and key                                                     
[certs] etcd/server serving cert is signed for DNS names [kind-control-plane localhost] and IPs [172.19.0
.2 127.0.0.1 ::1]                                                                                        
[certs] Generating "etcd/peer" certificate and key                                                       
[certs] etcd/peer serving cert is signed for DNS names [kind-control-plane localhost] and IPs [172.19.0.2
 127.0.0.1 ::1]                                                                                          
[certs] Generating "etcd/healthcheck-client" certificate and key                                         
[certs] Generating "apiserver-etcd-client" certificate and key                            
I1031 19:15:32.160580     150 certs.go:78] creating new public/private key files for signing service acco
unt users                                                                                                
[certs] Generating "sa" key and public key                                                               
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"                                                   
I1031 19:15:32.270317     150 kubeconfig.go:103] creating kubeconfig file for admin.conf                 
[kubeconfig] Writing "admin.conf" kubeconfig file                                                        
I1031 19:15:32.605678     150 kubeconfig.go:103] creating kubeconfig file for kubelet.conf               
[kubeconfig] Writing "kubelet.conf" kubeconfig file                                                      
I1031 19:15:32.740927     150 kubeconfig.go:103] creating kubeconfig file for controller-manager.conf    
[kubeconfig] Writing "controller-manager.conf" kubeconfig file                                           
I1031 19:15:32.878626     150 kubeconfig.go:103] creating kubeconfig file for scheduler.conf             
[kubeconfig] Writing "scheduler.conf" kubeconfig file                                                    
I1031 19:15:33.023983     150 kubelet.go:66] Stopping the kubelet                                        
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env" 
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"                     
[kubelet-start] Starting the kubelet                                                                     
[control-plane] Using manifest folder "/etc/kubernetes/manifests"                                        
[control-plane] Creating static Pod manifest for "kube-apiserver"                                        
I1031 19:15:33.161748     150 manifests.go:99] [control-plane] getting StaticPodSpecs                    
I1031 19:15:33.162189     150 certs.go:522] validating certificate period for CA certificate             
I1031 19:15:33.162269     150 manifests.go:125] [control-plane] adding volume "ca-certs" for component "k
ube-apiserver"                                                                                           
I1031 19:15:33.162280     150 manifests.go:125] [control-plane] adding volume "etc-ca-certificates" for c
omponent "kube-apiserver"                                                                                
I1031 19:15:33.162287     150 manifests.go:125] [control-plane] adding volume "k8s-certs" for component "
kube-apiserver"                                                                                          
I1031 19:15:33.162292     150 manifests.go:125] [control-plane] adding volume "usr-local-share-ca-certifi
cates" for component "kube-apiserver"                                                                    
I1031 19:15:33.162298     150 manifests.go:125] [control-plane] adding volume "usr-share-ca-certificates"
 for component "kube-apiserver"                                                                          
I1031 19:15:33.164991     150 manifests.go:154] [control-plane] wrote static Pod manifest for component "
kube-apiserver" to "/etc/kubernetes/manifests/kube-apiserver.yaml"                                       
I1031 19:15:33.165031     150 manifests.go:99] [control-plane] getting StaticPodSpecs                    
[control-plane] Creating static Pod manifest for "kube-controller-manager"                               
I1031 19:15:33.165431     150 manifests.go:125] [control-plane] adding volume "ca-certs" for component "k
ube-controller-manager"                                                                                  
I1031 19:15:33.165448     150 manifests.go:125] [control-plane] adding volume "etc-ca-certificates" for c
omponent "kube-controller-manager"                                                                       
I1031 19:15:33.165459     150 manifests.go:125] [control-plane] adding volume "flexvolume-dir" for compon
ent "kube-controller-manager"                                                                            
I1031 19:15:33.165470     150 manifests.go:125] [control-plane] adding volume "k8s-certs" for component "
kube-controller-manager"                                                                                 
I1031 19:15:33.165481     150 manifests.go:125] [control-plane] adding volume "kubeconfig" for component 
"kube-controller-manager"                                                                           
I1031 19:15:33.165491     150 manifests.go:125] [control-plane] adding volume "usr-local-share-ca-certifi
cates" for component "kube-controller-manager"                                                           
I1031 19:15:33.165505     150 manifests.go:125] [control-plane] adding volume "usr-share-ca-certificates"
 for component "kube-controller-manager"                                                                 
I1031 19:15:33.166872     150 manifests.go:154] [control-plane] wrote static Pod manifest for component "
kube-controller-manager" to "/etc/kubernetes/manifests/kube-controller-manager.yaml"                     
I1031 19:15:33.166905     150 manifests.go:99] [control-plane] getting StaticPodSpecs                    
[control-plane] Creating static Pod manifest for "kube-scheduler"                                        
I1031 19:15:33.167386     150 manifests.go:125] [control-plane] adding volume "kubeconfig" for component 
"kube-scheduler"                                                                                         
I1031 19:15:33.168381     150 manifests.go:154] [control-plane] wrote static Pod manifest for component "
kube-scheduler" to "/etc/kubernetes/manifests/kube-scheduler.yaml"                                       
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"                        
I1031 19:15:33.169891     150 local.go:65] [etcd] wrote Static Pod manifest for a local etcd member to "/
etc/kubernetes/manifests/etcd.yaml"                                                                      
I1031 19:15:33.169948     150 waitcontrolplane.go:83] [wait-control-plane] Waiting for the API server to 
be healthy                                                                                               
I1031 19:15:33.170878     150 loader.go:374] Config loaded from file:  /etc/kubernetes/admin.conf        
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "
/etc/kubernetes/manifests". This can take up to 4m0s                                                     
I1031 19:15:33.175917     150 round_trippers.go:553] GET https://kind-control-plane:6443/healthz?timeout=
10s  in 2 milliseconds                                                 
*** THIS MESSAGE REPEATED EVERY 500ms many times
[kubelet-check] Initial timeout of 40s passed.   
I1031 19:16:13.177579     150 round_trippers.go:553] GET https://kind-control-plane:6443/healthz?timeout=
10s  in 0 milliseconds     
*** THIS MESSAGE REPEATED EVERY 500ms many times
                                                                                                         
Unfortunately, an error has occurred:                                                                    
        timed out waiting for the condition                                                              
                                                                                                         
This error is likely caused by:                                                                          
        - The kubelet is not running                                                                     
        - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups di
sabled)                                                                                                  
                                                                                                         
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands
:                                                                                                        
        - 'systemctl status kubelet'                                                                     
        - 'journalctl -xeu kubelet'                                                                      
                                                                                                         
Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.                        
Here is one example how you may list all running Kubernetes containers by using crictl:                  
        - 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps -a | grep kube | grep -v p
ause'                                                                                                    
        Once you have found the failing container, you can inspect its logs with:                        
        - 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock logs CONTAINERID'            
couldn't initialize a Kubernetes cluster                                 
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runWaitControlPlanePhase                       [2/1817]
        cmd/kubeadm/app/cmd/phases/init/waitcontrolplane.go:108                                          
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1                                
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:234                                                
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll                                 
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:421                                                
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run                                      
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:207                                                
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1                                                   
        cmd/kubeadm/app/cmd/init.go:154                                                                  
github.com/spf13/cobra.(*Command).execute                                                                
        vendor/github.com/spf13/cobra/command.go:856                                                     
github.com/spf13/cobra.(*Command).ExecuteC                                                               
        vendor/github.com/spf13/cobra/command.go:974                                                     
github.com/spf13/cobra.(*Command).Execute                                                                
        vendor/github.com/spf13/cobra/command.go:902                                                     
k8s.io/kubernetes/cmd/kubeadm/app.Run                                                                    
        cmd/kubeadm/app/kubeadm.go:50                                                                    
main.main                                                                                                
        cmd/kubeadm/kubeadm.go:25                                                                        
runtime.main                                                                                             
        /usr/local/go/src/runtime/proc.go:250                                                            
runtime.goexit                                                                                           
        /usr/local/go/src/runtime/asm_amd64.s:1594                                                       
error execution phase wait-control-plane                                                                 
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1                                
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:235                                                
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll                                 
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:421                                                
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run                                      
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:207                                                
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1                                                   
        cmd/kubeadm/app/cmd/init.go:154                                                                  
github.com/spf13/cobra.(*Command).execute                                                                
        vendor/github.com/spf13/cobra/command.go:856 
github.com/spf13/cobra.(*Command).ExecuteC
        vendor/github.com/spf13/cobra/command.go:974 
github.com/spf13/cobra.(*Command).Execute
        vendor/github.com/spf13/cobra/command.go:902 
k8s.io/kubernetes/cmd/kubeadm/app.Run
        cmd/kubeadm/app/kubeadm.go:50
main.main
        cmd/kubeadm/kubeadm.go:25
runtime.main
        /usr/local/go/src/runtime/proc.go:250
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1594

What you expected to happen:

Running containers for both control-plane and node. I only get a running control-plane which fails to start the node container.

How to reproduce it (as minimally and precisely as possible):

I'm not sure, but maybe because I am using zfs docker storage driver.

Anything else we need to know?:

tmp-3955055506.tar.gz

Environment:

  • kind version: (use kind version): kind v0.17.0 go1.19.2 linux/amd64
  • Kubernetes version: (use kubectl version):Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.3", GitCommit:"434bfd82814af038ad94d62ebe59b133fcb50506", GitTreeState:"clean", BuildDate:"2022-10-12T10:57:26Z", GoVersion:"go1.19.2", Compiler:"gc", Platform:"linux/amd64"}
    Kustomize Version: v4.5.7
  • Docker version: (use docker info):
Client:            
 Context:    default                                
 Debug Mode: false
                                                    
Server:                                             
 Containers: 6                                      
  Running: 5        
  Paused: 0           
  Stopped: 1                                                                                             
 Images: 41           
 Server Version: 20.10.12
 Storage Driver: zfs 
  Zpool: lxd   
  Zpool Health: ONLINE
  Parent Dataset: lxd/docker
  Space Used By Parent: 14880817152
  Space Available: 132101271552
  Parent Quota: no 
  Compression: off
 Logging Driver: json-file        
 Cgroup Driver: systemd                             
 Cgroup Version: 2
 Plugins:            
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active                                                                                           
  NodeID: fcn77gjvcse8p8a7s0qy5jonu
  Is Manager: true
  ClusterID: l05x9y793vs5b8qs24toweqdo
  Managers: 1                                       
  Nodes: 1
  Default Address Pool: 10.0.0.0/8  
  SubnetSize: 24     
  Data Path Port: 4789
  Orchestration:                                    
   Task History Retention Limit: 5
  Raft:                                      
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false 
  Root Rotation In Progress: false
  Node Address: 76.214.139.191
  Manager Addresses:
   76.214.139.191:2377
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 
 runc version: 
 init version: 
 Security Options:
  apparmor
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.15.0-48-generic
 Operating System: Ubuntu 22.04.1 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 13.58GiB
 Name: delays
 ID: A4VB:QA7L:BKYP:GWQM:MRN7:VKR3:D5N4:6ZUL:Y7UD:VL6F:ZMSN:2T2I
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: jrwren
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
  • OS (e.g. from /etc/os-release):
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
@jrwren jrwren added the kind/bug Categorizes issue or PR as related to a bug. label Oct 31, 2022
@jrwren
Copy link
Author

jrwren commented Oct 31, 2022

and... I should have searched first.

duplicate #2163

work around using:

cat <<EOF | kind create cluster --config=-                                 
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
containerdConfigPatches:
- |-
 [plugins."io.containerd.grpc.v1.cri".containerd]
 snapshotter = "native"
EOF

I wonder if a better fix would be adding the zfsutils-linux to the kind-control-plane container image.

@jrwren jrwren closed this as completed Oct 31, 2022
@BenTheElder
Copy link
Member

something is wrong if we're not automatically detecting and switching to the native snapshotter

@BenTheElder BenTheElder reopened this Oct 31, 2022
@BenTheElder BenTheElder self-assigned this Oct 31, 2022
@BenTheElder BenTheElder changed the title kind create cluster crashes [zfs] kind create cluster crashes Oct 31, 2022
@BenTheElder
Copy link
Member

INFO: changing snapshotter from "overlayfs" to "fuse-overlayfs"

(from serial.log for the node)

I think the issue here is specifically with zfs + userns remap, I think I see the bug.

@BenTheElder
Copy link
Member

can you please try without the config file but with --image=kindest/node:v1.25.3@sha256:cb0a32bfb28db72c0c073c39c112bb84c1e4dda35135a03d0b9e49f5547d61a4 ? #2989

@jrwren
Copy link
Author

jrwren commented Oct 31, 2022

Looks the same.

tmp-1059374674.tar.gz

@BenTheElder
Copy link
Member

INFO: changing snapshotter from "overlayfs" to "fuse-overlayfs"

Hmm, So still doing that that should mean it has /dev/fuse available, and yet overlayfs-fuse will not work ...? πŸ˜•

@BenTheElder
Copy link
Member

I guess we should just always do ZFS => native

Can you check what stat -f -c %T /kind returns in the running node container kind created?

@jrwren
Copy link
Author

jrwren commented Oct 31, 2022

$ docker exec -it 4109 bash
root@test-control-plane:/# stat -f -c %T /kind
zfs

@BenTheElder
Copy link
Member

trying again with kindest/node:v1.25.3@sha256:cd248d1438192f7814fbca8fede13cfe5b9918746dfa12583976158a834fd5c5 from updated #2989

always forcing native mode on ZFS, I think this is the safest bet.

where possible we want to use overlay or overlay fuse (better tested and production performance) but native is a safe fallback. the other snapshotters don't appear to receive much testing upstream and may have compatibility issues (zfs in particular) in the odd containerd-in-container enviornment

@jrwren
Copy link
Author

jrwren commented Nov 2, 2022

SUCCESS!

 $ kind create cluster -n test --retain --image=kindest/node:v1.25.3@sha256:cd248d1438192f7814fbca8fede13cfe5b9918746dfa12583976158a834fd5c5
Creating cluster "test" ...
 βœ“ Ensuring node image (kindest/node:v1.25.3) πŸ–Ό 
 βœ“ Preparing nodes πŸ“¦  
 βœ“ Writing configuration πŸ“œ 
 βœ“ Starting control-plane πŸ•ΉοΈ 
 βœ“ Installing CNI πŸ”Œ 
 βœ“ Installing StorageClass πŸ’Ύ 
Set kubectl context to "kind-test"
You can now use your cluster with:

kubectl cluster-info --context kind-test

Have a question, bug, or feature request? Let us know! https://kind.sigs.k8s.io/#community πŸ™‚

Thank you.

@BenTheElder
Copy link
Member

Thanks for confirming! The fix should merge shortly and future node images will have this fix.

@BenTheElder BenTheElder added this to the v0.18.0 milestone Nov 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants