Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd and kube-apiserver does not start after incorrect machine shutdown #88574

Closed
mcajkovs opened this issue Feb 26, 2020 · 51 comments
Closed

etcd and kube-apiserver does not start after incorrect machine shutdown #88574

mcajkovs opened this issue Feb 26, 2020 · 51 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@mcajkovs
Copy link

I was told that this is correct issue tracker for my problem. Previously I've post this issue here

Environment

mcajkovs@ubuntu:~$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

mcajkovs@ubuntu:~$ uname -a
Linux ubuntu 4.15.0-88-generic #88-Ubuntu SMP Tue Feb 11 20:11:34 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

mcajkovs@ubuntu:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.3 LTS
Release:        18.04
Codename:       bionic

mcajkovs@ubuntu:~$ kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.2", GitCommit:"59603c6e503c87169aea6106f57b9f242f64df89", GitTreeState:"clean", BuildDate:"2020-01-18T23:30:10Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.0", GitCommit:"70132b0f130acc0bed193d9ba59dd186f0e634cf", GitTreeState:"clean", BuildDate:"2019-12-07T21:12:17Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}

Installation and setup

I've installed k8s on my virtual machine (VM) VMware Workstation using following steps:

swapoff -a
sudo apt-get update && sudo apt-get install -y apt-transport-https curl
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -

cat <<EOF | sudo tee /etc/apt/sources.list.d/kubernetes.list
deb https://apt.kubernetes.io/ kubernetes-xenial main
EOF

sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo kubeadm init --pod-network-cidr=10.244.0.0/16
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
kubectl taint nodes --all node-role.kubernetes.io/master-
curl https://docs.projectcalico.org/v3.10/manifests/calico.yaml -O
POD_CIDR="10.244.0.0/16" \
sed -i -e "s?192.168.0.0/16?$POD_CIDR?g" calico.yaml
kubectl apply -f calico.yaml


cat << EOF >> /var/lib/kubelet/config.yaml
evictionHard:
  imagefs.available: 1%
  memory.available: 100Mi
  nodefs.available: 1%
  nodefs.inodesFree: 1%
EOF

systemctl daemon-reload
systemctl restart kubelet


cat << EOF > /etc/docker/daemon.json
{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "storage-driver": "overlay2"
}
EOF

systemctl daemon-reload
systemctl restart docker
docker info | grep -i driver

Problem

k8s does not start after boot. As a a result I cannot use kubectl to communicate with k8s. I thing main problem is that apiserver is restarting and etcd is not started

mcajkovs@ubuntu:~$ docker ps -a | grep k8s
f0ba79d60407        41ef50a5f06a                      "kube-apiserver --ad…"   9 seconds ago       Up 8 seconds                                                             k8s_kube-apiserver_kube-apiserver-ubuntu_kube-system_3e49883d5c321b4236e7bed14c988ccb_133
1f91362f5c91        303ce5db0e90                      "etcd --advertise-cl…"   23 seconds ago      Exited (2) 22 seconds ago                                                k8s_etcd_etcd-ubuntu_kube-system_94d759dceed198fa6db05be9ea52a98a_198
1c8c2f27bddb        41ef50a5f06a                      "kube-apiserver --ad…"   46 seconds ago      Exited (2) 24 seconds ago                                                k8s_kube-apiserver_kube-apiserver-ubuntu_kube-system_3e49883d5c321b4236e7bed14c988ccb_132
4b02dcf082bf        f52d4c527ef2                      "kube-scheduler --au…"   About an hour ago   Up About an hour                                                         k8s_kube-scheduler_kube-scheduler-ubuntu_kube-system_9c994ea62a2d8d6f1bb7498f10aa6fcf_0
dd9c5e31d7c0        da5fd66c4068                      "kube-controller-man…"   About an hour ago   Up About an hour                                                         k8s_kube-controller-manager_kube-controller-manager-ubuntu_kube-system_8482ef84d3b4e5e90f4462818c76a7e9_0
2aa1c151d65b        k8s.gcr.io/pause:3.1              "/pause"                 About an hour ago   Up About an hour                                                         k8s_POD_kube-apiserver-ubuntu_kube-system_3e49883d5c321b4236e7bed14c988ccb_0
98ecfd1e9825        k8s.gcr.io/pause:3.1              "/pause"                 About an hour ago   Up About an hour                                                         k8s_POD_etcd-ubuntu_kube-system_94d759dceed198fa6db05be9ea52a98a_0
284d3f50112a        k8s.gcr.io/pause:3.1              "/pause"                 About an hour ago   Up About an hour                                                         k8s_POD_kube-scheduler-ubuntu_kube-system_9c994ea62a2d8d6f1bb7498f10aa6fcf_0
56f57710e623        k8s.gcr.io/pause:3.1              "/pause"                 About an hour ago   Up About an hour                                                         k8s_POD_kube-controller-manager-ubuntu_kube-system_8482ef84d3b4e5e90f4462818c76a7e9_0

mcajkovs@ubuntu:~$ kubectl get all -A
The connection to the server 192.168.195.130:6443 was refused - did you specify the right host or port?

mcajkovs@ubuntu:~$ journalctl -xeu kubelet
Feb 26 13:19:14 ubuntu kubelet[125589]: E0226 13:19:14.131947  125589 kubelet.go:2263] node "ubuntu" not found
Feb 26 13:19:21 ubuntu kubelet[125589]: E0226 13:19:21.802070  125589 kubelet_node_status.go:92] Unable to register node "ubuntu" with API server: Post https://192.168.195.130:6443/api/v1/nodes: net/http: TLS handshake timeout
Feb 26 13:19:28 ubuntu kubelet[125589]: E0226 13:19:28.684914  125589 eviction_manager.go:246] eviction manager: failed to get summary stats: failed to get node info: node "ubuntu" not found
Feb 26 13:19:31 ubuntu kubelet[125589]: E0226 13:19:31.546593  125589 reflector.go:153] k8s.io/kubernetes/pkg/kubelet/kubelet.go:458: Failed to list *v1.Node: Get https://192.168.195.130:6443/api/v1/nodes?fieldSelector=metadata.name%3Dubuntu&limit=500&resourceVersion=0: dial tcp 192.168.195.130:6443: connect: connection refused

I've tried also following with same result:

sudo systemctl stop kubelet
docker ps -a | grep k8s_ | less -S | awk '{print $1}' | while read i; do docker rm $i -f; done
sudo systemctl start kubelet

What you expected to happen?

Start k8s after VM boot.

How to reproduce it (as minimally and precisely as possible)?

Shut down VM incorrectly (e.g. kill VM process, power off host machine etc.) and start VM again.

Anything else we need to know?

I have NOT observed this problem when I correctly shut down VM. But if VM is shut down incorrectly (killed process etc) then this happens. If I do kubectl reset and set up k8s according to above steps then it works.

content of /etc/kubernetes/manifests files

etcd.yaml

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: etcd
    tier: control-plane
  name: etcd
  namespace: kube-system
spec:
  containers:
  - command:
    - etcd
    - --advertise-client-urls=https://192.168.195.130:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --initial-advertise-peer-urls=https://192.168.195.130:2380
    - --initial-cluster=ubuntu=https://192.168.195.130:2380
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://192.168.195.130:2379
    - --listen-metrics-urls=http://127.0.0.1:2381
    - --listen-peer-urls=https://192.168.195.130:2380
    - --name=ubuntu
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    image: k8s.gcr.io/etcd:3.4.3-0
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /health
        port: 2381
        scheme: HTTP
      initialDelaySeconds: 15
      timeoutSeconds: 15
    name: etcd
    resources: {}
    volumeMounts:
    - mountPath: /var/lib/etcd
      name: etcd-data
    - mountPath: /etc/kubernetes/pki/etcd
      name: etcd-certs
  hostNetwork: true
  priorityClassName: system-cluster-critical
  volumes:
  - hostPath:
      path: /etc/kubernetes/pki/etcd
      type: DirectoryOrCreate
    name: etcd-certs
  - hostPath:
      path: /var/lib/etcd
      type: DirectoryOrCreate
    name: etcd-data
status: {}

kube-apiserver.yaml

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-apiserver
    tier: control-plane
  name: kube-apiserver
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-apiserver
    - --advertise-address=192.168.195.130
    - --allow-privileged=true
    - --authorization-mode=Node,RBAC
    - --client-ca-file=/etc/kubernetes/pki/ca.crt
    - --enable-admission-plugins=NodeRestriction
    - --enable-bootstrap-token-auth=true
    - --etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt
    - --etcd-certfile=/etc/kubernetes/pki/apiserver-etcd-client.crt
    - --etcd-keyfile=/etc/kubernetes/pki/apiserver-etcd-client.key
    - --etcd-servers=https://127.0.0.1:2379
    - --insecure-port=0
    - --kubelet-client-certificate=/etc/kubernetes/pki/apiserver-kubelet-client.crt
    - --kubelet-client-key=/etc/kubernetes/pki/apiserver-kubelet-client.key
    - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
    - --proxy-client-cert-file=/etc/kubernetes/pki/front-proxy-client.crt
    - --proxy-client-key-file=/etc/kubernetes/pki/front-proxy-client.key
    - --requestheader-allowed-names=front-proxy-client
    - --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
    - --requestheader-extra-headers-prefix=X-Remote-Extra-
    - --requestheader-group-headers=X-Remote-Group
    - --requestheader-username-headers=X-Remote-User
    - --secure-port=6443
    - --service-account-key-file=/etc/kubernetes/pki/sa.pub
    - --service-cluster-ip-range=10.96.0.0/12
    - --tls-cert-file=/etc/kubernetes/pki/apiserver.crt
    - --tls-private-key-file=/etc/kubernetes/pki/apiserver.key
    image: k8s.gcr.io/kube-apiserver:v1.17.2
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 192.168.195.130
        path: /healthz
        port: 6443
        scheme: HTTPS
      initialDelaySeconds: 15
      timeoutSeconds: 15
    name: kube-apiserver
    resources:
      requests:
        cpu: 250m
    volumeMounts:
    - mountPath: /etc/ssl/certs
      name: ca-certs
      readOnly: true
    - mountPath: /etc/ca-certificates
      name: etc-ca-certificates
      readOnly: true
    - mountPath: /etc/pki
      name: etc-pki
      readOnly: true
    - mountPath: /etc/kubernetes/pki
      name: k8s-certs
      readOnly: true
    - mountPath: /usr/local/share/ca-certificates
      name: usr-local-share-ca-certificates
      readOnly: true
    - mountPath: /usr/share/ca-certificates
      name: usr-share-ca-certificates
      readOnly: true
  hostNetwork: true
  priorityClassName: system-cluster-critical
  volumes:
  - hostPath:
      path: /etc/ssl/certs
      type: DirectoryOrCreate
    name: ca-certs
  - hostPath:
      path: /etc/ca-certificates
      type: DirectoryOrCreate
    name: etc-ca-certificates
  - hostPath:
      path: /etc/pki
      type: DirectoryOrCreate
    name: etc-pki
  - hostPath:
      path: /etc/kubernetes/pki
      type: DirectoryOrCreate
    name: k8s-certs
  - hostPath:
      path: /usr/local/share/ca-certificates
      type: DirectoryOrCreate
    name: usr-local-share-ca-certificates
  - hostPath:
      path: /usr/share/ca-certificates
      type: DirectoryOrCreate
    name: usr-share-ca-certificates
status: {}

kube-controller-manager.yaml

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-controller-manager
    tier: control-plane
  name: kube-controller-manager
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-controller-manager
    - --allocate-node-cidrs=true
    - --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
    - --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
    - --bind-address=127.0.0.1
    - --client-ca-file=/etc/kubernetes/pki/ca.crt
    - --cluster-cidr=10.244.0.0/16
    - --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
    - --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
    - --controllers=*,bootstrapsigner,tokencleaner
    - --kubeconfig=/etc/kubernetes/controller-manager.conf
    - --leader-elect=true
    - --node-cidr-mask-size=24
    - --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
    - --root-ca-file=/etc/kubernetes/pki/ca.crt
    - --service-account-private-key-file=/etc/kubernetes/pki/sa.key
    - --service-cluster-ip-range=10.96.0.0/12
    - --use-service-account-credentials=true
    image: k8s.gcr.io/kube-controller-manager:v1.17.2
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10257
        scheme: HTTPS
      initialDelaySeconds: 15
      timeoutSeconds: 15
    name: kube-controller-manager
    resources:
      requests:
        cpu: 200m
    volumeMounts:
    - mountPath: /etc/ssl/certs
      name: ca-certs
      readOnly: true
    - mountPath: /etc/ca-certificates
      name: etc-ca-certificates
      readOnly: true
    - mountPath: /etc/pki
      name: etc-pki
      readOnly: true
    - mountPath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec
      name: flexvolume-dir
    - mountPath: /etc/kubernetes/pki
      name: k8s-certs
      readOnly: true
    - mountPath: /etc/kubernetes/controller-manager.conf
      name: kubeconfig
      readOnly: true
    - mountPath: /usr/local/share/ca-certificates
      name: usr-local-share-ca-certificates
      readOnly: true
    - mountPath: /usr/share/ca-certificates
      name: usr-share-ca-certificates
      readOnly: true
  hostNetwork: true
  priorityClassName: system-cluster-critical
  volumes:
  - hostPath:
      path: /etc/ssl/certs
      type: DirectoryOrCreate
    name: ca-certs
  - hostPath:
      path: /etc/ca-certificates
      type: DirectoryOrCreate
    name: etc-ca-certificates
  - hostPath:
      path: /etc/pki
      type: DirectoryOrCreate
    name: etc-pki
  - hostPath:
      path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec
      type: DirectoryOrCreate
    name: flexvolume-dir
  - hostPath:
      path: /etc/kubernetes/pki
      type: DirectoryOrCreate
    name: k8s-certs
  - hostPath:
      path: /etc/kubernetes/controller-manager.conf
      type: FileOrCreate
    name: kubeconfig
  - hostPath:
      path: /usr/local/share/ca-certificates
      type: DirectoryOrCreate
    name: usr-local-share-ca-certificates
  - hostPath:
      path: /usr/share/ca-certificates
      type: DirectoryOrCreate
    name: usr-share-ca-certificates
status: {}

kube-scheduler.yaml

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-scheduler
    tier: control-plane
  name: kube-scheduler
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-scheduler
    - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
    - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
    - --bind-address=127.0.0.1
    - --kubeconfig=/etc/kubernetes/scheduler.conf
    - --leader-elect=true
    image: k8s.gcr.io/kube-scheduler:v1.17.2
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10259
        scheme: HTTPS
      initialDelaySeconds: 15
      timeoutSeconds: 15
    name: kube-scheduler
    resources:
      requests:
        cpu: 100m
    volumeMounts:
    - mountPath: /etc/kubernetes/scheduler.conf
      name: kubeconfig
      readOnly: true
  hostNetwork: true
  priorityClassName: system-cluster-critical
  volumes:
  - hostPath:
      path: /etc/kubernetes/scheduler.conf
      type: FileOrCreate
    name: kubeconfig
status: {}
@mcajkovs mcajkovs added the kind/bug Categorizes issue or PR as related to a bug. label Feb 26, 2020
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Feb 26, 2020
@mcajkovs
Copy link
Author

@mcajkovs: There are no sig labels on this issue. Please add an appropriate label by using one of the following commands:

* `/sig <group-name>`

* `/wg <group-name>`

* `/committee <group-name>`

Please see the group list for a listing of the SIGs, working groups, and committees available.

sorry but how?
I have no gear icon on right side next to labels

@liggitt
Copy link
Member

liggitt commented Feb 26, 2020

What is the content of the etcd container logs?

/cc @jpbetz

@neolit123
Copy link
Member

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 26, 2020
@mcajkovs
Copy link
Author

What is the content of the etcd container logs?

/cc @jpbetz

Is this what you want?

mcajkovs@ubuntu:~$ docker ps -a | grep etcd | awk '{print $NF}' | while read i; do docker logs -t $i; done
2020-02-26T13:48:24.613993533Z [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2020-02-26T13:48:24.614022261Z 2020-02-26 13:48:24.613797 I | etcdmain: etcd Version: 3.4.3
2020-02-26T13:48:24.614026299Z 2020-02-26 13:48:24.613832 I | etcdmain: Git SHA: 3cf2f69b5
2020-02-26T13:48:24.614029070Z 2020-02-26 13:48:24.613835 I | etcdmain: Go Version: go1.12.12
2020-02-26T13:48:24.614031856Z 2020-02-26 13:48:24.613837 I | etcdmain: Go OS/Arch: linux/amd64
2020-02-26T13:48:24.614034475Z 2020-02-26 13:48:24.613840 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2020-02-26T13:48:24.614037246Z 2020-02-26 13:48:24.613890 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2020-02-26T13:48:24.614039948Z [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2020-02-26T13:48:24.614042652Z 2020-02-26 13:48:24.613914 I | embed: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file =
2020-02-26T13:48:24.615877713Z 2020-02-26 13:48:24.614428 I | embed: name = ubuntu
2020-02-26T13:48:24.615887178Z 2020-02-26 13:48:24.614451 I | embed: data dir = /var/lib/etcd
2020-02-26T13:48:24.615890051Z 2020-02-26 13:48:24.614481 I | embed: member dir = /var/lib/etcd/member
2020-02-26T13:48:24.615892716Z 2020-02-26 13:48:24.614503 I | embed: heartbeat = 100ms
2020-02-26T13:48:24.615895307Z 2020-02-26 13:48:24.614505 I | embed: election = 1000ms
2020-02-26T13:48:24.615897823Z 2020-02-26 13:48:24.614508 I | embed: snapshot count = 10000
2020-02-26T13:48:24.615900397Z 2020-02-26 13:48:24.614528 I | embed: advertise client URLs = https://192.168.195.130:2379
2020-02-26T13:48:24.615903073Z 2020-02-26 13:48:24.614531 I | embed: initial advertise peer URLs = https://192.168.195.130:2380
2020-02-26T13:48:24.615905720Z 2020-02-26 13:48:24.614536 I | embed: initial cluster =
2020-02-26T13:48:24.619906274Z 2020-02-26 13:48:24.619596 I | etcdserver: recovered store from snapshot at index 470047
2020-02-26T13:48:24.702625099Z 2020-02-26 13:48:24.702261 C | etcdserver: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
2020-02-26T13:48:24.705993632Z panic: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
2020-02-26T13:48:24.706007692Z  panic: runtime error: invalid memory address or nil pointer dereference
2020-02-26T13:48:24.706011346Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xc2cc4e]
2020-02-26T13:48:24.706014180Z
2020-02-26T13:48:24.706016948Z goroutine 1 [running]:
2020-02-26T13:48:24.706176000Z go.etcd.io/etcd/etcdserver.NewServer.func1(0xc000244f50, 0xc000242f48)
2020-02-26T13:48:24.706183200Z  /tmp/etcd-release-3.4.3/etcd/release/etcd/etcdserver/server.go:335 +0x3e
2020-02-26T13:48:24.706298544Z panic(0xed6960, 0xc0002ac360)
2020-02-26T13:48:24.706315784Z  /usr/local/go/src/runtime/panic.go:522 +0x1b5
2020-02-26T13:48:24.706319006Z github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc0001b8e40, 0x10aeaf5, 0x2a, 0xc000243018, 0x1, 0x1)
2020-02-26T13:48:24.706427145Z  /home/ec2-user/go/pkg/mod/github.com/coreos/pkg@v0.0.0-20160727233714-3ac0863d7acf/capnslog/pkg_logger.go:75 +0x135
2020-02-26T13:48:24.706562923Z go.etcd.io/etcd/etcdserver.NewServer(0x7ffd68579e7e, 0x6, 0x0, 0x0, 0x0, 0x0, 0xc000129200, 0x1, 0x1, 0xc000129380, ...)
2020-02-26T13:48:24.706713866Z  /tmp/etcd-release-3.4.3/etcd/release/etcd/etcdserver/server.go:456 +0x42f7
2020-02-26T13:48:24.706720809Z go.etcd.io/etcd/embed.StartEtcd(0xc00016f600, 0xc00016fb80, 0x0, 0x0)
2020-02-26T13:48:24.706913150Z  /tmp/etcd-release-3.4.3/etcd/release/etcd/embed/etcd.go:211 +0x9d0
2020-02-26T13:48:24.706919869Z go.etcd.io/etcd/etcdmain.startEtcd(0xc00016f600, 0x108423e, 0x6, 0x1, 0xc0001d51d0)
2020-02-26T13:48:24.706922776Z  /tmp/etcd-release-3.4.3/etcd/release/etcd/etcdmain/etcd.go:302 +0x40
2020-02-26T13:48:24.707078785Z go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2()
2020-02-26T13:48:24.707085556Z  /tmp/etcd-release-3.4.3/etcd/release/etcd/etcdmain/etcd.go:144 +0x2f71
2020-02-26T13:48:24.707088542Z go.etcd.io/etcd/etcdmain.Main()
2020-02-26T13:48:24.707091151Z  /tmp/etcd-release-3.4.3/etcd/release/etcd/etcdmain/main.go:46 +0x38
2020-02-26T13:48:24.707176060Z main.main()
2020-02-26T13:48:24.707182254Z  /tmp/etcd-release-3.4.3/etcd/release/etcd/main.go:28 +0x20

@neolit123
Copy link
Member

is the /var/lib/etcd/member directory present after you have rebooted?
is there a side-process that cleans it up?

@mcajkovs
Copy link
Author

is the /var/lib/etcd/member directory present after you have rebooted?
is there a side-process that cleans it up?

Yes it is presented. No there is no side-process job that cleans it up. Should I clean it now?

mcajkovs@ubuntu:~$ sudo tree /var/lib/etcd/member
/var/lib/etcd/member
├── snap
│   ├── 0000000000000005-0000000000068fdb.snap
│   ├── 0000000000000005-000000000006b6ec.snap
│   ├── 0000000000000005-000000000006ddfd.snap
│   ├── 0000000000000005-000000000007050e.snap
│   ├── 0000000000000005-0000000000072c1f.snap
│   └── db
└── wal
    ├── 0000000000000000-0000000000000000.wal
    ├── 0000000000000001-000000000001ca85.wal
    ├── 0000000000000002-000000000003bc98.wal
    ├── 0000000000000003-00000000000593ad.wal
    └── 0.tmp

2 directories, 11 files

@neolit123
Copy link
Member

neolit123 commented Feb 26, 2020

Should I clean it now?

no.

does this cluster have more etcd instances / control-plane Nodes? if yes, are you seeing the same issue on those?

@mcajkovs
Copy link
Author

mcajkovs commented Feb 26, 2020

Should I clean it now?

no.

does this cluster have more etcd instances / control-plane Nodes? if yes, are you seeing the same issue on those?

no, it is a single node cluster

@tedyu
Copy link
Contributor

tedyu commented Feb 26, 2020

From etcd log:

2020-02-26T13:48:24.702625099Z 2020-02-26 13:48:24.702261 C | etcdserver: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)

It seems the panic is by design if the recovery fails (from server.go):

			if be, err = recoverSnapshotBackend(cfg, be, *snapshot); err != nil {
				cfg.Logger.Panic("failed to recover v3 backend from snapshot", zap.Error(err))

@mcajkovs
Copy link
Author

I've tried to rename /var/lib/etcd/member to /var/lib/etcd/member.bak and then isssue sudo systemctl restart kubelet but after those steps I get only one service running in cluster

mcajkovs@ubuntu:~$ kubectl get all -A
NAMESPACE   NAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
default     service/kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   61s

If I understand correct the issue is due to broken etcd database. Are there any best practices for periodically backing up etcd database during cluster operation? Or is this type of issue normal when cluster is incorrectly shut down? Or what is the best way to avoid this?

@neolit123
Copy link
Member

I've tried to rename /var/lib/etcd/member to /var/lib/etcd/member.bak and then isssue sudo systemctl restart kubelet but after those steps I get only one service running in cluster

the folder contains the data of your existing cluster. deleting it would mean data loss.

If I understand correct the issue is due to broken etcd database.

this seems more like a bug in the etcd server where it cannot restore it's previous state.

Are there any best practices for periodically backing up etcd database during cluster operation?

these docs provide some guidelines:
https://docs.openshift.com/container-platform/4.1/backup_and_restore/backing-up-etcd.html
https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/

Or is this type of issue normal when cluster is incorrectly shut down? Or what is the best way to avoid this?

if you had multiple etcd members, it is possible to restore the overall cluster state from a healthy member.

@neolit123
Copy link
Member

Are there any best practices for periodically backing up etcd database during cluster operation?

if you have used kubeadm upgrade apply on that Node you should have an etcd backup under /etc/kubernetes/.

@fedebongio
Copy link
Contributor

/assign @jingyih

@k8s-ci-robot
Copy link
Contributor

@fedebongio: GitHub didn't allow me to assign the following users: jingyih.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @jingyih

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jingyih
Copy link
Contributor

jingyih commented Feb 28, 2020

Or is this type of issue normal when cluster is incorrectly shut down?

No this is not normal. etcd is designed to be recoverable from unexpected shut down.

While I am trying to reproduce this issue, on your side could you add "--logger=zap" to etcd starting command (in etcd.yaml). It will print more info such as exactly which snapshot file it fails to find.

@mcajkovs
Copy link
Author

mcajkovs commented Mar 9, 2020

Or is this type of issue normal when cluster is incorrectly shut down?

No this is not normal. etcd is designed to be recoverable from unexpected shut down.

While I am trying to reproduce this issue, on your side could you add "--logger=zap" to etcd starting command (in etcd.yaml). It will print more info such as exactly which snapshot file it fails to find.

mcajkovs@ubuntu:~$ docker ps -a | grep etcd | awk '{print $NF}' | while read i; do docker logs -t $i; done

2020-03-09T23:46:27.073177097Z {"level":"warn","ts":"2020-03-09T23:46:27.072Z","caller":"etcdmain/etcd.go:577","msg":"found invalid file under data directory","filename":"member.bak","data-dir":"/var/lib/etcd"}
2020-03-09T23:46:27.073225911Z {"level":"info","ts":"2020-03-09T23:46:27.072Z","caller":"etcdmain/etcd.go:134","msg":"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"}
2020-03-09T23:46:27.074110346Z {"level":"info","ts":"2020-03-09T23:46:27.073Z","caller":"embed/etcd.go:117","msg":"configuring peer listeners","listen-peer-urls":["https://192.168.195.130:2380"]}
2020-03-09T23:46:27.074126548Z {"level":"info","ts":"2020-03-09T23:46:27.073Z","caller":"embed/etcd.go:465","msg":"starting with peer TLS","tls-info":"cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
2020-03-09T23:46:27.103141751Z {"level":"info","ts":"2020-03-09T23:46:27.078Z","caller":"embed/etcd.go:127","msg":"configuring client listeners","listen-client-urls":["https://127.0.0.1:2379","https://192.168.195.130:2379"]}
2020-03-09T23:46:27.103162704Z {"level":"info","ts":"2020-03-09T23:46:27.078Z","caller":"embed/etcd.go:299","msg":"starting an etcd server","etcd-version":"3.4.3","git-sha":"3cf2f69b5","go-version":"go1.12.12","go-os":"linux","go-arch":"amd64","max-cpu-set":2,"max-cpu-available":2,"member-initialized":true,"name":"ubuntu","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":10000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://192.168.195.130:2380"],"listen-peer-urls":["https://192.168.195.130:2380"],"advertise-client-urls":["https://192.168.195.130:2379"],"listen-client-urls":["https://127.0.0.1:2379","https://192.168.195.130:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":false,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":""}
2020-03-09T23:46:27.103174589Z {"level":"info","ts":"2020-03-09T23:46:27.083Z","caller":"etcdserver/backend.go:79","msg":"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"2.773245ms"}
2020-03-09T23:46:27.103178271Z {"level":"info","ts":"2020-03-09T23:46:27.101Z","caller":"etcdserver/server.go:443","msg":"recovered v2 store from snapshot","snapshot-index":20002,"snapshot-size":"9.7 kB"}
2020-03-09T23:46:27.103181846Z {"level":"info","ts":"2020-03-09T23:46:27.101Z","caller":"mvcc/kvstore.go:378","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":17478}
2020-03-09T23:46:27.122089841Z {"level":"info","ts":"2020-03-09T23:46:27.120Z","caller":"etcdserver/server.go:461","msg":"recovered v3 backend from snapshot","backend-size-bytes":1724416,"backend-size":"1.7 MB","backend-size-in-use-bytes":790528,"backend-size-in-use":"790 kB"}
2020-03-09T23:46:27.351631324Z {"level":"info","ts":"2020-03-09T23:46:27.350Z","caller":"etcdserver/raft.go:506","msg":"restarting local member","cluster-id":"d5716bdb110239ac","local-member-id":"3eae9a550e2e3ec","commit-index":20962}
2020-03-09T23:46:27.351661644Z {"level":"info","ts":"2020-03-09T23:46:27.350Z","caller":"raft/raft.go:1530","msg":"3eae9a550e2e3ec switched to configuration voters=(282294822899999724)"}
2020-03-09T23:46:27.351665955Z {"level":"info","ts":"2020-03-09T23:46:27.350Z","caller":"raft/raft.go:700","msg":"3eae9a550e2e3ec became follower at term 9"}
2020-03-09T23:46:27.351669305Z {"level":"info","ts":"2020-03-09T23:46:27.350Z","caller":"raft/raft.go:383","msg":"newRaft 3eae9a550e2e3ec [peers: [3eae9a550e2e3ec], term: 9, commit: 20962, applied: 20002, lastindex: 20962, lastterm: 9]"}
2020-03-09T23:46:27.351672525Z {"level":"info","ts":"2020-03-09T23:46:27.350Z","caller":"api/capability.go:76","msg":"enabled capabilities for version","cluster-version":"3.4"}
2020-03-09T23:46:27.351675790Z {"level":"info","ts":"2020-03-09T23:46:27.350Z","caller":"membership/cluster.go:256","msg":"recovered/added member from store","cluster-id":"d5716bdb110239ac","local-member-id":"3eae9a550e2e3ec","recovered-remote-peer-id":"3eae9a550e2e3ec","recovered-remote-peer-urls":["https://192.168.195.130:2380"]}
2020-03-09T23:46:27.351679600Z {"level":"info","ts":"2020-03-09T23:46:27.350Z","caller":"membership/cluster.go:269","msg":"set cluster version from store","cluster-version":"3.4"}
2020-03-09T23:46:27.353126542Z {"level":"info","ts":"2020-03-09T23:46:27.352Z","caller":"mvcc/kvstore.go:378","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":17478}
2020-03-09T23:46:27.363544251Z {"level":"warn","ts":"2020-03-09T23:46:27.363Z","caller":"auth/store.go:1317","msg":"simple token is not cryptographically signed"}
2020-03-09T23:46:27.363854919Z {"level":"info","ts":"2020-03-09T23:46:27.363Z","caller":"etcdserver/quota.go:98","msg":"enabled backend quota with default value","quota-name":"v3-applier","quota-size-bytes":2147483648,"quota-size":"2.1 GB"}
2020-03-09T23:46:27.364766319Z {"level":"info","ts":"2020-03-09T23:46:27.364Z","caller":"etcdserver/server.go:779","msg":"starting etcd server","local-member-id":"3eae9a550e2e3ec","local-server-version":"3.4.3","cluster-id":"d5716bdb110239ac","cluster-version":"3.4"}
2020-03-09T23:46:27.367125665Z {"level":"info","ts":"2020-03-09T23:46:27.367Z","caller":"embed/etcd.go:708","msg":"starting with client TLS","tls-info":"cert = /etc/kubernetes/pki/etcd/server.crt, key = /etc/kubernetes/pki/etcd/server.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
2020-03-09T23:46:27.367321646Z {"level":"info","ts":"2020-03-09T23:46:27.367Z","caller":"embed/etcd.go:241","msg":"now serving peer/client/metrics","local-member-id":"3eae9a550e2e3ec","initial-advertise-peer-urls":["https://192.168.195.130:2380"],"listen-peer-urls":["https://192.168.195.130:2380"],"advertise-client-urls":["https://192.168.195.130:2379"],"listen-client-urls":["https://127.0.0.1:2379","https://192.168.195.130:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"]}
2020-03-09T23:46:27.367397331Z {"level":"info","ts":"2020-03-09T23:46:27.367Z","caller":"embed/etcd.go:778","msg":"serving metrics","address":"http://127.0.0.1:2381"}
2020-03-09T23:46:27.367636795Z {"level":"info","ts":"2020-03-09T23:46:27.367Z","caller":"etcdserver/server.go:658","msg":"started as single-node; fast-forwarding election ticks","local-member-id":"3eae9a550e2e3ec","forward-ticks":9,"forward-duration":"900ms","election-ticks":10,"election-timeout":"1s"}
2020-03-09T23:46:27.368184733Z {"level":"info","ts":"2020-03-09T23:46:27.368Z","caller":"embed/etcd.go:576","msg":"serving peer traffic","address":"192.168.195.130:2380"}
2020-03-09T23:46:27.452642633Z {"level":"info","ts":"2020-03-09T23:46:27.452Z","caller":"raft/raft.go:923","msg":"3eae9a550e2e3ec is starting a new election at term 9"}
2020-03-09T23:46:27.452806609Z {"level":"info","ts":"2020-03-09T23:46:27.452Z","caller":"raft/raft.go:713","msg":"3eae9a550e2e3ec became candidate at term 10"}
2020-03-09T23:46:27.452984857Z {"level":"info","ts":"2020-03-09T23:46:27.452Z","caller":"raft/raft.go:824","msg":"3eae9a550e2e3ec received MsgVoteResp from 3eae9a550e2e3ec at term 10"}
2020-03-09T23:46:27.453175924Z {"level":"info","ts":"2020-03-09T23:46:27.453Z","caller":"raft/raft.go:765","msg":"3eae9a550e2e3ec became leader at term 10"}
2020-03-09T23:46:27.453254323Z {"level":"info","ts":"2020-03-09T23:46:27.453Z","caller":"raft/node.go:325","msg":"raft.node: 3eae9a550e2e3ec elected leader 3eae9a550e2e3ec at term 10"}
2020-03-09T23:46:27.453822940Z {"level":"info","ts":"2020-03-09T23:46:27.453Z","caller":"etcdserver/server.go:2016","msg":"published local member to cluster through raft","local-member-id":"3eae9a550e2e3ec","local-member-attributes":"{Name:ubuntu ClientURLs:[https://192.168.195.130:2379]}","request-path":"/0/members/3eae9a550e2e3ec/attributes","cluster-id":"d5716bdb110239ac","publish-timeout":"7s"}
2020-03-09T23:46:27.456331369Z {"level":"info","ts":"2020-03-09T23:46:27.455Z","caller":"embed/serve.go:191","msg":"serving client traffic securely","address":"192.168.195.130:2379"}
2020-03-09T23:46:27.462130580Z {"level":"info","ts":"2020-03-09T23:46:27.461Z","caller":"embed/serve.go:191","msg":"serving client traffic securely","address":"127.0.0.1:2379"}

@jingyih
Copy link
Contributor

jingyih commented Mar 10, 2020

@mcajkovs, The latest etcd log you provided suggests that the etcd server was started successfully. The following log entry suggests that there was no snapshot file missing. Is there anything changed since the last panic?

2020-03-09T23:46:27.122089841Z {"level":"info","ts":"2020-03-09T23:46:27.120Z","caller":"etcdserver/server.go:461","msg":"recovered v3 backend from snapshot","backend-size-bytes":1724416,"backend-size":"1.7 MB","backend-size-in-use-bytes":790528,"backend-size-in-use":"790 kB"}

@mcajkovs
Copy link
Author

mcajkovs commented Mar 10, 2020

@mcajkovs, The latest etcd log you provided suggests that the etcd server was started successfully. The following log entry suggests that there was no snapshot file missing. Is there anything changed since the last panic?

2020-03-09T23:46:27.122089841Z {"level":"info","ts":"2020-03-09T23:46:27.120Z","caller":"etcdserver/server.go:461","msg":"recovered v3 backend from snapshot","backend-size-bytes":1724416,"backend-size":"1.7 MB","backend-size-in-use-bytes":790528,"backend-size-in-use":"790 kB"}

Oh I'm sorry - I've posted etcd output that was created as a result of renaming /var/lib/etcd/member to /var/lib/etcd/member.bak here is correct output:

2020-03-10T14:50:16.835863774Z {"level":"warn","ts":"2020-03-10T14:50:16.835Z","caller":"etcdmain/etcd.go:577","msg":"found invalid file under data directory","filename":"member.clean.working","data-dir":"/var/lib/etcd"}
2020-03-10T14:50:16.835899894Z {"level":"info","ts":"2020-03-10T14:50:16.835Z","caller":"etcdmain/etcd.go:134","msg":"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"}
2020-03-10T14:50:16.835905088Z {"level":"info","ts":"2020-03-10T14:50:16.835Z","caller":"embed/etcd.go:117","msg":"configuring peer listeners","listen-peer-urls":["https://192.168.195.130:2380"]}
2020-03-10T14:50:16.835908719Z {"level":"info","ts":"2020-03-10T14:50:16.835Z","caller":"embed/etcd.go:465","msg":"starting with peer TLS","tls-info":"cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
2020-03-10T14:50:16.836537420Z {"level":"info","ts":"2020-03-10T14:50:16.836Z","caller":"embed/etcd.go:127","msg":"configuring client listeners","listen-client-urls":["https://127.0.0.1:2379","https://192.168.195.130:2379"]}
2020-03-10T14:50:16.836789296Z {"level":"info","ts":"2020-03-10T14:50:16.836Z","caller":"embed/etcd.go:299","msg":"starting an etcd server","etcd-version":"3.4.3","git-sha":"3cf2f69b5","go-version":"go1.12.12","go-os":"linux","go-arch":"amd64","max-cpu-set":2,"max-cpu-available":2,"member-initialized":true,"name":"ubuntu","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":10000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://192.168.195.130:2380"],"listen-peer-urls":["https://192.168.195.130:2380"],"advertise-client-urls":["https://192.168.195.130:2379"],"listen-client-urls":["https://127.0.0.1:2379","https://192.168.195.130:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":false,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":""}
2020-03-10T14:50:16.837407618Z {"level":"info","ts":"2020-03-10T14:50:16.837Z","caller":"etcdserver/backend.go:79","msg":"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"351.148µs"}
2020-03-10T14:50:16.838367951Z {"level":"info","ts":"2020-03-10T14:50:16.838Z","caller":"etcdserver/server.go:443","msg":"recovered v2 store from snapshot","snapshot-index":470047,"snapshot-size":"8.2 kB"}
2020-03-10T14:50:16.844626183Z {"level":"warn","ts":"2020-03-10T14:50:16.844Z","caller":"snap/db.go:92","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":470047,"snapshot-file-path":"/var/lib/etcd/member/snap/0000000000072c1f.snap.db","error":"snap: snapshot file doesn't exist"}
2020-03-10T14:50:16.844728992Z {"level":"panic","ts":"2020-03-10T14:50:16.844Z","caller":"etcdserver/server.go:454","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/etcdserver.NewServer\n\t/tmp/etcd-release-3.4.3/etcd/release/etcd/etcdserver/server.go:454\ngo.etcd.io/etcd/embed.StartEtcd\n\t/tmp/etcd-release-3.4.3/etcd/release/etcd/embed/etcd.go:211\ngo.etcd.io/etcd/etcdmain.startEtcd\n\t/tmp/etcd-release-3.4.3/etcd/release/etcd/etcdmain/etcd.go:302\ngo.etcd.io/etcd/etcdmain.startEtcdOrProxyV2\n\t/tmp/etcd-release-3.4.3/etcd/release/etcd/etcdmain/etcd.go:144\ngo.etcd.io/etcd/etcdmain.Main\n\t/tmp/etcd-release-3.4.3/etcd/release/etcd/etcdmain/main.go:46\nmain.main\n\t/tmp/etcd-release-3.4.3/etcd/release/etcd/main.go:28\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"}
2020-03-10T14:50:16.848134227Z panic: failed to recover v3 backend from snapshot
2020-03-10T14:50:16.848145231Z 	panic: runtime error: invalid memory address or nil pointer dereference
2020-03-10T14:50:16.848148899Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xc2cc4e]
2020-03-10T14:50:16.848151653Z 
2020-03-10T14:50:16.848154298Z goroutine 1 [running]:
2020-03-10T14:50:16.848156991Z go.etcd.io/etcd/etcdserver.NewServer.func1(0xc0003acf50, 0xc0003aaf48)
2020-03-10T14:50:16.848159739Z 	/tmp/etcd-release-3.4.3/etcd/release/etcd/etcdserver/server.go:335 +0x3e
2020-03-10T14:50:16.848162471Z panic(0xed6960, 0xc000042050)
2020-03-10T14:50:16.848165123Z 	/usr/local/go/src/runtime/panic.go:522 +0x1b5
2020-03-10T14:50:16.848239190Z go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0001438c0, 0xc000074180, 0x1, 0x1)
2020-03-10T14:50:16.848250218Z 	/home/ec2-user/go/pkg/mod/go.uber.org/zap@v1.10.0/zapcore/entry.go:229 +0x546
2020-03-10T14:50:16.848253764Z go.uber.org/zap.(*Logger).Panic(0xc00022e1e0, 0x10ae243, 0x2a, 0xc000074180, 0x1, 0x1)
2020-03-10T14:50:16.848256671Z 	/home/ec2-user/go/pkg/mod/go.uber.org/zap@v1.10.0/logger.go:225 +0x7f
2020-03-10T14:50:16.848259559Z go.etcd.io/etcd/etcdserver.NewServer(0x7ffce41d3e71, 0x6, 0x0, 0x0, 0x0, 0x0, 0xc00017f080, 0x1, 0x1, 0xc00017f200, ...)
2020-03-10T14:50:16.848262325Z 	/tmp/etcd-release-3.4.3/etcd/release/etcd/etcdserver/server.go:454 +0x3c85
2020-03-10T14:50:16.848359067Z go.etcd.io/etcd/embed.StartEtcd(0xc000165080, 0xc000165600, 0x0, 0x0)
2020-03-10T14:50:16.848369650Z 	/tmp/etcd-release-3.4.3/etcd/release/etcd/embed/etcd.go:211 +0x9d0
2020-03-10T14:50:16.848372923Z go.etcd.io/etcd/etcdmain.startEtcd(0xc000165080, 0x108423e, 0x6, 0xc00017f901, 0x2)
2020-03-10T14:50:16.848463386Z 	/tmp/etcd-release-3.4.3/etcd/release/etcd/etcdmain/etcd.go:302 +0x40
2020-03-10T14:50:16.848470509Z go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2()
2020-03-10T14:50:16.848473433Z 	/tmp/etcd-release-3.4.3/etcd/release/etcd/etcdmain/etcd.go:144 +0x2f71
2020-03-10T14:50:16.848476230Z go.etcd.io/etcd/etcdmain.Main()
2020-03-10T14:50:16.848478915Z 	/tmp/etcd-release-3.4.3/etcd/release/etcd/etcdmain/main.go:46 +0x38
2020-03-10T14:50:16.848481664Z main.main()
2020-03-10T14:50:16.848484272Z 	/tmp/etcd-release-3.4.3/etcd/release/etcd/main.go:28 +0x20

Here is content of /etc/kubernetes/manifests/etcd.yaml

mcajkovs@ubuntu:~$ sudo cat /etc/kubernetes/manifests/etcd.yaml | head -n 40
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: etcd
    tier: control-plane
  name: etcd
  namespace: kube-system
spec:
  containers:
  - command:
    - etcd
    - --advertise-client-urls=https://192.168.195.130:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --initial-advertise-peer-urls=https://192.168.195.130:2380
    - --initial-cluster=ubuntu=https://192.168.195.130:2380
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://192.168.195.130:2379
    - --listen-metrics-urls=http://127.0.0.1:2381
    - --listen-peer-urls=https://192.168.195.130:2380
    - --name=ubuntu
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --logger=zap
    image: k8s.gcr.io/etcd:3.4.3-0
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /health
        port: 2381
        scheme: HTTP

@jingyih
Copy link
Contributor

jingyih commented Mar 10, 2020

Got it. The relevant log entries are:

2020-03-10T14:50:16.844626183Z {"level":"warn","ts":"2020-03-10T14:50:16.844Z","caller":"snap/db.go:92","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":470047,"snapshot-file-path":"/var/lib/etcd/member/snap/0000000000072c1f.snap.db","error":"snap: snapshot file doesn't exist"}
2020-03-10T14:50:16.844728992Z {"level":"panic","ts":"2020-03-10T14:50:16.844Z","caller":"etcdserver/server.go:454","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/etcdserver.NewServer\n\t/tmp/etcd-release-3.4.3/etcd/release/etcd/etcdserver/server.go:454\ngo.etcd.io/etcd/embed.StartEtcd\n\t/tmp/etcd-release-3.4.3/etcd/release/etcd/embed/etcd.go:211\ngo.etcd.io/etcd/etcdmain.startEtcd\n\t/tmp/etcd-release-3.4.3/etcd/release/etcd/etcdmain/etcd.go:302\ngo.etcd.io/etcd/etcdmain.startEtcdOrProxyV2\n\t/tmp/etcd-release-3.4.3/etcd/release/etcd/etcdmain/etcd.go:144\ngo.etcd.io/etcd/etcdmain.Main\n\t/tmp/etcd-release-3.4.3/etcd/release/etcd/etcdmain/main.go:46\nmain.main\n\t/tmp/etcd-release-3.4.3/etcd/release/etcd/main.go:28\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"}

So etcd is trying to open /var/lib/etcd/member/snap/0000000000072c1f.snap.db but could not find it. It should be related to the following file (which exists in your data dir)
0000000000000005-0000000000072c1f.snap.

cc @gyuho Could you also help take a look?

@jingyih
Copy link
Contributor

jingyih commented Mar 10, 2020

Not sure why it is looking for snap.db file. This is single node etcd cluster.

@jingyih
Copy link
Contributor

jingyih commented Mar 10, 2020

@mcajkovs how reproducible is this issue on your VM. Let's say we create a fresh etcd (fresh k8s cluster), and then the VM is shutdown unexpectedly (such as due to power cut). Does it always comes to this state where the restarting etcd panics?

@gyuho
Copy link
Member

gyuho commented Mar 10, 2020

@mcajkovs Did you remove any files in the snap directory? Or could be old snapshots which's been already discarded in the local node (but old incoming snapshot handler should not even reach this code path...)

@kidlj
Copy link
Contributor

kidlj commented Jun 4, 2020

We encountered the same issue with etcd-3.4.3 after a power cut. The three nodes cluster had two of them failed to recover, with the following message:

aller":"etcdserver/backend.go:79","msg":"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"3.557338ms"}
aller":"etcdserver/server.go:443","msg":"recovered v2 store from snapshot","snapshot-index":3700038,"snapshot-size":"11 kB"}
aller":"snap/db.go:92","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":3700038,"snapshot-file-path":"/var/lib/etcd/member/snap/0000000000387546.snap.db","error":"snap: snapshot file doesn't exist

The snap dir of the failed nodes:

[root@portal30 ~]# ls -lh /var/lib/etcd/member/snap/
total 5.7M
-rw-r--r-- 1 etcd users  11K Jun  2 12:20 0000000000000e6c-0000000000325ac2.snap
-rw-r--r-- 1 etcd users  11K Jun  2 21:48 0000000000000e8f-000000000033e163.snap
-rw-r--r-- 1 etcd users  11K Jun  3 07:15 0000000000000e8f-0000000000356804.snap
-rw-r--r-- 1 etcd users  11K Jun  3 16:46 0000000000000e92-000000000036eea5.snap
-rw-r--r-- 1 etcd users  11K Jun  4 02:22 0000000000000ead-0000000000387546.snap
-rw------- 1 etcd users 5.6M Jun  4 17:30 db

And one of them can recover, but can't make to consensus without one more healthy node.

here's it's log:

aller":"embed/etcd.go:299","msg":"starting an etcd server","etcd-version":"3.4.3","git-sha":"3cf2f69b5","go-version":"go1.12.12","go-os":"linux","go-arch":"amd64","max-cpu-set":8,"max-cpu-available":8,"member
aller":"etcdserver/backend.go:79","msg":"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"3.755632ms"}
aller":"etcdserver/server.go:443","msg":"recovered v2 store from snapshot","snapshot-index":3700038,"snapshot-size":"11 kB"}
aller":"mvcc/kvstore.go:378","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":3003230}
aller":"etcdserver/server.go:461","msg":"recovered v3 backend from snapshot","backend-size-bytes":5865472,"backend-size":"5.9 MB","backend-size-in-use-bytes":2039808,"backend-size-in-use":"2.0 MB"}
aller":"etcdserver/raft.go:602","msg":"forcing restart member","cluster-id":"8ed9f373da49715c","local-member-id":"18b8a647743bc74a","commit-index":3796121}
aller":"raft/raft.go:1530","msg":"18b8a647743bc74a switched to configuration voters=(1781356478447994698 7788609191025002950 11580190501588852091)"}
aller":"raft/raft.go:700","msg":"18b8a647743bc74a became follower at term 11928"}
aller":"raft/raft.go:383","msg":"newRaft 18b8a647743bc74a [peers: [18b8a647743bc74a,6c16b2cb1d6349c6,a0b519e81e8e5d7b], term: 11928, commit: 3796121, applied: 3700038, lastindex: 3796121, lastterm: 11928]"}
aller":"api/capability.go:76","msg":"enabled capabilities for version","cluster-version":"3.4"}
aller":"membership/cluster.go:256","msg":"recovered/added member from store","cluster-id":"8ed9f373da49715c","local-member-id":"18b8a647743bc74a","recovered-remote-peer-id":"18b8a647743bc74a","recovered-remot
aller":"membership/cluster.go:256","msg":"recovered/added member from store","cluster-id":"8ed9f373da49715c","local-member-id":"18b8a647743bc74a","recovered-remote-peer-id":"6c16b2cb1d6349c6","recovered-remot
aller":"membership/cluster.go:256","msg":"recovered/added member from store","cluster-id":"8ed9f373da49715c","local-member-id":"18b8a647743bc74a","recovered-remote-peer-id":"a0b519e81e8e5d7b","recovered-remot
aller":"membership/cluster.go:269","msg":"set cluster version from store","cluster-version":"3.4"}

and its snap directory tree:

# ls -l /var/lib/etcd/member/snap/
total 5788
-rw-r--r-- 1 root root   11220 Jun  4 17:08 0000000000000e6c-0000000000325ac2.snap
-rw-r--r-- 1 root root   11219 Jun  4 17:08 0000000000000e8f-000000000033e163.snap
-rw-r--r-- 1 root root   11220 Jun  4 17:08 0000000000000e8f-0000000000356804.snap
-rw-r--r-- 1 root root   11220 Jun  4 17:08 0000000000000e92-000000000036eea5.snap
-rw-r--r-- 1 root root   11219 Jun  4 17:08 0000000000000ead-0000000000387546.snap
-rw------- 1 root root 5865472 Jun  4 17:08 db

@kidlj
Copy link
Contributor

kidlj commented Jun 4, 2020

Since one of the nodes can recover, we add --force-new-cluster to it to make a new cluster using its old snaps, and added the other two nodes back.

I'm wondering why it can recover with the same(?) snap db but the other two can't? The recovery process seems to be looking for asnap.db file which doesn't exist in any node.


func (s *Snapshotter) dbFilePath(id uint64) string {
	return filepath.Join(s.dir, fmt.Sprintf("%016x.snap.db", id))
}

@kidlj
Copy link
Contributor

kidlj commented Jun 4, 2020

Kindly ping @gyuho @jingyih

@n0rad
Copy link

n0rad commented Aug 22, 2020

Same here after power loss, one of my 3 nodes cluster cannot start with the same panic.
I still have the pod/data in case you want to extract some info.

@llhuii
Copy link

llhuii commented Sep 25, 2020

Same problem here with single node cluster with etcd:3.3.10, k8s:v1.15.2.
After switch the etcd 3.4.3, the similar log appears:

{"level":"warn","ts":"2020-09-25T06:40:27.621Z","caller":"snap/db.go:92","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":55435591,"snapshot-file-path":"/var/lib/etcd/member/snap/00000000034de147.snap.db","error":"snap: snapshot file doesn't exist"}

@llhuii
Copy link

llhuii commented Sep 30, 2020

Can we recover part of data? I really don't want to reinstall the environments including other k8s-related projects which took me many weeks.

@Gabisonfire
Copy link

Gabisonfire commented Sep 30, 2020

I ended up doing a scheduled job as a workaround in the meantime with something like this to back up every 6 hours for 30 days:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: backup
  namespace: kube-system
spec:
  concurrencyPolicy: Allow
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - args:
            - -c
            - etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt
              --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key
              snapshot save /backup/etcd-snapshot-$(date +%Y-%m-%d_%H:%M:%S_%Z).db
            command:
            - /bin/sh
            env:
            - name: ETCDCTL_API
              value: "3"
            image: k8s.gcr.io/etcd:3.4.3-0
            imagePullPolicy: IfNotPresent
            name: backup
            resources: {}
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /etc/kubernetes/pki/etcd
              name: etcd-certs
              readOnly: true
            - mountPath: /backup
              name: backup
          - args:
            - -c
            - find /backup -type f -mtime +30 -exec rm -f {} \;
            command:
            - /bin/sh
            env:
            - name: ETCDCTL_API
              value: "3"
            image: k8s.gcr.io/etcd:3.4.3-0
            imagePullPolicy: IfNotPresent
            name: cleanup
            resources: {}
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /backup
              name: backup
          dnsPolicy: ClusterFirst
          hostNetwork: true
          nodeName: YOUR_MASTER_NODE_NAME
          restartPolicy: OnFailure
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
          volumes:
          - hostPath:
              path: /etc/kubernetes/pki/etcd
              type: DirectoryOrCreate
            name: etcd-certs
          - hostPath:
              path: /opt/etcd_backups
              type: DirectoryOrCreate
            name: backup
  schedule: 0 */6 * * *
  successfulJobsHistoryLimit: 3
  suspend: false

@evanrich
Copy link

evanrich commented Oct 23, 2020

I ended up doing a scheduled job as a workaround in the meantime with something like this to back up every 6 hours for 30 days:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: backup
  namespace: kube-system
spec:
  concurrencyPolicy: Allow
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - args:
            - -c
            - etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt
              --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key
              snapshot save /backup/etcd-snapshot-$(date +%Y-%m-%d_%H:%M:%S_%Z).db
            command:
            - /bin/sh
            env:
            - name: ETCDCTL_API
              value: "3"
            image: k8s.gcr.io/etcd:3.4.3-0
            imagePullPolicy: IfNotPresent
            name: backup
            resources: {}
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /etc/kubernetes/pki/etcd
              name: etcd-certs
              readOnly: true
            - mountPath: /backup
              name: backup
          - args:
            - -c
            - find /backup -type f -mtime +30 -exec rm -f {} \;
            command:
            - /bin/sh
            env:
            - name: ETCDCTL_API
              value: "3"
            image: k8s.gcr.io/etcd:3.4.3-0
            imagePullPolicy: IfNotPresent
            name: cleanup
            resources: {}
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /backup
              name: backup
          dnsPolicy: ClusterFirst
          hostNetwork: true
          nodeName: YOUR_MASTER_NODE_NAME
          restartPolicy: OnFailure
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
          volumes:
          - hostPath:
              path: /etc/kubernetes/pki/etcd
              type: DirectoryOrCreate
            name: etcd-certs
          - hostPath:
              path: /opt/etcd_backups
              type: DirectoryOrCreate
            name: backup
  schedule: 0 */6 * * *
  successfulJobsHistoryLimit: 3
  suspend: false

This is great! I tried running it but the output file name didn't have the date:
etcd-snapshot-.db

any ideas why?

edit just ran a test and watched the logs: seems date command is not found

/bin/sh: date: command not found

@mamiapatrick
Copy link

I install kubernetes through RKE and the 3 etcd node of my k8s are ok but the etcd server on rancher host (rancher/rancher:v2.3.2.) has this problem. All kubectl command are working and i can connect to API to interact with this but the rancher interface is down.
Do someone fix this issues. there is many threads on this, but no real solutions!!!

@jar349
Copy link

jar349 commented Dec 13, 2020

This happened to me as well and I had to go through these steps and then reconfigure to use the new, restored, etcd cluster.

@dims
Copy link
Member

dims commented Dec 13, 2020

@mamiapatrick rancher problems, please contact rancher. etcd problems, please contact the etcd community. thanks!

@chief93
Copy link

chief93 commented Dec 25, 2020

Encountered same problem after power down of my host machine. It's weird - the kube-apiserver does not start on IPv4 (0.0.0.0:6443) and only on IPv6 (:::6443), then fails too
image

@chief93
Copy link

chief93 commented Dec 25, 2020

Seems like I've found the reason of such behavior - the apiserver can not start because it can't connect to the etcd, which can not start because the filesystem is corrupted due to power down. Here is a relevent issue response, that I've found over googling about this problem etcd-io/etcd#10722 (comment).

The main problem is that corrupted was the master node, which lead to unavailability of performing of suggested "removing of the node".

@deeco
Copy link

deeco commented Jan 6, 2021

Same issue after power down , can it be restored to a point using .snaps and .wal filess or should just destroy cluster

etcd-io/etcd#11949 (comment)

@chief93
Copy link

chief93 commented Jan 6, 2021

@deeco unfortunately, in my case, I was need to re-setup my whole cluster from the scratch, because I had only 1 etcd running (set up by kubeadm), therefore I was unable to dump etcd database from any other, than master, nodes.

@deeco
Copy link

deeco commented Jan 6, 2021

@deeco unfortunately, in my case, I was need to re-setup my whole cluster from the scratch, because I had only 1 etcd running (set up by kubeadm), therefore I was unable to dump etcd database from any other, than master, nodes.

same here , not ideal and will to reimplement all :(

@oldthreefeng
Copy link

same issue with centos 7.7.1908. etcd and apiserver restart many times . see logs like

 I | etcdserver: recovered store from snapshot at index 164056893
2021-02-07 06:32:51.520022 C | etcdserver: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
panic: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xb8cb90]

for tmp solved

$ mv /var/lib/etcd/member  /var/lib/etcd/member.bak
$ systemctl restart kubelet

then to found the cluster is really cleaning, only static pod is here.

$ kubectl get pod -A
NAMESPACE     NAME                             READY   STATUS    RESTARTS   AGE
kube-system   etcd-kube11                      1/1     Running   1759       83s
kube-system   kube-apiserver-kube11            1/1     Running   1653       94s
kube-system   kube-controller-manager-kube11   1/1     Running   2          97s
kube-system   kube-scheduler-kube11            1/1     Running   2          90s
kube-system   kube-sealyun-lvscare-kube12      1/1     Running   2          115s
kube-system   kube-sealyun-lvscare-kube13      1/1     Running   2          95s

$ cat /etc/redhat-release 
CentOS Linux release 7.7.1908 (Core)
$ uname -a
Linux kube11 3.10.0-1062.el7.x86_64 #1 SMP Wed Aug 7 18:08:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

tips: just do backup when cluster is health!! just do backup when cluster is health!!just do backup when cluster is health!!

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 8, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 7, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@MOHAMMEDSADIQ-infrrd
Copy link

Why is this issue closed? There is no concrete solution found yet. I am facing the same issue today

@BedivereZero
Copy link
Contributor

Same question, how to recover data when ETCD cluster has only one node?

@sniperking1234
Copy link

Same question. Is there a solution now?

@vivisidea
Copy link

It's been many years, here is my solution, I help it can help someone

  1. remove broken etcd node from etcd cluster
  2. remove data dir of the broken node
  3. re-add node as a new member to the etcd cluster

follow the instructions here: https://etcd.io/docs/v3.5/tutorials/how-to-deal-with-membership/

@MuhaMMadHammD
Copy link

Still facing this issue while using rehl 8.3 machine
I had setup single node cluster using kubeadm
Does anyone has a proper solution
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

No branches or pull requests