Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-controller-manager and kube-scheduler restarting in a clean installation #88111

Open
cja000 opened this issue Feb 13, 2020 · 8 comments
Open

Comments

@cja000
Copy link

@cja000 cja000 commented Feb 13, 2020

What happened:
Just after a installation controller and scheduler begin to restart.

What you expected to happen:
I wish to have a stable kubernetes installation without restarts

How to reproduce it (as minimally and precisely as possible):
I only see this problem in our "production" environment. Testing the same procedure with Virtual Box even in a laptop I cannot reproduce it.

Anything else we need to know?:
etcd has a lot of messages took too long with values over 15 seconds (took too long (18.015145443s) to execute)

...
2020-02-13 09:25:04.688572 I | embed: listening for peers on 172.16.1.1:2380
2020-02-13 09:25:04.688761 I | embed: listening for metrics on http://127.0.0.1:2381
raft2020/02/13 09:25:04 INFO: a0e08efea825151f is starting a new election at term 1
raft2020/02/13 09:25:04 INFO: a0e08efea825151f became candidate at term 2
raft2020/02/13 09:25:04 INFO: a0e08efea825151f received MsgVoteResp from a0e08efea825151f at term 2
raft2020/02/13 09:25:04 INFO: a0e08efea825151f became leader at term 2
raft2020/02/13 09:25:04 INFO: raft.node: a0e08efea825151f elected leader a0e08efea825151f at term 2
2020-02-13 09:25:04.781152 I | etcdserver: setting up the initial cluster version to 3.4
2020-02-13 09:25:04.781527 N | etcdserver/membership: set the initial cluster version to 3.4
2020-02-13 09:25:04.781624 I | etcdserver: published {Name:ml-k8s-master ClientURLs:[https://172.16.1.1:2379]} to cluster 50f4e72624d098cd
2020-02-13 09:25:04.781657 I | embed: ready to serve client requests
2020-02-13 09:25:04.781729 I | etcdserver/api: enabled capabilities for version 3.4
2020-02-13 09:25:04.781814 I | embed: ready to serve client requests
2020-02-13 09:25:04.786124 I | embed: serving client requests on 127.0.0.1:2379
2020-02-13 09:25:04.786821 I | embed: serving client requests on 172.16.1.1:2379
2020-02-13 09:25:17.472409 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000024395s) to execute
WARNING: 2020/02/13 09:25:17 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-13 09:25:19.047993 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000018721s) to execute
WARNING: 2020/02/13 09:25:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-13 09:25:19.572311 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (1.999980608s) to execute
WARNING: 2020/02/13 09:25:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-13 09:25:21.547985 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000023755s) to execute
WARNING: 2020/02/13 09:25:21 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-13 09:25:21.672315 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context deadline exceeded" took too long (2.00004148s) to execute
WARNING: 2020/02/13 09:25:21 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-13 09:25:22.472452 W | etcdserver: timed out waiting for read index response (local node might have slow network)
2020-02-13 09:25:23.772362 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000063609s) to execute
WARNING: 2020/02/13 09:25:23 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-13 09:25:24.048026 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000065812s) to execute
WARNING: 2020/02/13 09:25:24 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-13 09:25:25.872341 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000038869s) to execute
WARNING: 2020/02/13 09:25:25 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-13 09:25:26.547981 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000024442s) to execute
WARNING: 2020/02/13 09:25:26 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-13 09:25:27.972340 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000071169s) to execute
WARNING: 2020/02/13 09:25:27 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-13 09:25:29.047995 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000056661s) to execute
WARNING: 2020/02/13 09:25:29 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-13 09:25:29.287420 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "range_response_count:0 size:4" took too long (18.015145443s) to execute
2020-02-13 09:25:29.287479 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "range_response_count:0 size:4" took too long (19.7395452s) to execute
2020-02-13 09:25:29.287497 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "range_response_count:0 size:4" took too long (20.1151665s) to execute
2020-02-13 09:25:29.287523 W | etcdserver: read-only range request "key:\"/registry/namespaces/default\" " with result "range_response_count:0 size:4" took too long (16.314794334s) to execute
2020-02-13 09:25:29.287568 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "range_response_count:0 size:4" took too long (14.739618835s) to execute
2020-02-13 09:25:29.287615 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "range_response_count:0 size:4" took too long (15.915336418s) to execute
2020-02-13 09:25:29.287747 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "range_response_count:0 size:4" took too long (17.239620012s) to execute
...

Environment:

  • Kubernetes version (use kubectl version):
podml@ml-k8s-master:~$ kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:07:13Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
    • Memory
podml@ml-k8s-master:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:          125Gi       1.3Gi       119Gi       2.0Mi       4.4Gi       123Gi
Swap:            0B          0B          0B
  • CPU
podml@ml-k8s-master:~$ lscpu 
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          24
On-line CPU(s) list:             0-23
Thread(s) per core:              1
Core(s) per socket:              12
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           63
Model name:                      Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Stepping:                        2
CPU MHz:                         1830.038
CPU max MHz:                     3300.0000
CPU min MHz:                     1200.0000
BogoMIPS:                        4999.56
Virtualization:                  VT-x
L1d cache:                       768 KiB
L1i cache:                       768 KiB
L2 cache:                        6 MiB
L3 cache:                        60 MiB
NUMA node0 CPU(s):               0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s):               1,3,5,7,9,11,13,15,17,19,21,23
Vulnerability Itlb multihit:     KVM: Mitigation: Split huge pages
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, RSB filling
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc a
                                 rch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dc
                                 a sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnm
                                 i flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d
  • OS (e.g: cat /etc/os-release):
podml@ml-k8s-master:~$ cat /etc/os-release 
NAME="Ubuntu"
VERSION="19.10 (Eoan Ermine)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 19.10"
VERSION_ID="19.10"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=eoan
UBUNTU_CODENAME=eoan
  • Kernel (e.g. uname -a):
podml@ml-k8s-master:~$ uname -a
Linux ml-k8s-master 5.3.0-29-generic #31-Ubuntu SMP Fri Jan 17 17:27:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
    Installed with kubeadm
podml@ml-k8s-master:~$ sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --apiserver-advertise-address=172.16.1.1
  • Network plugin and version (if this is a network-related bug):
    There are two interfaces eno1 (for external access) and enp129s0f0(kuberentes network)
podml@ml-k8s-master:~$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.41.60.1      0.0.0.0         UG    0      0        0 eno1.1500
10.41.60.0      0.0.0.0         255.255.255.0   U     0      0        0 eno1.1500
172.16.1.0      0.0.0.0         255.255.255.0   U     0      0        0 enp129s0f0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
  • Others:
    I have used exactly the same kubectl init command in the "production" master than in the Virtual Box master but I got different IP in the get pods command
podml@ml-k8s-master:~$ kubectl get pods -o wide -A
NAMESPACE     NAME                                    READY   STATUS              RESTARTS   AGE    IP           NODE            NOMINATED NODE   READINESS GATES
kube-system   coredns-6955765f44-2dx9q                0/1     ContainerCreating   0          112m   <none>       ml-k8s-master   <none>           <none>
kube-system   coredns-6955765f44-8fw48                0/1     ContainerCreating   0          112m   <none>       ml-k8s-master   <none>           <none>
kube-system   etcd-ml-k8s-master                      1/1     Running             0          112m   172.16.1.1   ml-k8s-master   <none>           <none>
kube-system   kube-apiserver-ml-k8s-master            1/1     Running             0          112m   172.16.1.1   ml-k8s-master   <none>           <none>
kube-system   kube-controller-manager-ml-k8s-master   0/1     CrashLoopBackOff    17         112m   172.16.1.1   ml-k8s-master   <none>           <none>
kube-system   kube-proxy-lpnzh                        1/1     Running             0          112m   172.16.1.1   ml-k8s-master   <none>           <none>
kube-system   kube-scheduler-ml-k8s-master            0/1     CrashLoopBackOff    17         112m   172.16.1.1   ml-k8s-master   <none>           <none>

"Production" (restarting) IP --> 172.16.1.1 (kubernetes network)
Virtual Box (stable) IP --> 10.0.2.15 (external IP)

Disk performance
I have followed Using Fio to Tell Whether Your Storage is Fast Enough for Etcd

podml@ml-k8s-master:~$ fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest
mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1)
mytest: (groupid=0, jobs=1): err= 0: pid=10421: Thu Feb 13 12:49:39 2020
  write: IOPS=8660, BW=18.0MiB/s (19.9MB/s)(21.0MiB/1158msec); 0 zone resets
    clat (nsec): min=1862, max=2905.0k, avg=31388.68, stdev=40146.89
     lat (nsec): min=1978, max=2905.2k, avg=31539.73, stdev=40153.10
    clat percentiles (nsec):
     |  1.00th=[  1928],  5.00th=[  1976], 10.00th=[  2008], 20.00th=[  2160],
     | 30.00th=[  2544], 40.00th=[  4832], 50.00th=[ 44288], 60.00th=[ 45824],
     | 70.00th=[ 47360], 80.00th=[ 50432], 90.00th=[ 61696], 95.00th=[ 67072],
     | 99.00th=[ 91648], 99.50th=[105984], 99.90th=[240640], 99.95th=[268288],
     | 99.99th=[522240]
   bw (  KiB/s): min=19284, max=19621, per=100.00%, avg=19452.50, stdev=238.29, samples=2
   iops        : min= 8586, max= 8736, avg=8661.00, stdev=106.07, samples=2
  lat (usec)   : 2=7.73%, 4=28.52%, 10=6.85%, 20=0.74%, 50=35.15%
  lat (usec)   : 100=20.33%, 250=0.61%, 500=0.06%, 750=0.01%
  lat (msec)   : 4=0.01%
  fsync/fdatasync/sync_file_range:
    sync (usec): min=44, max=1411, avg=81.69, stdev=24.78
    sync percentiles (usec):
     |  1.00th=[   62],  5.00th=[   66], 10.00th=[   67], 20.00th=[   69],
     | 30.00th=[   72], 40.00th=[   77], 50.00th=[   80], 60.00th=[   82],
     | 70.00th=[   85], 80.00th=[   90], 90.00th=[  100], 95.00th=[  113],
     | 99.00th=[  143], 99.50th=[  159], 99.90th=[  239], 99.95th=[  277],
     | 99.99th=[ 1074]
  cpu          : usr=7.26%, sys=13.31%, ctx=21301, majf=0, minf=14
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=18.0MiB/s (19.9MB/s), 18.0MiB/s-18.0MiB/s (19.9MB/s-19.9MB/s), io=21.0MiB (23.1MB), run=1158-1158msec

Disk stats (read/write):
    dm-0: ios=5578/11648, merge=0/0, ticks=204/984, in_queue=1188, util=92.46%, aggrios=5632/11109, aggrmerge=0/680, aggrticks=244/960, aggrin_queue=0, aggrutil=91.58%
  sda: ios=5632/11109, merge=0/680, ticks=244/960, in_queue=0, util=91.58%

I have another k8s running in a similar machine with Ubuntu 16.04.4 and k8s 1.14.2 without problems but I need to upgrade to Ubuntu 19.10

@cja000

This comment has been minimized.

Copy link
Author

@cja000 cja000 commented Feb 13, 2020

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node and removed needs-sig labels Feb 13, 2020
@neolit123

This comment has been minimized.

Copy link
Member

@neolit123 neolit123 commented Feb 14, 2020

looks like you logged a separate issue after our discussion here:
#74340 (comment)

this seems like an isolated case related to your setup (or just something rare).

here you mentioned that you have tried a number of different CNIs, but if that is true why are the CNI pods missing in your kubectl get po output?

2020-02-13 09:25:29.287747 W | etcdserver: read-only range request "key:"/registry/health" " with result "range_response_count:0 size:4" took too long (17.239620012s) to execute

tagging as api-machinery because they own the etcd implementation.

/remove-sig node
/sig api-machinery
/remove-kind bug
/triage unresolved

@cja000

This comment has been minimized.

Copy link
Author

@cja000 cja000 commented Feb 14, 2020

Yes, I created this new issue
I think it is something related with the environment but I cannot get any clue from the pods logs

podml@ml-k8s-master:~$ ./get_logs.sh 
Leader selection lost: 08:59:02
Logs from 08:58 to  08:59
--------------------
POD: coredns-6955765f44-jrrl8
--------------------
--------------------
POD: coredns-6955765f44-q9bkm
--------------------
--------------------
POD: etcd-ml-k8s-master
--------------------
2020-02-14 08:58:58.181154 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context deadline exceeded" took too long (2.000028245s) to execute
WARNING: 2020/02/14 08:58:58 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-14 08:59:02.300882 W | etcdserver: read-only range request "key:\"/registry/jobs/\" range_end:\"/registry/jobs0\" limit:500 " with result "error:context canceled" took too long (4.363348979s) to execute
WARNING: 2020/02/14 08:59:02 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-14 08:59:03.181298 W | etcdserver: timed out waiting for read index response (local node might have slow network)
2020-02-14 08:59:05.456452 W | etcdserver: failed to revoke 151f703f410e8b55 ("etcdserver: request timed out")
2020-02-14 08:59:07.579251 W | etcdserver: read-only range request "key:\"/registry/events/kube-system/kube-apiserver-ml-k8s-master.15f30131f321ff36\" " with result "error:context canceled" took too long (9.396316674s) to execute
WARNING: 2020/02/14 08:59:07 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-14 08:59:08.181033 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000042121s) to execute
WARNING: 2020/02/14 08:59:08 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-14 08:59:10.181537 W | etcdserver: timed out waiting for read index response (local node might have slow network)
2020-02-14 08:59:10.181647 W | etcdserver: read-only range request "key:\"/registry/poddisruptionbudgets\" range_end:\"/registry/poddisruptionbudgett\" count_only:true " with result "error:etcdserver: request timed out" took too long (7.223919113s) to execute
2020-02-14 08:59:10.181688 W | etcdserver: read-only range request "key:\"/registry/persistentvolumeclaims\" range_end:\"/registry/persistentvolumeclaimt\" count_only:true " with result "error:etcdserver: request timed out" took too long (9.774878581s) to execute
2020-02-14 08:59:10.181735 W | etcdserver: read-only range request "key:\"/registry/services/specs\" range_end:\"/registry/services/spect\" count_only:true " with result "error:etcdserver: request timed out" took too long (11.453872193s) to execute
2020-02-14 08:59:10.181823 W | etcdserver: read-only range request "key:\"/registry/replicasets\" range_end:\"/registry/replicasett\" count_only:true " with result "error:etcdserver: request timed out" took too long (9.248831403s) to execute
2020-02-14 08:59:10.569762 W | etcdserver: read-only range request "key:\"/registry/namespaces/default\" " with result "range_response_count:1 size:173" took too long (17.164415984s) to execute
2020-02-14 08:59:10.569826 W | etcdserver: read-only range request "key:\"/registry/namespaces/kube-system\" " with result "range_response_count:1 size:179" took too long (17.125888475s) to execute
2020-02-14 08:59:10.569858 W | etcdserver: read-only range request "key:\"/registry/ingress\" range_end:\"/registry/ingrest\" count_only:true " with result "range_response_count:0 size:6" took too long (17.087557233s) to execute
2020-02-14 08:59:10.569942 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-scheduler\" " with result "range_response_count:1 size:440" took too long (18.31016043s) to execute
2020-02-14 08:59:10.570036 W | etcdserver: request "header:<ID:1522058616051239826 username:\"kube-apiserver-etcd-client\" auth_revision:1 > txn:<compare:<target:MOD key:\"/registry/leases/kube-node-lease/ml-k8s-master\" mod_revision:61114 > success:<request_put:<key:\"/registry/leases/kube-node-lease/ml-k8s-master\" value_size:240 >> failure:<request_range:<key:\"/registry/leases/kube-node-lease/ml-k8s-master\" > >>" with result "size:18" took too long (16.995753273s) to execute
2020-02-14 08:59:10.570194 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-controller-manager\" " with result "range_response_count:1 size:458" took too long (18.315828398s) to execute
2020-02-14 08:59:10.570389 W | etcdserver: failed to revoke 151f703f410e8b55 ("lease not found")
2020-02-14 08:59:10.570494 W | etcdserver: read-only range request "key:\"/registry/networkpolicies\" range_end:\"/registry/networkpoliciet\" count_only:true " with result "range_response_count:0 size:6" took too long (5.893007021s) to execute
2020-02-14 08:59:10.570539 W | etcdserver: read-only range request "key:\"/registry/events/kube-system/kube-apiserver-ml-k8s-master.15f30131f321ff36\" " with result "range_response_count:1 size:517" took too long (2.802178524s) to execute
2020-02-14 08:59:10.570643 W | etcdserver: read-only range request "key:\"/registry/minions\" range_end:\"/registry/miniont\" count_only:true " with result "range_response_count:0 size:8" took too long (1.762697427s) to execute
2020-02-14 08:59:10.570687 W | etcdserver: read-only range request "key:\"/registry/mutatingwebhookconfigurations\" range_end:\"/registry/mutatingwebhookconfigurationt\" count_only:true " with result "range_response_count:0 size:6" took too long (1.729756006s) to execute
2020-02-14 08:59:10.570772 W | etcdserver: read-only range request "key:\"/registry/cronjobs\" range_end:\"/registry/cronjobt\" count_only:true " with result "range_response_count:0 size:6" took too long (4.58903663s) to execute
2020-02-14 08:59:10.570811 W | etcdserver: read-only range request "key:\"/registry/validatingwebhookconfigurations\" range_end:\"/registry/validatingwebhookconfigurationt\" count_only:true " with result "range_response_count:0 size:6" took too long (5.835117719s) to execute
2020-02-14 08:59:10.570924 W | etcdserver: read-only range request "key:\"/registry/deployments\" range_end:\"/registry/deploymentt\" count_only:true " with result "range_response_count:0 size:8" took too long (4.027499803s) to execute
2020-02-14 08:59:15.884380 I | mvcc: store.index: compact 60546
2020-02-14 08:59:15.885605 I | mvcc: finished scheduled compaction at 60546 (took 773.326µs)
--------------------
POD: kube-apiserver-ml-k8s-master
--------------------
I0214 08:59:00.574743       1 trace.go:116] Trace[1983671341]: "GuaranteedUpdate etcd3" type:*coordination.Lease (started: 2020-02-14 08:58:53.573418987 +0000 UTC m=+61541.537589897) (total time: 7.001283601s):
E0214 08:59:00.574798       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}
I0214 08:59:00.575103       1 trace.go:116] Trace[1397886959]: "Update" url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ml-k8s-master,user-agent:kubelet/v1.17.3 (linux/amd64) kubernetes/06ad960,client:172.16.1.1 (started: 2020-02-14 08:58:53.573269824 +0000 UTC m=+61541.537440733) (total time: 7.00179831s):
E0214 08:59:02.253200       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
I0214 08:59:02.253515       1 trace.go:116] Trace[1490658799]: "Get" url:/api/v1/namespaces/kube-system/endpoints/kube-controller-manager,user-agent:kube-controller-manager/v1.17.3 (linux/amd64) kubernetes/06ad960/leader-election,client:172.16.1.1 (started: 2020-02-14 08:58:52.253641269 +0000 UTC m=+61540.217812172) (total time: 9.99983788s):
E0214 08:59:02.259008       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
I0214 08:59:02.259280       1 trace.go:116] Trace[769769017]: "Get" url:/api/v1/namespaces/kube-system/endpoints/kube-scheduler,user-agent:kube-scheduler/v1.17.3 (linux/amd64) kubernetes/06ad960/leader-election,client:172.16.1.1 (started: 2020-02-14 08:58:52.259378942 +0000 UTC m=+61540.223549852) (total time: 9.999865213s):
E0214 08:59:02.272275       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"client disconnected"}
I0214 08:59:02.300452       1 trace.go:116] Trace[427744836]: "List etcd3" key:/jobs,resourceVersion:,limit:500,continue: (started: 2020-02-14 08:58:57.937112414 +0000 UTC m=+61545.901283317) (total time: 4.363288468s):
E0214 08:59:02.300509       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
I0214 08:59:02.300821       1 trace.go:116] Trace[1862767867]: "List" url:/apis/batch/v1/jobs,user-agent:kube-controller-manager/v1.17.3 (linux/amd64) kubernetes/06ad960/system:serviceaccount:kube-system:cronjob-controller,client:172.16.1.1 (started: 2020-02-14 08:58:57.93707678 +0000 UTC m=+61545.901247690) (total time: 4.363709696s):
I0214 08:59:07.577808       1 trace.go:116] Trace[2008566106]: "GuaranteedUpdate etcd3" type:*coordination.Lease (started: 2020-02-14 08:59:00.576455082 +0000 UTC m=+61548.540625988) (total time: 7.001310614s):
E0214 08:59:07.577870       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}
I0214 08:59:07.578206       1 trace.go:116] Trace[1041096166]: "Update" url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ml-k8s-master,user-agent:kubelet/v1.17.3 (linux/amd64) kubernetes/06ad960,client:172.16.1.1 (started: 2020-02-14 08:59:00.576316612 +0000 UTC m=+61548.540487514) (total time: 7.001825853s):
I0214 08:59:07.579050       1 trace.go:116] Trace[578086289]: "GuaranteedUpdate etcd3" type:*core.Event (started: 2020-02-14 08:58:58.182446016 +0000 UTC m=+61546.146616925) (total time: 9.396569596s):
E0214 08:59:07.579089       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
I0214 08:59:07.579364       1 trace.go:116] Trace[1435648382]: "Patch" url:/api/v1/namespaces/kube-system/events/kube-apiserver-ml-k8s-master.15f30131f321ff36,user-agent:kubelet/v1.17.3 (linux/amd64) kubernetes/06ad960,client:172.16.1.1 (started: 2020-02-14 08:58:58.182343573 +0000 UTC m=+61546.146514479) (total time: 9.39698838s):
I0214 08:59:10.570347       1 trace.go:116] Trace[2043612436]: "Get" url:/api/v1/namespaces/default,user-agent:kube-apiserver/v1.17.3 (linux/amd64) kubernetes/06ad960,client:::1 (started: 2020-02-14 08:58:53.404936116 +0000 UTC m=+61541.369107029) (total time: 17.165367218s):
I0214 08:59:10.570655       1 trace.go:116] Trace[1323927304]: "Get" url:/api/v1/namespaces/kube-system,user-agent:kube-apiserver/v1.17.3 (linux/amd64) kubernetes/06ad960,client:::1 (started: 2020-02-14 08:58:53.44361207 +0000 UTC m=+61541.407782972) (total time: 17.127000316s):
I0214 08:59:10.570909       1 trace.go:116] Trace[908427642]: "GuaranteedUpdate etcd3" type:*coordination.Lease (started: 2020-02-14 08:59:07.593653061 +0000 UTC m=+61555.557823941) (total time: 2.977224368s):
I0214 08:59:10.571013       1 trace.go:116] Trace[1268955959]: "Update" url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ml-k8s-master,user-agent:kubelet/v1.17.3 (linux/amd64) kubernetes/06ad960,client:172.16.1.1 (started: 2020-02-14 08:59:07.593556077 +0000 UTC m=+61555.557726949) (total time: 2.977426224s):
I0214 08:59:10.589288       1 trace.go:116] Trace[1096541851]: "GuaranteedUpdate etcd3" type:*core.Event (started: 2020-02-14 08:59:07.767883507 +0000 UTC m=+61555.732054416) (total time: 2.821369656s):
I0214 08:59:10.589411       1 trace.go:116] Trace[1078016545]: "Patch" url:/api/v1/namespaces/kube-system/events/kube-apiserver-ml-k8s-master.15f30131f321ff36,user-agent:kubelet/v1.17.3 (linux/amd64) kubernetes/06ad960,client:172.16.1.1 (started: 2020-02-14 08:59:07.767786833 +0000 UTC m=+61555.731957739) (total time: 2.821592124s):
--------------------
POD: kube-controller-manager-ml-k8s-master
--------------------
I0214 08:58:03.481241       1 node_lifecycle_controller.go:1443] Initializing eviction metric for zone: 
I0214 08:58:08.481548       1 node_lifecycle_controller.go:1443] Initializing eviction metric for zone: 
I0214 08:58:13.481836       1 node_lifecycle_controller.go:1443] Initializing eviction metric for zone: 
I0214 08:58:18.482133       1 node_lifecycle_controller.go:1443] Initializing eviction metric for zone: 
I0214 08:58:23.482438       1 node_lifecycle_controller.go:1443] Initializing eviction metric for zone: 
I0214 08:58:28.482786       1 node_lifecycle_controller.go:1443] Initializing eviction metric for zone: 
I0214 08:58:33.483098       1 node_lifecycle_controller.go:1443] Initializing eviction metric for zone: 
I0214 08:58:38.483407       1 node_lifecycle_controller.go:1443] Initializing eviction metric for zone: 
I0214 08:58:43.483712       1 node_lifecycle_controller.go:1443] Initializing eviction metric for zone: 
I0214 08:58:48.484934       1 node_lifecycle_controller.go:1443] Initializing eviction metric for zone: 
I0214 08:58:53.485243       1 node_lifecycle_controller.go:1443] Initializing eviction metric for zone: 
I0214 08:58:58.485563       1 node_lifecycle_controller.go:1443] Initializing eviction metric for zone: 
I0214 08:59:02.253057       1 leaderelection.go:288] failed to renew lease kube-system/kube-controller-manager: failed to tryAcquireOrRenew context deadline exceeded
F0214 08:59:02.253145       1 controllermanager.go:279] leaderelection lost
--------------------
POD: kube-flannel-ds-amd64-bh4mj
--------------------
--------------------
POD: kube-proxy-9mxbs
--------------------
--------------------
POD: kube-scheduler-ml-k8s-master
--------------------
I0214 08:59:02.258835       1 leaderelection.go:288] failed to renew lease kube-system/kube-scheduler: failed to tryAcquireOrRenew context deadline exceeded
F0214 08:59:02.258870       1 server.go:257] leaderelection lost

Just in case someone is interested, the script get_logs.sh used is:

#!/bin/bash
NOW=`date +"%Y%m%d_%H%M%S"`
LOG_FILE="logs-${NOW}"
CMD='kubectl logs -n kube-system '
PODS=`kubectl get pods -A -o custom-columns=:metadata.name --no-headers=true`
LINE=$(printf "%0.s-" {1..20})
for p in $PODS
do
    echo ${LINE} >> ${LOG_FILE}
    echo "POD: "${p} >> ${LOG_FILE}
    echo ${LINE} >> ${LOG_FILE}
    ${CMD} ${p} >> ${LOG_FILE}
done

LSL=$(awk -F'[ :.]' '/ leaderelection lost/{printf("%s:%s:%s\n"), $2,$3,$4}' ${LOG_FILE} | sort -u | head -1)
if [[ -z ${LSL} ]]; then
	echo "No Leader selection lost"
	exit
fi
echo "Leader selection lost: "$LSL
LSL_SEC=$(date +%s -d${LSL})
LSL_HM=$(date +"%H:%M" -d "@"${LSL_SEC})
MIN_BEFORE=$(date +"%H:%M" -d "@"$((${LSL_SEC} - 60)))
echo "Logs from "${MIN_BEFORE}" to " ${LSL_HM}
egrep -e '^---|^POD' -e " ${MIN_BEFORE}" -e " ${LSL_HM}" ${LOG_FILE}
@cja000

This comment has been minimized.

Copy link
Author

@cja000 cja000 commented Feb 14, 2020

Regarding the metrics:
With this script:

curl -s localhost:2381/metrics | awk -F' ' '/fsync.*sum/{t=$2;printf("%s: ", $1); getline; if ( $2 != 0) {print t/$2*1000 "ms"} else {print "0 ms"}}'

I have this values:

etcd_disk_wal_fsync_duration_seconds_sum: 21.0832ms
etcd_snap_db_fsync_duration_seconds_sum: 0ms
etcd_snap_fsync_duration_seconds_sum: 0.644806ms
@fedebongio

This comment has been minimized.

Copy link
Contributor

@fedebongio fedebongio commented Feb 20, 2020

/assign @jingyih

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

@k8s-ci-robot k8s-ci-robot commented Feb 20, 2020

@fedebongio: GitHub didn't allow me to assign the following users: jingyih.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @jingyih

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cja000

This comment has been minimized.

Copy link
Author

@cja000 cja000 commented Feb 21, 2020

Tested with CentOS 8 and got the same problem

@jingyih

This comment has been minimized.

Copy link
Contributor

@jingyih jingyih commented Feb 21, 2020

The disk performance (fio test) looks good to me.

Question:

  • Which version of etcd are you running?
  • Do you see "took too long" warnings in etcd log all the time, or do they appear only for a short period of time?
  • Since you mentioned this only happens in your production environment, is the etcd configured differently in this environment?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.