etcdserver: read-only range request took too long with etcd 3.2.24 #70082

ArchiFleKs · 2018-10-22T12:19:09Z

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:

When deploying a brand new cluster with either t2.medium (with 4Go RAM) instances or even t3.large (with 8Go RAM) I get errors in ETCD logs :

2018-10-22 11:22:14.706081 W | etcdserver: read-only range request "key:\"foo\" " with result "range_response_count:0 size:5" took too long (1.310910984s) to execute
2018-10-22 11:22:16.247803 W | etcdserver: read-only range request "key:\"/registry/persistentvolumeclaims/\" range_end:\"/registry/persistentvolumeclaims0\" " with
result "range_response_count:0 size:5" took too long (128.414553ms) to execute                                                                                       
2018-10-22 11:22:16.361601 W | etcdserver: read-only range request "key:\"/registry/persistentvolumeclaims\" range_end:\"/registry/persistentvolumeclaimt\" count_onl
y:true " with result "range_response_count:0 size:5" took too long (242.137365ms) to execute                                                                         
2018-10-22 11:22:16.363943 W | etcdserver: read-only range request "key:\"/registry/configmaps\" range_end:\"/registry/configmapt\" count_only:true " with result "ra
nge_response_count:0 size:7" took too long (115.223655ms) to execute                                                                                                 
2018-10-22 11:22:16.364641 W | etcdserver: read-only range request "key:\"/registry/configmaps/\" range_end:\"/registry/configmaps0\" " with result "range_response_c
ount:1 size:2564" took too long (117.026891ms) to execute                                                                                                            
2018-10-22 11:22:18.298297 W | wal: sync duration of 1.84846442s, expected less than 1s                                                                             
2018-10-22 11:22:20.375327 W | wal: sync duration of 2.067539108s, expected less than 1s                                                                             
proto: no coders for int                                                      
proto: no encoder for ValueSize int [GetProperties]                                                                                                                  
2018-10-22 11:22:20.892736 W | etcdserver: request "header:<ID:16312995093338348449 username:\"kube-apiserver-etcd-client\" auth_revision:1 > txn:<compare:<target:MO
D key:\"/registry/serviceaccounts/kube-system/node-controller\" mod_revision:283 > success:<request_put:<key:\"/registry/serviceaccounts/kube-system/node-controller\
" value_size:166 >> failure:<>>" with result "size:16" took too long (516.492646ms) to execute

What you expected to happen:

I expect the logs to be exempt of errors.

How to reproduce it (as minimally and precisely as possible):

Launch a kubeadm cluster with Kubernetes version v1.12.1

Anything else we need to know?:

When downgrading to Kubernetes v1.11.3 there are no error anymore, also, staying in v1.12.1 and manually downgrading etcd to version v3.2.18 (which is ship with kubernetes v1.11.3) workaround the issue.

Environment:

Kubernetes version (use kubectl version): v.1.12.1
Cloud provider or hardware configuration: aws
OS (e.g. from /etc/os-release): Coreos 1855.4.0
Kernel (e.g. uname -a): Linux ip-10-0-3-11.eu-west-1.compute.internal 4.14.67-coreos #1 SMP Mon Sep 10 23:14:26 UTC 2018 x86_64 Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz GenuineIntel GNU/Linux
Install tools: kubeadm
Others: etcd version 3.2.24 as per kubernetes 1.12.1

The text was updated successfully, but these errors were encountered:

ArchiFleKs · 2018-10-22T12:19:26Z

/kind bug

ArchiFleKs · 2018-10-22T12:19:55Z

/sig api-machinery

ArchiFleKs · 2018-10-22T12:40:38Z

kubernetes/kubeadm#1181

dims · 2018-10-23T14:36:42Z

/area etcd

jennybuckley · 2018-10-25T20:26:01Z

cc @jingyih

jingyih · 2018-10-25T23:04:10Z

Are you using SSD or HDD? (I think t2 and t3 instances could come with either SSD or HDD?) Do you see 'wal: sync duration of' warning message with etcd v3.2.18?

ArchiFleKs · 2018-10-26T14:05:31Z

@jingyih you might be right, turns out my launch template are using HDD (standard and not GP2).

I'll retry with etcd 3.2.24 and report here.

To answer your question I have no such error with etcd 3.2.18

lixianyang · 2018-10-27T08:43:36Z

kubernetes version: 1.12.1
etcd version: 3.2.24
os: CentOS Linux release 7.5.1804
kernel: 3.10.0-862.2.3.el7.x86_64
vm on openstack, ceph rbd as storage, 16 core, 32G memory
load average: 0.45, 0.36, 0.41

free -h
              total        used        free      shared  buff/cache   available
Mem:            31G        5.5G        361M        1.6G         25G         23G
Swap:            0B          0B          0B

Oct 27 16:19:45 k8s-web3. etcd[24400]: finished scheduled compaction at 1435982 (took 1.77165ms)
Oct 27 16:24:45 k8s-web3. etcd[24400]: store.index: compact 1436519
Oct 27 16:24:45 k8s-web3. etcd[24400]: finished scheduled compaction at 1436519 (took 1.682657ms)
Oct 27 16:27:14 k8s-web3. etcd[24400]: read-only range request "key:\"/registry/initializerconfigurations/\" range_end:\"/registry/initializerconfigurations0\" " with result "range_response_count:0 size:6" took too long (103.929647ms) to execute
Oct 27 16:27:14 k8s-web3. etcd[24400]: read-only range request "key:\"/registry/initializerconfigurations/\" range_end:\"/registry/initializerconfigurations0\" " with result "range_response_count:0 size:6" took too long (104.75876ms) to execute
Oct 27 16:27:14 k8s-web3. etcd[24400]: read-only range request "key:\"/registry/initializerconfigurations/\" range_end:\"/registry/initializerconfigurations0\" " with result "range_response_count:0 size:6" took too long (104.778056ms) to execute
Oct 27 16:29:45 k8s-web3. etcd[24400]: store.index: compact 1437057
Oct 27 16:29:45 k8s-web3. etcd[24400]: finished scheduled compaction at 1437057 (took 2.345841ms)
Oct 27 16:30:10 k8s-web3. etcd[24400]: read-only range request "key:\"/registry/secrets\" range_end:\"/registry/secrett\" count_only:true " with result "range_response_count:0 size:8" took too long (530.546591ms) to execute
Oct 27 16:30:10 k8s-web3. etcd[24400]: read-only range request "key:\"/registry/priorityclasses\" range_end:\"/registry/priorityclasset\" count_only:true " with result "range_response_count:0 size:8" took too long (317.166687ms) to execute

lixianyang · 2018-10-27T09:19:47Z

Probablely the disk is too slow

https://github.com/etcd-io/etcd/blob/master/Documentation/faq.md#what-does-the-etcd-warning-apply-entries-took-too-long-mean

https://github.com/etcd-io/etcd/blob/master/Documentation/metrics.md#disk

Etcd metrics in my cluster:

# HELP etcd_disk_wal_fsync_duration_seconds The latency distributions of fsync called by wal.
# TYPE etcd_disk_wal_fsync_duration_seconds histogram
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.001"} 0
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.002"} 0
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.004"} 98806
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.008"} 630289
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.016"} 770350
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.032"} 775197
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.064"} 776834
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.128"} 777738
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.256"} 778567
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.512"} 779201
etcd_disk_wal_fsync_duration_seconds_bucket{le="1.024"} 779595
etcd_disk_wal_fsync_duration_seconds_bucket{le="2.048"} 779788
etcd_disk_wal_fsync_duration_seconds_bucket{le="4.096"} 779875
etcd_disk_wal_fsync_duration_seconds_bucket{le="8.192"} 779897
etcd_disk_wal_fsync_duration_seconds_bucket{le="+Inf"} 779902
etcd_disk_wal_fsync_duration_seconds_sum 6224.105096278966
etcd_disk_wal_fsync_duration_seconds_count 779902

ArchiFleKs · 2018-10-27T19:18:34Z

Hi, I tested with with gp2 ssd on AWS I have the same issue. I don’t have the same wal fsync duration error though.

Some colleague of mine has the same issue with rancher and the same etcd version on ssd also.

ArchiFleKs · 2018-10-27T19:21:39Z

I’ll try with EBS optimized instance and also dedicated disk to rule out disk latency.

The cluster seems to function normally event with etcd 3.2.24.

jpbetz · 2018-10-30T18:34:34Z

Please check the backend_commit_duration_seconds metric and report back (per https://github.com/etcd-io/etcd/blob/master/Documentation/metrics.md#disk). If the value is high we can attribute the issue to disk latency and if it's low we know we should investigate further.

ArchiFleKs · 2018-10-31T14:34:08Z

@jpbetz here is what I have :

etcd_disk_backend_commit_duration_seconds_bucket{le="0.001"} 1
etcd_disk_backend_commit_duration_seconds_bucket{le="0.002"} 228127
etcd_disk_backend_commit_duration_seconds_bucket{le="0.004"} 348658
etcd_disk_backend_commit_duration_seconds_bucket{le="0.008"} 352308
etcd_disk_backend_commit_duration_seconds_bucket{le="0.016"} 354316
etcd_disk_backend_commit_duration_seconds_bucket{le="0.032"} 354594
etcd_disk_backend_commit_duration_seconds_bucket{le="0.064"} 354672
etcd_disk_backend_commit_duration_seconds_bucket{le="0.128"} 354757
etcd_disk_backend_commit_duration_seconds_bucket{le="0.256"} 354838
etcd_disk_backend_commit_duration_seconds_bucket{le="0.512"} 354841
etcd_disk_backend_commit_duration_seconds_bucket{le="1.024"} 354842
etcd_disk_backend_commit_duration_seconds_bucket{le="2.048"} 354842
etcd_disk_backend_commit_duration_seconds_bucket{le="4.096"} 354842
etcd_disk_backend_commit_duration_seconds_bucket{le="8.192"} 354842
etcd_disk_backend_commit_duration_seconds_bucket{le="+Inf"} 354842
etcd_disk_backend_commit_duration_seconds_sum 752.9406572860102
etcd_disk_backend_commit_duration_seconds_count 354842

I'm not sure how ot read this

RiceBowlJr · 2018-10-31T14:45:53Z

I have the same issue with my cluster, didn't notice before I saw that issue opened.
kubernetes version: 1.12.2
etcd version: 3.2.24
os: Ubuntu 16.04.5 LTS
kernel: Linux 4.9.58-xxxx-std-ipv6-64 #1 SMP Mon Oct 23 11:35:59 CEST 2017 x86_64 x86_64 x86_64 GNU/Linux
VM on kimsufi.com (Serveur KS-4C - i5-2300 - 16GB - 1x2To)
load average: 0.45, 0.36, 0.41

free -h
              total        used        free      shared  buff/cache   available
Mem:          7,7Gi       4,1Gi       1,1Gi       529Mi       2,5Gi       2,8Gi
Swap:         7,9Gi       1,7Gi       6,1Gi

Errors:

2018-10-29 13:47:59.656160 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-scheduler\" " with result "range_response_count:1 size:440" took too long (105.398841ms) to execute
2018-10-29 13:48:15.855931 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-controller-manager\" " with result "range_response_count:1 size:458" took too long (107.508688ms) to execute
2018-10-29 13:49:16.905572 W | etcdserver: read-only range request "key:\"/registry/events/kube-system/coredns-576cbf47c7-hzgxm.15621870fab0aa01\" " with result "range_response_count:1 size:495" took too long (125.232824ms) to execute
2018-10-29 13:49:16.905630 W | etcdserver: read-only range request "key:\"/registry/pods/kube-system/coredns-576cbf47c7-hzgxm\" " with result "range_response_count:1 size:1337" took too long (125.512723ms) to execute
2018-10-29 13:49:16.905667 W | etcdserver: read-only range request "key:\"/registry/pods/kube-system/coredns-576cbf47c7-hzgxm\" " with result "range_response_count:1 size:1337" took too long (125.429248ms) to execute
2018-10-29 13:49:26.897178 W | etcdserver: read-only range request "key:\"/registry/pods/kube-system/coredns-576cbf47c7-hzgxm\" " with result "range_response_count:1 size:1337" took too long (108.176701ms) to execute
2018-10-29 13:49:26.897266 W | etcdserver: read-only range request "key:\"/registry/events/kube-system/coredns-576cbf47c7-hzgxm.15621870fab0aa01\" " with result "range_response_count:1 size:495" took too long (107.939493ms) to execute
2018-10-29 13:49:26.897479 W | etcdserver: read-only range request "key:\"/registry/pods/kube-system/coredns-576cbf47c7-hzgxm\" " with result "range_response_count:1 size:1337" took too long (107.949024ms) to execute
2018-10-29 13:52:50.820939 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-controller-manager\" " with result "range_response_count:1 size:459" took too long (106.079668ms) to execute

jpbetz · 2018-10-31T18:10:29Z

@ArchiFleKs It says of 354842 total operations, 228127 took less than or equal to .002 seconds, 348658 took less than or equal to .004 seconds,where the 348658 number includes those that took less than .002 seconds as well. There are a very small portion (85 to be exact) of disk backend commits taking over 128ms. I'm not well enough calibrated to say for sure if those number are out of healthy range or not, but they don't look particularly alarming. The wal: sync duration of 2.067539108s, expected less than 1s line from the original report does, however, look quite alarming and suggests high disk IO latency. How about your etcd_disk_wal_fsync_duration_seconds_bucket metric, @ArchiFleKs? Does that show anything?

ArchiFleKs · 2018-11-02T17:56:20Z

Yes I agree for the original report, just to clarify, this is an empty cluster, with nothing on it, except a fresh cluster with kubeadm 1.12, and coredns on it.

What I find very weird is the fact that I have no issue with the same config and etcd 3.18. I'll check the etcd_disk_wal_fsync_duration_seconds_bucket and report asap

ArchiFleKs · 2018-11-04T19:09:52Z

Hi,

etcd_disk_wal_fsync_duration_seconds_bucket{le="0.001"} 926288
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.002"} 1.269401e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.004"} 1.327218e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.008"} 1.332407e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.016"} 1.334417e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.032"} 1.334858e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.064"} 1.335046e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.128"} 1.335151e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.256"} 1.335183e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.512"} 1.335185e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="1.024"} 1.335187e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="2.048"} 1.335187e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="4.096"} 1.335188e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="8.192"} 1.335188e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="+Inf"} 1.335188e+06
etcd_disk_wal_fsync_duration_seconds_sum 1336.26308381703
etcd_disk_wal_fsync_duration_seconds_count 1.335188e+06

jpbetz · 2018-11-05T22:07:39Z

@ArchiFleKs looks like there was one fsync that took between 2.048-4.096 seconds an two that took between 0.512-1.024 seconds. This would result in messages like the one you saw ("wal: sync duration of 2.067539108s, expected less than 1s" error"). https://github.com/etcd-io/etcd/blob/e8b940f268a80c14f7082589f60cbfd3de531d12/wal/wal.go#L572 both tallies this metric and logs the message.

If you're seeing the log message at higher rates than the metric suggests, that might require further investigation, but the log message does appear to correctly telling us that there was excessively high disk latency.

lixianyang · 2018-11-29T06:51:09Z

So, what's the suggestion of etcd version in 1.12.x ? 3.2.24 ?

jpbetz · 2018-11-29T22:53:54Z

@lixianyang Yes, etcd 3.2.24 is recommended for k8s 1.12.x.

kovetskiy · 2018-12-10T16:27:56Z

is there any possible workaround? experiencing the same issue with azure, I also tried to run etcd on separated ssd disk, didn't help.

kovetskiy · 2018-12-12T06:53:18Z

@lixianyang if you still wonder, I tried 3.2.18 and it works better than 3.2.24, I don't see these read-only request took too long messages anymore.

lostick · 2018-12-28T08:20:17Z

FYI, having the same issue with 3.2.25 on 1.12.3.

jcrowthe · 2019-03-14T16:37:49Z

As an update here, I'm seeing this same read-only range request ... took too long error in Azure k8s clusters with ~200 nodes, but notably the etcd data is mounted on ramdisk. This means that our 99th percentile backend commit duration is ~6ms, indicating this is not due to slowness of disk.

Have we identified if this is due to the version of etcd? Currently 3.2.24 is the recommended etcd version for 1.12 and 1.13, with 1.14 updating to 3.3.10 (source)

fejta-bot · 2021-02-02T21:56:17Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2021-03-04T22:42:28Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

khatrig · 2021-03-05T13:14:33Z

/remove-lifecycle rotten

KeithTt · 2021-05-17T10:31:13Z

Same error here, my colleague upgrade kubernetes cluster from 1.6.4 to 1.18.19, and this error occurs.

Everything is working before upgrade.

I can execute kubectl commands, but controller manager and scheduler are unhealthy, I do not know why.

# kubectl get cs
NAME                 STATUS      MESSAGE                                                                                     ERROR
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused
etcd-0               Healthy     {"health":"true"}

NorseGaud · 2021-05-17T11:33:19Z

Same error here, my colleague upgrade kubernetes cluster from 1.6.4 to 1.18.19, and this error occurs.

Everything is working before upgrade.

I can execute kubectl commands, but controller manager and scheduler are unhealthy, I do not know why.
# kubectl get cs
NAME                 STATUS      MESSAGE                                                                                     ERROR
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused
etcd-0               Healthy     {"health":"true"}

What does your disk situation look like? SSDs?

KeithTt · 2021-05-20T01:37:17Z

@NorseGaud Not SSD...

But I do not think it is the reason, this cluster was created half a year ago, and run healthily.

k8s-triage-robot · 2021-08-18T01:43:35Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2021-09-17T02:18:38Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2021-10-17T02:39:42Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2021-10-17T02:39:57Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Izvunzemen · 2021-10-26T20:41:21Z

Is there a setting that can be changed in order to increase this limit?
Currently it's impossible to create Kubernetes cluster on top of HDD's, which is really strange.

faizanrafiq-fr · 2021-11-12T07:06:14Z

Any solutions for this etcd issue?

cortex3 · 2021-12-03T17:22:34Z

I'm seeing a lot of these messages too. The backend_commit_duration seems fine to me and neither IOPS or CPU usage are looking strange

XanNava · 2021-12-16T10:20:56Z

I am getting these messages also, and am locked out of the webUI. Running on raid 1 SSD.

ghost · 2021-12-16T10:24:47Z

to quote myself where I already quoted myself:

this really seems to be related to poor disk performance for etcd.
apparently this only works with ssd like performance.

to quote myself: really try to test with higher performance ssds/storage.
my problems went away once the storage was fast enough.
even when you do not put etcd under heavy load it generates enough traffic on it's own.
so please try to test with faster (meaning: more iops) storage first.
I guess most common cloud instances like mentioned here do not deliver the required performance.

so please state your sustained SSD Performance in IOPS.

I would advice to not use some desktop SSDs because these tend to not be able to deliver sustained IOPS for a long time.

HTH

NorseGaud · 2022-02-06T12:36:42Z

Hey everyone, so it seems like ext4 has a write barrier cache can cause crappy performance even on a fast SSD. Check this out https://medium.com/paypal-tech/scaling-kubernetes-to-over-4k-nodes-and-200k-pods-29988fad6ed

The local SSD performed worse! After deeper investigation, this was attributed to ext4 file system’s write barrier cache commits. Since etcd uses write-ahead logging and calls fsync every time it commits to the raft log, it’s okay to disable the write barrier. Additionally, we have DB backup jobs at the file system level and application level for DR. After this change, the numbers with local SSD improved comparable to PD-SSD:

cortex3 · 2022-02-07T11:07:14Z

I fixed this issue by changing to a locally deployed VPN server. The ping times between masters changed from ~60ms to ~3ms and the alert is gone.

julienlau · 2022-02-07T11:45:18Z

Ok for performance, but this is not true that disabling write barrier has no effect on data loss !
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/storage_administration_guide/ch-writebarriers

The only condition were it's ok to do fsync writes with write barrier cache disabled is if you are using Battery-backed write cache. If this is your RAID controller that has a battery-backup-unit (BBU) or similar technology to protect the cache contents on power loss, make sure to disable the individual internal caches of the attached disks in the controller settings, as these are not protected by the RAID controller battery.

However the people considering this relevant may be very marginal given fsync gate postgresql

k8s-ci-robot added the area/etcd label Oct 23, 2018

tbe mentioned this issue Apr 4, 2019

Possible performance regression since 3.2.24 etcd-io/etcd#10610

Closed

This was referenced Nov 10, 2020

[flaky ci] [Serial]SRIOV VirtualMachineInstance with sriov plugin interface [test_id:1754]should create a virtual machine with sriov interface kubevirt/kubevirt#4519

Closed

etcd timeout errors on DinD setup using Prow kubernetes-sigs/kind#1922

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 2, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 4, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 5, 2021

tstromberg mentioned this issue Mar 24, 2021

failed to create a sandbox for pod: context deadline exceeded (etcdserver: read-only range request deadline exceeded) kubernetes/minikube#10897

Closed

neolit123 mentioned this issue Jul 9, 2021

pull-kubernetes-integration timing out #103512

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 18, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 17, 2021

k8s-ci-robot closed this as completed Oct 17, 2021

etcdserver: read-only range request took too long with etcd 3.2.24 #70082

etcdserver: read-only range request took too long with etcd 3.2.24 #70082

Comments

ArchiFleKs commented Oct 22, 2018

ArchiFleKs commented Oct 22, 2018

ArchiFleKs commented Oct 22, 2018

ArchiFleKs commented Oct 22, 2018

dims commented Oct 23, 2018

jennybuckley commented Oct 25, 2018

jingyih commented Oct 25, 2018

ArchiFleKs commented Oct 26, 2018

lixianyang commented Oct 27, 2018

lixianyang commented Oct 27, 2018

ArchiFleKs commented Oct 27, 2018

ArchiFleKs commented Oct 27, 2018

jpbetz commented Oct 30, 2018

ArchiFleKs commented Oct 31, 2018

RiceBowlJr commented Oct 31, 2018

jpbetz commented Oct 31, 2018

ArchiFleKs commented Nov 2, 2018

ArchiFleKs commented Nov 4, 2018

jpbetz commented Nov 5, 2018

lixianyang commented Nov 29, 2018

jpbetz commented Nov 29, 2018

kovetskiy commented Dec 10, 2018

kovetskiy commented Dec 12, 2018

lostick commented Dec 28, 2018

jcrowthe commented Mar 14, 2019

fejta-bot commented Feb 2, 2021

fejta-bot commented Mar 4, 2021

khatrig commented Mar 5, 2021

KeithTt commented May 17, 2021

NorseGaud commented May 17, 2021

KeithTt commented May 20, 2021

k8s-triage-robot commented Aug 18, 2021

k8s-triage-robot commented Sep 17, 2021

k8s-triage-robot commented Oct 17, 2021

k8s-ci-robot commented Oct 17, 2021

Izvunzemen commented Oct 26, 2021

faizanrafiq-fr commented Nov 12, 2021

cortex3 commented Dec 3, 2021

XanNava commented Dec 16, 2021

ghost commented Dec 16, 2021

NorseGaud commented Feb 6, 2022

cortex3 commented Feb 7, 2022

julienlau commented Feb 7, 2022 • edited

julienlau commented Feb 7, 2022 •

edited