after concurrent create 100 PVCs external-provisioner crash #322

heymingwei · 2019-08-08T07:15:17Z

after concurrent create 100 PVCs external-provisioner crash.
the sidecar use leader-election (lease)

I change the work-thread number to 4, it works!
so the question is: is my cpu too weak or the default number of work-thread too big? why set 100 to the default number of work-thread ?

msau42 · 2019-08-09T01:04:07Z

cc @jsafrane do we really need default worker thread count to be so high?

jsafrane · 2019-08-12T08:54:19Z

100 was requested by @saad-ali in https://docs.google.com/document/d/1wyq_9-EFsr7U90JMYXOHxoJlChwWXJqWsah3ctmuDDo/edit?disco=AAAACevk7-4

We do not need 100, but then we need to make up another number.

@GongWilliam, what CSI driver do you use? Can you check why the provisioner crashed?

I created 500 PVCs with mock driver and I got them provisioned (with lot of API server throttling) in ~4 minutes without any issues. It was a VM with 4 CPU cores.

heymingwei · 2019-08-15T02:56:01Z

the crash is because the provisioner can not aquire the lease and then it quit, waiting for the kubelet wake it up.
@jsafrane

zhucan · 2019-08-15T07:15:32Z

I also encountered the same problem. after concurrent create 50 PVCs external-provisioner crash.
Because I set the 'csi.storage.k8s.io/provisioner-secret-name' & 'csi.storage.k8s.io/provisioner-secret-namespace' in the storageclass @jsafrane @GongWilliam

hoyho · 2019-08-15T07:39:14Z

Perhaps it will happen when there's too many thread have operation(read/write) on ConfigMap concurrently

msau42 · 2019-08-15T13:52:52Z

Is there any stacktrace of the crash?

zhucan · 2019-08-16T03:24:53Z

logs:
provision "default/csi-pvc48" class "expand-sc": started
I0816 11:20:29.842122 1 controller.go:596] successfully created PV {GCEPersistentDisk:nil AWSElasticBlockStore:nil HostPath:nil Glusterfs:nil NFS:nil RBD:nil ISCSI:nil Cinder:nil CephFS:nil FC:nil Flocker:nil FlexVolume:nil AzureFile:nil VsphereVolume:nil Quobyte:nil AzureDisk:nil PhotonPersistentDisk:nil PortworxVolume:nil ScaleIO:nil Local:nil StorageOS:nil CSI:&CSIPersistentVolumeSource{Driver:csi.block.com,VolumeHandle:csi-iscsi-pvc-85b63906-2706-4101-8e23-380d11e7d6e3,ReadOnly:false,FSType:xfs,VolumeAttributes:map[string]string{accessPaths: admin,admin1,admin2,pool: data_pool,storage.kubernetes.io/csiProvisionerIdentity: 1565925614881-8081-.csi.block.com,xmsServers: 10.255.101.225,10.255.101.226,10.255.101.227,},ControllerPublishSecretRef:nil,NodeStageSecretRef:nil,NodePublishSecretRef:nil,ControllerExpandSecretRef:nil,}}
I0816 11:20:29.842534 1 controller.go:1278] provision "default/csi-pvc42" class "expand-sc": volume "pvc-85b63906-2706-4101-8e23-380d11e7d6e3" provisioned
I0816 11:20:29.842581 1 controller.go:1295] provision "default/csi-pvc42" class "expand-sc": succeeded
I0816 11:20:29.916490 1 leaderelection.go:263] failed to renew lease default/csi-block-com: failed to tryAcquireOrRenew context deadline exceeded
F0816 11:20:29.916544 1 leader_election.go:169] stopped leading

@msau42

msau42 · 2019-09-19T05:25:40Z

It sounds like the default 100 threads is too much for machines with fewer cores. Capturing a CPU/memory profile during stress I think would be useful to compare different values. I think lowering the default is reasonable, but we will need to wait until we're ready for a 2.0 as changing the default is a breaking change.

For now, you can set the --worker-threads flag lower as needed.

fejta-bot · 2019-12-18T05:50:57Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-01-23T01:53:01Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

msau42 · 2020-01-30T21:01:02Z

/remove-lifecycle rotten
/help

Need help running some performance benchmarking to determine what a good default number would be.

k8s-ci-robot · 2020-01-30T21:01:04Z

@msau42:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/remove-lifecycle rotten
/help

Need help running some performance benchmarking to determine what a good default number would be.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yoesmat · 2020-01-30T22:38:17Z

The safest would probably be to set the default to some multiplier of CPU count. Without priorities on goroutines you will always risk the lease routine not getting scheduled/picked up and thus loosing the leader election.

fejta-bot · 2020-04-29T23:03:29Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-05-29T23:49:47Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

msau42 · 2020-06-19T21:14:27Z

/lifecycle frozen

chrishenzie · 2020-07-23T23:52:14Z

I attempted to repro this issue several different ways but no success.

Using a revision prior to this change (Configurable throughput for clients to the API server. #447)
- Testing if API-throttling on a single k8s client could impact leader election
Setting CPU limits (10m on the external-provisioner container)
Sleeping during CreateVolume() in the mock-driver container

This was benchmarked by creating and deleting 5000 PVCs using the external-provisioner and mock-driver on a nodepool of GCE g1-smalls (1 vCPU). Creating the PVCs took ~2m, and creating all PVs took an additional ~20+ minutes, but during this time I only ever saw logs for acquiring the lease:

csi-provisioner I0723 23:51:01.725031       1 leaderelection.go:252] successfully acquired lease default/io-kubernetes-storage-mock

but never for the lease expiring. Also no pod restarts.
/close

k8s-ci-robot · 2020-07-23T23:52:16Z

@chrishenzie: You can't close an active issue/PR unless you authored it or you are a collaborator.

In response to this:

I attempted to repro this issue several different ways but no success.

Using a revision prior to this change (Configurable throughput for clients to the API server. #447)

Testing if API-throttling on a single k8s client could impact leader election

Setting CPU limits (10m on the external-provisioner container)

Sleeping during CreateVolume() in the mock-driver container

This was benchmarked by creating and deleting 5000 PVCs using the external-provisioner and mock-driver on a nodepool of GCE g1-smalls (1 vCPU). Creating the PVCs took ~2m, and creating all PVs took an additional ~20+ minutes, but during this time I only ever saw logs for acquiring the lease:
csi-provisioner I0723 23:51:01.725031       1 leaderelection.go:252] successfully acquired lease default/io-kubernetes-storage-mock
but never for the lease expiring. Also no pod restarts.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

msau42 · 2020-07-23T23:57:54Z

Thanks for investigating! It seems like we can't repro the issue on a 1 cpu machine, and with separating out the leader election client from provisioning k8s client, that should help avoid api throttling issues, which could be one avenue of starvation.

/close

k8s-ci-robot · 2020-07-23T23:57:57Z

@msau42: Closing this issue.

In response to this:

Thanks for investigating! It seems like we can't repro the issue on a 1 cpu machine, and with separating out the leader election client from provisioning k8s client, that should help avoid api throttling issues, which could be one avenue of starvation.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

andyzhangx · 2021-03-15T13:51:25Z

check my large scale test result, I got a relatively accurate memory limit for csi-provisioner, csi-attacher, csi-resizer per below test:

share large scale test result(20K volumes)

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 18, 2019

msau42 added this to the 2.0 milestone Dec 24, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 23, 2020

k8s-ci-robot added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jan 30, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 29, 2020

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jun 19, 2020

k8s-ci-robot closed this as completed Jul 23, 2020

chrishenzie mentioned this issue Nov 11, 2020

REQUEST: New membership for chrishenzie kubernetes/org#2328

Closed

6 tasks

zxh326 mentioned this issue Feb 29, 2024

[BUG] When a large number of pods created using juicefs sc are deleted within a similar period of time, the juicefs-csi-controller-0 will crash juicedata/juicefs-csi-driver#883

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

after concurrent create 100 PVCs external-provisioner crash #322

after concurrent create 100 PVCs external-provisioner crash #322

heymingwei commented Aug 8, 2019

msau42 commented Aug 9, 2019

jsafrane commented Aug 12, 2019

heymingwei commented Aug 15, 2019 •

edited

Loading

zhucan commented Aug 15, 2019

hoyho commented Aug 15, 2019

msau42 commented Aug 15, 2019

zhucan commented Aug 16, 2019

msau42 commented Sep 19, 2019

fejta-bot commented Dec 18, 2019

fejta-bot commented Jan 23, 2020

msau42 commented Jan 30, 2020

k8s-ci-robot commented Jan 30, 2020

yoesmat commented Jan 30, 2020

fejta-bot commented Apr 29, 2020

fejta-bot commented May 29, 2020

msau42 commented Jun 19, 2020

chrishenzie commented Jul 23, 2020

k8s-ci-robot commented Jul 23, 2020

msau42 commented Jul 23, 2020

k8s-ci-robot commented Jul 23, 2020

andyzhangx commented Mar 15, 2021

after concurrent create 100 PVCs external-provisioner crash #322

after concurrent create 100 PVCs external-provisioner crash #322

Comments

heymingwei commented Aug 8, 2019

msau42 commented Aug 9, 2019

jsafrane commented Aug 12, 2019

heymingwei commented Aug 15, 2019 • edited Loading

zhucan commented Aug 15, 2019

hoyho commented Aug 15, 2019

msau42 commented Aug 15, 2019

zhucan commented Aug 16, 2019

msau42 commented Sep 19, 2019

fejta-bot commented Dec 18, 2019

fejta-bot commented Jan 23, 2020

msau42 commented Jan 30, 2020

k8s-ci-robot commented Jan 30, 2020

yoesmat commented Jan 30, 2020

fejta-bot commented Apr 29, 2020

fejta-bot commented May 29, 2020

msau42 commented Jun 19, 2020

chrishenzie commented Jul 23, 2020

k8s-ci-robot commented Jul 23, 2020

msau42 commented Jul 23, 2020

k8s-ci-robot commented Jul 23, 2020

andyzhangx commented Mar 15, 2021

heymingwei commented Aug 15, 2019 •

edited

Loading