Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

after concurrent create 100 PVCs external-provisioner crash #322

Closed
heymingwei opened this issue Aug 8, 2019 · 21 comments
Closed

after concurrent create 100 PVCs external-provisioner crash #322

heymingwei opened this issue Aug 8, 2019 · 21 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
Milestone

Comments

@heymingwei
Copy link

after concurrent create 100 PVCs external-provisioner crash.
the sidecar use leader-election (lease)

I change the work-thread number to 4, it works!
so the question is: is my cpu too weak or the default number of work-thread too big? why set 100 to the default number of work-thread ?

@msau42
Copy link
Collaborator

msau42 commented Aug 9, 2019

cc @jsafrane do we really need default worker thread count to be so high?

@jsafrane
Copy link
Contributor

100 was requested by @saad-ali in https://docs.google.com/document/d/1wyq_9-EFsr7U90JMYXOHxoJlChwWXJqWsah3ctmuDDo/edit?disco=AAAACevk7-4

We do not need 100, but then we need to make up another number.

@GongWilliam, what CSI driver do you use? Can you check why the provisioner crashed?

I created 500 PVCs with mock driver and I got them provisioned (with lot of API server throttling) in ~4 minutes without any issues. It was a VM with 4 CPU cores.

@heymingwei
Copy link
Author

heymingwei commented Aug 15, 2019

the crash is because the provisioner can not aquire the lease and then it quit, waiting for the kubelet wake it up.
@jsafrane

@zhucan
Copy link
Member

zhucan commented Aug 15, 2019

I also encountered the same problem. after concurrent create 50 PVCs external-provisioner crash.
Because I set the 'csi.storage.k8s.io/provisioner-secret-name' & 'csi.storage.k8s.io/provisioner-secret-namespace' in the storageclass @jsafrane @GongWilliam

@hoyho
Copy link
Contributor

hoyho commented Aug 15, 2019

Perhaps it will happen when there's too many thread have operation(read/write) on ConfigMap concurrently

@msau42
Copy link
Collaborator

msau42 commented Aug 15, 2019

Is there any stacktrace of the crash?

@zhucan
Copy link
Member

zhucan commented Aug 16, 2019

logs:
provision "default/csi-pvc48" class "expand-sc": started
I0816 11:20:29.842122 1 controller.go:596] successfully created PV {GCEPersistentDisk:nil AWSElasticBlockStore:nil HostPath:nil Glusterfs:nil NFS:nil RBD:nil ISCSI:nil Cinder:nil CephFS:nil FC:nil Flocker:nil FlexVolume:nil AzureFile:nil VsphereVolume:nil Quobyte:nil AzureDisk:nil PhotonPersistentDisk:nil PortworxVolume:nil ScaleIO:nil Local:nil StorageOS:nil CSI:&CSIPersistentVolumeSource{Driver:csi.block.com,VolumeHandle:csi-iscsi-pvc-85b63906-2706-4101-8e23-380d11e7d6e3,ReadOnly:false,FSType:xfs,VolumeAttributes:map[string]string{accessPaths: admin,admin1,admin2,pool: data_pool,storage.kubernetes.io/csiProvisionerIdentity: 1565925614881-8081-.csi.block.com,xmsServers: 10.255.101.225,10.255.101.226,10.255.101.227,},ControllerPublishSecretRef:nil,NodeStageSecretRef:nil,NodePublishSecretRef:nil,ControllerExpandSecretRef:nil,}}
I0816 11:20:29.842534 1 controller.go:1278] provision "default/csi-pvc42" class "expand-sc": volume "pvc-85b63906-2706-4101-8e23-380d11e7d6e3" provisioned
I0816 11:20:29.842581 1 controller.go:1295] provision "default/csi-pvc42" class "expand-sc": succeeded
I0816 11:20:29.916490 1 leaderelection.go:263] failed to renew lease default/csi-block-com: failed to tryAcquireOrRenew context deadline exceeded
F0816 11:20:29.916544 1 leader_election.go:169] stopped leading

@msau42

@msau42
Copy link
Collaborator

msau42 commented Sep 19, 2019

It sounds like the default 100 threads is too much for machines with fewer cores. Capturing a CPU/memory profile during stress I think would be useful to compare different values. I think lowering the default is reasonable, but we will need to wait until we're ready for a 2.0 as changing the default is a breaking change.

For now, you can set the --worker-threads flag lower as needed.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 18, 2019
@msau42 msau42 added this to the 2.0 milestone Dec 24, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 23, 2020
@msau42
Copy link
Collaborator

msau42 commented Jan 30, 2020

/remove-lifecycle rotten
/help

Need help running some performance benchmarking to determine what a good default number would be.

@k8s-ci-robot
Copy link
Contributor

@msau42:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/remove-lifecycle rotten
/help

Need help running some performance benchmarking to determine what a good default number would be.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jan 30, 2020
@yoesmat
Copy link

yoesmat commented Jan 30, 2020

The safest would probably be to set the default to some multiplier of CPU count. Without priorities on goroutines you will always risk the lease routine not getting scheduled/picked up and thus loosing the leader election.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 29, 2020
@msau42
Copy link
Collaborator

msau42 commented Jun 19, 2020

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jun 19, 2020
@msau42 msau42 added this to To do in K8s 1.19 Jun 25, 2020
@msau42 msau42 moved this from To do to In progress in K8s 1.19 Jul 20, 2020
@chrishenzie
Copy link
Contributor

I attempted to repro this issue several different ways but no success.

  • Using a revision prior to this change (Configurable throughput for clients to the API server. #447)
    • Testing if API-throttling on a single k8s client could impact leader election
  • Setting CPU limits (10m on the external-provisioner container)
  • Sleeping during CreateVolume() in the mock-driver container

This was benchmarked by creating and deleting 5000 PVCs using the external-provisioner and mock-driver on a nodepool of GCE g1-smalls (1 vCPU). Creating the PVCs took ~2m, and creating all PVs took an additional ~20+ minutes, but during this time I only ever saw logs for acquiring the lease:

csi-provisioner I0723 23:51:01.725031       1 leaderelection.go:252] successfully acquired lease default/io-kubernetes-storage-mock

but never for the lease expiring. Also no pod restarts.
/close

@k8s-ci-robot
Copy link
Contributor

@chrishenzie: You can't close an active issue/PR unless you authored it or you are a collaborator.

In response to this:

I attempted to repro this issue several different ways but no success.

  • Using a revision prior to this change (Configurable throughput for clients to the API server. #447)
  • Testing if API-throttling on a single k8s client could impact leader election
  • Setting CPU limits (10m on the external-provisioner container)
  • Sleeping during CreateVolume() in the mock-driver container

This was benchmarked by creating and deleting 5000 PVCs using the external-provisioner and mock-driver on a nodepool of GCE g1-smalls (1 vCPU). Creating the PVCs took ~2m, and creating all PVs took an additional ~20+ minutes, but during this time I only ever saw logs for acquiring the lease:

csi-provisioner I0723 23:51:01.725031       1 leaderelection.go:252] successfully acquired lease default/io-kubernetes-storage-mock

but never for the lease expiring. Also no pod restarts.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@msau42
Copy link
Collaborator

msau42 commented Jul 23, 2020

Thanks for investigating! It seems like we can't repro the issue on a 1 cpu machine, and with separating out the leader election client from provisioning k8s client, that should help avoid api throttling issues, which could be one avenue of starvation.

/close

K8s 1.19 automation moved this from In progress to Done Jul 23, 2020
@k8s-ci-robot
Copy link
Contributor

@msau42: Closing this issue.

In response to this:

Thanks for investigating! It seems like we can't repro the issue on a 1 cpu machine, and with separating out the leader election client from provisioning k8s client, that should help avoid api throttling issues, which could be one avenue of starvation.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@andyzhangx
Copy link
Member

check my large scale test result, I got a relatively accurate memory limit for csi-provisioner, csi-attacher, csi-resizer per below test:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
No open projects
K8s 1.19
  
Done
Development

No branches or pull requests

10 participants