-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(JENKINS-68371) improve asynchrony of StandardPlannedNodeBuilder #1171
base: master
Are you sure you want to change the base?
(JENKINS-68371) improve asynchrony of StandardPlannedNodeBuilder #1171
Conversation
When the NodeProvisioner is building agents in the provision step: https://github.com/jenkinsci/kubernetes-plugin/blob/307d9791dcf7dfc3bbbcbdf1a7eab44ed752a4c8/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesCloud.java#L536 it currently loops through the number of nodes to provision. This is implemented in the StandardPlannedNodeBuilder, and is done as a blocking operation, while using a Future interface to satisfy the consumers. In our testing, it was seen that this operation could take upwards of 100 seconds when under load, causing provisioning to be effectively stopped for periods of time. This change introduces a configurable thread pool that will cause the Agent creation step to occur in a separate thread. This, in our testing resolves the serial bottleneck issue and allows provisioning to continue while the blocking operations occur separately.
Don't quite understand the test failure. Seems like some sort of race between the pod failing and it getting killed? |
This doesn't seem right. I would expect computation for a single agent to take up to 1 second, not 100 seconds, even on a loaded system. |
There are some logs in the related ticket that demonstrate the delays we have seen. Without this fix, provisioning with hundreds of agents needed can take up to an hour before those resources are available. |
This step is completely local and I don't think there is justification for running it asynchronously. |
What we have seen in our logs is that there is some blocking operation that prevents completion of the step (when whatever resource is freed, the blocked threads return all at the same time). When the thread pool is leveraged, this blocking operation does not prevent the KS from being created, allowing things to move smoothly. Given the expected result is already a Future, and we have a complex system with various moving parts that involve locking, I don't understanding the objection to leveraging a thread pool to prevent future performance issues. We could certainly make the thread pool smaller to avoid memory overhead -- it will just queue the work to be done. For us, profiling this situation would be very difficult as it only seems to manifest itself under production level loads on our production instances. This fix demonstrably addresses the underlying issue for us. |
Would additional tests be beneficial? I'm not sure exactly what I would test. |
Would you mind trying out #1178 ? This should already speed up the main loop and I still don't think asynchrony is necessary here even if the framework allows it. |
We have tried it locally and still have issues. We would like to show you some logs, but don't really want to share them in public. Any suggestions about how to do that? cc @sbeaulie |
When the NodeProvisioner is building agents in the provision step:
kubernetes-plugin/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesCloud.java
Line 536 in 307d979
nodes to provision. This is implemented in the StandardPlannedNodeBuilder,
and is done as a blocking operation, while using a Future interface
to satisfy the consumers. In our testing, it was seen that this
operation could take upwards of 100 seconds when under load, causing
provisioning to be effectively stopped for periods of time.
This change introduces a configurable thread pool that will cause
the Agent creation step to occur in a separate thread. This, in
our testing resolves the serial bottleneck issue and allows provisioning
to continue while the blocking operations occur separately.
https://issues.jenkins.io/browse/JENKINS-68371
I'm not sure exactly how to test this effectively.