-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spot instances are not provisioning if target capacity decreases to 0 #172
Comments
We are seeing the same issue. Manually changing "Target capacity" to 1 in AWS console starts the spot instances. This issue happens only when
|
I faced this issue today after replacing fleet ID in the plugin configuration:
The new fleet stayed at 0 till I manually made it's target positive. Later during the day it was scaled down by the plugin to 0 and again stuck there. |
hi just tried next scenario and plugin works as expected:
Did I miss some preconditions or points? Do you use weight for Fleet? |
just realized, Plugin version |
@terma |
@dimaov could you please share jenkins logs, thx |
Additional to previous scenario I tried one more:
Works fine, if you can share log they could help me to see your case, thx |
@terma
|
One clarification, Jenkins dictates some rules for plugins which extend capacity, like this or Azure etc. Any time when you change configuration, Jenkins doesn't do actual update of plugin settings, in fact Jenkins always remove plugin configuration and create new one. If you change fleet ID current version of plugin will remove all instances related to old fleet and will start add instances related to new fleet without any transition period which seems correct for me:
|
In case if you have problem with scale down to |
@terma When we faced this issue we have checked logs actually and it said nothing about that smth goes wrong. No errors in jenkins logs and In AWS Spot fleet events as well. Looks like No actions in logs and AWS events at all, and it was seems like plugin don't ask create additional spot instances when it in 0 state Could you clarify which exact logs would be useful for you ? |
Yep, First of all Jenkins call plugin to check if possible to scale, so log should contains records like:
Additional plugin generates on regular basic work logs like:
All of them will help me to see what is existent capacity, what is capacity which should be provision soon and what is demand, plus it will show how plugin react on Jenkins requests. If you want you can share in private artem.stasuk@gmail.com |
Real world example of plugin related logs:
|
Great thanks, |
@terma what you wrote about settings update ("change configuration") is interesting but not the behavior I see.
Looks like during settings update, even if Jenkins replaces plugin configuration, some "internal" structures/parameters are kept. Regarding the issue I can confirm that it looked like Jenkins did not request new instances at all. |
You are right twice Change fleet ID is special from plugin point of view:
At the same time Jenkins always replace entire configuration, but since previous internal case, plugin could be in wrong state if any scale activities in progress. |
Today, after about 2 weeks of normal work, we are stuck at zero again.
Maybe it's kind of timing issue? The plugin sees the last vm (that is being shut down), and waits for it? Upd. I tried replacing spot fleet with asg fleet with the same labels, but the new fleet kept staying at 0 and waiting jobs stayed in the queue. Only when I remove the fleet and re-added it after 1-2 minutes (again with the same labels), the plugin set the target according to number of jobs in the queue. Upd2. Another fleet is stuck, seems I'll have to remove all fleets and re-add everything again. |
Unfortunately, even with ASG, I still have to change timeouts sometimes (e.g. "Max Idle Minutes Before Scaledown"), and it breaks the configuration again, so have to remove/recreate all fleets. |
Here I have an example with this issue. It's currently affecting one of our fleets while other fleets with almost identical configuration (difference being labels and what asg to use) are fine. It's just stubbornly refusing to scale the asg up by itself. log (block between two "provisioning completed" lines)
the affected cloud/label is jenkins 2.225, fleet plugin 2.0.0 |
@chrono first of all there is plugin 2.0.2 and it works better I think. |
We're currently running into this problem with Autoscale groups, and I'm wondering if anybody has had the same problem with EC2 spot fleets? Perhaps switching from an ASG to a EC2 spot fleet might be a solution. |
We are running into this on Jenkins 2.222.4 with plugin 2.1.2 when using an ASG, we haven't tried a spot fleet yet |
This bug is turning into a showstopper for us because the requirement to always have 1 spot running in the fleet is causing that always-running spot to eventually fill disk inodes or space as Jenkins will naturally build on that always-running spot. 'MaxInstanceLifetime=7 days' and 'OldestInstanceFirst' termination policy in the ASG doesn't seem to reliably work either. Spot fleet plugin works well once the the min_size = desired_capacity is set to 1 (so there is always a spot running in the fleet) and scale-out can happen properly. . Jenkins 2.176.2 When min_size = desired_capacity = 0 (eg, the scale-out bug) AND there are jenkins builds INFO: currentDemand -1 availableCapacity 2 (availableExecutors 0 connectingExecutors 0 plannedCapacitySnapshot 2 additionalPlannedCapacity 0) When min_size = desired_capacity = 1 ( eg, to workaround around the scale-out bug ) . currentDemand changes according to the number of spots requested and seems to work normally: INFO: currentDemand 3 availableCapacity 12 (availableExecutors 0 connectingExecutors 5 plannedCapacitySnapshot 7 additionalPlannedCapacity 0) . currentDemand is NOT printed once the fleet starts scale-in. . currentDemand is NOT printed after the fleet terminates the idle spots and the fleet stabilises on 1 spot (min_size = desired_capacity = 1) |
Similar problem. I'd like to run certain jobs on the master and other others in the auto scaling group. My Jenkinsfile has:
(the EC2 Fleet plugin label is If an instance in the Auto Scaling Group is already running, the job builds fine. If there are no instances currently running in the ASG then builds hang with the message: As a test I removed the label from the Jenkins file and reduced the number of executors on the master to one. Now the first job in the queue builds on the master (as I'd expect), but a second one will successfully create an instance in the ASG and run there - even if no instance was currently running in the ASG. So this bug is only appearing for me when the Jenkinsfile specifies a label. Guessing, could this be related to the way labels are defined at a different level when using dockerfile? i.e. without the dockerfile the above agent block would be:
|
I've been unable to reproduce this bug for both spot-fleet and asg cases. I think we simply need more information to be able to figure out what's going wrong, so I'm going to work on adding additional log statements |
We still run into the issue, albeit much less frequently now. Whenever it happens, we see Restarting Jenkins when it occurs gets us back to a healthy state. It seems to happen more frequently if we're actively modifying the command clouds, if there are multiple command clouds servicing the same node labels, or if we're manually altering the desired capacity of the underlying ASG (things we were doing much more frequently early on, when we were seeing the problem almost daily). Now we're seeing it maybe once every 2-3 weeks. |
I switched from using an Auto Scaling Group to a Spot Fleet Request and I've not seen the problem since - been running for about a week now. |
I found a way to reproduce this 🎉 When jobs are submitted and in the queue, we iterate through the available clouds trying to find one that can provision capacity. When we find one, toAdd is modified for that fleet to the number of new instances we need, and this becomes the Restarting Jenkins resets Here's logs showing this, some are new ones I've just added locally to my dev build. The cloud interval is 60 seconds
I'm not sure how much control we have over changing the |
Nice catch @haugenj! This checks out with the observed behavior of the issue becoming rare when we stopped actively tinkering with the config as much :) |
We are still occasionally running into this bug. We are using version We did make some edits to our fleet configs recently which might have affected this, but it's hard to say for sure. We'll restart Jenkins this weekend and report back if it happens again with hopefully a bit more info. |
👍 Sounds good. If you've got the logs on hand still send them my way and perhaps I'll be able to figure out some case where we don't properly cover the config changes still, or perhaps there's another cause of this bug that we just don't know about |
It seems the only logs for the
Are there other logs that would be useful for troubleshooting this? |
Yeah I'm curious about the logs coming before this that could point us to what created this state. The more you can provide the better, and if you want you can send them directly to me via email haugenj@amazon.com |
Hmm by the time we notice the issue, the |
I'have noticed the strange behavior of plugin. "Minimum cluster size" in Jenkins configured equals to 0.
When "Target Capacity" is 0 and builds start Spot fleet plugin doesn't provision new instances and builds can stay in queue for hours and there are no errors in AWS Console or Jenkins logs.
If go to AWS Console and manually change "Target Capacity" from 0 to 1 spot instances are provisioning good.
I not sure if the issue related to scaling down to 0 but in another Jenkins where "Minimum cluster size" configured equals 1 this issue is absent.
Please advice.
To reproduce:
Create new spot request maually for use in Jenkins and set "Target Capacity" to 0
Add fleet to Jenkins via this plugin, scaling from 0-4
Start build on spot instance in Jenkins
Expected behavior:
New spot instances are provisioned automatically by plugin
Actual behavior:
Builds stay in queue for hours until you manually change "Target capacity" to 1 in AWS console.
Versions
Jenkins 2.204.2
EC2 Fleet Jenkins Plugin 1.3.1
The text was updated successfully, but these errors were encountered: