Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spot instances are not provisioning if target capacity decreases to 0 #172

Closed
dimaov opened this issue Feb 10, 2020 · 34 comments · Fixed by #236
Closed

Spot instances are not provisioning if target capacity decreases to 0 #172

dimaov opened this issue Feb 10, 2020 · 34 comments · Fixed by #236
Labels

Comments

@dimaov
Copy link

dimaov commented Feb 10, 2020

I'have noticed the strange behavior of plugin. "Minimum cluster size" in Jenkins configured equals to 0.
When "Target Capacity" is 0 and builds start Spot fleet plugin doesn't provision new instances and builds can stay in queue for hours and there are no errors in AWS Console or Jenkins logs.
If go to AWS Console and manually change "Target Capacity" from 0 to 1 spot instances are provisioning good.
I not sure if the issue related to scaling down to 0 but in another Jenkins where "Minimum cluster size" configured equals 1 this issue is absent.
Please advice.

To reproduce:
Create new spot request maually for use in Jenkins and set "Target Capacity" to 0
Add fleet to Jenkins via this plugin, scaling from 0-4
Start build on spot instance in Jenkins

Expected behavior:
New spot instances are provisioned automatically by plugin

Actual behavior:
Builds stay in queue for hours until you manually change "Target capacity" to 1 in AWS console.

Versions
Jenkins 2.204.2
EC2 Fleet Jenkins Plugin 1.3.1

@suganyaravikumar
Copy link

We are seeing the same issue. Manually changing "Target capacity" to 1 in AWS console starts the spot instances. This issue happens only when

  1. cluster min size is 0 in jenkins fleet configuration
  2. spot fleet target capacity is set to 0
  3. fleet scales down to 0 ( when starting for the first time or during weekends/nights)

@akrasnov-drv
Copy link

I faced this issue today after replacing fleet ID in the plugin configuration:

  • Fleet was in use for some time, when I needed configuration change I waited till it's scaled down to 0 (actually at the moment of change it had 0 of 5 in pending_fulfilment state for quite some time)
  • I created new (replacement) fleet and replaced fleet ID in existing fleet configuration of the plugin in Jenkins, then removed the old fleet from AWS

The new fleet stayed at 0 till I manually made it's target positive. Later during the day it was scaled down by the plugin to 0 and again stuck there.

@terma terma added the bug label Mar 20, 2020
@terma
Copy link

terma commented Mar 20, 2020

hi just tried next scenario and plugin works as expected:

  1. Create EC2 Spot Fleet in Console
  2. Set Target Capacity to 0
  3. Wait until fleet will be empty
  4. Create plugin and attach fleet
  5. Run job
  6. Plugin provision new ec2 instance (all good)

Did I miss some preconditions or points? Do you use weight for Fleet?

@terma
Copy link

terma commented Mar 20, 2020

just realized, Plugin version 1.3.x is really old and could have this defect, could you upgrade to latest one 2.0.0 this version is ok for your version of Jenkins and has a lot of fixes and improvements.

@dimaov
Copy link
Author

dimaov commented Mar 20, 2020

@terma
Yeah Plugin was updated to 2.0.0 some time ago but the same issue still present when spot scales down to 0.

@terma
Copy link

terma commented Mar 22, 2020

@dimaov could you please share jenkins logs, thx

@terma
Copy link

terma commented Mar 22, 2020

Additional to previous scenario I tried one more:

  1. Create EC2 Spot Fleet in Console
  2. Set Target Capacity to 1
  3. Wait until fleet will be empty
  4. Create plugin and attach fleet
  5. Run job
  6. Plugin provision new ec2 instance (all good)
  7. Wait scale down to 0
  8. Run job
  9. Plugin provision new ec2 instance (all good)

Works fine, if you can share log they could help me to see your case, thx

@akrasnov-drv
Copy link

@terma
As I wrote above I had the issue after replacing a fleet in the plugin configuration.
Maybe worth checking the following scenario?

  • set target capacity to some positive value (e.g. 5)
  • wait till the relevant instances are created
  • add the fleet to plugin
  • check if it scales properly (to 0 and up)
  • if yes - try replacing fleet with another one within the same plugin fleet configuration (just fleet ID replacement)

@terma
Copy link

terma commented Mar 22, 2020

One clarification,

Jenkins dictates some rules for plugins which extend capacity, like this or Azure etc. Any time when you change configuration, Jenkins doesn't do actual update of plugin settings, in fact Jenkins always remove plugin configuration and create new one.

If you change fleet ID current version of plugin will remove all instances related to old fleet and will start add instances related to new fleet without any transition period which seems correct for me:

  • changing of fleet is rare and more one time operation, I don't really see use cases when people do that on regular basic
  • simple implementation and less places for plugin logic mistake (compare to graceful transition)

@terma
Copy link

terma commented Mar 22, 2020

In case if you have problem with scale down to 0 (without updating settings) please please provide Jenkins logs, that's will help me to reconstruct case, thx

@dimaov
Copy link
Author

dimaov commented Mar 22, 2020

@terma
Thanks for your attention to this.

When we faced this issue we have checked logs actually and it said nothing about that smth goes wrong. No errors in jenkins logs and In AWS Spot fleet events as well. Looks like No actions in logs and AWS events at all, and it was seems like plugin don't ask create additional spot instances when it in 0 state

Could you clarify which exact logs would be useful for you ?
Now we scaling down spot instances to 1 and it's working without issues but I'll try to reproduce the issue again and provide needed logs to you.

@terma
Copy link

terma commented Mar 22, 2020

Yep,

First of all Jenkins call plugin to check if possible to scale, so log should contains records like:

currentDemand {0} availableCapacity {1} (availableExecutors {2} connectingExecutors {3} plannedCapacitySnapshot {4} additionalPlannedCapacity {5})

Additional plugin generates on regular basic work logs like:

fleet instances [...]
described instances [...]
jenkins nodes [...]
jenkins nodes without instance ...
new instances [...]

All of them will help me to see what is existent capacity, what is capacity which should be provision soon and what is demand, plus it will show how plugin react on Jenkins requests. If you want you can share in private artem.stasuk@gmail.com

@terma
Copy link

terma commented Mar 22, 2020

Real world example of plugin related logs:

Mar 22, 2020 2:32:15 PM com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
INFO: currentDemand 1 availableCapacity 0 (availableExecutors 0 connectingExecutors 0 plannedCapacitySnapshot 0 additionalPlannedCapacity 0)
Mar 22, 2020 2:32:15 PM com.amazon.jenkins.ec2fleet.EC2FleetCloud info
INFO: FleetCloud [ec2-fleet] excessWorkload 1
Mar 22, 2020 2:32:15 PM com.amazon.jenkins.ec2fleet.EC2FleetCloud info
INFO: FleetCloud [ec2-fleet] to provision = 1
Mar 22, 2020 2:32:15 PM hudson.slaves.NodeProvisioner$StandardStrategyImpl apply
INFO: Started provisioning FleetNode-0 from FleetCloud with 1 executors. Remaining excess workload: 0
Mar 22, 2020 2:32:16 PM com.amazon.jenkins.ec2fleet.EC2FleetCloud info
INFO: FleetCloud [ec2-fleet] fleet instances []
Mar 22, 2020 2:32:16 PM com.amazon.jenkins.ec2fleet.EC2FleetCloud info
INFO: FleetCloud [ec2-fleet] described instances []
Mar 22, 2020 2:32:16 PM com.amazon.jenkins.ec2fleet.EC2FleetCloud info
INFO: FleetCloud [ec2-fleet] jenkins nodes []

@dimaov
Copy link
Author

dimaov commented Mar 23, 2020

Great thanks,
I've reconfigured plugin with Min cluster size equals 0
I'll Provide you with the logs and related screenshots when issue appears again.

@akrasnov-drv
Copy link

akrasnov-drv commented Mar 23, 2020

@terma what you wrote about settings update ("change configuration") is interesting but not the behavior I see.

  • if I change timeouts or some other parameters it works fine (no replacement but real update)
  • if I replace just fleet ID I have issues with scaling, but if I replace full config everything works smoothly.

Looks like during settings update, even if Jenkins replaces plugin configuration, some "internal" structures/parameters are kept.
// sorry for kind of off-topic

Regarding the issue I can confirm that it looked like Jenkins did not request new instances at all.
There was a queue waiting for labels defined in a fleet, but target capacity for that fleet stayed at 0 (queued tasks had "jenkins does not have instances with label" status). When I set target capacity manually Jenkins happily took new instances and started to request additional ones as it does usually. When it scaled down to 0 later, the above repeated.

@terma
Copy link

terma commented Mar 24, 2020

You are right twice

Change fleet ID is special from plugin point of view:

  • in a way that plugin realize that old instances are not related to new fleet and remove them (that's why you see drop of existent capacity)
  • plugin has internal list of pending changes which will be dropped on fleet update, this is defect which could be addressed

At the same time Jenkins always replace entire configuration, but since previous internal case, plugin could be in wrong state if any scale activities in progress.

@akrasnov-drv
Copy link

akrasnov-drv commented Apr 5, 2020

Today, after about 2 weeks of normal work, we are stuck at zero again.
There are several things that could be relevant to this problem, or that can be just coincidence:

  • Several hours before I added new fleet and removed another one, but nothing was changed in the fleet that's stuck
  • New job request (that stayed in the queue for an hour and a half for now) was place as 0:00:15 (local time, though 21:00:15 UTC)
  • According to fleet history, the last matching node (the one that could be used) was brought down at 0:01:38

Maybe it's kind of timing issue? The plugin sees the last vm (that is being shut down), and waits for it?
Meanwhile I'm going to try auto-scaling groups. Maybe this will work more stable...

Upd. I tried replacing spot fleet with asg fleet with the same labels, but the new fleet kept staying at 0 and waiting jobs stayed in the queue. Only when I remove the fleet and re-added it after 1-2 minutes (again with the same labels), the plugin set the target according to number of jobs in the queue.

Upd2. Another fleet is stuck, seems I'll have to remove all fleets and re-add everything again.
Each update is a real pain. Hopefully with ASG I'll be able to change all we need in ASG config and not touch the plugin configuration. :(

@akrasnov-drv
Copy link

Unfortunately, even with ASG, I still have to change timeouts sometimes (e.g. "Max Idle Minutes Before Scaledown"), and it breaks the configuration again, so have to remove/recreate all fleets.

@chrono
Copy link

chrono commented Jun 23, 2020

Here I have an example with this issue. It's currently affecting one of our fleets while other fleets with almost identical configuration (difference being labels and what asg to use) are fine. It's just stubbornly refusing to scale the asg up by itself.

log (block between two "provisioning completed" lines)
Jun 23, 2020 1:15:26 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
2_vcpu ondemand us-east-1 [ci fastlane builder thm 2_vcpu us-east-1] fleet instances [i-0e51057bb436b5e12]
Jun 23, 2020 1:15:27 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
2_vcpu ondemand us-east-1 [ci fastlane builder thm 2_vcpu us-east-1] described instances [i-0e51057bb436b5e12]
Jun 23, 2020 1:15:27 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
2_vcpu ondemand us-east-1 [ci fastlane builder thm 2_vcpu us-east-1] jenkins nodes [i-0e51057bb436b5e12]
Jun 23, 2020 1:15:27 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
2_vcpu ondemand us-east-1 [ci fastlane builder thm 2_vcpu us-east-1] jenkins nodes without instance []
Jun 23, 2020 1:15:27 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
2_vcpu ondemand us-east-1 [ci fastlane builder thm 2_vcpu us-east-1] terminated instances []
Jun 23, 2020 1:15:27 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
2_vcpu ondemand us-east-1 [ci fastlane builder thm 2_vcpu us-east-1] new instances []
Jun 23, 2020 1:15:27 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
services us-east-1 [ci services] start
Jun 23, 2020 1:15:27 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
services us-east-1 [ci services] fleet instances []
Jun 23, 2020 1:15:27 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
services us-east-1 [ci services] described instances []
Jun 23, 2020 1:15:27 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
services us-east-1 [ci services] jenkins nodes []
Jun 23, 2020 1:15:27 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
services us-east-1 [ci services] jenkins nodes without instance []
Jun 23, 2020 1:15:27 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
services us-east-1 [ci services] terminated instances []
Jun 23, 2020 1:15:27 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
services us-east-1 [ci services] new instances []
Jun 23, 2020 1:15:29 AM INFO com.amazon.jenkins.ec2fleet.IdleRetentionStrategy check
Check if node idle i-01ebaafa1a5a687de
Jun 23, 2020 1:15:29 AM INFO com.amazon.jenkins.ec2fleet.IdleRetentionStrategy check
Check if node idle i-02dcf8664d7083b61
Jun 23, 2020 1:15:29 AM INFO com.amazon.jenkins.ec2fleet.IdleRetentionStrategy isIdleForTooLong
Instance: thm heavy spot us-east-2 i-02dcf8664d7083b61 Age: 801667 Max Age:900000
Jun 23, 2020 1:15:29 AM INFO com.amazon.jenkins.ec2fleet.IdleRetentionStrategy check
Check if node idle i-0472a600b6ed9d32b
Jun 23, 2020 1:15:29 AM INFO com.amazon.jenkins.ec2fleet.IdleRetentionStrategy isIdleForTooLong
Instance: thm heavy spot us-east-2 i-0472a600b6ed9d32b Age: 778710 Max Age:900000
Jun 23, 2020 1:15:35 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
thm heavy spot us-east-2 [ci thm heavy_spot spot us-east-2] start
Jun 23, 2020 1:15:35 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
thm heavy spot us-east-2 [ci thm heavy_spot spot us-east-2] fleet instances [i-0472a600b6ed9d32b, i-02dcf8664d7083b61]
Jun 23, 2020 1:15:35 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
thm heavy spot us-east-2 [ci thm heavy_spot spot us-east-2] described instances [i-0472a600b6ed9d32b, i-02dcf8664d7083b61]
Jun 23, 2020 1:15:35 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
thm heavy spot us-east-2 [ci thm heavy_spot spot us-east-2] jenkins nodes [i-0472a600b6ed9d32b, i-02dcf8664d7083b61]
Jun 23, 2020 1:15:35 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
thm heavy spot us-east-2 [ci thm heavy_spot spot us-east-2] jenkins nodes without instance []
Jun 23, 2020 1:15:35 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
thm heavy spot us-east-2 [ci thm heavy_spot spot us-east-2] terminated instances []
Jun 23, 2020 1:15:35 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
thm heavy spot us-east-2 [ci thm heavy_spot spot us-east-2] new instances []
Jun 23, 2020 1:15:35 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
96_vcpu spot us-east-2 [ci spot 96_vcpu us-east-2] start
Jun 23, 2020 1:15:35 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
96_vcpu spot us-east-2 [ci spot 96_vcpu us-east-2] fleet instances []
Jun 23, 2020 1:15:35 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
96_vcpu spot us-east-2 [ci spot 96_vcpu us-east-2] described instances []
Jun 23, 2020 1:15:35 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
96_vcpu spot us-east-2 [ci spot 96_vcpu us-east-2] jenkins nodes []
Jun 23, 2020 1:15:35 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
96_vcpu spot us-east-2 [ci spot 96_vcpu us-east-2] jenkins nodes without instance []
Jun 23, 2020 1:15:35 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
96_vcpu spot us-east-2 [ci spot 96_vcpu us-east-2] terminated instances []
Jun 23, 2020 1:15:35 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
96_vcpu spot us-east-2 [ci spot 96_vcpu us-east-2] new instances []
Jun 23, 2020 1:15:35 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
16_vcpu ondemand us-east-1 [ci bulk 16_vcpu us-east-1] start
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
16_vcpu ondemand us-east-1 [ci bulk 16_vcpu us-east-1] fleet instances []
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
16_vcpu ondemand us-east-1 [ci bulk 16_vcpu us-east-1] described instances []
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
16_vcpu ondemand us-east-1 [ci bulk 16_vcpu us-east-1] jenkins nodes []
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
16_vcpu ondemand us-east-1 [ci bulk 16_vcpu us-east-1] jenkins nodes without instance []
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
16_vcpu ondemand us-east-1 [ci bulk 16_vcpu us-east-1] terminated instances []
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
16_vcpu ondemand us-east-1 [ci bulk 16_vcpu us-east-1] new instances []
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
heavy thm ondemand us-east-1 [ci heavy thm ondemand us-east-1] start
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
heavy thm ondemand us-east-1 [ci heavy thm ondemand us-east-1] fleet instances []
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
heavy thm ondemand us-east-1 [ci heavy thm ondemand us-east-1] described instances []
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
heavy thm ondemand us-east-1 [ci heavy thm ondemand us-east-1] jenkins nodes []
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
heavy thm ondemand us-east-1 [ci heavy thm ondemand us-east-1] jenkins nodes without instance []
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
heavy thm ondemand us-east-1 [ci heavy thm ondemand us-east-1] terminated instances []
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
heavy thm ondemand us-east-1 [ci heavy thm ondemand us-east-1] new instances []
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
slim thm ondemand us-east-1 [ci slim ondemand thm us-east-1] start
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
slim thm ondemand us-east-1 [ci slim ondemand thm us-east-1] fleet instances []
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
slim thm ondemand us-east-1 [ci slim ondemand thm us-east-1] described instances []
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
slim thm ondemand us-east-1 [ci slim ondemand thm us-east-1] jenkins nodes []
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
slim thm ondemand us-east-1 [ci slim ondemand thm us-east-1] jenkins nodes without instance []
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
slim thm ondemand us-east-1 [ci slim ondemand thm us-east-1] terminated instances []
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
slim thm ondemand us-east-1 [ci slim ondemand thm us-east-1] new instances []
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
slim thm spot us-east-1 [ci slim_spot thm spot us-east-1] start
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
slim thm spot us-east-1 [ci slim_spot thm spot us-east-1] fleet instances [i-06db61274612c911c]
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
slim thm spot us-east-1 [ci slim_spot thm spot us-east-1] described instances [i-06db61274612c911c]
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
slim thm spot us-east-1 [ci slim_spot thm spot us-east-1] jenkins nodes [i-06db61274612c911c]
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
slim thm spot us-east-1 [ci slim_spot thm spot us-east-1] jenkins nodes without instance []
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
slim thm spot us-east-1 [ci slim_spot thm spot us-east-1] terminated instances []
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
slim thm spot us-east-1 [ci slim_spot thm spot us-east-1] new instances []
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
2_vcpu ondemand us-east-1 [ci fastlane builder thm 2_vcpu us-east-1] start
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
2_vcpu ondemand us-east-1 [ci fastlane builder thm 2_vcpu us-east-1] fleet instances [i-0e51057bb436b5e12]
Jun 23, 2020 1:15:36 AM INFO com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
currentDemand -2 availableCapacity 3 (availableExecutors 0 connectingExecutors 0 plannedCapacitySnapshot 3 additionalPlannedCapacity 0)
Jun 23, 2020 1:15:36 AM FINE com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy
Provisioning completed

the affected cloud/label is 16_vcpu and there were 2 builds waiting for this label at the time of the log.

jenkins 2.225, fleet plugin 2.0.0

@akrasnov-drv
Copy link

akrasnov-drv commented Jun 23, 2020

@chrono first of all there is plugin 2.0.2 and it works better I think.
The 2nd, from my experience, the only option you have (without jenkins restart) is to remove fleet definition from Jenkins and recreate it. Be aware that somehow it can affect other fleets too.
... Actually there is additional option, the temporary one. If you scale the fleet manually in AWS (e.g. set desired to 1) it usually starts working properly till it gets back to 0.

@jroylance
Copy link

We're currently running into this problem with Autoscale groups, and I'm wondering if anybody has had the same problem with EC2 spot fleets? Perhaps switching from an ASG to a EC2 spot fleet might be a solution.

@SrodriguezO
Copy link

We are running into this on Jenkins 2.222.4 with plugin 2.1.2 when using an ASG, we haven't tried a spot fleet yet

@davedomainx
Copy link

This bug is turning into a showstopper for us because the requirement to always have 1 spot running in the fleet is causing that always-running spot to eventually fill disk inodes or space as Jenkins will naturally build on that always-running spot.

'MaxInstanceLifetime=7 days' and 'OldestInstanceFirst' termination policy in the ASG doesn't seem to reliably work either.

Spot fleet plugin works well once the the min_size = desired_capacity is set to 1 (so there is always a spot running in the fleet) and scale-out can happen properly.

. Jenkins 2.176.2
. EC2 fleet plugin 2.0.2
. ASG Group, no weighting, nothing special or extraordinary about the ASG.
. We also have another spot fleet running on Jenkins 2.235.3 with plugin 2.0.2 and the
behaviour is exactly the same as below.

When min_size = desired_capacity = 0 (eg, the scale-out bug) AND there are jenkins builds
waiting for the spots to launch (which never happens) : The below is printed every 10 seconds.

INFO: currentDemand -1 availableCapacity 2 (availableExecutors 0 connectingExecutors 0 plannedCapacitySnapshot 2 additionalPlannedCapacity 0)

When min_size = desired_capacity = 1 ( eg, to workaround around the scale-out bug )

. currentDemand changes according to the number of spots requested and seems to work normally:

INFO: currentDemand 3 availableCapacity 12 (availableExecutors 0 connectingExecutors 5 plannedCapacitySnapshot 7 additionalPlannedCapacity 0)

. currentDemand is NOT printed once the fleet starts scale-in.

. currentDemand is NOT printed after the fleet terminates the idle spots and the fleet stabilises on 1 spot (min_size = desired_capacity = 1)

@reb197
Copy link

reb197 commented Nov 16, 2020

Similar problem. I'd like to run certain jobs on the master and other others in the auto scaling group. My Jenkinsfile has:

    agent {
        dockerfile {
            filename "docker/base/ci.Dockerfile"
            label "spot-instances"
        }
    }

(the EC2 Fleet plugin label is spot-instances).

If an instance in the Auto Scaling Group is already running, the job builds fine.

If there are no instances currently running in the ASG then builds hang with the message: Jenkins’ doesn’t have label ‘spot-instances’ until an instance is available. I can run one manually by adjusting the Desired Capacity in the ASG in the AWS console and the hung build resumes.

As a test I removed the label from the Jenkins file and reduced the number of executors on the master to one. Now the first job in the queue builds on the master (as I'd expect), but a second one will successfully create an instance in the ASG and run there - even if no instance was currently running in the ASG. So this bug is only appearing for me when the Jenkinsfile specifies a label.

Guessing, could this be related to the way labels are defined at a different level when using dockerfile? i.e. without the dockerfile the above agent block would be:

    agent {
         label "spot-instances"
    }

@haugenj
Copy link

haugenj commented Nov 19, 2020

I've been unable to reproduce this bug for both spot-fleet and asg cases. I think we simply need more information to be able to figure out what's going wrong, so I'm going to work on adding additional log statements

@SrodriguezO
Copy link

We still run into the issue, albeit much less frequently now. Whenever it happens, we see plannedCapacitySnapshot on the logs remains stuck at a non-zero value. This makes the plugin think the available capacity (which includes planned capacity) is higher than it should be and prevents launching new nodes unless demand grows beyond the inaccurate perceived capacity.

Restarting Jenkins when it occurs gets us back to a healthy state. It seems to happen more frequently if we're actively modifying the command clouds, if there are multiple command clouds servicing the same node labels, or if we're manually altering the desired capacity of the underlying ASG (things we were doing much more frequently early on, when we were seeing the problem almost daily). Now we're seeing it maybe once every 2-3 weeks.

@reb197
Copy link

reb197 commented Nov 26, 2020

I switched from using an Auto Scaling Group to a Spot Fleet Request and I've not seen the problem since - been running for about a week now.

@haugenj
Copy link

haugenj commented Dec 7, 2020

I found a way to reproduce this 🎉

When jobs are submitted and in the queue, we iterate through the available clouds trying to find one that can provision capacity. When we find one, toAdd is modified for that fleet to the number of new instances we need, and this becomes the plannedCapacitySnapshot. If that fleet is modified before the next round of update() is called on it (determined by the cloud status interval config), toAdd is reset to 0 but plannedCapacitySnapshot remains > 0, so we never provision new nodes.

Restarting Jenkins resets plannedCapacitySnapshot, which fixes the mismatch in state.

Here's logs showing this, some are new ones I've just added locally to my dev build. The cloud interval is 60 seconds

INFO: spot-fleet-1 [spot-fleet] start
Dec 07, 2020 12:51:30 PM com.amazon.jenkins.ec2fleet.EC2FleetCloud info
INFO: spot-fleet-1 [spot-fleet] toAdd: 0
Dec 07, 2020 12:51:31 PM com.amazon.jenkins.ec2fleet.EC2FleetCloud info
INFO: spot-fleet-1 [spot-fleet] new targetCapacity should be: 0
Dec 07, 2020 12:51:31 PM com.amazon.jenkins.ec2fleet.EC2FleetCloud info
INFO: spot-fleet-1 [spot-fleet] fleet instances: []
Dec 07, 2020 12:51:31 PM com.amazon.jenkins.ec2fleet.EC2FleetCloud info
INFO: spot-fleet-1 [spot-fleet] described instances: []
Dec 07, 2020 12:51:31 PM com.amazon.jenkins.ec2fleet.EC2FleetCloud info
INFO: spot-fleet-1 [spot-fleet] jenkins nodes: []
Dec 07, 2020 12:51:40 PM com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
INFO: label spot-fleet snapshot: LoadStatisticsSnapshot{definedExecutors=0, onlineExecutors=0, connectingExecutors=0, busyExecutors=0, idleExecutors=0, availableExecutors=0, queueLength=1}
Dec 07, 2020 12:51:40 PM com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
INFO: label spot-fleet: currentDemand 1 availableCapacity 0 (availableExecutors 0 connectingExecutors 0 plannedCapacitySnapshot 0 additionalPlannedCapacity 0)
Dec 07, 2020 12:51:40 PM com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
INFO: cloud com.amazon.jenkins.ec2fleet.EC2FleetCloud@212ed739 cant provision for label {1}, continuing...
Dec 07, 2020 12:51:40 PM com.amazon.jenkins.ec2fleet.EC2FleetCloud info
INFO: spot-fleet-1 [spot-fleet] excessWorkload 1
Dec 07, 2020 12:51:40 PM com.amazon.jenkins.ec2fleet.EC2FleetCloud info
INFO: spot-fleet-1 [spot-fleet] to provision = 1

# Fleet was changed here to a different id

Dec 07, 2020 12:51:44 PM com.amazon.jenkins.ec2fleet.utils.EC2FleetCloudAwareUtils reassign
INFO: Finish to reassign resources from old cloud with id 94551af8-375f-44c9-9d18-a0854050f332 to spot-fleet-1
Dec 07, 2020 12:51:44 PM com.amazon.jenkins.ec2fleet.IdleRetentionStrategy check
INFO: Check if node idle i-081cf67b7c34deee9
Dec 07, 2020 12:51:50 PM com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
INFO: label spot-fleet snapshot: LoadStatisticsSnapshot{definedExecutors=0, onlineExecutors=0, connectingExecutors=0, busyExecutors=0, idleExecutors=0, availableExecutors=0, queueLength=1}
Dec 07, 2020 12:51:50 PM com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
INFO: label spot-fleet: currentDemand 0 availableCapacity 1 (availableExecutors 0 connectingExecutors 0 plannedCapacitySnapshot 1 additionalPlannedCapacity 0)
Dec 07, 2020 12:51:50 PM com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
INFO: currentDemand is less than 1, not provisioning
...
INFO: spot-fleet-1 [spot-fleet] start
Dec 07, 2020 12:52:44 PM com.amazon.jenkins.ec2fleet.EC2FleetCloud info
INFO: spot-fleet-1 [spot-fleet] toAdd: 0
Dec 07, 2020 12:52:45 PM com.amazon.jenkins.ec2fleet.EC2FleetCloud info
INFO: spot-fleet-1 [spot-fleet] new targetCapacity should be: 0
Dec 07, 2020 12:52:45 PM com.amazon.jenkins.ec2fleet.EC2FleetCloud info
INFO: spot-fleet-1 [spot-fleet] fleet instances: []
Dec 07, 2020 12:52:45 PM com.amazon.jenkins.ec2fleet.EC2FleetCloud info
INFO: spot-fleet-1 [spot-fleet] described instances: []
Dec 07, 2020 12:52:45 PM com.amazon.jenkins.ec2fleet.EC2FleetCloud info
INFO: spot-fleet-1 [spot-fleet] jenkins nodes: []
Dec 07, 2020 12:52:50 PM com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
INFO: label spot-fleet snapshot: LoadStatisticsSnapshot{definedExecutors=0, onlineExecutors=0, connectingExecutors=0, busyExecutors=0, idleExecutors=0, availableExecutors=0, queueLength=1}
Dec 07, 2020 12:52:50 PM com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
INFO: label spot-fleet: currentDemand 0 availableCapacity 1 (availableExecutors 0 connectingExecutors 0 plannedCapacitySnapshot 1 additionalPlannedCapacity 0)
Dec 07, 2020 12:52:50 PM com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
INFO: currentDemand is less than 1, not provisioning

I'm not sure how much control we have over changing the plannedCapacitySnapshot because that looks like it's handled by Jenkins itself, not our plugin, but we might be able to check toAdd when reassigning resources and set that on the updated Fleet. I'll continue working on a fix

@SrodriguezO
Copy link

Nice catch @haugenj! This checks out with the observed behavior of the issue becoming rare when we stopped actively tinkering with the config as much :)

@SrodriguezO
Copy link

We are still occasionally running into this bug. We are using version 2.3.2 of the plugin. The logs show plannedCapacitySnapshot stuck above 0 despite no new nodes coming in.

We did make some edits to our fleet configs recently which might have affected this, but it's hard to say for sure. We'll restart Jenkins this weekend and report back if it happens again with hopefully a bit more info.

@haugenj
Copy link

haugenj commented Mar 12, 2021

👍 Sounds good. If you've got the logs on hand still send them my way and perhaps I'll be able to figure out some case where we don't properly cover the config changes still, or perhaps there's another cause of this bug that we just don't know about

@SrodriguezO
Copy link

It seems the only logs for the NoDelayProvisionStrategy logger are of this nature:

INFO	c.a.j.e.NoDelayProvisionStrategy#apply: label [<node_labels>]: currentDemand -6 availableCapacity 7 (availableExecutors 0 connectingExecutors 0 plannedCapacitySnapshot 7 additionalPlannedCapacity 0)

Are there other logs that would be useful for troubleshooting this?

@haugenj
Copy link

haugenj commented Mar 12, 2021

Yeah I'm curious about the logs coming before this that could point us to what created this state. The more you can provide the better, and if you want you can send them directly to me via email haugenj@amazon.com

@SrodriguezO
Copy link

Hmm by the time we notice the issue, the NoDelayProvisionStrategy log will have occurred hundreds of times :/ so we can't effectively pinpoint when it first started. I'll try to track config changes going forward so if it happens again we can search around the time of recent config changes to see if anything stands out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants