Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NoDelayProvisionStrategy won't provision after scaling down to 0 instances in auto scaling group #425

Open
cccCody opened this issue Nov 21, 2023 · 7 comments
Labels

Comments

@cccCody
Copy link

cccCody commented Nov 21, 2023

I think this is the same issue as #180

Describe the bug
I'm currently trying to move from ec2-plugin to this plugin, but I'm seeing that the final stage of my build doesn't ever get an executor. My build looks roughly like this:

  1. build step on a single node
  2. test in parallel on several nodes (150 of them!)
  3. collect coverage reports on a single node

Everything works nicely until the last step, where it gets stuck on:

All nodes of label ec2-fleet are offline

When I check the system logs, I see this on repeat:

Nov 21, 2023 3:08:12 PM FINE com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
label [ec2-fleet]: queueLength 1 availableCapacity 1 (availableExecutors 0 plannedCapacitySnapshot 1 additionalPlannedCapacity 0)
Nov 21, 2023 3:08:12 PM INFO com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
label [ec2-fleet]: No excess workload, provisioning not needed.

I'm especially suspicious of plannedCapacitySnapshot 1, which, if I'm reading the source code right, seems to mean that it thinks it's already started scaling up another node (and is waiting for it to come online?) but it never does.

Other misc info, may or may not be relevant:

  • All steps use the same label ("ec2-fleet") and run one executor per node.
  • cloud configuration includes:
    • Minimum Cluster Size: 0
    • Maximum Cluster Size: 2000
    • Minimum Spare Size: 0
    • Maximum Total Uses: 1

Environment Details

Plugin Version?
3.1.0 (latest as of opening this)

Jenkins Version?
2.426.1 (latest LTS version as of opening this issue)

Spot Fleet or ASG?
ASG

Label based fleet?
no

Linux or Windows?
linux

@cccCody
Copy link
Author

cccCody commented Nov 21, 2023

I was able to get it to work for a single run by setting "Minimum Spare Size" to 1, but then when I started another build after that, it hit the issue when provisioning the first node that time. It seems like, more generally, this is an issue with scaling out shortly after scaling in.

@icep87
Copy link

icep87 commented Nov 22, 2023

We are also seeing this issue. When there are no agents available, meaning they are all scale down. The plugin won
t spin up agents at all.

In the logs it says:

Nov 22, 2023 7:02:32 PM FINE com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy
label [linux]: queueLength 1 availableCapacity 1 (availableExecutors 0 plannedCapacitySnapshot 1 additionalPlannedCapacity 0)
Nov 22, 2023 7:02:32 PM INFO com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
label [linux]: No excess workload, provisioning not needed.
Nov 22, 2023 7:02:32 PM FINE com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy
label [powerful]: queueLength 1 availableCapacity 1 (availableExecutors 0 plannedCapacitySnapshot 1 additionalPlannedCapacity 0)
Nov 22, 2023 7:02:32 PM INFO com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
label [powerful]: No excess workload, provisioning not needed.

I'm wondering why it actually says that there is availableCapacity when clearly there is none and no scale up is triggered.

@icep87
Copy link

icep87 commented Nov 28, 2023

@cccCody Did you manage to find the cause of this?

@taka-papa
Copy link

I encountered a similar problem
I tried the jenkins script console

Jenkins jenkins = Jenkins.getInstance()

jenkins.getLabels().each { Label label ->
    def nodeProvisioner = label.nodeProvisioner
    def pendingLaunches = nodeProvisioner.getPendingLaunches()

     if (pendingLaunches.size() == 0) {
    	return
     }

    println("Label: ${label.name}")
    pendingLaunches.each {
        println("  Planned Node: ${it.displayName}, Executors: ${it.numExecutors}")
    }
}

Output

Label: xxx
  Planned Node: NodeName-xx, Executors: 1
  Planned Node: NodeName-xx, Executors: 1

There were no jobs running
Restarting jenkins solved it

@opajonk
Copy link

opajonk commented Jan 5, 2024

I think we are running into the same issue here, with the NoDelayProvisioningStrategy. Digging around in the issues I found #149 - this one reads like a regression. Could that be?

Restarting Jenkins "fixed" the issue, but I suspect it will come back. Then I will run the script console snippet of @snowman-papa to see if we also have "stuck planned" machines.

@pawel-t
Copy link

pawel-t commented Feb 8, 2024

I have faced the same issue on 3.2.0.

Once we have switched to ASG from SpotFleet. My SpotFleet was setup with Min = 0 and Spare = 0 and after it scaled down to 0 instance. For 1.5h it didn't scale up while jobs were waiting in queue.

I needed to increase min and spare in order it to work.

@ldmonkey
Copy link

We downgraded the plugin from 3.2.0 to 3.0.1. Waiting for a fix for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants