Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge lag in provisioning workers, workers become suspended but not deleted #322

Closed
wosiu opened this issue Feb 8, 2022 · 34 comments · Fixed by #376
Closed

Huge lag in provisioning workers, workers become suspended but not deleted #322

wosiu opened this issue Feb 8, 2022 · 34 comments · Fixed by #376
Assignees
Labels

Comments

@wosiu
Copy link

wosiu commented Feb 8, 2022

Issue Details

Sometimes there is a huge lag before workers are provisioned. Our developers started to observe this few weeks ago.
So for example:

16:36:56  Still waiting to schedule task
16:36:56  All nodes of label ‘[integration-tests&&spot&&disable-resubmit](https://anonymised.it/label/integration-tests&&spot&&disable-resubmit/)’ are offline
18:16:54  Running on [ci-jenkins-executor--integration-tests-spot i-0fda2dca2c245b03a](https://anonymised.it/computer/i-0fda2dca2c245b03a/) in /mnt/jenkins/workspaces/workspace/sanity-check-on-stag

As you can see the machine was provisioned almost 2 hours later. And I guess it started only because we manually changed a minimum cluster size for this label to 1 to kinda kick it.

I don't have full log, but I was able to collect some logs from this period of time while the build was waiting for a node:

2022-02-08 16:45:42.495+0000 [id=42]  FINE  c.amazon.jenkins.ec2fleet.EC2FleetCloud: ci-jenkins-executor--integration-tests-spot [ integration-tests spot disable-resubmit 1executors] setting stats
2022-02-08 16:45:42.495+0000 [id=42]  FINE  c.amazon.jenkins.ec2fleet.EC2FleetCloud: ci-jenkins-executor--integration-tests-spot [ integration-tests spot disable-resubmit 1executors] Jenkins nodes: []
2022-02-08 16:45:42.495+0000 [id=42]  FINE  c.amazon.jenkins.ec2fleet.EC2FleetCloud: ci-jenkins-executor--integration-tests-spot [ integration-tests spot disable-resubmit 1executors] Described instances: []
2022-02-08 16:45:42.495+0000 [id=42]  FINE  c.amazon.jenkins.ec2fleet.EC2FleetCloud: ci-jenkins-executor--integration-tests-spot [ integration-tests spot disable-resubmit 1executors] Fleet instances: []
2022-02-08 16:45:42.392+0000 [id=42]  FINE  c.amazon.jenkins.ec2fleet.EC2FleetCloud: ci-jenkins-executor--integration-tests-spot [ integration-tests spot disable-resubmit 1executors] start cloud com.amazon.jenkins.ec2fleet.EC2FleetCloud@26e970e3
2022-02-08 16:45:41.234+0000 [id=41]  FINE  c.a.j.ec2fleet.NoDelayProvisionStrategy: Provisioning completed
2022-02-08 16:45:41.234+0000 [id=41]  FINE  c.a.j.ec2fleet.NoDelayProvisionStrategy: label [integration-tests&&spot&&disable-resubmit]: currentDemand is less than 1, not provisioning
2022-02-08 16:45:41.234+0000 [id=41]  FINE  c.a.j.ec2fleet.NoDelayProvisionStrategy: label [integration-tests&&spot&&disable-resubmit]: currentDemand -7 availableCapacity 8 (availableExecutors 0 connectingExecutors 0 plannedCapacitySnapshot 8 additionalPlannedCapacity 0)
2022-02-08 16:45:32.506+0000 [id=41]  FINE  c.amazon.jenkins.ec2fleet.EC2FleetCloud: ci-jenkins-executor--integration-tests-spot [ integration-tests spot disable-resubmit 1executors] setting stats
2022-02-08 16:45:32.506+0000 [id=41]  FINE  c.amazon.jenkins.ec2fleet.EC2FleetCloud: ci-jenkins-executor--integration-tests-spot [ integration-tests spot disable-resubmit 1executors] Jenkins nodes: []
2022-02-08 16:45:32.506+0000 [id=41]  FINE  c.amazon.jenkins.ec2fleet.EC2FleetCloud: ci-jenkins-executor--integration-tests-spot [ integration-tests spot disable-resubmit 1executors] Described instances: []
2022-02-08 16:45:32.505+0000 [id=41]  FINE  c.amazon.jenkins.ec2fleet.EC2FleetCloud: ci-jenkins-executor--integration-tests-spot [ integration-tests spot disable-resubmit 1executors] Fleet instances: []
2022-02-08 16:45:32.392+0000 [id=41]  FINE  c.amazon.jenkins.ec2fleet.EC2FleetCloud: ci-jenkins-executor--integration-tests-spot [ integration-tests spot disable-resubmit 1executors] start cloud com.amazon.jenkins.ec2fleet.EC2FleetCloud@26e970e3
2022-02-08 16:45:31.234+0000 [id=40]  FINE  c.a.j.ec2fleet.NoDelayProvisionStrategy: Provisioning completed
2022-02-08 16:45:31.234+0000 [id=40]  FINE  c.a.j.ec2fleet.NoDelayProvisionStrategy: label [integration-tests&&spot&&disable-resubmit]: currentDemand is less than 1, not provisioning
2022-02-08 16:45:31.234+0000 [id=40]  FINE  c.a.j.ec2fleet.NoDelayProvisionStrategy: label [integration-tests&&spot&&disable-resubmit]: currentDemand -7 availableCapacity 8 (availableExecutors 0 connectingExecutors 0 plannedCapacitySnapshot 8 additionalPlannedCapacity 0)
2022-02-08 16:45:22.476+0000 [id=37]  FINE  c.amazon.jenkins.ec2fleet.EC2FleetCloud: ci-jenkins-executor--integration-tests-spot [ integration-tests spot disable-resubmit 1executors] setting stats

At 18:10 the status shown target: 0 so it looks like the plugin didn't realise that there is a build waiting for this node:
image

Around 18:15 we changed minimum cluster size for this to 1, and a new node was started.

What is interesting we noticed it on 2 separate jenkinses we have. We don't recall problem like this before.

We suspect it started to happen after migration from version 2.3.7 to 2.4.1, but we're not sure.
Also we migrated to the 2.4.1 around 21.12.2021, whereas we THINK the problem started some longer time after the upgrade and jenkins restart. Not sure though.

To Reproduce
I don't know. It just happens from time to time and as a result our jobs are timeouted. Looks like frequency of this problem increases from week to week.

Environment Details

Plugin Version?
2.4.1

Jenkins Version?
2.325

Spot Fleet or ASG?
ASG

Label based fleet?
No

Linux or Windows?
Linux

EC2Fleet Configuration as Code

    - eC2Fleet:
        name: "ci-jenkins-executor--integration-tests-spot"
        fleet: "ci-jenkins-executor--integration-tests-spot"
        labelString: "integration-tests spot disable-resubmit 1executors"
        minSize: 0
        maxSize: 120
        maxTotalUses: 50
        numExecutors: 1
        disableTaskResubmit: true
        addNodeOnlyIfRunning: false
        alwaysReconnect: true
        cloudStatusIntervalSec: 10
        computerConnector:
          sSHConnector:
            credentialsId: "standard-runner-ubuntu-user-private-key"
            launchTimeoutSeconds: 60
            maxNumRetries: 10
            port: 22
            retryWaitTime: 15
            sshHostKeyVerificationStrategy:
              manuallyTrustedKeyVerificationStrategy:
                requireInitialManualTrust: false
        fsRoot: "/mnt/jenkins/workspaces"
        idleMinutes: 3
        initOnlineCheckIntervalSec: 15
        initOnlineTimeoutSec: 600
        noDelayProvision: true
        privateIpUsed: true
        region: "us-west-2"
        restrictUsage: true
        scaleExecutorsByWeight: false

Anything else unique about your setup?

No

@wosiu wosiu added the bug label Feb 8, 2022
@fdaca
Copy link

fdaca commented Feb 17, 2022

This can be found in logs:

currentDemand -1 availableCapacity 8 (availableExecutors 0 connectingExecutors 0 plannedCapacitySnapshot 8 additionalPlannedCapacity 0)

and

currentDemand is less than 1, not provisioning

@imuqtadir
Copy link

From the limited logs you provide, I see that plannedCapacitySnapshot=8 which means that jenkins is waiting for the instances to come up. And also Described instances: [] which basically means that no instances were launched/running within the fleet. I don't see the target capacity printed in the logs so I can't really point to a particular issue. There could be multiple reasons for issue to happen.

When this happens, please check whether the target on your EC2 fleet/ASG was updated. If it was indeed updated correctly and you are not getting newer instances launched, it could mean that your chosen instance types NOT available for spot currently. I'm not sure what set of instances you have selected so expanding your instance type options could be something you should do. Other reason could be that you have a user-data script attached to your instances which is causing this huge lag for them to become available. From plugin's point of view, it is responsible for calculating correct target capacity.

If the target is not getting updated, then it probably requires additional debugging. However, since plannedCapacitySnapshot 8 has been updated correctly, I highly doubt this to be the case.

Side note, we recently launched minSpareInstances as part of 2.5.0. This will always keep the set amount of instances available which avoids the provisioning time. Please see #321

@fdaca
Copy link

fdaca commented Feb 18, 2022

@imuqtadir Thank you for the details. We're using spot nodes with EC2 Fleet plugin. What we're observing is having plannedCapacitySnapshot set to a desired amount for a long time, but:

  • no events to scale out on the ASG side visible
  • EC2 Fleet status reporting nodes: 0, target: 0
  • our ASG maximum capacity is high - around 120
  • our spot instance requests utilisation is within limits with a lot of room
  • blocks like this take 1h+ to self-resolve

We haven't bumped into this issue after we moved to EC2 Fleet 2.5.0 version and Jenkins 2.334 yet, which might be the possible cure (?)

@imuqtadir
Copy link

@fdaca Great, if it happens the next time around I want you to check the target capacity on EC2 side (along with EC2 Fleet status on the plugin). I'll close out this issue for now but feel free to reopen or create a new issue.

@mwos-sl
Copy link

mwos-sl commented Mar 5, 2022

@imuqtadir I don't have github persmission to reopen the issue.

But the issue is definitely still there (on 2 independent jenkinses we have). It seems like the plugin does not figure out that there is some demand for a given label.

Plugin version: 2.5.0
Jenkins: 2.336

Behaviour exactly the same as described above.
Pipeline is hanging for more than 1 hour on:
image

while plugin says target is 0, for whole that time:

image

Attaching logs from log recorder for this plugin:
EC2 Fleet Plugin.log

On AWS side, ASG desired count is 0. And there haven't been any events in Activity history for the last several hours. Last events are:
image

whereas the job that is hanging at the moment started ~4 hours after the last event in the Activity history.

@mwos-sl
Copy link

mwos-sl commented Mar 5, 2022

I noticed it gets unstuck only when new job arrives which needs the same label. So now I'm observing situation that a different build started which uses the same label (while the previous build is still hanging), and as a result "target" capacity on the plugin side has just increased to 1. And eventually both builds passed. So I wouldn't even call it that it "auto-resolves". If there isn't a new build, the hanging one gets stuck forever.

@imuqtadir imuqtadir reopened this Mar 5, 2022
@imuqtadir imuqtadir added the stalebot-ignore To NOT let the stalebot update or close the Issue / PR label Mar 8, 2022
@mwos-sl
Copy link

mwos-sl commented Mar 10, 2022

Is there anything I could help to resolve this? Like provide some extra data?

@imuqtadir
Copy link

@mwos-sl Thanks for the details. This seems like a transient issue and difficult to reproduce specially since the new build with the same label is able to increment the total capacity later. How often do you see this issue happening and does it affect other labels or it is specific to a single label always? If you see a pattern that helps us reproduce the bug, it will be super helpful.

@mwos-sl
Copy link

mwos-sl commented Mar 14, 2022

does it affect other labels

Seems like all the labels managed by the plugin are affected.
Labels managed by the other plugin (https://plugins.jenkins.io/ec2/) are just fine.
I'll try to gather more data with time.

@mwos-sl
Copy link

mwos-sl commented Mar 21, 2022

I just saw situation, when 2 jobs were waiting for tens of minutes for a single label, yet the target capacity was set to 0 by the plugin, and there were no scaling events on ASG side. So it seems:

since the new build with the same label is able to increment the total capacity later.

is not always the case :( We even implemented a "workaround" job, which runs every hour to provision 1 machine from every label to unblock the queue, but the improvement is questionable.

@xocasdashdash
Copy link

So i think we've hit the same issue. Somehow Jenkins is under the impression that we have an executor with a label but then it's not showing up on the UI.
The solution for now has been to retry the jobs (manually), which is something that we want to avoid.

From my limited java skills it seems that the issue comes from jenkins as the data that comes in here (

public NodeProvisioner.StrategyDecision apply(final NodeProvisioner.StrategyState strategyState) {
) is not good.

I'm trying to see if i can find out where this data is coming from and why does jenkins think this is the case, i'm suspecting some internal caching that once it expires the situation is fixed.

@mwos-sl have you found a valid workaround? I'm tempted to try the straight ec2 plugin instead of this one if there're no issues there

@mwos-sl
Copy link

mwos-sl commented Jun 7, 2022

@mwos-sl have you found a valid workaround?

Nope, unfortunately not. The issue described here was causing serious lags on our jenkinses, so we rolled back to ec2 plugin you mentioned. We don't have any issue with ec2 plugin, except it is very poor for spot instances, and we use on-demand only there. This plugin (ec2-fleet-plugin) is waaaay better for handling spots (possibility to declare multiple spot pools).

@xocasdashdash Can you experiment with setting to false "no delay provision strategy"? Maybe the bug is there and all we need is to disable it? Unfortunately I'm not able myself to test it anymore, but I wish to go back to this plugin at some point :(

@xocasdashdash
Copy link

I'll try to check it out, but it seems to be a deeper issue in how jenkins assigns jobs to labels, not much this plugin can do if that fails 😕

@wosiu
Copy link
Author

wosiu commented Oct 18, 2022

I gave one more shot with the recent release of the plugin (2.5.2), and so far it works fine!!! 🤞
It seems @h-okon fixed the issue with: #343 (many thanks!!!)
Let me soak for few weeks more and I close this ticket.

EDIT: Nope, the issue is still there :(

@igtsekov
Copy link

I am using 2.5.2 and it still happens. :(

@avikivity
Copy link

/cc @benipeled

@mwos-sl
Copy link

mwos-sl commented Nov 10, 2022

Ok, so indeed there is still some issue. It is better than the last time though, but still happens.
Basically "Idle time" was not respected. Machine was marked for termination, still it was listed on jenkins (but with small red cross). And even though there were some jobs waiting for a particular label, the existing workers weren't picked, because there were previously marked for termination due to being idle (some time before). But it seems they still participate in the available capacity, because new workers weren't provisioned. As a result - builds are waiting for executors, while no new workers are provisioned :(

Checking if node 'i-07d22ecd9c1ddc6af' is idle 
Nov 10, 2022 6:16:28 AM INFO com.amazon.jenkins.ec2fleet.EC2RetentionStrategy isIdleForTooLong
Instance: build-jenkins-fleet-cse-general-spot i-07d22ecd9c1ddc6af Builds left: 46  Age: 42362305 Max Age:180000
Nov 10, 2022 6:16:28 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
build-jenkins-fleet-cse-general-spot [cse general spot] Scheduling instance 'i-07d22ecd9c1ddc6af' for termination on cloud com.amazon.jenkins.ec2fleet.EC2FleetCloud@6f3c3f8d with force: false
Nov 10, 2022 6:16:28 AM FINE com.amazon.jenkins.ec2fleet.EC2FleetCloud fine
build-jenkins-fleet-cse-general-spot [cse general spot] InstanceIdsToTerminate: [i-07d22ecd9c1ddc6af, i-07228fb5b5499afa6, i-0632d358da022654c]
Nov 10, 2022 6:16:28 AM FINE com.amazon.jenkins.ec2fleet.EC2RetentionStrategy check
Checking if node 'i-07d3ac95b1dd1c768' is idle 
Nov 10, 2022 6:16:28 AM INFO com.amazon.jenkins.ec2fleet.EC2RetentionStrategy isIdleForTooLong

@mwos-sl
Copy link

mwos-sl commented Nov 15, 2022

After some investigation, it seems that plannedCapacitySnapshot become drifted after configuration reloading (without jenkins restart, which is no go for us):

label [cse&&general]: currentDemand -3 availableCapacity 4 (availableExecutors 0 connectingExecutors 0 plannedCapacitySnapshot 4 additionalPlannedCapacity 0)

There was 1 job in queue at the time, that's why
current demand = queueSize - availableCapacity = 1 - 4 = -3

Some of the already closed issues might be related:
#261
#172
They were "solved" by restarting jenkins, which sounds like a bug is there all the time.

@mwos-sl
Copy link

mwos-sl commented Dec 13, 2022

After several days of no problem we hit an issue again. I >think< the problem starts after configuration reload.

@HDatPaddle
Copy link

After several days of no problem we hit an issue again. I >think< the problem starts after configuration reload.

Agree, it also appears to happen to us after a configuration reload. A reboot clears is.

@naorc123
Copy link

Is there any planned fix for it? or maybe other workaround?
We encounter it as well but restarting Jenkins after configuration reload make no sense.

@mwos-sl
Copy link

mwos-sl commented May 23, 2023

@pdk27 any updates? Are there any chances the plugin will be maintained again?

@mtn-boblloyd
Copy link

mtn-boblloyd commented May 23, 2023

I wound up fixing this myself and using a custom version of the plugin. It looks like either Jenkins changed the purpose of some of the NodeProvisioner.StrategyState meanings, or something else changed. In the NoDelayProvisionStrategy.java file, I changed the functionality to get the current number of functional nodes to this:

        final LoadStatistics.LoadStatisticsSnapshot snapshot = strategyState.getSnapshot();
        final int availableCapacity = snapshot.getOnlineExecutors() + snapshot.getConnectingExecutors() - snapshot.getBusyExecutors();

        int currentDemand = snapshot.getQueueLength() - availableCapacity;
        LOGGER.log(currentDemand < 1 ? Level.FINE : Level.INFO,
                "label [{0}]: currentDemand {1} availableCapacity {2} (onlineExecutors {3} connectingExecutors {4} busyExecutors {5} queueLength {6})",
                new Object[]{label, currentDemand, availableCapacity, snapshot.getOnlineExecutors(),
                        snapshot.getConnectingExecutors(), snapshot.getBusyExecutors(), snapshot.getQueueLength()});

This caused the check to find the actual number of nodes that were either already provisioned or in the process of being provisioned, and find the available capacity without busy executors. We only use 1 executor per node, so this may not fly if you're using multiple executors (I haven't tested).

This seems to keep our nodes provisioning and scaling down regularly, and I'm happy with it for our purposes, now.

I don't have branch push permissions on this repo, so I can't push my changes up for a pull request, unfortunately :(

@xocasdashdash
Copy link

@mtn-boblloyd can't you open a PR from your fork?

@pdk27
Copy link
Collaborator

pdk27 commented May 24, 2023

@mtn-boblloyd Thanks so much for your PR. I am also working on a fix for currentDemand computation. I will get in touch about it soon.


@mwos-sl Apologies for the delay. We have been short-staffed since the developers who maintained this plugin moved on.

I have been looking into reproducing the scenarios detailed here and in various previous (related) issues in order get the full context. I wasn't able to reproduce it intentionally but I did come across the scenario below after running the plugin for long hours:

  • 1 (sometimes more than 1 🤷‍♀️ ) active/connected node takes incoming builds and runs them successfully
  • As the queue gets bigger, new instances / nodes are launched but Jenkins fails to connect to them, during which they get suspended for a while, eventually getting terminated.
  • Changing plugin configurations like noDelayProvision didn't seem to impact the suspended instances situation

Some observations:

  1. Agent/Slave logs on the active node showed LinkageError while performing (stacktrace below) with NO impact on the node's ability to take on new builds.
LinkageError while performing

image

  1. After the LinkageError is logged, subsequent node connections (some of which were in the process of establishing connection) just hang image, according to their logs. Logging on the plugin side showed exceptions like java.lang.IllegalStateException: Failed to provision node. Could not connect to node 'i- XXXXX' before timeout or java.io.IOException: Agent failed to connect, even though the launcher didn't report it. See the log output for details. after which new instance was launched to keep up with the demand, and this cycle repeated.

  2. The logs for suspended nodes would hang at different stages in the log. So I made sure that I was able to SSH into the new EC2 instance and also check for java version - this was not a problem.

Explanation:

This is the only explanation I could think of. Please share thoughts if any. The combination of things below seems to be the problem.

  1. Jenkins being unable to connect to new capacity +
  2. Extended time when connections are attempted without luck(depends on various configurations like number of retries, wait time between retries, timeout, etc) +
  3. plannedCapacitySnapshot being decremented ONLY after the connection to node is established or timeout (code) +
  4. Available capacity computation double counting executors for such (connecting) nodes with connectingExecutors (because of this the availableCapacity could differ wildly from reality, esp. when huge number of nodes are affected)

Fix that seemed to have worked for me:

LinkageError was fixed in Jenkins 2.277.2. I upgraded to Jenkins 2.277.2 and I have not come across the above suspended nodes situation since.

@mwos-sl Looks like you are running a more recent version of Jenkins than 2.277.2? Have you seen different reasons why Jenkins might not be able to connect to the nodes? Is there a way for you to test with Jenkins version 2.277.2?
FYI, the nodes that are suspended have shouldAcceptTasks set to false which happens when the plugin attempts to terminate the instance (code).

I am working on fixing the computations.
In the mean time, we will be upgrading the minimum Jenkins version for the plugin.

@ikaakkola
Copy link

ikaakkola commented May 26, 2023

To clear plannedCapacitySnapshot (Technically these are PlannedNode instances in the NodeProvisioner, here) you can use a Groovy script through the Script Console.

Once you do this, scaling will again work for a while, until you accumulate enough "pending launches", causing plannedCapacitySnapshot to grow. When plannedCapacitySnapshot grows, the plugin will only scale up when there are more items in queue than there are "pending launches" stuck..

List and optionally delete pending launches for all labels

// If set to true, all pending launches are deleted
def deletePending = false;

for (label in Jenkins.instance.getLabels()) {
  println("Label " + label.name);
  println("   Load statistics: " + label.loadStatistics.computeSnapshot());
  println("   Nodeprovisioner state: " + label.nodeProvisioner.provisioningState);
  println("   Nodeprovisioner pending launches:");
  for (launch in label.nodeProvisioner.pendingLaunches) {
    println("      pending launch: " + launch.displayName + ", hashCode=" + launch.hashCode());
    if (deletePending) {
      println("      -> terminating!");
      launch.future.setException(new java.lang.RuntimeException("Terminating pending launch!"));
    } 
  }
  println("");
}

I have not been able to figure out what is causing these stuck pending launches, but clearly there is some code path where a PlannedNode is created but the Future for it is never completed.

@pdk27
Copy link
Collaborator

pdk27 commented May 26, 2023

@ikaakkola Thats correct! It makes sense to include planned nodes in available capacity to avoid over provisioning.

The plugin (specifically this class) controls when the planned nodes' futures are resolved - after Jenkins is able to connect to the node successfully or if that attempt times out.
In the scenario I described above I saw that Jenkins would connect to an nodes successfully until the LinkageError occurred, after which subsequent attempts to connect to new nodes would just hang and timeout. Hence this PR to upgrade minimum Jenkins version.

What version of Jenkins are you running? Do you see any errors in agent logs?

@ikaakkola
Copy link

@pdk27 we are currently on ancient versions, hence I did not add my findings here. Waiting to update to latest versions of Jenkins and the plugin and if there still are 'pending launches' that do not get completed (either done, cancelled or due to an exception) I'll dig deeper. Just wanted to share the workaround we currently use (clear the pending launches manually every now and then) which doesn't need a Jenkins restart.

(For the record, we are not seeing LinkageErrors)

pdk27 added a commit that referenced this issue Jun 1, 2023
* [Fix] Fix computation of excess workload and available capacity

#322

#359

* Update src/main/java/com/amazon/jenkins/ec2fleet/NoDelayProvisionStrategy.java

Co-authored-by: Jerad C <jeradc@amazon.com>

---------

Co-authored-by: Jerad C <jeradc@amazon.com>
@pdk27
Copy link
Collaborator

pdk27 commented Jun 1, 2023

Opened a discussion in the release 2.6.0 which includes some fixes/ other changes, after which I don't see lag in provisioning in my environment. Please share details relevant to the release in the discussion.

@pdk27 pdk27 removed the stalebot-ignore To NOT let the stalebot update or close the Issue / PR label Jun 2, 2023
pdk27 added a commit to pdk27/ec2-fleet-plugin that referenced this issue Jun 27, 2023
[fix] Fix maxtotaluses decrement logic

add logs in post job action to expose tasks terminated with problems

jenkinsci#322

add and fix tests
pdk27 added a commit to pdk27/ec2-fleet-plugin that referenced this issue Jun 27, 2023
[fix] Fix maxtotaluses decrement logic

add logs in post job action to expose tasks terminated with problems

jenkinsci#322

add and fix tests
pdk27 added a commit to pdk27/ec2-fleet-plugin that referenced this issue Jun 28, 2023
[fix] Fix maxtotaluses decrement logic

add logs in post job action to expose tasks terminated with problems

jenkinsci#322

add and fix tests
@pdk27 pdk27 mentioned this issue Jun 28, 2023
pdk27 added a commit that referenced this issue Jun 28, 2023
* [fix] Terminate scheduled instances ONLY IF idle

#363

* [fix] leave maxTotalUses alone and track remainingUses correctly

add a flag to track termination of agents by plugin

* [fix] Fix lost state (instanceIdsToTerminate) on configuration change

[fix] Fix maxtotaluses decrement logic

add logs in post job action to expose tasks terminated with problems

#322

add and fix tests

* add integration tests for configuration change leading to lost state and rebuilding lost state to terminate instances previously marked for termination
@pdk27 pdk27 reopened this Jun 28, 2023
@pdk27
Copy link
Collaborator

pdk27 commented Jun 28, 2023

@wosiu I was finally able to reproduce this issue and here is what I think is happening:

  1. Your maxTotalUses is set to 50, let's say your instances finish 50 builds and are scheduled for termination. i.e. suspended.
  2. A cloud config change is initiated, leading to:
    • recreation of EC2FleetCloud object - hence loosing state like instanceIdsToTerminate (problem#1)
  3. EC2RetentionStrategy checks for instances to terminate when idle but this doesn't terminate the suspended instances because:
  4. suspended nodes remain hanging

Fixes:

pdk27 added a commit to pdk27/ec2-fleet-plugin that referenced this issue Jul 5, 2023
… tracking of cloud objects

[fix] Remove plannedNodeScheduledFutures

[refactor] Added instanceId to FleetNode for clarity, added getDescriptor to return sub type
[refactor] Dont provision if Jenkins is quieting down and terminating
[refactor] Replace more occurences of 'slave' with 'agent'

jenkinsci#360

jenkinsci#322
pdk27 added a commit to pdk27/ec2-fleet-plugin that referenced this issue Jul 11, 2023
… tracking of cloud instance

[refactor] Added instanceId to FleetNode for clarity, added getDescriptor to return sub type

[refactor] Dont provision if Jenkins is quieting down and terminating

Fix jenkinsci#360
Fix jenkinsci#322
@pdk27 pdk27 closed this as completed in a0d67cf Jul 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.