Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very often get two Fargate tasks when only one agent is needed #311

Open
helloPiers opened this issue Mar 13, 2023 · 3 comments
Open

Very often get two Fargate tasks when only one agent is needed #311

helloPiers opened this issue Mar 13, 2023 · 3 comments
Labels

Comments

@helloPiers
Copy link

Jenkins and plugins versions report

Environment
Jenkins: 2.394
OS: Linux - 5.10.167-147.601.amzn2.x86_64
Java: 17.0.6 - Eclipse Adoptium (OpenJDK 64-Bit Server VM)
---
amazon-ecs:1.47
apache-httpcomponents-client-4-api:4.5.14-150.v7a_b_9d17134a_5
aws-credentials:191.vcb_f183ce58b_9
aws-java-sdk-ec2:1.12.406-370.v8f993c987059
aws-java-sdk-ecs:1.12.406-370.v8f993c987059
aws-java-sdk-efs:1.12.406-370.v8f993c987059
aws-java-sdk-minimal:1.12.406-370.v8f993c987059
bootstrap5-api:5.2.1-3
bouncycastle-api:2.27
branch-api:2.1071.v1a_188a_562481
caffeine-api:2.9.3-65.v6a_47d0f4d1fe
checks-api:1.8.1
cloudbees-folder:6.815.v0dd5a_cb_40e0e
command-launcher:90.v669d7ccb_7c31
commons-lang3-api:3.12.0-36.vd97de6465d5b_
commons-text-api:1.10.0-36.vc008c8fcda_7b_
configuration-as-code:1569.vb_72405b_80249
configurationslicing:430.v966357576543
credentials:1214.v1de940103927
credentials-binding:523.vd859a_4b_122e6
data-tables-api:1.12.1-4
display-url-api:2.3.7
durable-task:504.vb10d1ae5ba2f
echarts-api:5.4.0-1
font-awesome-api:6.2.1-1
git:5.0.0
git-client:4.1.0
instance-identity:142.v04572ca_5b_265
ionicons-api:31.v4757b_6987003
jackson2-api:2.14.2-319.v37853346a_229
jakarta-activation-api:2.0.1-3
jakarta-mail-api:2.0.1-3
javax-activation-api:1.2.0-6
javax-mail-api:1.6.2-9
jaxb:2.3.8-1
jdk-tool:63.v62d2fd4b_4793
jquery3-api:3.6.1-2
junit:1177.v90374a_ef4d09
lockable-resources:1131.vb_7c3d377e723
mailer:448.v5b_97805e3767
matrix-project:785.v06b_7f47b_c631
metrics:4.2.13-420.vea_2f17932dd6
mina-sshd-api-common:2.9.2-50.va_0e1f42659a_a
mina-sshd-api-core:2.9.2-50.va_0e1f42659a_a
next-build-number:1.8
pipeline-build-step:486.vd08f550cceee
pipeline-graph-analysis:202.va_d268e64deb_3
pipeline-graph-view:175.vb_1a_b_b_cd0fb_86
pipeline-groovy-lib:629.vb_5627b_ee2104
pipeline-input-step:466.v6d0a_5df34f81
pipeline-milestone-step:111.v449306f708b_7
pipeline-model-api:2.2118.v31fd5b_9944b_5
pipeline-model-definition:2.2118.v31fd5b_9944b_5
pipeline-model-extensions:2.2118.v31fd5b_9944b_5
pipeline-rest-api:2.31
pipeline-stage-step:305.ve96d0205c1c6
pipeline-stage-tags-metadata:2.2118.v31fd5b_9944b_5
pipeline-stage-view:2.31
plain-credentials:143.v1b_df8b_d3b_e48
plugin-util-api:2.20.0
popper2-api:2.11.6-2
scm-api:631.v9143df5b_e4a_a
script-security:1229.v4880b_b_e905a_6
snakeyaml-api:1.33-95.va_b_a_e3e47b_fa_4
ssh-credentials:305.v8f4381501156
sshd:3.275.v9e17c10f2571
structs:324.va_f5d6774f3a_d
trilead-api:2.84.v72119de229b_7
variant:59.vf075fe829ccb
workflow-aggregator:596.v8c21c963d92d
workflow-api:1208.v0cc7c6e0da_9e
workflow-basic-steps:1010.vf7a_b_98e847c1
workflow-cps:3641.vf58904a_b_b_5d8
workflow-durable-task-step:1234.v019404b_3832a
workflow-job:1282.ve6d865025906
workflow-multibranch:733.v109046189126
workflow-scm-step:400.v6b_89a_1317c9a_
workflow-step-api:639.v6eca_cd8c04a_a_
workflow-support:839.v35e2736cfd5c

What Operating System are you using (both controller, and any agents involved in the problem)?

Linux in Fargate containers, using images based on jenkins/jenkins:alpine-jdk17 and jenkins/inbound-agent:jdk17.

Reproduction steps

  1. Wait until no ECS Cloud agents are running
  2. Start a job that needs a agent (main node does not run jobs)

Expected Results

A new Fargate Task for a new Agent node started up, does the Job.

Actual Results

A new Fargate Task is started by the ECS Cloud and runs the Job. Meanwhile a second, identical Fargate Task starts and does nothing as there are no other jobs to run.

After the idle period both are stopped.

Anything else?

I've tried adjusting various timeouts and other values such as poll times for Tasks, to no avail.

It seems like the Task is not considered started quickly enough to avoid Jenkins asking again for a provision?

There are two c.c.j.plugins.amazonecs.ECSCloud#provision: Asked to provision 1 agent(s) for: generic entries - here are what look like the important logs extracted from the full log included below:

2023-03-13 17:13:52.645+0000 [id=41]	INFO	c.c.j.plugins.amazonecs.ECSCloud#provision: Asked to provision 1 agent(s) for: generic
2023-03-13 17:14:43.191+0000 [id=6870]	INFO	h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #16 from /10.28.137.77:48718

2023-03-13 17:14:52.645+0000 [id=29]	INFO	c.c.j.plugins.amazonecs.ECSCloud#provision: Asked to provision 1 agent(s) for: generic

2023-03-13 17:15:18.407+0000 [id=6855]	INFO	c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [fargate-agent-cloud-generic-211x7]: Task started, waiting for agent to become online
2023-03-13 17:15:18.407+0000 [id=6855]	INFO	c.c.j.p.amazonecs.ECSLauncher#waitForAgent: [fargate-agent-cloud-generic-211x7]: Agent connected
2023-03-13 17:15:18.459+0000 [id=6856]	INFO	c.c.j.p.amazonecs.ECSComputer#taskAccepted: [fargate-agent-cloud-generic-211x7]: JobName: FreeStyle1


2023-03-13 17:16:15.305+0000 [id=6901]	INFO	h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #18 from /10.28.107.11:55180
2023-03-13 17:16:18.565+0000 [id=6874]	INFO	c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [fargate-agent-cloud-generic-zfqhk]: Task started, waiting for agent to become online
2023-03-13 17:16:18.565+0000 [id=6874]	INFO	c.c.j.p.amazonecs.ECSLauncher#waitForAgent: [fargate-agent-cloud-generic-zfqhk]: Agent connected



Node fargate-agent-cloud-generic-211x7 (first to launch) carried out the job.

Full controller node log
2023-03-13 17:13:52.645+0000 [id=41]	INFO	c.c.j.plugins.amazonecs.ECSCloud#provision: Asked to provision 1 agent(s) for: generic
2023-03-13 17:13:52.645+0000 [id=41]	INFO	c.c.j.plugins.amazonecs.ECSCloud#provision: Will provision fargate-agent-cloud-generic-211x7, for label: generic
2023-03-13 17:14:02.682+0000 [id=40]	INFO	hudson.slaves.NodeProvisioner#update: fargate-agent-cloud-generic-211x7 provisioning successfully completed. We have now 2 computer(s)
2023-03-13 17:14:02.721+0000 [id=6855]	INFO	c.c.j.p.amazonecs.ECSLauncher#runECSTask: [fargate-agent-cloud-generic-211x7]: Starting agent with task definition arn:aws:ecs:eu-west-1:014056181913:task-definition/jenkins-fargate-test-agent-generic:16}
2023-03-13 17:14:03.293+0000 [id=6855]	INFO	c.c.j.p.amazonecs.ECSLauncher#runECSTask: [fargate-agent-cloud-generic-211x7]: Agent started with task arn : arn:aws:ecs:eu-west-1:014056181913:task/jenkins-fargate-test/07834b1193bf45db8ac88095db2fe31b
2023-03-13 17:14:03.293+0000 [id=6855]	INFO	c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [fargate-agent-cloud-generic-211x7]: TaskArn: arn:aws:ecs:eu-west-1:014056181913:task/jenkins-fargate-test/07834b1193bf45db8ac88095db2fe31b
2023-03-13 17:14:03.293+0000 [id=6855]	INFO	c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [fargate-agent-cloud-generic-211x7]: TaskDefinitionArn: arn:aws:ecs:eu-west-1:014056181913:task-definition/jenkins-fargate-test-agent-generic:16
2023-03-13 17:14:03.293+0000 [id=6855]	INFO	c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [fargate-agent-cloud-generic-211x7]: ClusterArn: arn:aws:ecs:eu-west-1:014056181913:cluster/jenkins-fargate-test
2023-03-13 17:14:03.293+0000 [id=6855]	INFO	c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [fargate-agent-cloud-generic-211x7]: ContainerInstanceArn: null
2023-03-13 17:14:43.110+0000 [id=6869]	INFO	h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #15 from /10.28.137.77:48708 failed: null
2023-03-13 17:14:43.191+0000 [id=6870]	INFO	h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #16 from /10.28.137.77:48718
2023-03-13 17:14:52.645+0000 [id=29]	INFO	c.c.j.plugins.amazonecs.ECSCloud#provision: Asked to provision 1 agent(s) for: generic
2023-03-13 17:14:52.645+0000 [id=29]	INFO	c.c.j.plugins.amazonecs.ECSCloud#provision: Will provision fargate-agent-cloud-generic-zfqhk, for label: generic
2023-03-13 17:15:02.679+0000 [id=38]	INFO	hudson.slaves.NodeProvisioner#update: fargate-agent-cloud-generic-zfqhk provisioning successfully completed. We have now 3 computer(s)
2023-03-13 17:15:02.715+0000 [id=6874]	INFO	c.c.j.p.amazonecs.ECSLauncher#runECSTask: [fargate-agent-cloud-generic-zfqhk]: Starting agent with task definition arn:aws:ecs:eu-west-1:014056181913:task-definition/jenkins-fargate-test-agent-generic:16}
2023-03-13 17:15:03.448+0000 [id=6874]	INFO	c.c.j.p.amazonecs.ECSLauncher#runECSTask: [fargate-agent-cloud-generic-zfqhk]: Agent started with task arn : arn:aws:ecs:eu-west-1:014056181913:task/jenkins-fargate-test/ea715dd06a4d4b8389773367bb72e817
2023-03-13 17:15:03.448+0000 [id=6874]	INFO	c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [fargate-agent-cloud-generic-zfqhk]: TaskArn: arn:aws:ecs:eu-west-1:014056181913:task/jenkins-fargate-test/ea715dd06a4d4b8389773367bb72e817
2023-03-13 17:15:03.448+0000 [id=6874]	INFO	c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [fargate-agent-cloud-generic-zfqhk]: TaskDefinitionArn: arn:aws:ecs:eu-west-1:014056181913:task-definition/jenkins-fargate-test-agent-generic:16
2023-03-13 17:15:03.448+0000 [id=6874]	INFO	c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [fargate-agent-cloud-generic-zfqhk]: ClusterArn: arn:aws:ecs:eu-west-1:014056181913:cluster/jenkins-fargate-test
2023-03-13 17:15:03.448+0000 [id=6874]	INFO	c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [fargate-agent-cloud-generic-zfqhk]: ContainerInstanceArn: null
2023-03-13 17:15:18.407+0000 [id=6855]	INFO	c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [fargate-agent-cloud-generic-211x7]: Task started, waiting for agent to become online
2023-03-13 17:15:18.407+0000 [id=6855]	INFO	c.c.j.p.amazonecs.ECSLauncher#waitForAgent: [fargate-agent-cloud-generic-211x7]: Agent connected
2023-03-13 17:15:18.459+0000 [id=6856]	INFO	c.c.j.p.amazonecs.ECSComputer#taskAccepted: [fargate-agent-cloud-generic-211x7]: JobName: FreeStyle1
2023-03-13 17:15:18.459+0000 [id=6856]	INFO	c.c.j.p.amazonecs.ECSComputer#taskAccepted: [fargate-agent-cloud-generic-211x7]: JobUrl: job/FreeStyle1/
2023-03-13 17:16:15.185+0000 [id=6900]	INFO	h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #17 from /10.28.107.11:55170 failed: null
2023-03-13 17:16:15.305+0000 [id=6901]	INFO	h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #18 from /10.28.107.11:55180
2023-03-13 17:16:18.565+0000 [id=6874]	INFO	c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [fargate-agent-cloud-generic-zfqhk]: Task started, waiting for agent to become online
2023-03-13 17:16:18.565+0000 [id=6874]	INFO	c.c.j.p.amazonecs.ECSLauncher#waitForAgent: [fargate-agent-cloud-generic-zfqhk]: Agent connected
2023-03-13 17:20:56.862+0000 [id=35]	INFO	h.slaves.CloudRetentionStrategy#check: Disconnecting fargate-agent-cloud-generic-211x7
2023-03-13 17:20:56.862+0000 [id=35]	INFO	c.c.j.plugins.amazonecs.ECSSlave#_terminate: [fargate-agent-cloud-generic-211x7]: Stopping: TaskArn arn:aws:ecs:eu-west-1:014056181913:task/jenkins-fargate-test/07834b1193bf45db8ac88095db2fe31b, ClusterArn arn:aws:ecs:eu-west-1:014056181913:cluster/jenkins-fargate-test
2023-03-13 17:20:56.866+0000 [id=35]	INFO	c.c.j.p.amazonecs.ECSService#stopTask: Delete ECS agent task: arn:aws:ecs:eu-west-1:014056181913:task/jenkins-fargate-test/07834b1193bf45db8ac88095db2fe31b
2023-03-13 17:20:56.928+0000 [id=35]	INFO	h.slaves.CloudRetentionStrategy#check: Disconnecting fargate-agent-cloud-generic-zfqhk
2023-03-13 17:20:56.928+0000 [id=35]	INFO	c.c.j.plugins.amazonecs.ECSSlave#_terminate: [fargate-agent-cloud-generic-zfqhk]: Stopping: TaskArn arn:aws:ecs:eu-west-1:014056181913:task/jenkins-fargate-test/ea715dd06a4d4b8389773367bb72e817, ClusterArn arn:aws:ecs:eu-west-1:014056181913:cluster/jenkins-fargate-test
2023-03-13 17:20:56.929+0000 [id=6948]	INFO	j.s.DefaultJnlpSlaveReceiver#channelClosed: Computer.threadPoolForRemoting [#62] for fargate-agent-cloud-generic-211x7 terminated: java.nio.channels.ClosedChannelException
2023-03-13 17:20:56.930+0000 [id=35]	INFO	c.c.j.p.amazonecs.ECSService#stopTask: Delete ECS agent task: arn:aws:ecs:eu-west-1:014056181913:task/jenkins-fargate-test/ea715dd06a4d4b8389773367bb72e817
2023-03-13 17:20:56.985+0000 [id=6949]	INFO	j.s.DefaultJnlpSlaveReceiver#channelClosed: Computer.threadPoolForRemoting [#63] for fargate-agent-cloud-generic-zfqhk terminated: java.nio.channels.ClosedChannelException
@helloPiers helloPiers added the bug label Mar 13, 2023
@maiconbaum
Copy link

Same problem here.

@maiconbaum
Copy link

maiconbaum commented Aug 22, 2023

@helloPiers

Kinda solve this increasing the initialDelay of NodeProvisioner:

{
          name  = "JAVA_OPTS"
          value = "-Dhudson.slaves.NodeProvisioner.initialDelay=60 -Dhudson.slaves.NodeProvisioner.MARGIN=50 -Dhudson.slaves.NodeProvisioner.MARGIN0=0.85"
}

This setting increases the node provisioner delay typically to 600 seconds. Depending on hudson.model.LoadStatistics.clock.

Refer to this link if it helps. Official "documentation" can be found here.

@helloPiers
Copy link
Author

Thanks, that's interesting - I hadn't delved that deep into the internal configuration before.

I ended up writing my own plugin, which of course is much easier to do for a single use case that I define for myself, than for a general tool for a mass audience - I don't underestimate the effort of maintaining a public plugin.

(Aside: even making my own simple plugin for Fargate would have been almost impossible given the state of the plugin development documentation, without using this plugin as inspiration.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants