[JENKINS-49707] If a KubernetesComputer disconnects, remove the KubernetesSlave #461

jglick · 2019-04-30T00:24:44Z

Downstream of jenkinsci/workflow-durable-task-step-plugin#104. If the pod has disconnected, I am assuming it will not be able to reconnect, and the agent state is lost—true? If so, there is no point in letting any running sh step continue: it should fail now.

…netesSlave so that DurableTaskStep knows to clean up.

carlossg

LGTM

jglick · 2019-05-01T12:09:40Z

Seems RestartPipelineTest is broken by these changes. Investigating.

while this plugin had tried to turn off the rerun feature, what it actually did was make it so that failing tests are rerun once—i.e., run twice before finally failing. This happens even when -Dtest=… is specified, which disables the profile in the parent POM, annoyingly making all failing local tests run twice.

…rely due to a Remoting channel being disconnected. Among other things, this fixes failures in RestartPipelineTest. (Perhaps attributable to org.jvnet.hudson.test.ChannelShutdownListener rather than production behavior, but still.)

jglick · 2019-05-01T17:49:13Z

Hmm, live tests fail because I did not consider this RBAC scenario:

WARNING	o.c.j.p.k.pod.retention.Reaper#activate: failed to set up watcher on kubernetes
io.fabric8.kubernetes.client.KubernetesClientException: pods is forbidden: User "system:serviceaccount:…" cannot watch pods at the cluster scope

I suppose the answer is to watch only within the KubernetesCloud.namespace (also do a startup-time list pods in this namespace), and do not attempt to clean up agents associated with pods using a custom namespace, which I assume is rare anyway.

…o do not even try.

jglick · 2019-05-01T23:39:12Z

With @schottsfired’s help I got 61307bf tested in a semirealistic environment.

schottsfired · 2019-05-02T02:36:07Z

Brilliant, @jglick! Looks good on my end.

jglick · 2019-05-28T13:35:32Z

Perhaps related to #201.

jglick · 2019-06-06T16:57:20Z

Tested using AWS Quickstart for CloudBees Core on EKS with a Pipeline targeted to Spot instances. Simulated a termination in various ways. kubectl delete pod:

Cannot contact …: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on JNLP4-connect connection from … failed. The channel is closing down or has closed down

sudo halt -ff:

java.net.SocketTimeoutException: sent ping but didn't receive pong within 1000ms (after 182 successful ping/pongs)
	at okhttp3.internal.ws.RealWebSocket.writePingFrame(RealWebSocket.java:546)
	at okhttp3.internal.ws.RealWebSocket$PingRunnable.run(RealWebSocket.java:530)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Cannot contact …: java.lang.InterruptedException

…and in either case the agent does get removed and the build aborted. 🎉

src/test/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/RestartPipelineTest.java

…imum. Unfortunately I lack write access to the repo so this Jenkinsfile change will be ignored.

Jenkinsfile

pom.xml

Co-Authored-By: Devin Nusbaum <dwnusbaum@users.noreply.github.com>

jglick added 3 commits April 29, 2019 20:22

[JENKINS-49707] If a KubernetesComputer disconnects, remove the Kuber…

99ad1f2

…netesSlave so that DurableTaskStep knows to clean up.

workflow-step-api-plugin.version

cae2572

Updated to jenkinsci/workflow-durable-task-step-plugin@58b127f.

eed014f

carlossg reviewed Apr 30, 2019

View reviewed changes

jglick added 2 commits April 30, 2019 17:25

Incremental deployment.

c3c98a1

JDK 11 Javadoc failure.

4510310

jglick added 3 commits May 1, 2019 09:06

Copy-pasta.

f11a182

We will not in general be permitted to watch pods at cluster scope, s…

61307bf

…o do not even try.

jglick added 4 commits June 5, 2019 13:46

Merge branch 'master' into removingAgentIsFatal-JENKINS-49707

6da8cf6

Bump.

2c3b39e

RequireUpperBoundDeps

5b30251

Need to update workflow-cps to interpret DynamicContext.

3f00eb1

jglick commented Jun 6, 2019

View reviewed changes

src/test/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/RestartPipelineTest.java Outdated Show resolved Hide resolved

jglick marked this pull request as ready for review June 6, 2019 17:47

jglick added 5 commits June 11, 2019 09:31

Merge branch 'master' into removingAgentIsFatal-JENKINS-49707

6f60c9e

Merge branch 'master' into removingAgentIsFatal-JENKINS-49707

b9b5f9c

Test flake pending jenkinsci#496.

210e0f7

Bump.

a25d24f

Senseless to even try to build against _older_ LTS lines than our min…

e71a5c5

…imum. Unfortunately I lack write access to the repo so this Jenkinsfile change will be ignored.

jglick commented Jun 11, 2019

View reviewed changes

Jenkinsfile Show resolved Hide resolved

Vlatombe added the on hold label Jun 12, 2019

jglick mentioned this pull request Jun 12, 2019

Added jglick to kubernetes jenkins-infra/repository-permissions-updater#1173

Merged

4 tasks

jglick added 2 commits June 13, 2019 08:09

Merge branch 'master' into removingAgentIsFatal-JENKINS-49707

3c49b62

Merge branch 'master' into removingAgentIsFatal-JENKINS-49707

a5269d3

jglick mentioned this pull request Jul 2, 2019

Added a generic method to create then run a pipeline based on the current method name #527

Merged

dwnusbaum reviewed Jul 5, 2019

View reviewed changes

pom.xml Outdated Show resolved Hide resolved

jglick and others added 2 commits July 5, 2019 15:43

workflow-durable-task-step 2.32

70153a9

Co-Authored-By: Devin Nusbaum <dwnusbaum@users.noreply.github.com>

Merge branch 'master' into removingAgentIsFatal-JENKINS-49707

a4cccb7

jglick removed the on hold label Jul 5, 2019

Using improved assertion after jenkinsci#496.

bb4e297

Vlatombe added the enhancement Improvements label Jul 6, 2019

Vlatombe approved these changes Jul 10, 2019

View reviewed changes

Vlatombe merged commit f991e78 into jenkinsci:master Jul 10, 2019

jglick deleted the removingAgentIsFatal-JENKINS-49707 branch July 10, 2019 12:58

Vlatombe mentioned this pull request Jul 10, 2019

[JENKINS-58405] Backward compatibility for yaml field merge #542

Merged

jglick mentioned this pull request Nov 16, 2021

[JENKINS-49707] Agent missing after controller restart to fail resumption of node step, not kill whole build jenkinsci/workflow-durable-task-step-plugin#180

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JENKINS-49707] If a KubernetesComputer disconnects, remove the KubernetesSlave #461

[JENKINS-49707] If a KubernetesComputer disconnects, remove the KubernetesSlave #461

jglick commented Apr 30, 2019

carlossg left a comment

jglick commented May 1, 2019

jglick commented May 1, 2019

jglick commented May 1, 2019

schottsfired commented May 2, 2019 •

edited

Loading

jglick commented May 28, 2019

jglick commented Jun 6, 2019

[JENKINS-49707] If a KubernetesComputer disconnects, remove the KubernetesSlave #461

[JENKINS-49707] If a KubernetesComputer disconnects, remove the KubernetesSlave #461

Conversation

jglick commented Apr 30, 2019

carlossg left a comment

Choose a reason for hiding this comment

jglick commented May 1, 2019

jglick commented May 1, 2019

jglick commented May 1, 2019

schottsfired commented May 2, 2019 • edited Loading

jglick commented May 28, 2019

jglick commented Jun 6, 2019

schottsfired commented May 2, 2019 •

edited

Loading