Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-49707] If a KubernetesComputer disconnects, remove the KubernetesSlave #461

Merged
merged 23 commits into from
Jul 10, 2019

Conversation

jglick
Copy link
Member

@jglick jglick commented Apr 30, 2019

Downstream of jenkinsci/workflow-durable-task-step-plugin#104. If the pod has disconnected, I am assuming it will not be able to reconnect, and the agent state is lost—true? If so, there is no point in letting any running sh step continue: it should fail now.

Copy link
Contributor

@carlossg carlossg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jglick
Copy link
Member Author

jglick commented May 1, 2019

Seems RestartPipelineTest is broken by these changes. Investigating.

jglick added 3 commits May 1, 2019 09:06
while this plugin had tried to turn off the rerun feature,
what it actually did was make it so that failing tests are rerun once—i.e., run twice before finally failing.
This happens even when -Dtest=… is specified,
which disables the profile in the parent POM,
annoyingly making all failing local tests run twice.
…rely due to a Remoting channel being disconnected.

Among other things, this fixes failures in RestartPipelineTest.
(Perhaps attributable to org.jvnet.hudson.test.ChannelShutdownListener rather than production behavior, but still.)
@jglick
Copy link
Member Author

jglick commented May 1, 2019

Hmm, live tests fail because I did not consider this RBAC scenario:

WARNING	o.c.j.p.k.pod.retention.Reaper#activate: failed to set up watcher on kubernetes
io.fabric8.kubernetes.client.KubernetesClientException: pods is forbidden: User "system:serviceaccount:…" cannot watch pods at the cluster scope

I suppose the answer is to watch only within the KubernetesCloud.namespace (also do a startup-time list pods in this namespace), and do not attempt to clean up agents associated with pods using a custom namespace, which I assume is rare anyway.

@jglick
Copy link
Member Author

jglick commented May 1, 2019

With @schottsfired’s help I got 61307bf tested in a semirealistic environment.

@schottsfired
Copy link
Contributor

schottsfired commented May 2, 2019

Brilliant, @jglick! Looks good on my end.

@jglick
Copy link
Member Author

jglick commented May 28, 2019

Perhaps related to #201.

@jglick
Copy link
Member Author

jglick commented Jun 6, 2019

Tested using AWS Quickstart for CloudBees Core on EKS with a Pipeline targeted to Spot instances. Simulated a termination in various ways. kubectl delete pod:

Cannot contact …: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on JNLP4-connect connection from … failed. The channel is closing down or has closed down

sudo halt -ff:

java.net.SocketTimeoutException: sent ping but didn't receive pong within 1000ms (after 182 successful ping/pongs)
	at okhttp3.internal.ws.RealWebSocket.writePingFrame(RealWebSocket.java:546)
	at okhttp3.internal.ws.RealWebSocket$PingRunnable.run(RealWebSocket.java:530)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Cannot contact …: java.lang.InterruptedException

…and in either case the agent does get removed and the build aborted. 🎉

@jglick jglick marked this pull request as ready for review June 6, 2019 17:47
Jenkinsfile Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
jglick and others added 2 commits July 5, 2019 15:43
Co-Authored-By: Devin Nusbaum <dwnusbaum@users.noreply.github.com>
@jglick jglick removed the on hold label Jul 5, 2019
@Vlatombe Vlatombe added the enhancement Improvements label Jul 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants