Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix KubernetesSlave NPE on getting Workspace after Jenkins Master Restart #446

Closed
wants to merge 1 commit into from
Closed

Fix KubernetesSlave NPE on getting Workspace after Jenkins Master Restart #446

wants to merge 1 commit into from

Conversation

jeffersongirao
Copy link
Contributor

@jeffersongirao jeffersongirao commented Mar 20, 2019

@carlossg Thanks a lot for your work on the kubernetes-plugin, it is greatly appreciated! 馃檪 Could you please let me know your thoughts on this?

How to reproduce the issue

Kubernetes-plugin 1.14.9
Jenkins LTS 2.164.1

  • Configure a Kubernetes PodTemplate in the way it stays up after the job finishes (idleMinutes setting)
  • Run a Job that uses such Kubernetes PodTemplate to provision an Agent (that should work good)
  • Restart Jenkins Master (for instance killing the Jenkins process inside the pod)
  • When the Jenkins Master comes back online, the Agent should be able to connect back to it and it is reported as being on-line both from the Agent logs and in the Jenkins master /computer/<agent_name>/
  • When running Job that tries to use that agent you should get an error saying:
ERROR: Issue with creating launcher for agent <agent name>. Computer has been disconnected
Building remotely on <agent name> (k8s)ERROR: <agent name> seems to be offline
Finished: FAILURE

Root cause

When fetching a Workspace for a Job we get a null value that triggers the aforementioned error. If we go deep on how Jenkins fetch the Workspace root path for Slaves, Jenkins queries a Map with Computers and KubernetesSlave (Node) is a key for such Map.

When Jenkins restarts and the agent gets reconnected that Map is re-instantiated, the KubernetesSlave instance is one of the keys but it is not equals() and has a different hashCode(). They differ on the template private field given that PodTemplate does not implements an equals()/hashCode().

Proposed Fix

Perhaps it is reasonable to assume that if agents have the same name they are the same independently of the PodTemplate definition. I have changed both equals() and hasCode() for KubernetesSlave in tune with that.

An alternative would be to implement equals()/hashCode() on the PodTemplate, let me know if you would like me to do it that way instead.

@Vlatombe
Copy link
Member

Hi @jeffersongirao it would be great if you could provide a test showing this bug in order to avoid future regressions. It could be added to https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/test/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/RestartPipelineTest.java

@carlossg
Copy link
Contributor

Thanks! I think you can have multiple agents with the same name so equals()/hashCode() on the PodTemplate would be better

@Vlatombe Vlatombe added the bug Bug Fixes label Jul 6, 2019
@jeffersongirao
Copy link
Contributor Author

jeffersongirao commented Aug 27, 2019

@carlossg I think that I actually need to take it back about the possibility of solving this at the PodTemplate level, the comparison there happens in Map of <Node,Computer>, that is up in the hierarchy and we don't have the template field available (https://github.com/jenkinsci/jenkins/blob/jenkins-2.164.1/core/src/main/java/hudson/model/AbstractCIBase.java#L176). How do you suggest to proceed?

@jeffersongirao
Copy link
Contributor Author

@Vlatombe I have made an attempt to reproduce it in the tests how you suggested, unfortunately it seems like the RestartableJenkinsRule does not have the exact same behaviour as when I'm running a real instance. Wondering if there is anything cached in terms of reference to registered computers in that, any pointers?

@Vlatombe
Copy link
Member

Vlatombe commented Dec 23, 2020

If I understand correctly the original report, this should have been fixed as a side-effect of #783 (1.25.6 and later)

@Vlatombe Vlatombe closed this Dec 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug Fixes
Projects
None yet
3 participants