JENKINS-45295 Swarm Client should update labels on the fly when labelsFile changes #110
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
In JENKINS-45295 and JENKINS-48247, users have reported that when the contents of the file specified by
-labelsFile
are changed, the client is blindly restarted, interrupting any running jobs. Needless to say, this is disruptive.Evaluation
LabelFileWatcher
has two modes of operation: "soft" updates (which attempt to modify the labels in place) and "hard" updates (which restart the client, interrupting any running jobs). A soft update is first tried, followed by a hard update if the soft update is unsuccessful. I verified experimentally that after changing the labels file, a soft update was attempted and failed. Then the hard update was done, which restarted the client.Why did the soft update fail? I stepped through the failure and saw that the soft update process made a call to the backend, but the backend returned a 404, claiming that a node by that name could not be found. Looking into this further, I saw that the node created on the backend had a hash appended to it, while the node name being provided by the client did not have this hash. This provided the root cause: we were setting up the label watcher before the node had been named by the server. The requested name was not actually used by the server, but the label watcher thread was still trying to use it when communicating with the server.
Incidentally, the hash is only appended to the node name when
-disableClientsUniqueId
is not provided, and I was able to work around the issue by passing in-disableClientsUniqueId
.Solution
Create the node on the server before creating the label watcher thread, and use the name returned by the server when setting up the label watcher thread.
Implementation
The implementation was pretty straightforward. I had to wean our test framework off
-disableClientsUniqueId
in order to be able to test this. I started by removing-disableClientsUniqueId
fromTestUtils
, which entailed writing a customgetComputer
function to iterate through nodes with a possible hash in the name. That having been done, we were no longer using-disableClientsUniqueId
inPipelineTest
, which needed it for tests that restart the agent. I added it back to just those tests. That having been done, I removeddisableClientsUniqueId
from the label tests.