JENKINS-45295 Swarm Client should update labels on the fly when labelsFile changes #110

basil · 2019-06-02T17:49:57Z

Problem

In JENKINS-45295 and JENKINS-48247, users have reported that when the contents of the file specified by -labelsFile are changed, the client is blindly restarted, interrupting any running jobs. Needless to say, this is disruptive.

Evaluation

LabelFileWatcher has two modes of operation: "soft" updates (which attempt to modify the labels in place) and "hard" updates (which restart the client, interrupting any running jobs). A soft update is first tried, followed by a hard update if the soft update is unsuccessful. I verified experimentally that after changing the labels file, a soft update was attempted and failed. Then the hard update was done, which restarted the client.

Why did the soft update fail? I stepped through the failure and saw that the soft update process made a call to the backend, but the backend returned a 404, claiming that a node by that name could not be found. Looking into this further, I saw that the node created on the backend had a hash appended to it, while the node name being provided by the client did not have this hash. This provided the root cause: we were setting up the label watcher before the node had been named by the server. The requested name was not actually used by the server, but the label watcher thread was still trying to use it when communicating with the server.

Incidentally, the hash is only appended to the node name when -disableClientsUniqueId is not provided, and I was able to work around the issue by passing in -disableClientsUniqueId.

Solution

Create the node on the server before creating the label watcher thread, and use the name returned by the server when setting up the label watcher thread.

Implementation

The implementation was pretty straightforward. I had to wean our test framework off -disableClientsUniqueId in order to be able to test this. I started by removing -disableClientsUniqueId from TestUtils, which entailed writing a custom getComputer function to iterate through nodes with a possible hash in the name. That having been done, we were no longer using -disableClientsUniqueId in PipelineTest, which needed it for tests that restart the agent. I added it back to just those tests. That having been done, I removed disableClientsUniqueId from the label tests.

…sFile changes

jimklimov · 2019-06-03T06:41:46Z

Nice research write-up. Got one comment though: you were removing this and that test with the option - so now future developers would not have a means of regression testing of how this plugin works WITH the option set? Shouldn't there be two sets of tests in place, with and without the option, that should act according to some set expectations for each case?

basil · 2019-06-03T15:28:25Z

Shouldn't there be two sets of tests in place, with and without the option, that should act according to some set expectations for each case?

Good point. I implemented this in #113.

JENKINS-45295 Swarm Client should update labels on the fly when label…

8c50608

…sFile changes

basil merged commit 1c9e394 into jenkinsci:master Jun 2, 2019

basil deleted the JENKINS-45295 branch June 2, 2019 18:44

basil mentioned this pull request Jun 3, 2019

Clean up label tests and add tests for both unique and non-unique IDs #113

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JENKINS-45295 Swarm Client should update labels on the fly when labelsFile changes #110

JENKINS-45295 Swarm Client should update labels on the fly when labelsFile changes #110

basil commented Jun 2, 2019

jimklimov commented Jun 3, 2019

basil commented Jun 3, 2019

JENKINS-45295 Swarm Client should update labels on the fly when labelsFile changes #110

JENKINS-45295 Swarm Client should update labels on the fly when labelsFile changes #110

Conversation

basil commented Jun 2, 2019

Problem

Evaluation

Solution

Implementation

jimklimov commented Jun 3, 2019

basil commented Jun 3, 2019