Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage collection of orphan pods #1543

Merged
merged 7 commits into from
May 7, 2024

Conversation

Vlatombe
Copy link
Member

@Vlatombe Vlatombe commented May 6, 2024

  • Provide a new configuration section to garbage collect orphan pods

  • All kubernetes agents are regularly annotated with the current timestamp.

  • Then look at agents managed by the current instance, and look at pods that have not been refreshed for the timeout defined in the configuration section (2 minutes minimum).

  • By default, the default namespace for each Kubernetes Cloud is monitored. If there are pods using different namespaces, these need to be configured accordingly in order to be picked up by the garbage collection logic.

Testing done

Submitter checklist

Edit tasklist title
Beta Give feedback Tasklist Submitter checklist, more options

Delete tasklist

Delete tasklist block?
Are you sure? All relationships in this tasklist will be removed.
  1. Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
    Options
  2. Ensure that the pull request title represents the desired changelog entry
    Options
  3. Please describe what you did
    Options
  4. Link to relevant issues in GitHub or Jira
    Options
  5. Link to relevant pull requests, esp. upstream and downstream changes
    Options
  6. Ensure you have provided tests - that demonstrates feature works or fixes the issue
    Options
Loading

* Provide a new configuration section to garbage collect orphan pods

* All kubernetes agents are regularly annotated with the current timestamp.

* Then look at agents managed by the current instance, and look at pods that have not been refreshed for the timeout defined in the configuration section (2 minutes minimum).

* By default, the default namespace for each Kubernetes Cloud is monitored. If there are pods using different namespaces, these need to be configured accordingly in order to be picked up by the garbage collection logic.
@Vlatombe Vlatombe added the enhancement Improvements label May 6, 2024
@Vlatombe Vlatombe requested a review from a team as a code owner May 6, 2024 14:23
Comment on lines 877 to 882
r.jenkins.removeNode(node);
break;
}
}
// Build is marked as failed because the agent has vanished
r.assertBuildStatus(Result.FAILURE, r.waitForCompletion(b));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I guess but not very realistic.

Maybe you can doKill the build to simulate a loss of regular cleanup from podTemplate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doKill cancels the build, but that leaves the node running.

Copy link
Member

@jglick jglick May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is also something that ought to be fixed I think. Rather than refreshing the timestamp when a node is online, can we do so only when it is busy? Or is there some use case for a long-lived idle K8s agent?

@@ -538,6 +543,51 @@ public static Builder builder() {
return new Builder();
}

public void annotateTtl(TaskListener listener) {
var kubernetesCloud = getKubernetesCloud();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raises an IllegalStateException if so, which is caught in order to avoid breaking the whole loop.

@Vlatombe Vlatombe requested a review from jglick May 6, 2024 16:04
return Duration.between(
Instant.ofEpochMilli(refreshTime),
Instant.now())
Level.FINE,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW you can use Logger.fine with a Supplier which is a bit more concise. (Not, alas, with a Throwable.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not, alas, with a Throwable

Indeed, this is why I use this form for consistency.

Comment on lines +47 to +48
private static Long RECURRENCE_PERIOD = SystemProperties.getLong(
GarbageCollection.class.getName() + ".recurrencePeriod", TimeUnit.MINUTES.toSeconds(1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have SystemProperties.getDuration that would parse 250ms, 5s, 10m, etc.

Copy link
Member Author

@Vlatombe Vlatombe May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, too bad Duration.parse takes ISO-8601 format which is not practical.

@Vlatombe Vlatombe enabled auto-merge (squash) May 7, 2024 07:28
@Vlatombe Vlatombe merged commit f10083a into jenkinsci:master May 7, 2024
6 checks passed
@Vlatombe Vlatombe deleted the feat/garbage-collection branch May 7, 2024 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvements
Projects
None yet
2 participants