Skip to content

Fix k8s operator exit handler pod loop and TTL cleanup, add tolerations#26971

Merged
pmbrull merged 8 commits intomainfrom
task/fix-issue-26772-with-proper-tests-728b027c
Apr 14, 2026
Merged

Fix k8s operator exit handler pod loop and TTL cleanup, add tolerations#26971
pmbrull merged 8 commits intomainfrom
task/fix-issue-26772-with-proper-tests-728b027c

Conversation

@pmbrull
Copy link
Copy Markdown
Collaborator

@pmbrull pmbrull commented Apr 2, 2026

requires open-metadata/openmetadata-helm-charts#492

Summary

  • Fix exit handler pod creation loop: findExitHandlerPod() was missing the name-based fallback that findMainPod() already had. When label propagation was delayed, the reconciler kept entering the "create" branch every 10 seconds, publishing ExitHandlerCreated events indefinitely.
  • Fix TTL-based pod cleanup: handleTerminalPhase() returned noUpdate() without rescheduling, so the operator never re-checked when ttlSecondsAfterFinished expired. Now it reschedules after the remaining TTL duration.
  • Add tolerations support: Users can now specify Kubernetes tolerations for ingestion pod scheduling, propagated through the full stack (CRD schema, operator, server-side config and OMJob builder).

Closes #26772

Test plan

  • OMJobReconcilerTest — verifies exit handler pod is not recreated when already found, TTL rescheduling for terminal phases, immediate cleanup when TTL expired, no reschedule without TTL
  • PodManagerTest — verifies tolerations are applied to pod spec, name-based fallback for exit handler pod discovery, no fallback when status has no pod name
  • CRDSchemaValidationTest — verifies tolerations field exists in CRD schema and Java model
  • K8sPipelineClientConfigTest — verifies tolerations parsing from config maps, empty/invalid tolerations handling

…ns support (#26772)

Fix two bugs in the OMJob operator:
- Exit handler pods were recreated indefinitely because findExitHandlerPod()
  lacked the name-based fallback that findMainPod() already had, causing
  label propagation delays to trigger repeated pod creation events
- Terminal phase handler never rescheduled for TTL-based cleanup, so pods
  were never cleaned up after ttlSecondsAfterFinished expired

Add tolerations support for ingestion pod scheduling across the full stack:
- Operator: OMJobPodSpec field, PodManager.buildPod(), CRD schema
- Server: OMJob model, K8sPipelineClientConfig parsing, K8sPipelineClient
  builder, K8sJobUtils serialization

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 2, 2026 10:51
@github-actions github-actions bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Apr 2, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes two operational issues in the OMJob Kubernetes operator (exit handler pod recreation loop and missing TTL-based cleanup rescheduling) and adds end-to-end tolerations support for ingestion pod scheduling.

Changes:

  • Add a name-based fallback when discovering the exit handler pod to prevent repeated creation when label propagation is delayed.
  • Reschedule reconciliation in terminal phases based on remaining TTL so pod cleanup happens when ttlSecondsAfterFinished expires.
  • Introduce tolerations configuration and propagate it through the server-side OMJob builder, CRD schema, operator model, and pod creation.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
openmetadata-service/src/test/java/org/openmetadata/service/clients/pipeline/k8s/K8sPipelineClientConfigTest.java Adds unit tests covering tolerations parsing and default/invalid cases.
openmetadata-service/src/main/java/org/openmetadata/service/clients/pipeline/k8s/OMJob.java Extends OMJob pod spec model to include tolerations.
openmetadata-service/src/main/java/org/openmetadata/service/clients/pipeline/k8s/K8sPipelineClientConfig.java Adds tolerations config key and parsing into V1Toleration objects.
openmetadata-service/src/main/java/org/openmetadata/service/clients/pipeline/k8s/K8sPipelineClient.java Propagates tolerations into constructed Job/CronJob pod specs and OMJob specs.
openmetadata-service/src/main/java/org/openmetadata/service/clients/pipeline/k8s/K8sJobUtils.java Serializes tolerations into the OMJob pod spec map when present.
openmetadata-k8s-operator/src/test/java/org/openmetadata/operator/unit/CRDSchemaValidationTest.java Ensures tolerations exists in the CRD schema and Java model field set.
openmetadata-k8s-operator/src/test/java/org/openmetadata/operator/service/PodManagerTest.java Adds tests for tolerations application and exit handler pod name-fallback lookup behavior.
openmetadata-k8s-operator/src/test/java/org/openmetadata/operator/controller/OMJobReconcilerTest.java Adds reconciler tests for exit handler create-once behavior and TTL rescheduling/cleanup.
openmetadata-k8s-operator/src/main/resources/crds/omjob-crd.yaml Adds tolerations schema to both main and exit handler pod specs.
openmetadata-k8s-operator/src/main/java/org/openmetadata/operator/service/PodManager.java Adds name-based fallback to find exit handler pods and applies tolerations to pod specs.
openmetadata-k8s-operator/src/main/java/org/openmetadata/operator/model/OMJobSpec.java Extends operator-side OMJob pod spec model with tolerations field.
openmetadata-k8s-operator/src/main/java/org/openmetadata/operator/controller/OMJobReconciler.java Reschedules reconciliation in terminal phases until TTL expiry; cleans up when expired.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

🟡 Playwright Results — all passed (21 flaky)

✅ 3597 passed · ❌ 0 failed · 🟡 21 flaky · ⏭️ 207 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 456 0 1 2
🟡 Shard 2 640 0 2 32
🟡 Shard 3 648 0 3 26
🟡 Shard 4 613 0 9 47
🟡 Shard 5 605 0 2 67
🟡 Shard 6 635 0 4 33
🟡 21 flaky test(s) (passed on retry)
  • Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/DataQuality/BundleSuiteBulkOperations.spec.ts › Bulk selection operations (shard 2, 1 retry)
  • Features/Permissions/GlossaryPermissions.spec.ts › Team-based permissions work correctly (shard 3, 1 retry)
  • Flow/ExploreDiscovery.spec.ts › Should display deleted assets when showDeleted is checked and deleted is not present in queryFilter (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
  • Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for Database Schema (shard 4, 1 retry)
  • Pages/DataContracts.spec.ts › Contract Status badge should be visible on condition if Contract Tab is present/hidden by Persona (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Rename domain with assets (tables, topics, dashboards) preserves associations (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Verify Domain entity API calls do not include invalid domains field in glossary term assets (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Set & update column-level custom property (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Glossary Term Add, Update and Remove for child entities (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Tier Add, Update and Remove (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Tag Add, Update and Remove (shard 4, 2 retries)
  • Pages/ExploreTree.spec.ts › Verify Database and Database Schema available in explore tree (shard 5, 1 retry)
  • Pages/Glossary.spec.ts › Add and Remove Assets (shard 5, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Stored Procedure (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/ProfilerConfigurationPage.spec.ts › Non admin user (shard 6, 1 retry)
  • Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Adds the tolerations config binding so the server picks up the
K8S_TOLERATIONS env var set by the Helm chart secret.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 2, 2026 14:28
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.

akash-jain-10
akash-jain-10 previously approved these changes Apr 6, 2026
- Remove redundant server-created pod selector fallback in findMainPod()
  since buildPodSelector() now matches all pods by omjob-name alone
- Add null guard for getItems() in deletePods() to prevent NPE
- Update local test values for namespace and image config

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 9, 2026 06:42
@pmbrull pmbrull added the To release Will cherry-pick this PR into the release branch label Apr 9, 2026
@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 9, 2026

Code Review ✅ Approved 1 resolved / 1 findings

Fixes the Kubernetes operator exit handler pod loop and TTL cleanup, adding tolerations support. Resolves the TTL boundary race condition where pods were never cleaned at exact expiry second.

✅ 1 resolved
Edge Case: TTL boundary race: pods never cleaned at exact expiry second

📄 openmetadata-k8s-operator/src/main/java/org/openmetadata/operator/controller/OMJobReconciler.java:309 📄 openmetadata-k8s-operator/src/main/java/org/openmetadata/operator/controller/OMJobReconciler.java:321-335
In handleTerminalPhase, when remaining == 0 (i.e., elapsed time equals TTL exactly), shouldCleanup() returns false (it uses strict >) and the reschedule guard remaining > 0 is also false. The method falls through to the no-reschedule terminal return at line 339, so pods are never cleaned up.

This is a very narrow 1-second window and any subsequent reconciliation event would resolve it, but it could theoretically leave pods behind if no other event triggers reconciliation.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Comment on lines +267 to +271
if (map.get("tolerationSeconds") != null) {
toleration.setTolerationSeconds(
Long.parseLong(map.get("tolerationSeconds").toString()));
}
result.add(toleration);
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parseTolerations wraps the whole parsing loop in a single try/catch, so a single malformed field (e.g., non-numeric tolerationSeconds) will throw and cause the method to return an empty list, discarding any tolerations already parsed. Consider handling parse errors per-item/per-field (e.g., catch NumberFormatException around tolerationSeconds) so valid tolerations are still applied while skipping only invalid entries.

Copilot uses AI. Check for mistakes.
Comment on lines 316 to 320
LOG.info("TTL expired for OMJob: {}, cleaning up pods", omJob.getMetadata().getName());
podManager.deletePods(omJob);

eventPublisher.publishNormalEvent(
omJob, "ResourcesCleanedUp", "Cleaned up pods due to TTL expiration");
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deletePods is called here, but PodManager.deletePods swallows exceptions internally. That means reconciliation will proceed to publish ResourcesCleanedUp even if deletion failed. Consider having deletePods return a boolean (or throw) and only publishing the cleanup event on success.

Copilot uses AI. Check for mistakes.
Comment on lines +321 to +323

return UpdateControl.<OMJobResource>noUpdate();
}
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After attempting TTL cleanup, the reconciler returns noUpdate() with no reschedule. If pod deletion fails due to a transient API error, the operator won't retry cleanup until the next external reconcile trigger. Consider rescheduling after a short delay when cleanup fails (and only stopping reconciliation once deletion succeeds).

Copilot uses AI. Check for mistakes.
Comment on lines +300 to +308
List<Pod> pods = client.pods().inNamespace(namespace).withLabels(selector).list().getItems();
if (pods == null) {
pods = List.of();
}
for (Pod pod : pods) {
String podName = pod.getMetadata().getName();
client.pods().inNamespace(namespace).withName(podName).delete();
LOG.info("Deleted pod: {}", podName);
}
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes pod cleanup from a single label-selector delete to listing pods and deleting them one-by-one, which increases API calls and can become expensive when many pods exist (e.g., after prior reconciliation loops). Unless per-pod deletion is required for correctness, consider using withLabels(selector).delete() (or batch deletion) and rely on logging at a higher level.

Suggested change
List<Pod> pods = client.pods().inNamespace(namespace).withLabels(selector).list().getItems();
if (pods == null) {
pods = List.of();
}
for (Pod pod : pods) {
String podName = pod.getMetadata().getName();
client.pods().inNamespace(namespace).withName(podName).delete();
LOG.info("Deleted pod: {}", podName);
}
client.pods().inNamespace(namespace).withLabels(selector).delete();
LOG.info("Deleted pods for OMJob: {}", omJob.getMetadata().getName());

Copilot uses AI. Check for mistakes.
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 9, 2026

@pmbrull pmbrull merged commit cfd71e8 into main Apr 14, 2026
78 of 108 checks passed
@pmbrull pmbrull deleted the task/fix-issue-26772-with-proper-tests-728b027c branch April 14, 2026 07:42
@github-actions
Copy link
Copy Markdown
Contributor

Changes have been cherry-picked to the 1.12.6 branch.

github-actions bot pushed a commit that referenced this pull request Apr 14, 2026
…ns (#26971)

* Fix k8s operator exit handler pod loop and TTL cleanup, add tolerations support (#26772)

Fix two bugs in the OMJob operator:
- Exit handler pods were recreated indefinitely because findExitHandlerPod()
  lacked the name-based fallback that findMainPod() already had, causing
  label propagation delays to trigger repeated pod creation events
- Terminal phase handler never rescheduled for TTL-based cleanup, so pods
  were never cleaned up after ttlSecondsAfterFinished expired

Add tolerations support for ingestion pod scheduling across the full stack:
- Operator: OMJobPodSpec field, PodManager.buildPod(), CRD schema
- Server: OMJob model, K8sPipelineClientConfig parsing, K8sPipelineClient
  builder, K8sJobUtils serialization

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add K8S_TOLERATIONS env var mapping in openmetadata.yaml

Adds the tolerations config binding so the server picks up the
K8S_TOLERATIONS env var set by the Helm chart secret.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add tolerations to k8s test values for local validation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix cleanup

* Address PR review: remove redundant pod lookup and guard null items

- Remove redundant server-created pod selector fallback in findMainPod()
  since buildPodSelector() now matches all pods by omjob-name alone
- Add null guard for getItems() in deletePods() to prevent NPE
- Update local test values for namespace and image config

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
(cherry picked from commit cfd71e8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

k8s operator service improvements

4 participants