Fix k8s operator exit handler pod loop and TTL cleanup, add tolerations by pmbrull · Pull Request #26971 · open-metadata/OpenMetadata

pmbrull · 2026-04-02T10:51:17Z

requires open-metadata/openmetadata-helm-charts#492

Summary

Fix exit handler pod creation loop: findExitHandlerPod() was missing the name-based fallback that findMainPod() already had. When label propagation was delayed, the reconciler kept entering the "create" branch every 10 seconds, publishing ExitHandlerCreated events indefinitely.
Fix TTL-based pod cleanup: handleTerminalPhase() returned noUpdate() without rescheduling, so the operator never re-checked when ttlSecondsAfterFinished expired. Now it reschedules after the remaining TTL duration.
Add tolerations support: Users can now specify Kubernetes tolerations for ingestion pod scheduling, propagated through the full stack (CRD schema, operator, server-side config and OMJob builder).

Test plan

OMJobReconcilerTest — verifies exit handler pod is not recreated when already found, TTL rescheduling for terminal phases, immediate cleanup when TTL expired, no reschedule without TTL
PodManagerTest — verifies tolerations are applied to pod spec, name-based fallback for exit handler pod discovery, no fallback when status has no pod name
CRDSchemaValidationTest — verifies tolerations field exists in CRD schema and Java model
K8sPipelineClientConfigTest — verifies tolerations parsing from config maps, empty/invalid tolerations handling

…ns support (#26772) Fix two bugs in the OMJob operator: - Exit handler pods were recreated indefinitely because findExitHandlerPod() lacked the name-based fallback that findMainPod() already had, causing label propagation delays to trigger repeated pod creation events - Terminal phase handler never rescheduled for TTL-based cleanup, so pods were never cleaned up after ttlSecondsAfterFinished expired Add tolerations support for ingestion pod scheduling across the full stack: - Operator: OMJobPodSpec field, PodManager.buildPod(), CRD schema - Server: OMJob model, K8sPipelineClientConfig parsing, K8sPipelineClient builder, K8sJobUtils serialization Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR fixes two operational issues in the OMJob Kubernetes operator (exit handler pod recreation loop and missing TTL-based cleanup rescheduling) and adds end-to-end tolerations support for ingestion pod scheduling.

Changes:

Add a name-based fallback when discovering the exit handler pod to prevent repeated creation when label propagation is delayed.
Reschedule reconciliation in terminal phases based on remaining TTL so pod cleanup happens when ttlSecondsAfterFinished expires.
Introduce tolerations configuration and propagate it through the server-side OMJob builder, CRD schema, operator model, and pod creation.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
openmetadata-service/src/test/java/org/openmetadata/service/clients/pipeline/k8s/K8sPipelineClientConfigTest.java	Adds unit tests covering tolerations parsing and default/invalid cases.
openmetadata-service/src/main/java/org/openmetadata/service/clients/pipeline/k8s/OMJob.java	Extends OMJob pod spec model to include tolerations.
openmetadata-service/src/main/java/org/openmetadata/service/clients/pipeline/k8s/K8sPipelineClientConfig.java	Adds `tolerations` config key and parsing into `V1Toleration` objects.
openmetadata-service/src/main/java/org/openmetadata/service/clients/pipeline/k8s/K8sPipelineClient.java	Propagates tolerations into constructed Job/CronJob pod specs and OMJob specs.
openmetadata-service/src/main/java/org/openmetadata/service/clients/pipeline/k8s/K8sJobUtils.java	Serializes tolerations into the OMJob pod spec map when present.
openmetadata-k8s-operator/src/test/java/org/openmetadata/operator/unit/CRDSchemaValidationTest.java	Ensures tolerations exists in the CRD schema and Java model field set.
openmetadata-k8s-operator/src/test/java/org/openmetadata/operator/service/PodManagerTest.java	Adds tests for tolerations application and exit handler pod name-fallback lookup behavior.
openmetadata-k8s-operator/src/test/java/org/openmetadata/operator/controller/OMJobReconcilerTest.java	Adds reconciler tests for exit handler create-once behavior and TTL rescheduling/cleanup.
openmetadata-k8s-operator/src/main/resources/crds/omjob-crd.yaml	Adds `tolerations` schema to both main and exit handler pod specs.
openmetadata-k8s-operator/src/main/java/org/openmetadata/operator/service/PodManager.java	Adds name-based fallback to find exit handler pods and applies tolerations to pod specs.
openmetadata-k8s-operator/src/main/java/org/openmetadata/operator/model/OMJobSpec.java	Extends operator-side OMJob pod spec model with tolerations field.
openmetadata-k8s-operator/src/main/java/org/openmetadata/operator/controller/OMJobReconciler.java	Reschedules reconciliation in terminal phases until TTL expiry; cleans up when expired.

...ice/src/main/java/org/openmetadata/service/clients/pipeline/k8s/K8sPipelineClientConfig.java

github-actions · 2026-04-02T12:45:17Z

🟡 Playwright Results — all passed (21 flaky)

✅ 3597 passed · ❌ 0 failed · 🟡 21 flaky · ⏭️ 207 skipped

Shard	Passed	Flaky	Skipped
🟡 Shard 1	456	1	2
🟡 Shard 2	640	2	32
🟡 Shard 3	648	3	26
🟡 Shard 4	613	9	47
🟡 Shard 5	605	2	67
🟡 Shard 6	635	4	33

🟡 21 flaky test(s) (passed on retry)

Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
Features/DataQuality/BundleSuiteBulkOperations.spec.ts › Bulk selection operations (shard 2, 1 retry)
Features/Permissions/GlossaryPermissions.spec.ts › Team-based permissions work correctly (shard 3, 1 retry)
Flow/ExploreDiscovery.spec.ts › Should display deleted assets when showDeleted is checked and deleted is not present in queryFilter (shard 3, 1 retry)
Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
Pages/DataContracts.spec.ts › Create Data Contract and validate for Database Schema (shard 4, 1 retry)
Pages/DataContracts.spec.ts › Contract Status badge should be visible on condition if Contract Tab is present/hidden by Persona (shard 4, 1 retry)
Pages/Domains.spec.ts › Rename domain with assets (tables, topics, dashboards) preserves associations (shard 4, 1 retry)
Pages/Domains.spec.ts › Verify Domain entity API calls do not include invalid domains field in glossary term assets (shard 4, 1 retry)
Pages/Entity.spec.ts › Set & update column-level custom property (shard 4, 1 retry)
Pages/Entity.spec.ts › Glossary Term Add, Update and Remove for child entities (shard 4, 1 retry)
Pages/Entity.spec.ts › Tier Add, Update and Remove (shard 4, 1 retry)
Pages/Entity.spec.ts › Tag Add, Update and Remove (shard 4, 2 retries)
Pages/ExploreTree.spec.ts › Verify Database and Database Schema available in explore tree (shard 5, 1 retry)
Pages/Glossary.spec.ts › Add and Remove Assets (shard 5, 1 retry)
Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Stored Procedure (shard 6, 1 retry)
Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
Pages/ProfilerConfigurationPage.spec.ts › Non admin user (shard 6, 1 retry)
Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)

📦 Download artifacts

How to debug locally

# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

...etadata-k8s-operator/src/main/java/org/openmetadata/operator/controller/OMJobReconciler.java

Adds the tolerations config binding so the server picks up the K8S_TOLERATIONS env var set by the Helm chart secret. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.

openmetadata-k8s-operator/src/main/resources/crds/omjob-crd.yaml

...etadata-k8s-operator/src/main/java/org/openmetadata/operator/controller/OMJobReconciler.java

- Remove redundant server-created pod selector fallback in findMainPod() since buildPodSelector() now matches all pods by omjob-name alone - Add null guard for getItems() in deletePods() to prevent NPE - Update local test values for namespace and image config Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gitar-bot · 2026-04-09T06:43:15Z

Code Review ✅ Approved 1 resolved / 1 findings

Fixes the Kubernetes operator exit handler pod loop and TTL cleanup, adding tolerations support. Resolves the TTL boundary race condition where pods were never cleaned at exact expiry second.

✅ 1 resolved

✅ Edge Case: TTL boundary race: pods never cleaned at exact expiry second

📄 openmetadata-k8s-operator/src/main/java/org/openmetadata/operator/controller/OMJobReconciler.java:309 📄 openmetadata-k8s-operator/src/main/java/org/openmetadata/operator/controller/OMJobReconciler.java:321-335
In handleTerminalPhase, when remaining == 0 (i.e., elapsed time equals TTL exactly), shouldCleanup() returns false (it uses strict >) and the reschedule guard remaining > 0 is also false. The method falls through to the no-reschedule terminal return at line 339, so pods are never cleaned up.

This is a very narrow 1-second window and any subsequent reconciliation event would resolve it, but it could theoretically leave pods behind if no other event triggers reconciliation.

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Copilot · 2026-04-09T06:48:55Z

...ice/src/main/java/org/openmetadata/service/clients/pipeline/k8s/K8sPipelineClientConfig.java

+            if (map.get("tolerationSeconds") != null) {
+              toleration.setTolerationSeconds(
+                  Long.parseLong(map.get("tolerationSeconds").toString()));
+            }
+            result.add(toleration);


parseTolerations wraps the whole parsing loop in a single try/catch, so a single malformed field (e.g., non-numeric tolerationSeconds) will throw and cause the method to return an empty list, discarding any tolerations already parsed. Consider handling parse errors per-item/per-field (e.g., catch NumberFormatException around tolerationSeconds) so valid tolerations are still applied while skipping only invalid entries.

Copilot · 2026-04-09T06:48:56Z

...etadata-k8s-operator/src/main/java/org/openmetadata/operator/controller/OMJobReconciler.java

      LOG.info("TTL expired for OMJob: {}, cleaning up pods", omJob.getMetadata().getName());
      podManager.deletePods(omJob);

      eventPublisher.publishNormalEvent(
          omJob, "ResourcesCleanedUp", "Cleaned up pods due to TTL expiration");


deletePods is called here, but PodManager.deletePods swallows exceptions internally. That means reconciliation will proceed to publish ResourcesCleanedUp even if deletion failed. Consider having deletePods return a boolean (or throw) and only publishing the cleanup event on success.

Copilot · 2026-04-09T06:48:56Z

...etadata-k8s-operator/src/main/java/org/openmetadata/operator/controller/OMJobReconciler.java

+
+      return UpdateControl.<OMJobResource>noUpdate();
+    }


After attempting TTL cleanup, the reconciler returns noUpdate() with no reschedule. If pod deletion fails due to a transient API error, the operator won't retry cleanup until the next external reconcile trigger. Consider rescheduling after a short delay when cleanup fails (and only stopping reconciliation once deletion succeeds).

Copilot · 2026-04-09T06:48:56Z

openmetadata-k8s-operator/src/main/java/org/openmetadata/operator/service/PodManager.java

+      List<Pod> pods = client.pods().inNamespace(namespace).withLabels(selector).list().getItems();
+      if (pods == null) {
+        pods = List.of();
+      }
+      for (Pod pod : pods) {
+        String podName = pod.getMetadata().getName();
+        client.pods().inNamespace(namespace).withName(podName).delete();
+        LOG.info("Deleted pod: {}", podName);
+      }


This changes pod cleanup from a single label-selector delete to listing pods and deleting them one-by-one, which increases API calls and can become expensive when many pods exist (e.g., after prior reconciliation loops). Unless per-pod deletion is required for correctness, consider using withLabels(selector).delete() (or batch deletion) and rely on logging at a higher level.

Suggested change

List<Pod> pods = client.pods().inNamespace(namespace).withLabels(selector).list().getItems();

if (pods == null) {

pods = List.of();

}

for (Pod pod : pods) {

String podName = pod.getMetadata().getName();

client.pods().inNamespace(namespace).withName(podName).delete();

LOG.info("Deleted pod: {}", podName);

}

client.pods().inNamespace(namespace).withLabels(selector).delete();

LOG.info("Deleted pods for OMJob: {}", omJob.getMetadata().getName());

sonarqubecloud · 2026-04-09T07:45:41Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2026-04-14T07:44:18Z

Changes have been cherry-picked to the 1.12.6 branch.

…ns (#26971) * Fix k8s operator exit handler pod loop and TTL cleanup, add tolerations support (#26772) Fix two bugs in the OMJob operator: - Exit handler pods were recreated indefinitely because findExitHandlerPod() lacked the name-based fallback that findMainPod() already had, causing label propagation delays to trigger repeated pod creation events - Terminal phase handler never rescheduled for TTL-based cleanup, so pods were never cleaned up after ttlSecondsAfterFinished expired Add tolerations support for ingestion pod scheduling across the full stack: - Operator: OMJobPodSpec field, PodManager.buildPod(), CRD schema - Server: OMJob model, K8sPipelineClientConfig parsing, K8sPipelineClient builder, K8sJobUtils serialization Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add K8S_TOLERATIONS env var mapping in openmetadata.yaml Adds the tolerations config binding so the server picks up the K8S_TOLERATIONS env var set by the Helm chart secret. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add tolerations to k8s test values for local validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix cleanup * Address PR review: remove redundant pod lookup and guard null items - Remove redundant server-created pod selector fallback in findMainPod() since buildPodSelector() now matches all pods by omjob-name alone - Add null guard for getItems() in deletePods() to prevent NPE - Update local test values for namespace and image config Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> (cherry picked from commit cfd71e8)

Copilot AI review requested due to automatic review settings April 2, 2026 10:51

github-actions bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Apr 2, 2026

Copilot started reviewing on behalf of pmbrull April 2, 2026 10:52 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

...ice/src/main/java/org/openmetadata/service/clients/pipeline/k8s/K8sPipelineClientConfig.java Show resolved Hide resolved

pmbrull temporarily deployed to test April 2, 2026 11:01 — with GitHub Actions Inactive

Merge branch 'main' into task/fix-issue-26772-with-proper-tests-728b027c

bfef6b1

gitar-bot bot reviewed Apr 2, 2026

View reviewed changes

...etadata-k8s-operator/src/main/java/org/openmetadata/operator/controller/OMJobReconciler.java Show resolved Hide resolved

Add K8S_TOLERATIONS env var mapping in openmetadata.yaml

519692b

Adds the tolerations config binding so the server picks up the K8S_TOLERATIONS env var set by the Helm chart secret. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 2, 2026 14:28

Copilot started reviewing on behalf of pmbrull April 2, 2026 14:29 View session

Add tolerations to k8s test values for local validation

85c4262

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

pmbrull requested review from akash-jain-10, harshach and tutte as code owners April 2, 2026 14:31

Copilot AI reviewed Apr 2, 2026

View reviewed changes

pmbrull temporarily deployed to test April 2, 2026 14:42 — with GitHub Actions Inactive

akash-jain-10 previously approved these changes Apr 6, 2026

View reviewed changes

pmbrull had a problem deploying to test April 8, 2026 08:00 — with GitHub Actions Error

pmbrull temporarily deployed to test April 8, 2026 08:29 — with GitHub Actions Inactive

pmbrull had a problem deploying to test April 8, 2026 08:29 — with GitHub Actions Failure

pmbrull temporarily deployed to test April 8, 2026 08:29 — with GitHub Actions Inactive

akash-jain-10 approved these changes Apr 8, 2026

View reviewed changes

aji-aju approved these changes Apr 8, 2026

View reviewed changes

Merge branch 'main' into task/fix-issue-26772-with-proper-tests-728b027c

5110d07

Copilot AI review requested due to automatic review settings April 9, 2026 06:42

pmbrull added the To release Will cherry-pick this PR into the release branch label Apr 9, 2026

Copilot started reviewing on behalf of pmbrull April 9, 2026 06:43 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

pmbrull temporarily deployed to test April 9, 2026 06:53 — with GitHub Actions Inactive

pmbrull merged commit cfd71e8 into main Apr 14, 2026
78 of 108 checks passed

pmbrull deleted the task/fix-issue-26772-with-proper-tests-728b027c branch April 14, 2026 07:42

Conversation

pmbrull commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

github-actions bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🟡 Playwright Results — all passed (21 flaky)

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gitar-bot bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Apr 9, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pmbrull commented Apr 2, 2026 •

edited

Loading

github-actions bot commented Apr 2, 2026 •

edited

Loading

gitar-bot bot commented Apr 9, 2026 •

edited

Loading