Fixing pod leak in KubernetesJobOperator by Copilot · Pull Request #1 · jykae/airflow

Copilot · 2026-05-22T11:17:50Z

Inspect unresolved review comments and affected docs/newsfragment text
Update KubernetesJobOperator docs for on_finish_action and on_kill_action behavior
Update newsfragment wording to match actual supported actions/kill behavior
Run targeted validation for touched docs files
Run final validation checks and summarize

…sJobOperator KubernetesJobOperator inherited from KubernetesPodOperator but overrode execute() without calling post_complete_action(), so the monitoring / log-streaming pods discovered via get_pods() were never deleted. These pods have no ownerReferences to the V1Job, so ttl_seconds_after_finished and the Foreground cascade in on_kill don't reap them either. - execute() and execute_complete() now wrap their work in try/finally and call post_complete_action() for each pod in self.pods. on_finish_action (delete_pod / delete_succeeded_pod / keep_pod) is now honoured. - on_kill() additionally calls pod_manager.delete_pod() for each monitoring pod (the Job's foreground cascade doesn't reach them). - Per-pod cleanup errors are logged but never mask the in-flight exception, so Job-level failures keep propagating. - execute_complete() resolves monitoring pods once and shares the lookup between the log-retrieval path and the cleanup path. - Added unit tests, a bugfix newsfragment, and an operators.rst section documenting the cleanup contract.

… drop unused import, relax pod-deletion ordering in test, fix trailing comma

Copilot

Pull request overview

This PR addresses a resource-leak in KubernetesJobOperator by ensuring the monitoring/log-streaming pods discovered via get_pods() are cleaned up at the end of execution (including kill and deferrable-resume paths), and documents how on_finish_action applies to these pods.

Changes:

Add monitoring-pod cleanup to execute() (via finally), execute_complete(), and on_kill().
Add unit tests covering cleanup on success/failure, honoring on_finish_action, deferrable resume cleanup, and kill-path cleanup.
Document the distinction between Job-owned worker pods vs. operator-created monitoring pods, and add a bugfix newsfragment.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
`providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/job.py`	Adds cleanup logic for monitoring pods across success/failure/kill/deferrable resume paths.
`providers/cncf/kubernetes/tests/unit/cncf/kubernetes/operators/test_job.py`	Adds regression tests asserting monitoring-pod cleanup behavior and `on_finish_action` handling.
`providers/cncf/kubernetes/newsfragments/kubernetes-job-operator-cleanup-monitoring-pods.bugfix.rst`	Release note describing the monitoring-pod leak fix and new behavior.
`providers/cncf/kubernetes/docs/operators.rst`	Documentation explaining monitoring pods and how `on_finish_action` affects their cleanup.

In _cleanup_monitoring_pods, remote_pod is resolved via find_pod(), which is designed to locate a single matching pod by task-instance labels and can invoke duplicate-pod resolution logic (process_duplicate_label_pods). For KubernetesJobOperator with parallelism > 1, this lookup can return the wrong pod (or trigger duplicate-handling side effects), so post_complete_action() may receive a mismatched remote_pod. Consider using the already-discovered pod’s name/namespace to refresh state (e.g. via hook.get_pod) or just pass remote_pod=pod when you already have the V1Pod object from get_pods(). Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

The new unit tests add several mock.MagicMock() instances (pods, jobs, TI, etc.) without spec/autospec, and some patch() usages also create non-spec'd mocks by default. Using autospec=True on patches and create_autospec(...)/MagicMock(spec=...) for key Kubernetes objects helps catch typos/attribute mismatches in these tests and aligns with Airflow’s test hardening guidance. Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

…nd test assertions - triggers/job.py: Always include pod_names/pod_namespace in trigger event regardless of get_logs setting, so execute_complete() can reliably clean up monitoring pods even when get_logs=False - operators/job.py: Log unexpected ApiException in on_kill() instead of suppressing all ApiExceptions; remove unused `suppress` import - tests/test_job.py: Rewrite test_execute_respects_keep_pod and test_execute_deletes_pod_default to keep process_pod_deletion real and assert on pod_manager.delete_pod; stub hook.get_pod for remote_pod resolution - tests/test_job.py: Add regression test for get_logs=False deferrable path

…ccidental deletion of method signature

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

+* ``delete_pod`` (default) — the pod is deleted after the task
+  finishes (success or failure).
+* ``delete_succeeded_pod`` — the pod is deleted only when the task
+  succeeded.


+When the task is killed, ``on_kill`` deletes the Job (with foreground cascade)
+and additionally attempts pod deletion directly for the discovered pods.


@@ -0,0 +1 @@
+Fix pod cleanup gaps in ``KubernetesJobOperator``: ``execute()``, ``execute_complete()`` and ``on_kill()`` now consistently clean up pods discovered via ``get_pods()``, including deferrable resume paths where pod lookup can fail. The inherited ``on_finish_action`` parameter (``delete_pod`` / ``delete_succeeded_pod`` / ``keep_pod``) is honoured for these pods, matching ``KubernetesPodOperator`` behavior. Per-pod cleanup errors are logged but never mask a Job-level failure.


* fix(providers/cncf/kubernetes): clean up monitoring pods in KubernetesJobOperator KubernetesJobOperator inherited from KubernetesPodOperator but overrode execute() without calling post_complete_action(), so the monitoring / log-streaming pods discovered via get_pods() were never deleted. These pods have no ownerReferences to the V1Job, so ttl_seconds_after_finished and the Foreground cascade in on_kill don't reap them either. - execute() and execute_complete() now wrap their work in try/finally and call post_complete_action() for each pod in self.pods. on_finish_action (delete_pod / delete_succeeded_pod / keep_pod) is now honoured. - on_kill() additionally calls pod_manager.delete_pod() for each monitoring pod (the Job's foreground cascade doesn't reach them). - Per-pod cleanup errors are logged but never mask the in-flight exception, so Job-level failures keep propagating. - execute_complete() resolves monitoring pods once and shares the lookup between the log-retrieval path and the cleanup path. - Added unit tests, a bugfix newsfragment, and an operators.rst section documenting the cleanup contract. * Address code review feedback: remove dead PodNotFoundException check, drop unused import, relax pod-deletion ordering in test, fix trailing comma * Potential fix for pull request finding In _cleanup_monitoring_pods, remote_pod is resolved via find_pod(), which is designed to locate a single matching pod by task-instance labels and can invoke duplicate-pod resolution logic (process_duplicate_label_pods). For KubernetesJobOperator with parallelism > 1, this lookup can return the wrong pod (or trigger duplicate-handling side effects), so post_complete_action() may receive a mismatched remote_pod. Consider using the already-discovered pod’s name/namespace to refresh state (e.g. via hook.get_pod) or just pass remote_pod=pod when you already have the V1Pod object from get_pods(). Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Use isinstance(exc, TaskDeferred) instead of brittle string comparison * Potential fix for pull request finding The new unit tests add several mock.MagicMock() instances (pods, jobs, TI, etc.) without spec/autospec, and some patch() usages also create non-spec'd mocks by default. Using autospec=True on patches and create_autospec(...)/MagicMock(spec=...) for key Kubernetes objects helps catch typos/attribute mismatches in these tests and aligns with Airflow’s test hardening guidance. Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Address PR review comments: fix trigger pod_names, on_kill logging, and test assertions - triggers/job.py: Always include pod_names/pod_namespace in trigger event regardless of get_logs setting, so execute_complete() can reliably clean up monitoring pods even when get_logs=False - operators/job.py: Log unexpected ApiException in on_kill() instead of suppressing all ApiExceptions; remove unused `suppress` import - tests/test_job.py: Rewrite test_execute_respects_keep_pod and test_execute_deletes_pod_default to keep process_pod_deletion real and assert on pod_manager.delete_pod; stub hook.get_pod for remote_pod resolution - tests/test_job.py: Add regression test for get_logs=False deferrable path * Fix orphaned test_on_kill_deletes_monitoring_pods method body after accidental deletion of method signature * Make pod resolution best-effort in execute_complete * Address remaining KubernetesJobOperator review comments * Finalize review-comment fixes for KubernetesJobOperator * Fix remaining KubernetesJobOperator review comments * Update KubernetesJobOperator docs for action semantics * Improve KubernetesJobOperator newsfragment readability --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Ville Jyrkkä <vjyrkka@gmail.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI added 2 commits May 22, 2026 11:10

Address code review feedback: remove dead PodNotFoundException check,…

9a4e8a2

… drop unused import, relax pod-deletion ordering in test, fix trailing comma

Copilot AI assigned Copilot and jykae May 22, 2026

Copilot created this pull request from a session on behalf of jykae May 22, 2026 11:17 View session

jykae requested a review from Copilot May 22, 2026 11:18

Copilot started reviewing on behalf of jykae May 22, 2026 11:18 View session

jykae marked this pull request as ready for review May 22, 2026 11:22

Copilot AI reviewed May 22, 2026

View reviewed changes

Copilot started work on behalf of jykae May 22, 2026 11:26 View session

Use isinstance(exc, TaskDeferred) instead of brittle string comparison

dadd9fd

Copilot finished work on behalf of jykae May 22, 2026 11:31

Copilot AI requested a review from jykae May 22, 2026 11:31

jykae requested review from Copilot May 22, 2026 11:33

Copilot started reviewing on behalf of jykae May 22, 2026 11:33 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Copilot started work on behalf of jykae May 22, 2026 11:39 View session

Copilot AI added 2 commits May 22, 2026 11:47

Fix orphaned test_on_kill_deletes_monitoring_pods method body after a…

a480cf9

…ccidental deletion of method signature

Copilot finished work on behalf of jykae May 22, 2026 11:50

jykae requested review from Copilot and removed request for Copilot May 22, 2026 11:50

Copilot started reviewing on behalf of jykae May 22, 2026 11:51 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/job.py

Copilot started work on behalf of jykae May 22, 2026 11:55 View session

Make pod resolution best-effort in execute_complete

205c9d6

Copilot finished work on behalf of jykae May 22, 2026 11:59

jykae requested a review from Copilot May 22, 2026 11:59

Copilot started reviewing on behalf of jykae May 22, 2026 11:59 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/job.py

Comment thread providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/job.py

Comment thread providers/cncf/kubernetes/tests/unit/cncf/kubernetes/operators/test_job.py Outdated

Copilot started work on behalf of jykae May 22, 2026 12:11 View session

Copilot AI added 2 commits May 22, 2026 12:15

Address remaining KubernetesJobOperator review comments

fcc5201

Finalize review-comment fixes for KubernetesJobOperator

5fcc46a

Copilot finished work on behalf of jykae May 22, 2026 12:18

jykae requested a review from Copilot May 22, 2026 12:19

Copilot started reviewing on behalf of jykae May 22, 2026 12:19 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/job.py

Comment thread providers/cncf/kubernetes/docs/operators.rst Outdated

Comment thread ...ers/cncf/kubernetes/newsfragments/kubernetes-job-operator-cleanup-monitoring-pods.bugfix.rst Outdated

Copilot started work on behalf of jykae May 22, 2026 12:29 View session

Fix remaining KubernetesJobOperator review comments

ca8e811

Copilot finished work on behalf of jykae May 22, 2026 12:34

jykae requested a review from Copilot May 22, 2026 12:35

Copilot started reviewing on behalf of jykae May 22, 2026 12:35 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Copilot started work on behalf of jykae May 22, 2026 12:43 View session

Copilot AI added 2 commits May 22, 2026 12:45

Update KubernetesJobOperator docs for action semantics

7c93a66

Improve KubernetesJobOperator newsfragment readability

f6e9e18

jykae approved these changes May 22, 2026

View reviewed changes

Copilot finished work on behalf of jykae May 22, 2026 12:47

Copilot AI requested a review from jykae May 22, 2026 12:47

jykae merged commit 7d3f581 into main May 22, 2026

jykae deleted the copilot/fix-kubernetesjoboperator-pod-leak branch May 22, 2026 12:48

		When the task is killed, ``on_kill`` deletes the Job (with foreground cascade)
		and additionally attempts pod deletion directly for the discovered pods.

		@@ -0,0 +1 @@
		Fix pod cleanup gaps in ``KubernetesJobOperator``: ``execute()``, ``execute_complete()`` and ``on_kill()`` now consistently clean up pods discovered via ``get_pods()``, including deferrable resume paths where pod lookup can fail. The inherited ``on_finish_action`` parameter (``delete_pod`` / ``delete_succeeded_pod`` / ``keep_pod``) is honoured for these pods, matching ``KubernetesPodOperator`` behavior. Per-pod cleanup errors are logged but never mask a Job-level failure.

Conversation

Copilot AI commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented May 22, 2026 •

edited

Loading