Skip to content

fix(deps): pin below kubernetes 36.0.0 (multiple client regressions)#511

Merged
andreyvelich merged 1 commit into
kubeflow:mainfrom
tariq-hasan:fix/pin-kubernetes-client-v36-regressions
May 28, 2026
Merged

fix(deps): pin below kubernetes 36.0.0 (multiple client regressions)#511
andreyvelich merged 1 commit into
kubeflow:mainfrom
tariq-hasan:fix/pin-kubernetes-client-v36-regressions

Conversation

@tariq-hasan
Copy link
Copy Markdown
Member

What this PR does / why we need it:

This is a followup to #507 to pin kubernetes python client to <36.0.0.

There has been a recent 36.0.1 release which has been breaking the tests again because it does not address the error with read_namespaced_pod_log.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #

Checklist:

  • Docs included if any changes are user facing

Copilot AI review requested due to automatic review settings May 27, 2026 13:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates the Python dependency constraint for the Kubernetes client to avoid known regressions in version 36.0.0.

Changes:

  • Replaces an exclusion specifier (!=36.0.0) with an upper bound (<36.0.0) for kubernetes.

Comment thread pyproject.toml
Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
@tariq-hasan tariq-hasan force-pushed the fix/pin-kubernetes-client-v36-regressions branch from 2defe53 to e0accd6 Compare May 27, 2026 13:14
Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@google-oss-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andreyvelich
Copy link
Copy Markdown
Member

@tariq-hasan Could you check why E2Es are failing ?

@tariq-hasan
Copy link
Copy Markdown
Member Author

The E2E tests are failing due to timeout. This is unrelated to the kubernetes python client error.

TimeoutError                              Traceback (most recent call last)
Cell In[11], line 1
----> 1 TrainerClient().wait_for_job_status(name=job_id, timeout=20)

File /opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/kubeflow/trainer/api/trainer_client.py:261, in TrainerClient.wait_for_job_status(self, name, status, timeout, polling_interval, callbacks)
    233 def wait_for_job_status(
    234     self,
    235     name: str,
   (...)    239     callbacks: list[Callable[[types.TrainJob], None]] | None = None,
    240 ) -> types.TrainJob:
    241     """Wait for a TrainJob to reach a desired status.
    242 
    243     Args:
   (...)    259         TimeoutError: Timeout to wait for TrainJob status.
    260     """
--> 261     return self.backend.wait_for_job_status(
    262         name=name,
    263         status=status,
    264         timeout=timeout,
    265         polling_interval=polling_interval,
    266         callbacks=callbacks,
    267     )

File /opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/kubeflow/trainer/backends/kubernetes/backend.py:496, in KubernetesBackend.wait_for_job_status(self, name, status, timeout, polling_interval, callbacks)
    492         return trainjob
    494     time.sleep(polling_interval)
--> 496 raise TimeoutError(f"Timeout waiting for TrainJob {name} to reach status: {status} status")

TimeoutError: Timeout waiting for TrainJob uc2470bf1d5f to reach status: {'Complete'} status

The original error ApiTypeError: Got an unexpected keyword argument 'watch' to method read_namespaced_pod_log traced by @XploY04 (kubeflow/trainer#3556 (comment)) does not appear in the logs any longer.

@XploY04
Copy link
Copy Markdown
Contributor

XploY04 commented May 27, 2026

@tariq-hasan
That makes sense. My earlier investigation was based on the Trainer PR 3556 notebook artifacts, where the failure was definitely from client.get_job_logs(..., follow=True) and kubernetes==36.0.1.

For this SDK PR, since the ApiTypeError no longer appears after pinning <36.0.0, that part seems fixed or bypassed. The remaining failure is a separate timeout in wait_for_job_status(..., timeout=20). So the next thing to debug is why the TrainJob is not reaching Complete within 20 seconds, not the Kubernetes Python client log-streaming issue.

@XploY04
Copy link
Copy Markdown
Contributor

XploY04 commented May 27, 2026

The notebook-level exception is a timeout, but the pod logs show the TrainJob failed earlier while loading the SQuAD dataset.

The failing line is: load_dataset("squad", split="train[:100]") and the pod logs show: huggingface_hub.errors.HfUriError: Invalid HF URI 'hf://datasets/squad@.../.huggingface.yaml'. Repository id must be 'namespace/name', got 'squad'.

So the TrainJob never reaches Complete, and then the notebook reports:
Timeout waiting for TrainJob ... to reach status: {'Complete'} status

The likely reason this started failing now is that the notebook installs unpinned Hugging Face packages:
packages_to_install=["datasets", "transformers[torch]", "cloudpathlib[all]"]
so CI can pick up newer Hugging Face behavior. The current HF URI parser expects a namespaced dataset id. The notebook itself links SQuAD as rajpurkar/squad, so I think the fix should be to change:
load_dataset("squad", split="train[:100]")
to:
load_dataset("rajpurkar/squad", split="train[:100]")
and rerun the notebook E2E.

@tariq-hasan
Copy link
Copy Markdown
Member Author

@XploY04 Thanks for the detailed analysis.

It would be helpful if you open a small PR on the Kubeflow Trainer notebook for the fix.

That way it would help unblock the E2E tests here as they are reading from the Trainer notebooks.

@andreyvelich
Copy link
Copy Markdown
Member

Let's manually merge it

@andreyvelich andreyvelich merged commit 7202402 into kubeflow:main May 28, 2026
11 of 17 checks passed
@google-oss-prow google-oss-prow Bot added this to the v0.5 milestone May 28, 2026
@tariq-hasan tariq-hasan deleted the fix/pin-kubernetes-client-v36-regressions branch May 28, 2026 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants