Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*job API(master) cannot compatible with old job #1725

Closed
nkflash opened this issue Jan 16, 2023 · 14 comments
Closed

*job API(master) cannot compatible with old job #1725

nkflash opened this issue Jan 16, 2023 · 14 comments

Comments

@nkflash
Copy link

nkflash commented Jan 16, 2023

image

training-operator master import Kubeflow common v0.4.5 which change basic ReplicaStatus.LabelSelector to string.

image

This change cause old job cannot be recognized by new training-operator.

E0116 09:19:36.195020       1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.25.3/tools/cache/reflector.go:169: Failed to watch *v1.PyTorchJob: failed to list *v1.PyTorchJob: json: cannot unmarshal object into Go struct field ReplicaStatus.items.status.replicaStatuses.labelSelector of type string
W0116 09:20:00.172838       1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.25.3/tools/cache/reflector.go:169: failed to list *v1.PyTorchJob: json: cannot unmarshal object into Go struct field ReplicaStatus.items.status.replicaStatuses.labelSelector of type string
E0116 09:20:00.172897       1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.25.3/tools/cache/reflector.go:169: Failed to watch *v1.PyTorchJob: failed to list *v1.PyTorchJob: json: cannot unmarshal object into Go struct field ReplicaStatus.items.status.replicaStatuses.labelSelector of type string
W0116 09:20:37.963944       1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.25.3/tools/cache/reflector.go:169: failed to list *v1.PyTorchJob: json: cannot unmarshal object into Go struct field ReplicaStatus.items.status.replicaStatuses.labelSelector of type string
E0116 09:20:37.963996       1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.25.3/tools/cache/reflector.go:169: Failed to watch *v1.PyTorchJob: failed to list *v1.PyTorchJob: json: cannot unmarshal object into Go struct field ReplicaStatus.items.status.replicaStatuses.labelSelector of type string
{"level":"error","ts":1673860878.8079154,"msg":"Could not wait for Cache to sync","controller":"mpijob-controller","error":"failed to wait for mpijob-controller caches to sync: timed out waiting for cache to be synced"}
{"level":"error","ts":1673860878.8079553,"msg":"Could not wait for Cache to sync","controller":"tfjob-controller","error":"failed to wait for tfjob-controller caches to sync: timed out waiting for cache to be synced"}
{"level":"error","ts":1673860878.807957,"msg":"Could not wait for Cache to sync","controller":"mxjob-controller","error":"failed to wait for mxjob-controller caches to sync: timed out waiting for cache to be synced"}

training-operator can not start up If there are old jobs in system.

@johnugeorge
Copy link
Member

Yes. This is a breaking change which will be in next release. The current label selector field can't work for HPA in K8s and hence change was made. In the coming release 1.7, we have breaking changes for both operator and SDK

Ref: kubeflow/common#197
#1719

@tenzen-y @terrytangyuan @zw0610 @andreyvelich Any thoughts?

@andreyvelich
Copy link
Member

andreyvelich commented Jan 16, 2023

That's right.
@nkflash Since master branch might be unstable, any concerns of using release version of operator (e.g. v1.5-branch )?

@tenzen-y
Copy link
Member

@johnugeorge IMO, it might be better to add a new member like selector to ReplicaStatus instead of replacing the type of labelSelector with string. And we mark labelSelector as Deprecated.

And we inform users training-operator does not support HPA in v1.5.0.

This means we provide API for a while in the following:

type ReplicaStatus struct {
...
        // Deprecated: LabelSelector does not work fine.
	LabelSelector *metav1.LabelSelector `json:"labelSelector,omitempty"`
	Selector string `json:"selector,omitempty"`
}
...

WDYT?

@nkflash
Copy link
Author

nkflash commented Jan 17, 2023

@johnugeorge IMO, it might be better to add a new member like selector to ReplicaStatus instead of replacing the type of labelSelector with string. And we mark labelSelector as Deprecated.

And we inform users training-operator does not support HPA in v1.5.0.

This means we provide API for a while in the following:

type ReplicaStatus struct {
...
        // Deprecated: LabelSelector does not work fine.
	LabelSelector *metav1.LabelSelector `json:"labelSelector,omitempty"`
	Selector string `json:"selector,omitempty"`
}
...

WDYT?

I agree with @tenzen-y. Or there is another solution, training-operator may increase the api version? @andreyvelich @johnugeorge

@nkflash
Copy link
Author

nkflash commented Jan 17, 2023

That's right.
@nkflash Since master branch might be unstable, any concerns of using release version of operator (e.g. v1.5-branch )?

I try to use paddle operator recently(only in master branch), so I caught this issue.
As I mentation, in other comments, training-operator need increase api version, so client could handle this.

@tenzen-y
Copy link
Member

tenzen-y commented Jan 17, 2023

@johnugeorge IMO, it might be better to add a new member like selector to ReplicaStatus instead of replacing the type of labelSelector with string. And we mark labelSelector as Deprecated.
And we inform users training-operator does not support HPA in v1.5.0.
This means we provide API for a while in the following:

type ReplicaStatus struct {
...
        // Deprecated: LabelSelector does not work fine.
	LabelSelector *metav1.LabelSelector `json:"labelSelector,omitempty"`
	Selector string `json:"selector,omitempty"`
}
...

WDYT?

I agree with @tenzen-y. Or there is another solution, training-operator may increase the api version? @andreyvelich @johnugeorge

I think we need to increase the API Version if we remove LabelSelector.
Also, we should not do breaking API changes since the training-operator provides a stable v1 API version, not a beta or alpha version.

@nkflash
Copy link
Author

nkflash commented Jan 17, 2023

@johnugeorge IMO, it might be better to add a new member like selector to ReplicaStatus instead of replacing the type of labelSelector with string. And we mark labelSelector as Deprecated.
And we inform users training-operator does not support HPA in v1.5.0.
This means we provide API for a while in the following:

type ReplicaStatus struct {
...
        // Deprecated: LabelSelector does not work fine.
	LabelSelector *metav1.LabelSelector `json:"labelSelector,omitempty"`
	Selector string `json:"selector,omitempty"`
}
...

WDYT?

I agree with @tenzen-y. Or there is another solution, training-operator may increase the api version? @andreyvelich @johnugeorge

I think we need to increase the API Version if we remove LabelSelector. Also, we should not significantly change the API since the training-operator provides a stable v1 API version, not a beta or alpha version.

Agree!

@andreyvelich
Copy link
Member

andreyvelich commented Jan 17, 2023

@johnugeorge What do you think ?

@johnugeorge
Copy link
Member

I will deprecate the existing field and add a new Selector field. If we want to remove the deprecated field in the future, we can update to v2beta1 api as we are currently in stable version

@tenzen-y
Copy link
Member

@johnugeorge If you don't have enough time to work on this, I could work on this.

@johnugeorge
Copy link
Member

Thanks @tenzen-y. I have started on it. I will post if I cannot make it in a day

@tenzen-y
Copy link
Member

Thanks. Feel free to send a ping to me.

@tenzen-y
Copy link
Member

@nkflash This was fixed by #1724. If you have the same issue again, feel free to reopen this issue or create another issue.

/close

@google-oss-prow
Copy link

@tenzen-y: Closing this issue.

In response to this:

@nkflash This was fixed by #1724. If you have the same issue again, feel free to reopen this issue or create another issue.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants