New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TfJob operator stops working on invalid spec #561

Closed
jlewi opened this Issue Apr 26, 2018 · 10 comments

Comments

Projects
None yet
4 participants
@jlewi
Copy link
Collaborator

jlewi commented Apr 26, 2018

I submitted a job with an invalid spec (container args contained integrs and not strings). The job was created but it was never started and the status was never updated. Furthermore, I think this blocked the TFJob operator from processing any other jobs. Deleting the job fixed things.

The TFJob operator showed the following logs.

E0422 01:21:04.083809       1 reflector.go:205] github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/factory.go:59: Failed to list *v1alpha1.TFJob: v1alpha1.TFJobList: Items: []v1alpha1.TFJob: v1alpha1.TFJob: Spec: v1alpha1.TFJobSpec: ReplicaSpecs: []*v1alpha1.TFReplicaSpec: v1alpha1.TFReplicaSpec: Template: v1.PodTemplateSpec: Spec: v1.PodSpec: Containers: []v1.Container: v1.Container: Env: []v1.EnvVar: v1.EnvVar: v1.EnvVar: Value: ReadString: expects " or n, parsing 1996 ...,"value":1... at {"apiVersion":"kubeflow.org/v1alpha1","items":[{"apiVersion":"kubeflow.org/v1alpha1","kind":"TF

I believe what's happening is that since the spec is invalid the result of List can't be successfully parsed into a Go struct. As a result, I think the TFJob operator is unable to work.

I think this is a problem in the underlying informer package; i.e. its not robust to invalid specs. We should check if this is a known issue and if there is an existing bug. (Ideally, it would just ignore invalid specs).

I think we could fix this a number of ways in TFJob controller

  • If we use CRD's spec validation feature and provide a swagger spec I think we could prevent invalid specs from being accepted in the first place

  • The operator could try to catch this error and then find and update the invalid spec

    • Deleting the job might be confusing for users but maybe we could leave the job there but delete most fields and set status to "invalid spec"?

Swagger is probably the best place to start.

We should try to get this fixed in 0.2

@jlewi

This comment has been minimized.

Copy link
Collaborator Author

jlewi commented Apr 30, 2018

See also #437 - OpenAPI Validation for TFJob controller

@gaocegege

This comment has been minimized.

Copy link
Member

gaocegege commented May 25, 2018

It is hard to validate the config using the feature crd validation since we have podtemplatespec in the definition. The feature does not support $ref, podtemplatespec heavily uses $ref. If we want to check it, we need to get the OpenAPI schema and have a hack on it, which I think is not what we want.

@jlewi

This comment has been minimized.

Copy link
Collaborator Author

jlewi commented May 25, 2018

Can you explain more about $ref and how it is used? Would it be possible to just use OpenAPI validation to ensure that container args are strings not integers?

I guess another solution might be to use an admission controller to validate the spec.

@enisoc Any suggestions on how to handle this?

@jessesuen I think you faced a similar problem with the Argo CRD what did you do?

@enisoc

This comment has been minimized.

Copy link

enisoc commented May 25, 2018

If you just want to validate container args, and not everything in PodTemplateSpec, then it may be feasible to write an OpenAPI schema by hand for that.

If you want to validate the whole PodTemplateSpec, the best workaround I've heard of so far is this one (although I haven't tried it personally):

kubernetes/kubernetes#54579 (comment)

@gaocegege

This comment has been minimized.

Copy link
Member

gaocegege commented May 26, 2018

Thanks, I will take a look. We want to validate the whole podtemplatespec 😄

@jessesuen

This comment has been minimized.

Copy link

jessesuen commented May 28, 2018

@jessesuen I think you faced a similar problem with the Argo CRD what did you do?

@jlewi coincidentally, I literally just "fixed" this in the workflow controller. But it's more to workaround the upstream kubernetes issue: kubernetes/kubernetes#57705.

I followed the recommendation in this comment:
kubernetes/kubernetes#57705 (comment)

Instead of using the auto-generated workflow informer, I wrote a unstructured.Unstructured informer which watches Unstructured objects of the argoproj.io/v1alpha1 workflow API resource type. Then I convert unstructured object to workflow objects using runtime.DefaultUnstructuredConverter.FromUnstructured. If this fails to convert, I now can update the workflow with a proper error message indicating the spec was invalid.

The fix can be seen here:
argoproj/argo@8d96ea7

@gaocegege

This comment has been minimized.

Copy link
Member

gaocegege commented May 28, 2018

Thanks for your reply @jessesuen

@jlewi I wrote a tool to generate the validation from the OpenAPI specification: https://github.com/gaocegege/crd-validation. Generated CRD for tfjob v1alpha2 is https://github.com/gaocegege/crd-validation/blob/master/generated/tfjob-crd-v1alpha2.yaml

While we meet an issue from Kubernetes side: kubernetes/kubernetes#59485 (comment). Kubernetes does not support addtionalproperties while it is needed for map type. Unfortunately, TFRelicaSpec is a map.

The upstream said it will be implemented in 1.11. Maybe we should wait for it. After 1.11 I think we are able to use the tool above to generate validations for all crds in kubeflow community.

@gaocegege

This comment has been minimized.

Copy link
Member

gaocegege commented May 29, 2018

At this moment validating all types in the CRD is not practical, and the crd validation feature has some limitations, such as lack of addtional properties support. I think the workaround from jessesuen is a good way to solve the problem.

@jlewi

This comment has been minimized.

Copy link
Collaborator Author

jlewi commented May 29, 2018

@gaocegege I like the idea of following @jessesuen's work around and then implementing what validation we can with CRD validation.

@gaocegege

This comment has been minimized.

Copy link
Member

gaocegege commented May 29, 2018

Yeah, I am working on it. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment