Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upTfJob operator stops working on invalid spec #561
Comments
jlewi
added
priority/p1
kind/bug
labels
Apr 26, 2018
gaocegege
added
api/v1alpha2
api/v1alpha1
labels
Apr 27, 2018
This comment has been minimized.
This comment has been minimized.
See also #437 - OpenAPI Validation for TFJob controller |
gaocegege
added
the
area/operator
label
May 18, 2018
This comment has been minimized.
This comment has been minimized.
It is hard to validate the config using the feature crd validation since we have podtemplatespec in the definition. The feature does not support |
This comment has been minimized.
This comment has been minimized.
Can you explain more about $ref and how it is used? Would it be possible to just use OpenAPI validation to ensure that container args are strings not integers? I guess another solution might be to use an admission controller to validate the spec. @enisoc Any suggestions on how to handle this? @jessesuen I think you faced a similar problem with the Argo CRD what did you do? |
This comment has been minimized.
This comment has been minimized.
enisoc
commented
May 25, 2018
If you just want to validate container args, and not everything in PodTemplateSpec, then it may be feasible to write an OpenAPI schema by hand for that. If you want to validate the whole PodTemplateSpec, the best workaround I've heard of so far is this one (although I haven't tried it personally): |
This comment has been minimized.
This comment has been minimized.
Thanks, I will take a look. We want to validate the whole podtemplatespec |
This comment has been minimized.
This comment has been minimized.
jessesuen
commented
May 28, 2018
@jlewi coincidentally, I literally just "fixed" this in the workflow controller. But it's more to workaround the upstream kubernetes issue: kubernetes/kubernetes#57705. I followed the recommendation in this comment: Instead of using the auto-generated workflow informer, I wrote a The fix can be seen here: |
This comment has been minimized.
This comment has been minimized.
Thanks for your reply @jessesuen @jlewi I wrote a tool to generate the validation from the OpenAPI specification: https://github.com/gaocegege/crd-validation. Generated CRD for tfjob v1alpha2 is https://github.com/gaocegege/crd-validation/blob/master/generated/tfjob-crd-v1alpha2.yaml While we meet an issue from Kubernetes side: kubernetes/kubernetes#59485 (comment). Kubernetes does not support addtionalproperties while it is needed for map type. Unfortunately, The upstream said it will be implemented in 1.11. Maybe we should wait for it. After 1.11 I think we are able to use the tool above to generate validations for all crds in kubeflow community. |
This comment has been minimized.
This comment has been minimized.
At this moment validating all types in the CRD is not practical, and the crd validation feature has some limitations, such as lack of addtional properties support. I think the workaround from jessesuen is a good way to solve the problem. |
This comment has been minimized.
This comment has been minimized.
@gaocegege I like the idea of following @jessesuen's work around and then implementing what validation we can with CRD validation. |
This comment has been minimized.
This comment has been minimized.
Yeah, I am working on it. |
jlewi commentedApr 26, 2018
I submitted a job with an invalid spec (container args contained integrs and not strings). The job was created but it was never started and the status was never updated. Furthermore, I think this blocked the TFJob operator from processing any other jobs. Deleting the job fixed things.
The TFJob operator showed the following logs.
I believe what's happening is that since the spec is invalid the result of List can't be successfully parsed into a Go struct. As a result, I think the TFJob operator is unable to work.
I think this is a problem in the underlying informer package; i.e. its not robust to invalid specs. We should check if this is a known issue and if there is an existing bug. (Ideally, it would just ignore invalid specs).
I think we could fix this a number of ways in TFJob controller
If we use CRD's spec validation feature and provide a swagger spec I think we could prevent invalid specs from being accepted in the first place
The operator could try to catch this error and then find and update the invalid spec
Swagger is probably the best place to start.
We should try to get this fixed in 0.2