Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
TfJob operator stops working on invalid spec #561
I submitted a job with an invalid spec (container args contained integrs and not strings). The job was created but it was never started and the status was never updated. Furthermore, I think this blocked the TFJob operator from processing any other jobs. Deleting the job fixed things.
The TFJob operator showed the following logs.
I believe what's happening is that since the spec is invalid the result of List can't be successfully parsed into a Go struct. As a result, I think the TFJob operator is unable to work.
I think this is a problem in the underlying informer package; i.e. its not robust to invalid specs. We should check if this is a known issue and if there is an existing bug. (Ideally, it would just ignore invalid specs).
I think we could fix this a number of ways in TFJob controller
Swagger is probably the best place to start.
We should try to get this fixed in 0.2
It is hard to validate the config using the feature crd validation since we have podtemplatespec in the definition. The feature does not support
Can you explain more about $ref and how it is used? Would it be possible to just use OpenAPI validation to ensure that container args are strings not integers?
I guess another solution might be to use an admission controller to validate the spec.
@enisoc Any suggestions on how to handle this?
@jessesuen I think you faced a similar problem with the Argo CRD what did you do?
If you just want to validate container args, and not everything in PodTemplateSpec, then it may be feasible to write an OpenAPI schema by hand for that.
If you want to validate the whole PodTemplateSpec, the best workaround I've heard of so far is this one (although I haven't tried it personally):
I followed the recommendation in this comment:
Instead of using the auto-generated workflow informer, I wrote a
The fix can be seen here:
Thanks for your reply @jessesuen
@jlewi I wrote a tool to generate the validation from the OpenAPI specification: https://github.com/gaocegege/crd-validation. Generated CRD for tfjob v1alpha2 is https://github.com/gaocegege/crd-validation/blob/master/generated/tfjob-crd-v1alpha2.yaml
While we meet an issue from Kubernetes side: kubernetes/kubernetes#59485 (comment). Kubernetes does not support addtionalproperties while it is needed for map type. Unfortunately,
The upstream said it will be implemented in 1.11. Maybe we should wait for it. After 1.11 I think we are able to use the tool above to generate validations for all crds in kubeflow community.