-
Couldn't load subscription status.
- Fork 146
Closed
Description
Description
Since we use custom CRDs (volcano) and not typed k8s constructs this validation does not happen up until the job is picked up by the scheduler.
E.g. https://kubernetes.io/docs/concepts/workloads/controllers/job clearly states that:
When the control plane creates new Pods for a Job, the .metadata.name of the Job is part of the basis for naming those Pods. The name of a Job must be a valid [DNS subdomain](https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-subdomain-names) value, but this can produce unexpected results for the Pod hostnames. For best compatibility, the name should follow the more restrictive rules for a [DNS label](https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names). Even when the name is a DNS subdomain, the name must be no longer than 63 characters.
Yet we are allowed to send this freely only to see a silent failure:
Execute plugin when job add failed, err: Service "blablablabla..." is invalid: [metadata.name: Invalid value: "blablablabla...": must be no more than 63 characters, spec.selector: Invalid value: "blablablabla...": must be no more than 63 characters]
Motivation/Background
If we know the resource violates the spec - we should fail fast.
Detailed Proposal
Add a flag (defaults to False for now), e.g. validate_spec to k8s scheduler that would call create_namespaced_custom_object prior actual job submission.
Alternatives
- Do manual validation using code from TorchX itself, but it's brittle.
- Status quo - leads to unpleasantly surprising customer experience
Additional context/links
Metadata
Metadata
Assignees
Labels
No labels