Skip to content

Validate job spec prior submission in kubernetes scheduler #1152

@clumsy

Description

@clumsy

Description

Since we use custom CRDs (volcano) and not typed k8s constructs this validation does not happen up until the job is picked up by the scheduler.

E.g. https://kubernetes.io/docs/concepts/workloads/controllers/job clearly states that:

When the control plane creates new Pods for a Job, the .metadata.name of the Job is part of the basis for naming those Pods. The name of a Job must be a valid [DNS subdomain](https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-subdomain-names) value, but this can produce unexpected results for the Pod hostnames. For best compatibility, the name should follow the more restrictive rules for a [DNS label](https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names). Even when the name is a DNS subdomain, the name must be no longer than 63 characters.

Yet we are allowed to send this freely only to see a silent failure:

Execute plugin when job add failed, err: Service "blablablabla..." is invalid: [metadata.name: Invalid value: "blablablabla...": must be no more than 63 characters, spec.selector: Invalid value: "blablablabla...": must be no more than 63 characters]

Motivation/Background

If we know the resource violates the spec - we should fail fast.

Detailed Proposal

Add a flag (defaults to False for now), e.g. validate_spec to k8s scheduler that would call create_namespaced_custom_object prior actual job submission.

Alternatives

  1. Do manual validation using code from TorchX itself, but it's brittle.
  2. Status quo - leads to unpleasantly surprising customer experience

Additional context/links

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions