Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: allowance of non DNS-1035 compliant PyTorchJob names results in service creation failures and missing state #1745

Closed
MEllis-github opened this issue Jan 27, 2023 · 4 comments · Fixed by #1748
Assignees

Comments

@MEllis-github
Copy link

MEllis-github commented Jan 27, 2023

Overview

Instantiating a PyTorchJob that specifies a PyTorchJob.metadata.name not complying with RFC-1035 but valid in every other way results in the following:

  • pytorchjob creation
  • missing pytorchjob state / nil status
  • successful pod creation
  • failed service creation (service names must be RFC-1035 compliant)

This behavior was noticed in a production environment where it led to distributed pytorch jobs failing to make progress (failing to rendezvous during initialization due to the service creation failure, and not transitioning due to the missing status).

I have provided a very much simplified case for reproduction below.

Questions

Is the community aware of this problem?
Has a resolution been proposed?
What is the best way to follow or participate in a resolution?

Version

training-operator Release v1.5

Reproducing

Note: kubectl can replace oc in the following.

test-env# Baseline, create a pytorchjob with a valid name (all should go well)
bash-3.2$ oc get pytorchjob
No resources found
bash-3.2$ oc create -f test.yaml
pytorchjob.kubeflow.org/test created
bash-3.2$ oc get pytorchjob
NAME   STATE     AGE
test   Running   8s
bash-3.2$ oc describe pytorchjob test
Name:         test
Namespace:    test-env
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1
Kind:         PyTorchJob
Metadata:
  Creation Timestamp:  2023-01-27T21:02:39Z
  Generation:          1
  Managed Fields:
    API Version:  kubeflow.org/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:pytorchReplicaSpecs:
          .:
          f:Master:
            .:
            f:replicas:
            f:restartPolicy:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
                f:imagePullSecrets:
                f:volumes:
    Manager:      kubectl-create
    Operation:    Update
    Time:         2023-01-27T21:02:39Z
    API Version:  kubeflow.org/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:completionTime:
        f:conditions:
        f:replicaStatuses:
          .:
          f:Master:
            .:
            f:labelSelector:
              .:
              f:matchLabels:
                .:
                f:group-name:
                f:job-name:
                f:training.kubeflow.org/job-name:
                f:training.kubeflow.org/operator-name:
                f:training.kubeflow.org/replica-type:
            f:succeeded:
        f:startTime:
    Manager:         manager
    Operation:       Update
    Subresource:     status
    Time:            2023-01-27T21:03:15Z
  Resource Version:  232228338
  UID:               ff185e74-3cff-417f-92e8-d8adb578fd1a
Spec:
  Pytorch Replica Specs:
    Master:
      Replicas:        1
      Restart Policy:  Never
      Template:
        Spec:
          Containers:
            Command:
              bash
              -c
              echo "Environment variables set by the kubeflow training operator:"
echo ${MASTER_ADDR}:${MASTER_PORT}
echo "PYTHONUNBUFFERED:"${PYTHONUNBUFFERED}
echo My global rank is ${RANK} / ${WORLD_SIZE}
#
# User commands
#
echo "Container started!" && sleep 30 && echo "Bye now"

            Env:
            Image:              bash
            Image Pull Policy:  IfNotPresent
            Name:               pytorch
            Resources:
              Limits:
                Cpu:             1
                Memory:          1Gi
                nvidia.com/gpu:  0
              Requests:
                Cpu:             1
                Memory:          1Gi
                nvidia.com/gpu:  0
            Volume Mounts:
          Image Pull Secrets:
          Volumes:
Status:
  Completion Time:  2023-01-27T21:03:15Z
  Conditions:
    Last Transition Time:  2023-01-27T21:02:39Z
    Last Update Time:      2023-01-27T21:02:39Z
    Message:               PyTorchJob test is created.
    Reason:                PyTorchJobCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2023-01-27T21:02:43Z
    Last Update Time:      2023-01-27T21:02:43Z
    Message:               PyTorchJob test is running.
    Reason:                JobRunning
    Status:                False
    Type:                  Running
    Last Transition Time:  2023-01-27T21:03:15Z
    Last Update Time:      2023-01-27T21:03:15Z
    Message:               PyTorchJob test is successfully completed.
    Reason:                JobSucceeded
    Status:                True
    Type:                  Succeeded
  Replica Statuses:
    Master:
      Label Selector:
        Match Labels:
          Group - Name:                         kubeflow.org
          Job - Name:                           test
          training.kubeflow.org/job-name:       test
          training.kubeflow.org/operator-name:  pytorchjob-controller
          training.kubeflow.org/replica-type:   Master
      Succeeded:                                1
  Start Time:                                   2023-01-27T21:02:39Z
Events:
  Type    Reason                   Age              From                   Message
  ----    ------                   ----             ----                   -------
  Normal  SuccessfulCreatePod      41s              pytorchjob-controller  Created pod: test-master-0
  Normal  SuccessfulCreateService  41s              pytorchjob-controller  Created service: test-master-0
  Normal  ExitedWithCode           5s (x2 over 7s)  pytorchjob-controller  Pod: test-env.test-master-0 exited with code 0
  Normal  JobSucceeded             5s               pytorchjob-controller  PyTorchJob test is successfully completed.
bash-3.2$ # All went well as expected
bash-3.2$ # Now prefix a number to the name to trigger the reported issue
bash-3.2$ vi test.yaml
bash-3.2$ oc create -f test.yaml
pytorchjob.kubeflow.org/1test created
bash-3.2$ oc get pytorchjob
NAME    STATE       AGE
1test               7s
test    Succeeded   2m19s
bash-3.2$ # No state on "1test" !
bash-3.2$ oc describe pytorchjob 1test
Name:         1test
Namespace:    test-env
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1
Kind:         PyTorchJob
Metadata:
  Creation Timestamp:  2023-01-27T21:04:51Z
  Generation:          1
  Managed Fields:
    API Version:  kubeflow.org/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:pytorchReplicaSpecs:
          .:
          f:Master:
            .:
            f:replicas:
            f:restartPolicy:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
                f:imagePullSecrets:
                f:volumes:
    Manager:         kubectl-create
    Operation:       Update
    Time:            2023-01-27T21:04:51Z
  Resource Version:  232230735
  UID:               33d2d935-3da5-42d2-ba0a-3931f1a7928b
Spec:
  Pytorch Replica Specs:
    Master:
      Replicas:        1
      Restart Policy:  Never
      Template:
        Spec:
          Containers:
            Command:
              bash
              -c
              echo "Environment variables set by the kubeflow training operator:"
echo ${MASTER_ADDR}:${MASTER_PORT}
echo "PYTHONUNBUFFERED:"${PYTHONUNBUFFERED}
echo My global rank is ${RANK} / ${WORLD_SIZE}
#
# User commands
#
echo "Container started!" && sleep 30 && echo "Bye now"

            Env:
            Image:              bash
            Image Pull Policy:  IfNotPresent
            Name:               pytorch
            Resources:
              Limits:
                Cpu:             1
                Memory:          1Gi
                nvidia.com/gpu:  0
              Requests:
                Cpu:             1
                Memory:          1Gi
                nvidia.com/gpu:  0
            Volume Mounts:
          Image Pull Secrets:
          Volumes:
Events:
  Type     Reason               Age                 From                   Message
  ----     ------               ----                ----                   -------
  Normal   SuccessfulCreatePod  18s                 pytorchjob-controller  Created pod: 1test-master-0
  Warning  FailedCreateService  13s (x13 over 18s)  pytorchjob-controller  Error creating: Service "1test-master-0" is invalid: metadata.name: Invalid value: "1test-master-0": a DNS-1035 label must consist of lower case alphanumeric characters or '-', start with an alphabetic character, and end with an alphanumeric character (e.g. 'my-name',  or 'abc-123', regex used for validation is '[a-z]([-a-z0-9]*[a-z0-9])?')
bash-3.2$ # service creation failure!
bash-3.2$ # And nil status!

Note: in this greatly simplified test case, the pod can actually complete because it does not rely on the service unlike general PyTorchJob use-cases (it's just running echo and sleep), but the behaviors noted are observable.

@tenzen-y
Copy link
Member

@MEllis-github Thanks for reporting this!

The training-operator doesn't verify whether the CustomJob (e.g., PyTorchJob) name meets DNS-1035.

However, we may want to validate the CustomJob name so that the name meets DNS-1035. Or it might be better to convert the CustomJob name following DNS-1035 only when we operate (CRUD) the Service.

@kubeflow/wg-training-leads WDYT?

@zw0610
Copy link
Member

zw0610 commented Jan 29, 2023

I would suggest let each CustomJob validator to decide whether to check the job name meets NDS-1035, which means CustomJob like MPIJob may skip this.

@tenzen-y
Copy link
Member

I would suggest let each CustomJob validator to decide whether to check the job name meets NDS-1035, which means CustomJob like MPIJob may skip this.

Makes sense.
I was thinking of adding validation to each validator.

@tenzen-y
Copy link
Member

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants