PyTorchJob does not run #1856

hongbo-miao · 2023-07-10T05:15:12Z

I deployed Kubeflow (including Kubeflow Training operator) in a local Kubernetes by

export PIPELINE_VERSION=2.0.0
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=${PIPELINE_VERSION}"
kubectl wait crd/applications.app.k8s.io --for=condition=established --timeout=60s
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic?ref=${PIPELINE_VERSION}"

Then I deployed a training job by

kubectl create --filename=https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml

It stucks there forever

➜ kubectl get pytorchjobs --namespace=kubeflow
NAME             STATE   AGE
pytorch-simple           30m

Any ideas? Thanks! 😃

johnugeorge · 2023-07-10T15:35:10Z

Can you check events? Anything in controller logs? Can you do
kubectl describe pytorchjobs --namespace=kubeflow

hongbo-miao · 2023-07-10T16:53:09Z

Thanks @johnugeorge !

I assume you mean workflow controller pod log (?) I recreated the training job, nothing helpful in this controller pod log:

time="2023-07-10T16:51:03.999Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:04.003Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:09.011Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:09.018Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:12.781Z" level=info msg="List workflows 200"
time="2023-07-10T16:51:12.781Z" level=info msg=healthz age=5m0s err="<nil>" instanceID= labelSelector="!workflows.argoproj.io/phase,!workflows.argoproj.io/controller-instanceid" managedNamespace=kubeflow
time="2023-07-10T16:51:14.023Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:14.026Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:19.033Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:19.038Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:24.044Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:24.050Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:29.056Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:29.061Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:30.401Z" level=info msg="Watch configmaps 200"
time="2023-07-10T16:51:31.425Z" level=info msg="Watch workflowtemplates 200"
time="2023-07-10T16:51:34.066Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:34.072Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:39.077Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:39.081Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:44.086Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:44.091Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:49.096Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:49.100Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:54.104Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:54.108Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:58.424Z" level=info msg="Watch workflowtaskresults 200"
time="2023-07-10T16:51:59.114Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:59.119Z" level=info msg="Update leases 200"
time="2023-07-10T16:52:04.125Z" level=info msg="Get leases 200"
time="2023-07-10T16:52:04.130Z" level=info msg="Update leases 200"

And here is the result of kubectl describe pytorchjobs --namespace=kubeflow:

Name:         pytorch-simple
Namespace:    kubeflow
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1
Kind:         PyTorchJob
Metadata:
  Creation Timestamp:  2023-07-10T05:54:56Z
  Generation:          1
  Managed Fields:
    API Version:  kubeflow.org/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:pytorchReplicaSpecs:
          .:
          f:Master:
            .:
            f:replicas:
            f:restartPolicy:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
          f:Worker:
            .:
            f:replicas:
            f:restartPolicy:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
    Manager:         kubectl-create
    Operation:       Update
    Time:            2023-07-10T05:54:56Z
  Resource Version:  6951698
  UID:               12dc5c33-f248-4b0a-81b6-aaa640f331f9
Spec:
  Pytorch Replica Specs:
    Master:
      Replicas:        1
      Restart Policy:  OnFailure
      Template:
        Spec:
          Containers:
            Command:
              python3
              /opt/pytorch-mnist/mnist.py
              --epochs=1
            Image:              docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
            Image Pull Policy:  Always
            Name:               pytorch
    Worker:
      Replicas:        1
      Restart Policy:  OnFailure
      Template:
        Spec:
          Containers:
            Command:
              python3
              /opt/pytorch-mnist/mnist.py
              --epochs=1
            Image:              docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
            Image Pull Policy:  Always
            Name:               pytorch
Events:                         <none>

johnugeorge · 2023-07-10T22:25:12Z

I meant, training operator pod logs

hongbo-miao · 2023-07-10T23:57:57Z

Hi @johnugeorge

I am using the default one from Kubeflow based on this doc:

By default, PyTorch Operator will be deployed as a controller in training operator.

~~I verified I have pytorchjobs.kubeflow.org by:~~ (I don't have it installed, it is because I installed the standalone version before, and deleted the pod)

➜ kubectl get crd
NAME                                                   CREATED AT
...
pytorchjobs.kubeflow.org                               2023-07-10T03:55:47Z
tfjobs.kubeflow.org                                    2023-07-10T03:55:47Z
xgboostjobs.kubeflow.org                               2023-07-10T03:55:48Z

I know if I use standalone training operator, I would have a pod called something like training-operator.
Hmm, given I am not using standalone training operator, just wonder which pod log should I print out , thanks!

➜ kubectl get pods -n kubeflow
NAME                                               READY   STATUS    RESTARTS      AGE
metadata-writer-79d569c46f-km7nh                   1/1     Running   0             17h
metadata-envoy-deployment-59687d9798-f2bxl         1/1     Running   0             17h
ml-pipeline-persistenceagent-84f946b944-zcs5d      1/1     Running   0             17h
ml-pipeline-scheduledworkflow-54d88874b-mcd49      1/1     Running   0             17h
ml-pipeline-viewer-crd-75c6d588df-pwd4c            1/1     Running   0             17h
cache-deployer-deployment-779655b9f7-gr9z5         1/1     Running   0             17h
workflow-controller-5f6fdf89d7-pcg2z               1/1     Running   0             17h
ml-pipeline-ui-679784dfd6-c4r4h                    1/1     Running   0             17h
minio-549846c488-pb6q6                             1/1     Running   0             17h
ml-pipeline-visualizationserver-7f8f7fdbdc-w6w6k   1/1     Running   0             17h
mysql-5f968d4688-mtqr4                             1/1     Running   0             17h
cache-server-55c88c76c5-p9hpx                      1/1     Running   0             17h
metadata-grpc-deployment-6d744c66bb-k9w92          1/1     Running   2 (17h ago)   17h
ml-pipeline-867f66dc54-sfc2f                       1/1     Running   1 (17h ago)   17h

johnugeorge · 2023-07-11T00:29:38Z

There should be a training operator pod when you install Kubeflow. I see that pipelines is the only component that is installed

hongbo-miao · 2023-07-11T01:08:56Z

Sorry, I guess

export PIPELINE_VERSION=2.0.0
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=${PIPELINE_VERSION}"
kubectl wait crd/applications.app.k8s.io --for=condition=established --timeout=60s
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/env/dev?ref=${PIPELINE_VERSION}"

does not install training operator, right?

I originally was confused by this sentence at https://www.kubeflow.org/docs/components/training/pytorch/#installing-pytorch-operator

I thought when install Kubeflow pipelines, it also comes with training operator which is not:

I guess after installing Kubeflow pipelines, I have to install training operators separately. Please correct me if I am wrong. I have another question at #1855 regarding how the version matches.

Anyway, I will try Kubeflow pipelines 2.0 and Kubeflow Training Operator 1.6 see if they work together, and report the results. Thanks!

hongbo-miao · 2023-07-14T08:24:33Z

Thanks @johnugeorge !

I finally succeed deploying Kubeflow Training Operator 1.6 based on #1841 (comment)

Here is my scripts

# Install Kubeflow Pipelines
export PIPELINE_VERSION=2.0.0
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=${PIPELINE_VERSION}"
kubectl wait crd/applications.app.k8s.io --for=condition=established --timeout=60s
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic?ref=${PIPELINE_VERSION}"

# Install Kubeflow Training Operator
# Steps are at https://github.com/kubeflow/training-operator/issues/1841#issuecomment-1635334868

# Create a PyTorch training job
kubectl create --filename=https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml

This demo PyTorch training job succeed finishing:

However, the job is not listed in my Kubeflow Pipelines UI:

I feel this Kubeflow Training Operator does not connect with my Kubeflow Pipelines correctly. Any ideas? Thanks!

Also, just want to confirm "Kubeflow Pipelines" does not include "Kubeflow Training Operator", right? And they are supposed to deploy individually?

johnugeorge · 2023-07-14T10:20:17Z

No. Kubeflow pipelines is a ML workflow orchestrator. It is up to you to decide the workflow graph. If you want to see training job inside pipelines UI, you have to trigger the job within a pipeline experiment .

hongbo-miao · 2023-07-14T18:53:03Z

I see, thank you so much, @johnugeorge !

A demo machine learning code is at https://github.com/Hongbo-Miao/hongbomiao.com/pull/9807/files
And I can see it starts to train and show in the UI 😃

hongbo-miao mentioned this issue Jul 10, 2023

Failed to deploy PyTorchJob hongbo-miao/hongbomiao.com#9665

Closed

hongbo-miao mentioned this issue Jul 14, 2023

How do Kubeflow pipelines service and Kubeflow Training Operator versions match? #1855

Closed

hongbo-miao closed this as completed Jul 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorchJob does not run #1856

PyTorchJob does not run #1856

hongbo-miao commented Jul 10, 2023 •

edited

Loading

johnugeorge commented Jul 10, 2023

hongbo-miao commented Jul 10, 2023 •

edited

Loading

johnugeorge commented Jul 10, 2023

hongbo-miao commented Jul 10, 2023 •

edited

Loading

johnugeorge commented Jul 11, 2023

hongbo-miao commented Jul 11, 2023 •

edited

Loading

hongbo-miao commented Jul 14, 2023 •

edited

Loading

johnugeorge commented Jul 14, 2023

hongbo-miao commented Jul 14, 2023

PyTorchJob does not run #1856

PyTorchJob does not run #1856

Comments

hongbo-miao commented Jul 10, 2023 • edited Loading

johnugeorge commented Jul 10, 2023

hongbo-miao commented Jul 10, 2023 • edited Loading

johnugeorge commented Jul 10, 2023

hongbo-miao commented Jul 10, 2023 • edited Loading

johnugeorge commented Jul 11, 2023

hongbo-miao commented Jul 11, 2023 • edited Loading

hongbo-miao commented Jul 14, 2023 • edited Loading

johnugeorge commented Jul 14, 2023

hongbo-miao commented Jul 14, 2023

hongbo-miao commented Jul 10, 2023 •

edited

Loading

hongbo-miao commented Jul 10, 2023 •

edited

Loading

hongbo-miao commented Jul 10, 2023 •

edited

Loading

hongbo-miao commented Jul 11, 2023 •

edited

Loading

hongbo-miao commented Jul 14, 2023 •

edited

Loading