Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorchJob does not run #1856

Closed
hongbo-miao opened this issue Jul 10, 2023 · 9 comments
Closed

PyTorchJob does not run #1856

hongbo-miao opened this issue Jul 10, 2023 · 9 comments

Comments

@hongbo-miao
Copy link

hongbo-miao commented Jul 10, 2023

I deployed Kubeflow (including Kubeflow Training operator) in a local Kubernetes by

export PIPELINE_VERSION=2.0.0
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=${PIPELINE_VERSION}"
kubectl wait crd/applications.app.k8s.io --for=condition=established --timeout=60s
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic?ref=${PIPELINE_VERSION}"

Then I deployed a training job by

kubectl create --filename=https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml

It stucks there forever

➜ kubectl get pytorchjobs --namespace=kubeflow
NAME             STATE   AGE
pytorch-simple           30m

Any ideas? Thanks! 😃

@johnugeorge
Copy link
Member

Can you check events? Anything in controller logs? Can you do
kubectl describe pytorchjobs --namespace=kubeflow

@hongbo-miao
Copy link
Author

hongbo-miao commented Jul 10, 2023

Thanks @johnugeorge !

I assume you mean workflow controller pod log (?) I recreated the training job, nothing helpful in this controller pod log:

time="2023-07-10T16:51:03.999Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:04.003Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:09.011Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:09.018Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:12.781Z" level=info msg="List workflows 200"
time="2023-07-10T16:51:12.781Z" level=info msg=healthz age=5m0s err="<nil>" instanceID= labelSelector="!workflows.argoproj.io/phase,!workflows.argoproj.io/controller-instanceid" managedNamespace=kubeflow
time="2023-07-10T16:51:14.023Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:14.026Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:19.033Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:19.038Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:24.044Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:24.050Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:29.056Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:29.061Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:30.401Z" level=info msg="Watch configmaps 200"
time="2023-07-10T16:51:31.425Z" level=info msg="Watch workflowtemplates 200"
time="2023-07-10T16:51:34.066Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:34.072Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:39.077Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:39.081Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:44.086Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:44.091Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:49.096Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:49.100Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:54.104Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:54.108Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:58.424Z" level=info msg="Watch workflowtaskresults 200"
time="2023-07-10T16:51:59.114Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:59.119Z" level=info msg="Update leases 200"
time="2023-07-10T16:52:04.125Z" level=info msg="Get leases 200"
time="2023-07-10T16:52:04.130Z" level=info msg="Update leases 200"

And here is the result of kubectl describe pytorchjobs --namespace=kubeflow:

Name:         pytorch-simple
Namespace:    kubeflow
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1
Kind:         PyTorchJob
Metadata:
  Creation Timestamp:  2023-07-10T05:54:56Z
  Generation:          1
  Managed Fields:
    API Version:  kubeflow.org/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:pytorchReplicaSpecs:
          .:
          f:Master:
            .:
            f:replicas:
            f:restartPolicy:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
          f:Worker:
            .:
            f:replicas:
            f:restartPolicy:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
    Manager:         kubectl-create
    Operation:       Update
    Time:            2023-07-10T05:54:56Z
  Resource Version:  6951698
  UID:               12dc5c33-f248-4b0a-81b6-aaa640f331f9
Spec:
  Pytorch Replica Specs:
    Master:
      Replicas:        1
      Restart Policy:  OnFailure
      Template:
        Spec:
          Containers:
            Command:
              python3
              /opt/pytorch-mnist/mnist.py
              --epochs=1
            Image:              docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
            Image Pull Policy:  Always
            Name:               pytorch
    Worker:
      Replicas:        1
      Restart Policy:  OnFailure
      Template:
        Spec:
          Containers:
            Command:
              python3
              /opt/pytorch-mnist/mnist.py
              --epochs=1
            Image:              docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
            Image Pull Policy:  Always
            Name:               pytorch
Events:                         <none>

@johnugeorge
Copy link
Member

I meant, training operator pod logs

@hongbo-miao
Copy link
Author

hongbo-miao commented Jul 10, 2023

Hi @johnugeorge

I am using the default one from Kubeflow based on this doc:

By default, PyTorch Operator will be deployed as a controller in training operator.

I verified I have pytorchjobs.kubeflow.org by: (I don't have it installed, it is because I installed the standalone version before, and deleted the pod)

➜ kubectl get crd
NAME                                                   CREATED AT
...
pytorchjobs.kubeflow.org                               2023-07-10T03:55:47Z
tfjobs.kubeflow.org                                    2023-07-10T03:55:47Z
xgboostjobs.kubeflow.org                               2023-07-10T03:55:48Z

I know if I use standalone training operator, I would have a pod called something like training-operator.
Hmm, given I am not using standalone training operator, just wonder which pod log should I print out , thanks!

➜ kubectl get pods -n kubeflow
NAME                                               READY   STATUS    RESTARTS      AGE
metadata-writer-79d569c46f-km7nh                   1/1     Running   0             17h
metadata-envoy-deployment-59687d9798-f2bxl         1/1     Running   0             17h
ml-pipeline-persistenceagent-84f946b944-zcs5d      1/1     Running   0             17h
ml-pipeline-scheduledworkflow-54d88874b-mcd49      1/1     Running   0             17h
ml-pipeline-viewer-crd-75c6d588df-pwd4c            1/1     Running   0             17h
cache-deployer-deployment-779655b9f7-gr9z5         1/1     Running   0             17h
workflow-controller-5f6fdf89d7-pcg2z               1/1     Running   0             17h
ml-pipeline-ui-679784dfd6-c4r4h                    1/1     Running   0             17h
minio-549846c488-pb6q6                             1/1     Running   0             17h
ml-pipeline-visualizationserver-7f8f7fdbdc-w6w6k   1/1     Running   0             17h
mysql-5f968d4688-mtqr4                             1/1     Running   0             17h
cache-server-55c88c76c5-p9hpx                      1/1     Running   0             17h
metadata-grpc-deployment-6d744c66bb-k9w92          1/1     Running   2 (17h ago)   17h
ml-pipeline-867f66dc54-sfc2f                       1/1     Running   1 (17h ago)   17h

@johnugeorge
Copy link
Member

There should be a training operator pod when you install Kubeflow. I see that pipelines is the only component that is installed

@hongbo-miao
Copy link
Author

hongbo-miao commented Jul 11, 2023

Sorry, I guess

export PIPELINE_VERSION=2.0.0
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=${PIPELINE_VERSION}"
kubectl wait crd/applications.app.k8s.io --for=condition=established --timeout=60s
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/env/dev?ref=${PIPELINE_VERSION}"

does not install training operator, right?

I originally was confused by this sentence at https://www.kubeflow.org/docs/components/training/pytorch/#installing-pytorch-operator

I thought when install Kubeflow pipelines, it also comes with training operator which is not:

image

I guess after installing Kubeflow pipelines, I have to install training operators separately. Please correct me if I am wrong. I have another question at #1855 regarding how the version matches.

Anyway, I will try Kubeflow pipelines 2.0 and Kubeflow Training Operator 1.6 see if they work together, and report the results. Thanks!

@hongbo-miao
Copy link
Author

hongbo-miao commented Jul 14, 2023

Thanks @johnugeorge !

I finally succeed deploying Kubeflow Training Operator 1.6 based on #1841 (comment)

Here is my scripts

# Install Kubeflow Pipelines
export PIPELINE_VERSION=2.0.0
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=${PIPELINE_VERSION}"
kubectl wait crd/applications.app.k8s.io --for=condition=established --timeout=60s
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic?ref=${PIPELINE_VERSION}"

# Install Kubeflow Training Operator
# Steps are at https://github.com/kubeflow/training-operator/issues/1841#issuecomment-1635334868

# Create a PyTorch training job
kubectl create --filename=https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml

This demo PyTorch training job succeed finishing:

image

However, the job is not listed in my Kubeflow Pipelines UI:

image

I feel this Kubeflow Training Operator does not connect with my Kubeflow Pipelines correctly. Any ideas? Thanks!

Also, just want to confirm "Kubeflow Pipelines" does not include "Kubeflow Training Operator", right? And they are supposed to deploy individually?

@johnugeorge
Copy link
Member

No. Kubeflow pipelines is a ML workflow orchestrator. It is up to you to decide the workflow graph. If you want to see training job inside pipelines UI, you have to trigger the job within a pipeline experiment .

@hongbo-miao
Copy link
Author

I see, thank you so much, @johnugeorge !

A demo machine learning code is at https://github.com/Hongbo-Miao/hongbomiao.com/pull/9807/files
And I can see it starts to train and show in the UI 😃
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants