Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cut beta release of tf-operator for 1.4 release #1385

Closed
Jeffwan opened this issue Aug 27, 2021 · 4 comments
Closed

Cut beta release of tf-operator for 1.4 release #1385

Jeffwan opened this issue Aug 27, 2021 · 4 comments

Comments

@Jeffwan
Copy link
Member

Jeffwan commented Aug 27, 2021

We notice a few issues recently

1. Gang scheduling issue with Volcano #1382
2. installation issue due to kustomize version or kubectl #1381
3. Testing on existing SDK
4. E2E testing for rest of the frameworks besides TF.

I will spend some time this weekend to address some of above issue before we cut beta release.

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 29, 2021

SDK installation issue

1. Failed to create the job. but our e2e integration pass. it's probably my env's problem

tfjob_client = TFJobClient()
tfjob_client.create(tfjob, namespace=namespace)


    236             # model definition for request.
    237             obj_dict = {obj.attribute_map[attr]: getattr(obj, attr)
--> 238                         for attr, _ in six.iteritems(obj.openapi_types)
    239                         if getattr(obj, attr) is not None}
    240 

AttributeError: 'V1TFJob' object has no attribute 'openapi_types'

2. Installation issue

➜  python git:(sdk) ✗ python3 setup.py install --user
running install
error: can't combine user with prefix, exec_prefix/home, or install_(plat)base

works after adding `--prefix=`

➜  python git:(sdk) ✗ python3 setup.py install --user --prefix=
running install
running bdist_egg
.....
Using /usr/local/lib/python3.9/site-packages
Finished processing dependencies for kubeflow-tfjob==0.1.4

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 30, 2021

Manually test all 4 frameworks, all work well

apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
  name: test
  namespace: kubeflow
spec:
   tfReplicaSpecs:
     Worker:
       replicas: 2
       restartPolicy: OnFailure
       template:
         spec:
           containers:
             - name: tensorflow
               image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
               command:
                 - "python"
                 - "/var/tf_mnist/mnist_with_summaries.py"
~
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: pytorch
  namespace: kubeflow
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
              imagePullPolicy: Always
              command:
                - "python3"
                - "/opt/pytorch-mnist/mnist.py"
                - "--epochs=1"
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
              imagePullPolicy: Always
              command:
                - "python3"
                - "/opt/pytorch-mnist/mnist.py"
                - "--epochs=1"
~

https://github.com/kubeflow/tf-operator/blob/master/examples/mxnet/mxjob_dist_v1.yaml
https://github.com/kubeflow/tf-operator/blob/master/examples/xgboost/xgboostjob.yaml

@Jeffwan
Copy link
Member Author

Jeffwan commented Sep 1, 2021

I cut https://github.com/kubeflow/tf-operator/releases/tag/v1.3.0-rc.0 release. Official release would be discussed in the WG-training meeting.

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants