Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Test Failure] TFJob test failure; no module named py #1218

Closed
jlewi opened this issue Jul 16, 2018 · 20 comments
Closed

[Test Failure] TFJob test failure; no module named py #1218

jlewi opened this issue Jul 16, 2018 · 20 comments
Labels

Comments

@jlewi
Copy link
Contributor

jlewi commented Jul 16, 2018

TFJob test is failing with

/usr/bin/python: No module named py

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/kubeflow_kubeflow/1200/kubeflow-presubmit/2586/

Seems completely unrelated to that PR.

Need to check whether its also failing @ HEAD.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 16, 2018

For #1200 it looks like the test passed on retry which is very weird.

@lluunn
Copy link
Contributor

lluunn commented Jul 16, 2018

Looking into this.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 16, 2018

/asssign @lluunn Thanks

@lluunn
Copy link
Contributor

lluunn commented Jul 16, 2018

Found pod: kubeflow-test-infra kubeflow-presubmit-kubeflow-e2e-gke-1224-e8f6cef-2600-21af-237888385

Logs:
/usr/bin/python: No module named py

describe pod

Environment:

Command:
      python
      -m
      py.test_runner
      test
      --cluster=e2e-21af

 PYTHONPATH:                      /mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-1224-e8f6cef-2600-21af/src/kubeflow/kubeflow:/mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-1224-e8f6cef-2600-21af/src/kubeflow/testing/py:/mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-1224-e8f6cef-2600-21af/src/kubeflow/tf-operator

cannot exec into it: error: cannot exec into a container in a completed pod; current phase is Failed

@lluunn
Copy link
Contributor

lluunn commented Jul 16, 2018

root@debug-worker-0:/# ls /mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-1224-e8f6cef-2600-21af/src/kubeflow/tf-operator
Gopkg.lock  OWNERS     cmd                 docs      linter_config.json  py                     test
Gopkg.toml  README.md  dashboard           examples  pkg                 releasing.md           tf_job_design_doc.md
LICENSE     build      developer_guide.md  hack      prow_config.yaml    submit_release_job.sh  vendor

However, no __init__.py

root@debug-worker-0:/# ls /mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-1224-e8f6cef-2600-21af/src/kubeflow/tf-operator/py -a
.        Pipfile.lock             deploy.py     py_checks.py     test_runner.py     tf_job_client.py
..       README.md                prow.py       release.py       test_util.py       util.py
Pipfile  build_and_push_image.py  prow_test.py  release_test.py  test_util_test.py  util_test.py

@lluunn
Copy link
Contributor

lluunn commented Jul 16, 2018

@lluunn
Copy link
Contributor

lluunn commented Jul 16, 2018

It's weird:

root@debug-worker-0:/mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-1224-e8f6cef-2600-21af/src/kubeflow/tf-operator/py# git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	deleted:    ../examples/tf_sample/tf_sample/__init__.py
	deleted:    __init__.py
	deleted:    ../test/workflows/app.lock
	deleted:    ../vendor/github.com/modern-go/reflect2/reflect2_amd64.s

__init__.py is somehow deleted

@lluunn
Copy link
Contributor

lluunn commented Jul 17, 2018

So __init__.py should be there right after step checkout, but gone at step tfjob-test.
I will do a binary search to find which step deleted it.

@lluunn
Copy link
Contributor

lluunn commented Jul 17, 2018

http://testing-argo.kubeflow.org/workflows/kubeflow-test-infra/kubeflow-presubmit-kubeflow-e2e-gke-1224-a45247a-2608-9146?tab=workflow

screenshot from 2018-07-17 10-32-29

__init__.py is gone

root@debug-worker-0:/# ls /mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-1224-a45247a-2608-9146/src/kubeflow/tf-operator/py -a
.        Pipfile.lock             deploy.py     py_checks.py     test_runner.py     tf_job_client.py
..       README.md                prow.py       release.py       test_util.py       util.py
Pipfile  build_and_push_image.py  prow_test.py  release_test.py  test_util_test.py  util_test.py

@lluunn
Copy link
Contributor

lluunn commented Jul 17, 2018

root@debug-worker-0:/# ls /mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-1224-1296b1b-2610-cc8a/src/kubeflow/tf-operator/py -a
.        Pipfile.lock  build_and_push_image.py  prow_test.py  release_test.py  test_util_test.py  util_test.py
..       README.md     deploy.py                py_checks.py  test_runner.py   tf_job_client.py
Pipfile  __init__.py   prow.py                  release.py    test_util.py     util.py

It's there for this one http://testing-argo.kubeflow.org/workflows/kubeflow-test-infra/kubeflow-presubmit-kubeflow-e2e-gke-1224-1296b1b-2610-cc8a?tab=workflow

screenshot from 2018-07-17 10-45-34

So looks like bootstraper deleted it somehow..?

@lluunn
Copy link
Contributor

lluunn commented Jul 17, 2018

screenshot from 2018-07-17 11-22-26

__init__.py is still there at the end of bootstrapper step...

@lluunn
Copy link
Contributor

lluunn commented Jul 17, 2018

#1218 (comment)
Retrying command in this comment (above above one): __init__.py is gone..

root@debug-worker-0:/# ls /mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-1224-1296b1b-2610-cc8a/src/kubeflow/tf-operator/py -a
.        Pipfile.lock             deploy.py     py_checks.py     test_runner.py     tf_job_client.py
..       README.md                prow.py       release.py       test_util.py       util.py
Pipfile  build_and_push_image.py  prow_test.py  release_test.py  test_util_test.py  util_test.py

So something else is deleting it..

@lluunn
Copy link
Contributor

lluunn commented Jul 17, 2018

root@debug-worker-0:/mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-1224-1296b1b-2610-cc8a/src/kubeflow/testing# git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	deleted:    py/kubeflow/__init__.py
	deleted:    py/kubeflow/testing/__init__.py
	deleted:    py/kubeflow/tests/__init__.py

testing repo also got deleted.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 17, 2018

@lluunn Was this PR submitted after you submitted #1228

@lluunn
Copy link
Contributor

lluunn commented Jul 17, 2018

What do you mean?
So I redeployed the NFS, that seems fixed the issue

@jlewi
Copy link
Contributor Author

jlewi commented Jul 18, 2018

@lluunn Nice work!

@jlewi
Copy link
Contributor Author

jlewi commented Jul 18, 2018

@lluunn Should we revert #1228?

@lluunn
Copy link
Contributor

lluunn commented Jul 18, 2018

That one is not merged, it's closed.

@lluunn
Copy link
Contributor

lluunn commented Jul 18, 2018

Seems fixed now

@lluunn lluunn closed this as completed Jul 18, 2018
surajkota pushed a commit to surajkota/kubeflow that referenced this issue Jun 13, 2022
…1218)

* GoogleCloudPlatform/kubeflow-distribution#33 is tracking GCP blueprints on private GKE with VPC-SC

  * This PR doesn't fully enable that but it includes a lot of necessary
    changes.

* cluster-private-patch.yaml is a cluster patch that turns on a lot of
  settings to deploy GKE with private GKE

  * For ease of use we make the master publicly accessible anywhere; users
    could configure that behavior if desired using patch overlays.

* Use kpt setters to name all the networking resources (firewall rules, networks, etc...)

  * This ensures the names are unique based on the KF deployment name and won't conflict with
    existing rules.

  * The setters also ensures that the references get set correctly; e.g. the firewall rules
    correctly refer the newly created network.

* Add a CNRM resource to enable CloudDNS.

  * Per GoogleCloudPlatform/kubeflow-distribution#31 we should probably use CNRM and not AnthosCLI to enable
    all required services.

* Add a kpt setter to control firewall rule logging

  * Enabling firewall rule logging can be useful to debug why connections are blocked.

    Enable logging on firewall rules.

* Add an extra firewall rule for ISTIO

  *Per https://istio.io/docs/setup/platform-setup/gke/ we need to manually create an additional firewall rule to allow traffic to the ISTIO pilot webhook port.

* Add a NAT to allow outbound internet egress

  * Egress is still blocked by firewall rules
  * Per kbueflow/gcp-blueprints#34 this was an attempt to make it possible
    to pull images from DockerHub and Quay.IO. This was partially
    succesful; pulling from DockerHub works but for Quay.IO the firewall
    rules are strill blocking required connections.

* Fix the v3 version of the cert-manager package.

  * kubeflow#1134 moved the kubeflow issuer into its own package to avoid
    race conditions

   * That refactored means that the v3 packages no longer included the
     actual cert-manager resources
   * This PR fixes that by having the v3 package pull in the base package
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants