-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TFServing test if failing blocking submits #1126
Comments
Following the instructions: looking at the K8s events I see |
Looks like the pod was created but was never assigned to a node.
I also see:
Which is odd; I'd expect scaling the GPU pod would fit it. |
Autoscaling is enabled for the gpu-pool
|
On retest it seemed to work; I manually observed the pods. So I think the issue might be with autoscaling not working to deal with heavy load. |
/area 0.4.0 |
Closing this issue since there's no additional work needed. |
http://testing-argo.kubeflow.org/workflows/kubeflow-test-infra/kubeflow-presubmit-tf-serving-image-1109-866b43a-2354-4846?tab=workflow&nodeId=kubeflow-presubmit-tf-serving-image-1109-866b43a-2354-4846-1071956215&sidePanel=logs%3Akubeflow-presubmit-tf-serving-image-1109-866b43a-2354-4846-1071956215%3Amain
Test is failing
ERROR|2018-07-03T15:36:26|/mnt/test-data-volume/kubeflow-presubmit-tf-serving-image-1109-866b43a-2354-4846/src/kubeflow/testing/py/kubeflow/testing/util.p
y|296| Timeout waiting for deployment inception-gpu-v1 in namespace kubeflow-presubmit-tf-serving-image-1109-866b43a-2354-4846 to be ready
The text was updated successfully, but these errors were encountered: