Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFServing test if failing blocking submits #1126

Closed
jlewi opened this issue Jul 5, 2018 · 6 comments
Closed

TFServing test if failing blocking submits #1126

jlewi opened this issue Jul 5, 2018 · 6 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Jul 5, 2018

http://testing-argo.kubeflow.org/workflows/kubeflow-test-infra/kubeflow-presubmit-tf-serving-image-1109-866b43a-2354-4846?tab=workflow&nodeId=kubeflow-presubmit-tf-serving-image-1109-866b43a-2354-4846-1071956215&sidePanel=logs%3Akubeflow-presubmit-tf-serving-image-1109-866b43a-2354-4846-1071956215%3Amain

Test is failing

ERROR|2018-07-03T15:36:26|/mnt/test-data-volume/kubeflow-presubmit-tf-serving-image-1109-866b43a-2354-4846/src/kubeflow/testing/py/kubeflow/testing/util.p
y|296| Timeout waiting for deployment inception-gpu-v1 in namespace kubeflow-presubmit-tf-serving-image-1109-866b43a-2354-4846 to be ready

@jlewi
Copy link
Contributor Author

jlewi commented Jul 5, 2018

Following the instructions: looking at the K8s events
https://github.com/kubeflow/testing

I see
2018-07-03 08:36:39.000 PDT
0/5 nodes are available: 5 Insufficient nvidia.com/gpu.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 5, 2018

Looks like the pod was created but was never assigned to a node.

Created pod: inception-gpu-v1-6889ddb9cc-mhd6l

I also see:

pod didn't trigger scale-up (it wouldn't fit if a new node is added)

Which is odd; I'd expect scaling the GPU pod would fit it.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 5, 2018

Autoscaling is enabled for the gpu-pool

gcloud --project=kubeflow-ci container node-pools describe --zone=us-east1-d --cluster=kubeflow-testing  gpu-pool
autoscaling:
  enabled: true
  maxNodeCount: 10
  minNodeCount: 2
config:
  accelerators:
  - acceleratorCount: '2'
    acceleratorType: nvidia-tesla-k80
  diskSizeGb: 100
  imageType: COS
  machineType: n1-standard-8
  minCpuPlatform: Automatic
  oauthScopes:
  - https://www.googleapis.com/auth/compute
  - https://www.googleapis.com/auth/devstorage.read_only
  - https://www.googleapis.com/auth/service.management
  - https://www.googleapis.com/auth/servicecontrol
  - https://www.googleapis.com/auth/logging.write
  - https://www.googleapis.com/auth/monitoring
  serviceAccount: default
initialNodeCount: 1
instanceGroupUrls:
- https://www.googleapis.com/compute/v1/projects/kubeflow-ci/zones/us-east1-d/instanceGroupManagers/gke-kubeflow-testing-gpu-pool-117658d4-grp
management: {}
name: gpu-pool
selfLink: https://container.googleapis.com/v1/projects/kubeflow-ci/zones/us-east1-d/clusters/kubeflow-testing/nodePools/gpu-pool
status: RUNNING
version: 1.9.6-gke.0

@jlewi
Copy link
Contributor Author

jlewi commented Jul 5, 2018

On retest it seemed to work; I manually observed the pods.

So I think the issue might be with autoscaling not working to deal with heavy load.

@jbottum
Copy link
Contributor

jbottum commented Sep 30, 2018

/area 0.4.0

@jlewi
Copy link
Contributor Author

jlewi commented Oct 14, 2018

Closing this issue since there's no additional work needed.

@jlewi jlewi closed this as completed Oct 14, 2018
@carmine carmine added this to the 0.4.0 milestone Nov 6, 2018
surajkota pushed a commit to surajkota/kubeflow that referenced this issue Jun 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants