TFServing test if failing blocking submits #1126

jlewi · 2018-07-05T02:12:04Z

http://testing-argo.kubeflow.org/workflows/kubeflow-test-infra/kubeflow-presubmit-tf-serving-image-1109-866b43a-2354-4846?tab=workflow&nodeId=kubeflow-presubmit-tf-serving-image-1109-866b43a-2354-4846-1071956215&sidePanel=logs%3Akubeflow-presubmit-tf-serving-image-1109-866b43a-2354-4846-1071956215%3Amain

Test is failing

ERROR|2018-07-03T15:36:26|/mnt/test-data-volume/kubeflow-presubmit-tf-serving-image-1109-866b43a-2354-4846/src/kubeflow/testing/py/kubeflow/testing/util.p
y|296| Timeout waiting for deployment inception-gpu-v1 in namespace kubeflow-presubmit-tf-serving-image-1109-866b43a-2354-4846 to be ready

jlewi · 2018-07-05T02:35:27Z

Following the instructions: looking at the K8s events
https://github.com/kubeflow/testing

I see
2018-07-03 08:36:39.000 PDT
0/5 nodes are available: 5 Insufficient nvidia.com/gpu.

jlewi · 2018-07-05T02:37:06Z

Looks like the pod was created but was never assigned to a node.

Created pod: inception-gpu-v1-6889ddb9cc-mhd6l

I also see:

pod didn't trigger scale-up (it wouldn't fit if a new node is added)

Which is odd; I'd expect scaling the GPU pod would fit it.

jlewi · 2018-07-05T02:47:22Z

Autoscaling is enabled for the gpu-pool

gcloud --project=kubeflow-ci container node-pools describe --zone=us-east1-d --cluster=kubeflow-testing  gpu-pool
autoscaling:
  enabled: true
  maxNodeCount: 10
  minNodeCount: 2
config:
  accelerators:
  - acceleratorCount: '2'
    acceleratorType: nvidia-tesla-k80
  diskSizeGb: 100
  imageType: COS
  machineType: n1-standard-8
  minCpuPlatform: Automatic
  oauthScopes:
  - https://www.googleapis.com/auth/compute
  - https://www.googleapis.com/auth/devstorage.read_only
  - https://www.googleapis.com/auth/service.management
  - https://www.googleapis.com/auth/servicecontrol
  - https://www.googleapis.com/auth/logging.write
  - https://www.googleapis.com/auth/monitoring
  serviceAccount: default
initialNodeCount: 1
instanceGroupUrls:
- https://www.googleapis.com/compute/v1/projects/kubeflow-ci/zones/us-east1-d/instanceGroupManagers/gke-kubeflow-testing-gpu-pool-117658d4-grp
management: {}
name: gpu-pool
selfLink: https://container.googleapis.com/v1/projects/kubeflow-ci/zones/us-east1-d/clusters/kubeflow-testing/nodePools/gpu-pool
status: RUNNING
version: 1.9.6-gke.0

jlewi · 2018-07-05T03:07:47Z

On retest it seemed to work; I manually observed the pods.

So I think the issue might be with autoscaling not working to deal with heavy load.

jbottum · 2018-09-30T15:53:35Z

/area 0.4.0

jlewi · 2018-10-14T21:13:01Z

Closing this issue since there's no additional work needed.

jlewi added area/testing area/inference priority/p1 labels Jul 5, 2018

jlewi mentioned this issue Jul 5, 2018

Add Kaggle notebook Dockerfile #1109

Merged

k8s-ci-robot added the area/0.4.0 label Sep 30, 2018

jlewi closed this as completed Oct 14, 2018

carmine added this to the 0.4.0 milestone Nov 6, 2018

surajkota pushed a commit to surajkota/kubeflow that referenced this issue Jun 13, 2022

private GKE template (kubeflow#1126)

b838846

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFServing test if failing blocking submits #1126

TFServing test if failing blocking submits #1126

jlewi commented Jul 5, 2018

jlewi commented Jul 5, 2018

jlewi commented Jul 5, 2018 •

edited

Loading

jlewi commented Jul 5, 2018

jlewi commented Jul 5, 2018

jbottum commented Sep 30, 2018

jlewi commented Oct 14, 2018

TFServing test if failing blocking submits #1126

TFServing test if failing blocking submits #1126

Comments

jlewi commented Jul 5, 2018

jlewi commented Jul 5, 2018

jlewi commented Jul 5, 2018 • edited Loading

jlewi commented Jul 5, 2018

jlewi commented Jul 5, 2018

jbottum commented Sep 30, 2018

jlewi commented Oct 14, 2018

jlewi commented Jul 5, 2018 •

edited

Loading