DM template fails with GPU enabled #2392

hongye-sun · 2019-02-05T02:03:51Z

It seems that there is a time conflict issue on DM side. Our test deployments kept failure due to error:

e2e-9b40778-1331 has resource warnings
e2e-9b40778-1331-gpu-pool-v1: {"ResourceType":"gcp-types/container-v1beta1:projects.locations.clusters.nodePools","ResourceErrorCode":"400","ResourceErrorMessage":{"code":400,"message":"Cluster is currently being created, deleted, updated or repaired and cannot be updated.","status":"FAILED_PRECONDITION","statusMessage":"Bad Request","requestPath":"https://container.googleapis.com/v1beta1/projects/ml-pipeline-test/locations/us-central1-a/clusters/e2e-9b40778-1331/nodePools","httpMethod":"POST"}}

It seems that DM is trying to create the GPU nodepool while the cluster is still being created.

From the expanded template, it looks good and gpu nodepool is depending on the cluster setup to complete. The same provision script work before. Is it a regression on DM side?

resources:
- name: e2e-9b40778-1331-admin
  properties:
    accountId: e2e-9b40778-1331-admin
    displayName: Service Account used for Kubeflow admin actions.
  type: iam.v1.serviceAccount
- name: e2e-9b40778-1331-user
  properties:
    accountId: e2e-9b40778-1331-user
    displayName: Service Account used for Kubeflow user actions.
  type: iam.v1.serviceAccount
- name: e2e-9b40778-1331-vm
  properties:
    accountId: e2e-9b40778-1331-vm
    displayName: GCP Service Account to use as VM Service Account for Kubeflow Cluster
      VMs
  type: iam.v1.serviceAccount
- metadata:
    dependsOn:
    - e2e-9b40778-1331-vm
  name: e2e-9b40778-1331
  properties:
    cluster:
      autoscaling:
        enableNodeAutoprovisioning: true
        resourceLimits:
        - maximum: 20
          resourceType: cpu
        - maximum: 200
          resourceType: memory
        - maximum: 8
          resourceType: nvidia-tesla-k80
      initialClusterVersion: '1.11'
      loggingService: logging.googleapis.com/kubernetes
      monitoringService: monitoring.googleapis.com/kubernetes
      name: e2e-9b40778-1331
      nodePools:
      - autoscaling:
          enabled: true
          maxNodeCount: 10
          minNodeCount: 0
        config:
          machineType: n1-standard-8
          minCpuPlatform: Intel Broadwell
          oauthScopes:
          - https://www.googleapis.com/auth/logging.write
          - https://www.googleapis.com/auth/monitoring
          - https://www.googleapis.com/auth/devstorage.read_only
          serviceAccount: e2e-9b40778-1331-vm@ml-pipeline-test.iam.gserviceaccount.com
        initialNodeCount: 2
        name: default-pool
      podSecurityPolicyConfig:
        enabled: false
    parent: projects/ml-pipeline-test/locations/us-central1-a
    zone: us-central1-a
  type: gcp-types/container-v1beta1:projects.locations.clusters
- metadata:
    dependsOn:
    - e2e-9b40778-1331
  name: e2e-9b40778-1331-gpu-pool-v1
  properties:
    clusterId: e2e-9b40778-1331
    nodePool:
      autoscaling:
        enabled: true
        maxNodeCount: 1
        minNodeCount: 1
      config:
        accelerators:
        - acceleratorCount: 1
          acceleratorType: nvidia-tesla-k80
        machineType: n1-standard-8
        minCpuPlatform: Intel Broadwell
        oauthScopes:
        - https://www.googleapis.com/auth/logging.write
        - https://www.googleapis.com/auth/monitoring
        - https://www.googleapis.com/auth/devstorage.read_only
        serviceAccount: e2e-9b40778-1331-vm@ml-pipeline-test.iam.gserviceaccount.com
      initialNodeCount: 1
      name: gpu-pool
    parent: projects/ml-pipeline-test/locations/us-central1-a/clusters/e2e-9b40778-1331
    project: null
    zone: us-central1-a
  type: gcp-types/container-v1beta1:projects.locations.clusters.nodePools
- name: e2e-9b40778-1331-ip
  properties:
    description: Static IP for Kubeflow ingress.
  type: compute.v1.globalAddress

The text was updated successfully, but these errors were encountered:

jlewi · 2019-02-05T19:41:16Z

I believe we only recently changed our DM configs to not install the GPU node pool by default.
In the past I don't recall any problems with DM having race conditions with creating the cluster and the node pools.

One possibility is that DM depends on is insufficient because the cluster will enter some state that prevents updates (e.g. if its autoscaling) then updates adding things like node pools might fail.
Its possible changes to our default config have triggered this e.g. enabling of autoscaling/autoprovisioning by default
Or its possible other changes you are making are triggering it.
Assuming that's the problem then you probably need to add some backoff and retry.
You could either add this to kfctl.sh
Or you could add backoff and retry around calls to kfctl update

kunmingg · 2019-02-05T23:33:50Z

@hongye-sun
Can you retry on older GKE version?
Like at https://github.com/kubeflow/kubeflow/blob/master/deployment/gke/deployment_manager_configs/cluster-kubeflow.yaml#L34,
set version = 1.11.6-gke.2?

hongye-sun · 2019-03-26T17:09:04Z

It turns out that I cannot repo the issue any more after we switch to another region. This might be related to another GCE stockout issue which we had before.

jlewi added kind/question platform/gcp area/bootstrap priority/p2 area/kfctl labels Feb 5, 2019

hongye-sun closed this as completed Mar 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM template fails with GPU enabled #2392

DM template fails with GPU enabled #2392

hongye-sun commented Feb 5, 2019 •

edited

jlewi commented Feb 5, 2019

kunmingg commented Feb 5, 2019

hongye-sun commented Mar 26, 2019

DM template fails with GPU enabled #2392

DM template fails with GPU enabled #2392

Comments

hongye-sun commented Feb 5, 2019 • edited

jlewi commented Feb 5, 2019

kunmingg commented Feb 5, 2019

hongye-sun commented Mar 26, 2019

hongye-sun commented Feb 5, 2019 •

edited