Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM template fails with GPU enabled #2392

Closed
hongye-sun opened this issue Feb 5, 2019 · 3 comments
Closed

DM template fails with GPU enabled #2392

hongye-sun opened this issue Feb 5, 2019 · 3 comments

Comments

@hongye-sun
Copy link
Contributor

hongye-sun commented Feb 5, 2019

It seems that there is a time conflict issue on DM side. Our test deployments kept failure due to error:

e2e-9b40778-1331 has resource warnings
e2e-9b40778-1331-gpu-pool-v1: {"ResourceType":"gcp-types/container-v1beta1:projects.locations.clusters.nodePools","ResourceErrorCode":"400","ResourceErrorMessage":{"code":400,"message":"Cluster is currently being created, deleted, updated or repaired and cannot be updated.","status":"FAILED_PRECONDITION","statusMessage":"Bad Request","requestPath":"https://container.googleapis.com/v1beta1/projects/ml-pipeline-test/locations/us-central1-a/clusters/e2e-9b40778-1331/nodePools","httpMethod":"POST"}}

It seems that DM is trying to create the GPU nodepool while the cluster is still being created.

From the expanded template, it looks good and gpu nodepool is depending on the cluster setup to complete. The same provision script work before. Is it a regression on DM side?

resources:
- name: e2e-9b40778-1331-admin
  properties:
    accountId: e2e-9b40778-1331-admin
    displayName: Service Account used for Kubeflow admin actions.
  type: iam.v1.serviceAccount
- name: e2e-9b40778-1331-user
  properties:
    accountId: e2e-9b40778-1331-user
    displayName: Service Account used for Kubeflow user actions.
  type: iam.v1.serviceAccount
- name: e2e-9b40778-1331-vm
  properties:
    accountId: e2e-9b40778-1331-vm
    displayName: GCP Service Account to use as VM Service Account for Kubeflow Cluster
      VMs
  type: iam.v1.serviceAccount
- metadata:
    dependsOn:
    - e2e-9b40778-1331-vm
  name: e2e-9b40778-1331
  properties:
    cluster:
      autoscaling:
        enableNodeAutoprovisioning: true
        resourceLimits:
        - maximum: 20
          resourceType: cpu
        - maximum: 200
          resourceType: memory
        - maximum: 8
          resourceType: nvidia-tesla-k80
      initialClusterVersion: '1.11'
      loggingService: logging.googleapis.com/kubernetes
      monitoringService: monitoring.googleapis.com/kubernetes
      name: e2e-9b40778-1331
      nodePools:
      - autoscaling:
          enabled: true
          maxNodeCount: 10
          minNodeCount: 0
        config:
          machineType: n1-standard-8
          minCpuPlatform: Intel Broadwell
          oauthScopes:
          - https://www.googleapis.com/auth/logging.write
          - https://www.googleapis.com/auth/monitoring
          - https://www.googleapis.com/auth/devstorage.read_only
          serviceAccount: e2e-9b40778-1331-vm@ml-pipeline-test.iam.gserviceaccount.com
        initialNodeCount: 2
        name: default-pool
      podSecurityPolicyConfig:
        enabled: false
    parent: projects/ml-pipeline-test/locations/us-central1-a
    zone: us-central1-a
  type: gcp-types/container-v1beta1:projects.locations.clusters
- metadata:
    dependsOn:
    - e2e-9b40778-1331
  name: e2e-9b40778-1331-gpu-pool-v1
  properties:
    clusterId: e2e-9b40778-1331
    nodePool:
      autoscaling:
        enabled: true
        maxNodeCount: 1
        minNodeCount: 1
      config:
        accelerators:
        - acceleratorCount: 1
          acceleratorType: nvidia-tesla-k80
        machineType: n1-standard-8
        minCpuPlatform: Intel Broadwell
        oauthScopes:
        - https://www.googleapis.com/auth/logging.write
        - https://www.googleapis.com/auth/monitoring
        - https://www.googleapis.com/auth/devstorage.read_only
        serviceAccount: e2e-9b40778-1331-vm@ml-pipeline-test.iam.gserviceaccount.com
      initialNodeCount: 1
      name: gpu-pool
    parent: projects/ml-pipeline-test/locations/us-central1-a/clusters/e2e-9b40778-1331
    project: null
    zone: us-central1-a
  type: gcp-types/container-v1beta1:projects.locations.clusters.nodePools
- name: e2e-9b40778-1331-ip
  properties:
    description: Static IP for Kubeflow ingress.
  type: compute.v1.globalAddress
@jlewi
Copy link
Contributor

jlewi commented Feb 5, 2019

I believe we only recently changed our DM configs to not install the GPU node pool by default.
In the past I don't recall any problems with DM having race conditions with creating the cluster and the node pools.

One possibility is that DM depends on is insufficient because the cluster will enter some state that prevents updates (e.g. if its autoscaling) then updates adding things like node pools might fail.
Its possible changes to our default config have triggered this e.g. enabling of autoscaling/autoprovisioning by default
Or its possible other changes you are making are triggering it.
Assuming that's the problem then you probably need to add some backoff and retry.
You could either add this to kfctl.sh
Or you could add backoff and retry around calls to kfctl update

@kunmingg
Copy link
Contributor

kunmingg commented Feb 5, 2019

@hongye-sun
Can you retry on older GKE version?
Like at https://github.com/kubeflow/kubeflow/blob/master/deployment/gke/deployment_manager_configs/cluster-kubeflow.yaml#L34,
set version = 1.11.6-gke.2?

@hongye-sun
Copy link
Contributor Author

It turns out that I cannot repo the issue any more after we switch to another region. This might be related to another GCE stockout issue which we had before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants