You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that DM is trying to create the GPU nodepool while the cluster is still being created.
From the expanded template, it looks good and gpu nodepool is depending on the cluster setup to complete. The same provision script work before. Is it a regression on DM side?
I believe we only recently changed our DM configs to not install the GPU node pool by default.
In the past I don't recall any problems with DM having race conditions with creating the cluster and the node pools.
One possibility is that DM depends on is insufficient because the cluster will enter some state that prevents updates (e.g. if its autoscaling) then updates adding things like node pools might fail.
Its possible changes to our default config have triggered this e.g. enabling of autoscaling/autoprovisioning by default
Or its possible other changes you are making are triggering it.
Assuming that's the problem then you probably need to add some backoff and retry.
You could either add this to kfctl.sh
Or you could add backoff and retry around calls to kfctl update
It turns out that I cannot repo the issue any more after we switch to another region. This might be related to another GCE stockout issue which we had before.
It seems that there is a time conflict issue on DM side. Our test deployments kept failure due to error:
e2e-9b40778-1331 has resource warnings
e2e-9b40778-1331-gpu-pool-v1: {"ResourceType":"gcp-types/container-v1beta1:projects.locations.clusters.nodePools","ResourceErrorCode":"400","ResourceErrorMessage":{"code":400,"message":"Cluster is currently being created, deleted, updated or repaired and cannot be updated.","status":"FAILED_PRECONDITION","statusMessage":"Bad Request","requestPath":"https://container.googleapis.com/v1beta1/projects/ml-pipeline-test/locations/us-central1-a/clusters/e2e-9b40778-1331/nodePools","httpMethod":"POST"}}
It seems that DM is trying to create the GPU nodepool while the cluster is still being created.
From the expanded template, it looks good and gpu nodepool is depending on the cluster setup to complete. The same provision script work before. Is it a regression on DM side?
The text was updated successfully, but these errors were encountered: