-
Notifications
You must be signed in to change notification settings - Fork 2.4k
-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GCP] Node Auto provisioner pool is missing VM service account and oauth scopes #4259
Comments
Issue Label Bot is not confident enough to auto-label this issue. |
@Toeplitz How did you deploy Kubeflow? GKE uses the VM service account by default to pull GCR images. So you likely don't have a VM service account with proper IAM roles or you haven't set the OAuth scopes on the VM. Alternatively you can set image pull secrets |
@jlewi I deployed using the CLI instructions https://www.kubeflow.org/docs/gke/deploy/deploy-cli/ with Oauth and I'm accessing kubeflow through https://KFAPP.endpoints.project-id.cloud.goog/ The service accounts are created: I'm not sure how to check for OAuth scopes on the VM ? |
@Toeplitz I suspect you are running into an issue with the node auto-provisioner not setting the service account and scopes correctly. Can you provide the output of
That should provide the name of the node its running on Then do
Assuming you are running into problems with the node autoprovisioner here's how to fix this Run the commands below to set the service account and oauth scopes for the auto-provisioner
Then look for an autoprovisioned node pool (it should have
Then delete that node pool
The next time a new node pool is created it will have the newly added scope and VM service account. |
@jlewi Thank you for the answer. It turns out that the accounts where not setup correctly on the autoprovisioned gpu nodes. Your suggestion fixed the issue. Am I wrong to think this should be added to the default setup? As a small aside, is it possible to have the autoprovisioned node become n1-standard-8 instead of n1-standard-1 ? |
* See kubeflow/kfctl#48 1.14.6-gke-13 has a bug with workload identity so to work around it we temporarily pin to 1.14.6-gke-2. * Fix node autoprovisioning defaults (kubeflow/kubeflow#4259) we need to set the default service account otherwise we won't be able to pull images from private GCR. * Note: We can't set both service account and scopes so we only set the service account.
* See kubeflow/kfctl#48 1.14.6-gke-13 has a bug with workload identity so to work around it we temporarily pin to 1.14.6-gke-2. * Fix node autoprovisioning defaults (kubeflow/kubeflow#4259) we need to set the default service account otherwise we won't be able to pull images from private GCR. * Note: We can't set both service account and scopes so we only set the service account.
This should be fixed kubeflow/manifests#498. We just need to wait for the next 0.7 RC and then verify it is fixed before closing this bug. |
Using kubeflow/manifests 99246dd I used GPUs to force it onto an nap node. Unfortunately it still can't pull the image there is an authorization issue Here's the node spec
So its using the right service account but is missing oauth scope for storage. The service account has objectViewer permission. |
So I confirmed with node autoprovisioning team and there is a known issue. The next release of GKE will support setting both service account and scopes. So here's where things stand
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/llifecycle frozen The API changes to allow setting both scopes and service account should be rolled out. It should be easy to update the deployment manager configs and then verify this is working. We should do this for 1.0. |
* kubeflow/kubeflow#4259 GKE now allows setting oauthscopes and VM service account for NAP pools. We need to use this otherwise private images won't be accessible. * kubeflow/kubeflow#3930 Bump NAP resource constraints because these are global constraints not per node constraints. * Bump max CPU to 128 CPUs * Bump memory
* kubeflow/kubeflow#4259 GKE now allows setting oauthscopes and VM service account for NAP pools. We need to use this otherwise private images won't be accessible. * kubeflow/kubeflow#3930 Bump NAP resource constraints because these are global constraints not per node constraints. * Bump max CPU to 128 CPUs * Bump memory
* kubeflow/kubeflow#4259 GKE now allows setting oauthscopes and VM service account for NAP pools. We need to use this otherwise private images won't be accessible. * kubeflow/kubeflow#3930 Bump NAP resource constraints because these are global constraints not per node constraints. * Bump max CPU to 128 CPUs * Bump memory
* kubeflow/kubeflow#4259 GKE now allows setting oauthscopes and VM service account for NAP pools. We need to use this otherwise private images won't be accessible. * kubeflow/kubeflow#3930 Bump NAP resource constraints because these are global constraints not per node constraints. * Bump max CPU to 128 CPUs * Bump memory
Fix is cherry-picked onto both release and master branches. The full spec for a 1.0 GKE cluster is below. Looking at the node autoprovisioning settings both service accounts and oauth scopes are set.
|
/kind bug
What steps did you take and what happened:
After deploying kubeflow successfully on GCP I'm trying to run a pipeline.
Components in the pipelines tries to pull a private container from gcr.io fails to authenticate. See description below.
The following code is used to define the pipeline component:
Now the relevant parts of the compiled yaml file for the pipeline:
After initialization the pod falls over with the following:
Anything else you would like to add:
I have noticed that the user-gcp-sa.json file is not in /secret but rather in /mainctrfs, could this be causing problems?
Entering the relevant pod with a shell shows:
All I could find in this was https://www.kubeflow.org/docs/gke/authentication/#authentication-from-kubeflow-pipelines which does not tell you much other than to do what I'm describing.
Environment:
Python module version:
kfp (0.1.31.2)
kfp-server-api (0.1.18.3)
The text was updated successfully, but these errors were encountered: