Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCP] Node Auto provisioner pool is missing VM service account and oauth scopes #4259

Closed
Toeplitz opened this issue Oct 9, 2019 · 12 comments

Comments

@Toeplitz
Copy link

Toeplitz commented Oct 9, 2019

/kind bug

What steps did you take and what happened:
After deploying kubeflow successfully on GCP I'm trying to run a pipeline.
Components in the pipelines tries to pull a private container from gcr.io fails to authenticate. See description below.

The following code is used to define the pipeline component:

    train = dsl.ContainerOp(
        name='train',
        image='gcr.io/<project>/<container>:latest',
        arguments=[ .... ]
    ).set_gpu_limit(1)

 steps = [train]
    for step in steps:
        step.apply(gcp.use_gcp_secret('user-gcp-sa'))

Now the relevant parts of the compiled yaml file for the pipeline:


  - container:
      env:
      - name: GOOGLE_APPLICATION_CREDENTIALS
        value: /secret/gcp-credentials/user-gcp-sa.json
      - name: CLOUDSDK_AUTH_CREDENTIAL_FILE_OVERRIDE
        value: /secret/gcp-credentials/user-gcp-sa.json
      image: gcr.io/<project>/<container>:latest
      resources:
        limits:
          nvidia.com/gpu: 1
      volumeMounts:
      - mountPath: /secret/gcp-credentials
        name: gcp-credentials-user-gcp-sa
    inputs:
    name: train
    volumes:
    - name: gcp-credentials-user-gcp-sa
      secret:
        secretName: user-gcp-sa

After initialization the pod falls over with the following:

Normal Pulling 56s kubelet, gke-pipelines-nap-n1-standard-1-gpu1--30d96aa7-5sq4 pulling image "argoproj/argoexec:v2.3.0"
Normal Pulled 48s kubelet, gke-pipelines-nap-n1-standard-1-gpu1--30d96aa7-5sq4 Successfully pulled image "argoproj/argoexec:v2.3.0"
Normal Started 44s kubelet, gke-pipelines-nap-n1-standard-1-gpu1--30d96aa7-5sq4 Started container
Normal Created 44s kubelet, gke-pipelines-nap-n1-standard-1-gpu1--30d96aa7-5sq4 Created container
Normal BackOff 16s (x2 over 43s) kubelet, gke-pipelines-nap-n1-standard-1-gpu1--30d96aa7-5sq4 Back-off pulling image "gcr.io/project/container:latest"
Warning Failed 16s (x2 over 43s) kubelet, gke-pipelines-nap-n1-standard-1-gpu1--30d96aa7-5sq4 Error: ImagePullBackOff
Normal Pulling 2s (x3 over 44s) kubelet, gke-pipelines-nap-n1-standard-1-gpu1--30d96aa7-5sq4 pulling image "gcr.io/project/container:latest"
Warning Failed 2s (x3 over 44s) kubelet, gke-pipelines-nap-n1-standard-1-gpu1--30d96aa7-5sq4 Failed to pull image "gcr.io/project/container:latest": rpc error: code = Unknown desc = Error response from daemon: unauthorized: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication
Warning Failed 2s (x3 over 44s) kubelet, gke-pipelines-nap-n1-standard-1-gpu1--30d96aa7-5sq4 Error: ErrImagePull

Anything else you would like to add:

I have noticed that the user-gcp-sa.json file is not in /secret but rather in /mainctrfs, could this be causing problems?

Entering the relevant pod with a shell shows:

root@pod-b9jks-3315453858:/# find / -name "user-gcp-sa.json"
/mainctrfs/secret/gcp-credentials/user-gcp-sa.json
/mainctrfs/secret/gcp-credentials/..2019_10_09_10_01_29.770571990/user-gcp-sa.json

All I could find in this was https://www.kubeflow.org/docs/gke/authentication/#authentication-from-kubeflow-pipelines which does not tell you much other than to do what I'm describing.

Environment:

  • Kubeflow version: build commit 812ca7f
  • kfctl version: kfctl v0.6.2-0-g47a0e4c7
  • Kubernetes platform: gcp 1.12.10-gke.5
  • kubectl version: 1.6
  • OS: Ubuntu Bionic x64

Python module version:
kfp (0.1.31.2)
kfp-server-api (0.1.18.3)

@issue-label-bot
Copy link

Issue Label Bot is not confident enough to auto-label this issue.
See dashboard for more details.

@jlewi
Copy link
Contributor

jlewi commented Oct 10, 2019

@Toeplitz How did you deploy Kubeflow? GKE uses the VM service account by default to pull GCR images. So you likely don't have a VM service account with proper IAM roles or you haven't set the OAuth scopes on the VM.

Alternatively you can set image pull secrets
https://medium.com/@michaelmorrissey/using-cross-project-gcr-images-in-gke-1ddc36de3d42

@Toeplitz
Copy link
Author

Toeplitz commented Oct 10, 2019

@jlewi I deployed using the CLI instructions https://www.kubeflow.org/docs/gke/deploy/deploy-cli/ with Oauth and I'm accessing kubeflow through https://KFAPP.endpoints.project-id.cloud.goog/

The service accounts are created:
kfapp-user@project.iam.gserviceaccount.com (has Storage Admin and project viewer), kfapp-vm@project.iam.gserviceaccount.com (has Storage Object Viewer) so that looks correct to me in that both of those should have access to GCR pulls?

I'm not sure how to check for OAuth scopes on the VM ?

@jlewi
Copy link
Contributor

jlewi commented Oct 10, 2019

@Toeplitz I suspect you are running into an issue with the node auto-provisioner not setting the service account and scopes correctly.

Can you provide the output of

kubectl describe pods -o yaml ${POD}

That should provide the name of the node its running on

Then do

gcloud --project=${PROJECT} compute instances describe --zone=${ZONE} $INSTANCE

Assuming you are running into problems with the node autoprovisioner here's how to fix this

Run the commands below to set the service account and oauth scopes for the auto-provisioner

gcloud --project=${PROJECT} beta container clusters update ${KFAPP} --enable-autoprovisioning --autoprovisioning-service-account=${KFAPP}-vm@${PROJECT}.iam.gserviceaccount.com --zone=${ZONE}

gcloud --project=${PROJECT} beta container clusters update ${KFAPP} --enable-autoprovisioning --autoprovisioning-scopes=https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring,https://www.googleapis.com/auth/devstorage.read_only

Then look for an autoprovisioned node pool (it should have nap in the name`)

gcloud --project=${PROJECT} container node-pools list --zone=${ZONE} --cluster=${CLUSTER}

Then delete that node pool

gcloud --project=${PROJECT} container node-pools delete --zone=${ZONE} --cluster=${CLUSTER} ${NODEPOOL}

The next time a new node pool is created it will have the newly added scope and VM service account.

@Toeplitz
Copy link
Author

Toeplitz commented Oct 10, 2019

@jlewi Thank you for the answer. It turns out that the accounts where not setup correctly on the autoprovisioned gpu nodes. Your suggestion fixed the issue.

Am I wrong to think this should be added to the default setup?

As a small aside, is it possible to have the autoprovisioned node become n1-standard-8 instead of n1-standard-1 ?

@jlewi jlewi changed the title Pod authentication with GCR fails or fails to use user-gcp-sa.json [GCP] Node Auto provisioner pool is missing VM service account and oauth scopes Oct 10, 2019
@jlewi
Copy link
Contributor

jlewi commented Oct 10, 2019

@Toeplitz yes we should be configuring the default service account on the auto-provisioned pool.

I don't think we set the default settings for the NAP pool; the whole idea of NAP is that it picks a VM size based on the resources requested by your pod.

/cc @amygdala

jlewi pushed a commit to jlewi/manifests that referenced this issue Oct 11, 2019
* See kubeflow/kfctl#48 1.14.6-gke-13 has
  a bug with workload identity so to work around it we temporarily
  pin to 1.14.6-gke-2.

* Fix node autoprovisioning defaults (kubeflow/kubeflow#4259) we need to set
  the default service account otherwise we won't be able to
  pull images from private GCR.

  * Note: We can't set both service account and scopes so we only set
    the service account.
k8s-ci-robot pushed a commit to kubeflow/manifests that referenced this issue Oct 12, 2019
* See kubeflow/kfctl#48 1.14.6-gke-13 has
  a bug with workload identity so to work around it we temporarily
  pin to 1.14.6-gke-2.

* Fix node autoprovisioning defaults (kubeflow/kubeflow#4259) we need to set
  the default service account otherwise we won't be able to
  pull images from private GCR.

  * Note: We can't set both service account and scopes so we only set
    the service account.
@jlewi
Copy link
Contributor

jlewi commented Oct 12, 2019

This should be fixed kubeflow/manifests#498. We just need to wait for the next 0.7 RC and then verify it is fixed before closing this bug.

@jlewi
Copy link
Contributor

jlewi commented Oct 13, 2019

Using kubeflow/manifests 99246dd

I used GPUs to force it onto an nap node. Unfortunately it still can't pull the image there is an authorization issue

pod.describe.txt

Here's the node spec
nap-vm.txt

serviceAccounts:
- email: kftest-1012-170754-vm@jlewi-dev.iam.gserviceaccount.com
  scopes:
  - https://www.googleapis.com/auth/logging.write
  - https://www.googleapis.com/auth/monitoring

So its using the right service account but is missing oauth scope for storage.
Here's the iam policy
iam-policy.txt

The service account has objectViewer permission.

@jlewi
Copy link
Contributor

jlewi commented Oct 15, 2019

So I confirmed with node autoprovisioning team and there is a known issue.

The next release of GKE will support setting both service account and scopes.

So here's where things stand

  • In 0.7.0 we will set the service account

  • In 0.7.0 users will still need to run the gcloud command to set scopes on NAP

    gcloud --project=${PROJECT} beta container clusters update ${KFAPP} --enable-autoprovisioning --autoprovisioning-scopes=https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring,https://www.googleapis.com/auth/devstorage.read_only
    
  • In 0.7.X once GKE has rolled out the fix to allow setting both we will update our DM configs to set both.

  • Downgrading to P1 since this is not going to block 0.7.0 and we are waiting on a GKE push

@stale
Copy link

stale bot commented Jan 13, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot closed this as completed Jan 20, 2020
@jlewi jlewi reopened this Jan 21, 2020
@stale stale bot removed the lifecycle/stale label Jan 21, 2020
@jlewi
Copy link
Contributor

jlewi commented Jan 21, 2020

/llifecycle frozen

The API changes to allow setting both scopes and service account should be rolled out. It should be easy to update the deployment manager configs and then verify this is working.

We should do this for 1.0.

jlewi pushed a commit to jlewi/manifests that referenced this issue Jan 29, 2020
* kubeflow/kubeflow#4259 GKE now allows setting oauthscopes and VM service
  account for NAP pools. We need to use this otherwise private images
  won't be accessible.

* kubeflow/kubeflow#3930 Bump NAP resource constraints because these
  are global constraints not per node constraints.

  * Bump max CPU to 128 CPUs
  * Bump memory
k8s-ci-robot pushed a commit to kubeflow/manifests that referenced this issue Jan 29, 2020
* kubeflow/kubeflow#4259 GKE now allows setting oauthscopes and VM service
  account for NAP pools. We need to use this otherwise private images
  won't be accessible.

* kubeflow/kubeflow#3930 Bump NAP resource constraints because these
  are global constraints not per node constraints.

  * Bump max CPU to 128 CPUs
  * Bump memory
jlewi pushed a commit to jlewi/manifests that referenced this issue Jan 29, 2020
* kubeflow/kubeflow#4259 GKE now allows setting oauthscopes and VM service
  account for NAP pools. We need to use this otherwise private images
  won't be accessible.

* kubeflow/kubeflow#3930 Bump NAP resource constraints because these
  are global constraints not per node constraints.

  * Bump max CPU to 128 CPUs
  * Bump memory
k8s-ci-robot pushed a commit to kubeflow/manifests that referenced this issue Jan 29, 2020
* kubeflow/kubeflow#4259 GKE now allows setting oauthscopes and VM service
  account for NAP pools. We need to use this otherwise private images
  won't be accessible.

* kubeflow/kubeflow#3930 Bump NAP resource constraints because these
  are global constraints not per node constraints.

  * Bump max CPU to 128 CPUs
  * Bump memory
@jlewi
Copy link
Contributor

jlewi commented Jan 30, 2020

Fix is cherry-picked onto both release and master branches.

The full spec for a 1.0 GKE cluster is below. Looking at the node autoprovisioning settings both service accounts and oauth scopes are set.

autoscaling:
  autoprovisioningNodePoolDefaults:
    oauthScopes:
    - https://www.googleapis.com/auth/logging.write
    - https://www.googleapis.com/auth/monitoring
    - https://www.googleapis.com/auth/devstorage.read_only
    serviceAccount: kf-v1-01300120-b52-vm@kubeflow-ci-deployment.iam.gserviceaccount.com
  enableNodeAutoprovisioning: true
  resourceLimits:
  - maximum: '128'
    resourceType: cpu
  - maximum: '2000'
    resourceType: memory
  - maximum: '16'
    resourceType: nvidia-tesla-k80

kf-v1-01300120-b52.yaml.txt

@jlewi jlewi closed this as completed Jan 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants