Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround workload identity bug and fix node autoprovisioning. #498

Merged
merged 1 commit into from
Oct 12, 2019

Conversation

jlewi
Copy link
Contributor

@jlewi jlewi commented Oct 11, 2019

Which issue is resolved by this Pull Request:
Resolves #

Description of your changes:

Checklist:

  • Unit tests have been rebuilt:
    1. cd manifests/tests
    2. make generate
    3. make test

This change is Reviewable

* See kubeflow/kfctl#48 1.14.6-gke-13 has
  a bug with workload identity so to work around it we temporarily
  pin to 1.14.6-gke-2.

* Fix node autoprovisioning defaults (kubeflow/kubeflow#4259) we need to set
  the default service account otherwise we won't be able to
  pull images from private GCR.

  * Note: We can't set both service account and scopes so we only set
    the service account.
@jlewi
Copy link
Contributor Author

jlewi commented Oct 11, 2019

/assign @lluunn
/assign @gabrielwen
/hold because still doing some manual verification

@jlewi
Copy link
Contributor Author

jlewi commented Oct 11, 2019

I checked the cluster settings

gcloud --project=jlewi-dev beta container clusters describe jlewi-v07-002
addonsConfig:
  kubernetesDashboard:
    disabled: true
  networkPolicyConfig:
    disabled: true
autoscaling:
  autoprovisioningNodePoolDefaults:
    oauthScopes:
    - https://www.googleapis.com/auth/logging.write
    - https://www.googleapis.com/auth/monitoring
    serviceAccount: jlewi-v07-002-vm@jlewi-dev.iam.gserviceaccount.com
  enableNodeAutoprovisioning: true
  resourceLimits:
  - maximum: '20'
    resourceType: cpu
  - maximum: '200'
    resourceType: memory
  - maximum: '8'
    resourceType: nvidia-tesla-k80

The auto provisioning service account is set. It looks like OAuth scopes got populated but don't include dev storage.

When I tried setting both though I got an error

ERRO[0010] Updating jlewi-v07-002 error: &{Code:RESOURCE_ERROR Location:/deployments/jlewi-v07-002/resources/jlewi-v07-002 Message:{"ResourceType":"gcp-types/container-v1beta1:projects.locations.clusters","ResourceErrorCode":"400","ResourceErrorMessage":{"code":400,"message":"At most one of node_autoprovisioning_defaults.oauth_scopes and node_autoprovisioning_defaults.service_account should be specified.","status":"INVALID_ARGUMENT","statusMessage":"Bad Request","requestPath":"https://container.googleapis.com/v1beta1/projects/jlewi-dev/locations/us-east1-d/clusters","httpMethod":"POST"}} ForceSendFields:[] NullFields:[]}  filename="gcp/gcp.go:380"
Error: couldn't apply KfApp:  (kubeflow.error): Code 500 with message: coordinator Apply failed for gcp:  (kubeflow.error): Code 400 with message: gcp apply could not update deployment manager Error could not update deployment manager entries; Updating jlewi-v07-002 error(400): BAD REQUEST

Lets hope when its using GCR it using IAM and not scopes. We should run a simple test to verify.

@jlewi
Copy link
Contributor Author

jlewi commented Oct 11, 2019

/assign @kunmingg

@kunmingg
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kunmingg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jlewi
Copy link
Contributor Author

jlewi commented Oct 12, 2019

Error waiting for metadata deployment.

util.py                    429 ERROR    Timeout waiting for deployment metadata-deployment in namespace kubefl
ow to be ready
util.py                     45 INFO     Running: kubectl describe deployment -n kubeflow metadata-deployment
cwd=None
util.py                     60 INFO     Subprocess output:
util.py                     71 INFO     Name:                   metadata-deployment
util.py                     71 INFO     Namespace:              kubeflow
util.py                     71 INFO     CreationTimestamp:      Sat, 12 Oct 2019 00:56:00 +0000
util.py                     71 INFO     Labels:                 component=server
util.py                     71 INFO     kustomize.component=metadata
util.py                     71 INFO     Annotations:            deployment.kubernetes.io/revision=1
util.py                     71 INFO     kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"apps/v
1","kind":"Deployment","metadata":{"annotations":{},"labels":{"component":"server","kustomize.component":"meta
data"},"name":"metad...
util.py                     71 INFO     Selector:               component=server,kustomize.component=metadata
util.py                     71 INFO     Replicas:               3 desired | 3 updated | 3 total | 0 available
| 3 unavailable
util.py                     71 INFO     StrategyType:           RollingUpdate
util.py                     71 INFO     MinReadySeconds:        0

@jlewi
Copy link
Contributor Author

jlewi commented Oct 12, 2019

/test all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants