Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start deploying the bootstrapper via deployment manager. #823

Merged
merged 1 commit into from May 23, 2018

Conversation

jlewi
Copy link
Contributor

@jlewi jlewi commented May 17, 2018

  • This config creates the K8s resources needed to run the bootstrapper

  • Enable the ResourceManager API; this is used to get IAM policies

  • Add IAM roles to the cloudservices account. This is needed so that
    the deployment manager has sufficient RBAC permissions to do what it needs
    to.

  • Delete initialNodeCount and just make the default node pool a 1 CPU node pool.

  • The bootstrapper isn't running successfully; it looks like its trying
    to create a pytorch component but its using an older version of the registry
    which doesn't include the pytorch operator.

Related to: #802 Use deployment manager to run bootstrapper

Related to #802 On GCP trigger bootstrapper with deployment manager
Related to #757 Install Kubeflow via deployment manager
Related to #833 Internal errors
/assign @kunmingg


This change is Reviewable

@jlewi
Copy link
Contributor Author

jlewi commented May 18, 2018

Test failed because of an IAM permission issue

ERROR: (gcloud.deployment-manager.deployments.create) Error in Operation [operation-1526618835540-56c73a5665820-80e4de3c-fba39e36]: errors:
- code: RESOURCE_ERROR
 location: /deployments/z23-cb334c0-1566-349b/resources/patch-iam-policy
 message: '{"ResourceType":"gcp-types/cloudresourcemanager-v1:cloudresourcemanager.projects.setIamPolicy","ResourceErrorCode":"403","ResourceErrorMessage"
:{"code":403,"message":"The
   caller does not have permission","status":"PERMISSION_DENIED","statusMessage":"Forbidden","requestPath":"https://cloudresourcemanager.googleapis.com/v1
/projects/kubeflow-ci:setIamPolicy","httpMethod":"POST"}}'

The cloudservices account in the test project needs to have IAM set policy permissions.

I added Project IAM Admin Role to the cloudservices account in project: kubeflow-ci

# 2. Create two separate deployments and launch the boot strapper
# after the cluster is created.
#
# Two separate deployments doesn't make much sense; we could just use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having initialNodeCount and cpu-pool-initialNodeCount can be confusing. Can initialNodeCount property be removed from here and set to 1 by default in cluster.jinja?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

{% set K8S_ENDPOINTS = {'': 'api/v1', '-v1beta1-extensions': 'apis/extensions/v1beta1'} %}
{% set RBAC_TYPE_NAME = TYPE_NAME + '-rbac-v1' %}

{% set K8S_ENDPOINTS = {'': 'api/v1', '-v1beta1-extensions': 'apis/extensions/v1beta1', '-rbac-v1': 'apis/rbac.authorization.k8s.io/v1'} %}
{% set CPU_POOL = 'cpu-pool-' + properties['pool-version'] %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use {{ CPU_POOL }} on line 68 instead of duplicating this

# Wait for the type provider to be created.
- {{ TYPE_NAME }}

{# StatefulSet needs a service. #}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct.
I was able to create a statefulset without a service

kubectl apply -f- <<EOF
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: website
spec:
  serviceName: "nginx"
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
EOF

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I thought. Thanks for trying it out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@jlewi
Copy link
Contributor Author

jlewi commented May 18, 2018

/test all

@ankushagarwal
Copy link
Contributor

/lgtm
/approve

/hold

Feel free to run /hold cancel when ready to submit

@jlewi
Copy link
Contributor Author

jlewi commented May 19, 2018

Test failed with an internal error trying to tear it down.

Your active configuration is: [default]
+ gcloud deployment-manager --project=kubeflow-ci --quiet deployments delete z23-25b96cd-1584-c7cc
Waiting for delete [operation-1526763570795-56c95584b11f9-22f0bb38-c3ac6396]...
..............failed.
ERROR: (gcloud.deployment-manager.deployments.delete) Delete operation operation-1526763570795-56c95584b11f9-22f0bb38-c3ac6396 failed.
Error in Operation [operation-1526763570795-56c95584b11f9-22f0bb38-c3ac6396]: errors:
- code: INTERNAL_ERROR
 message: "Code: '-3751873619725894346'"

@jlewi
Copy link
Contributor Author

jlewi commented May 19, 2018

/test all

@jlewi
Copy link
Contributor Author

jlewi commented May 19, 2018

@kunmingg @ankushagarwal Can you PTAL? I had to fix a bug causing the tests to fail. So I also updated it to actually run the bootstrapper.

@ankushagarwal
Copy link
Contributor

/lgtm
/approve

@ankushagarwal
Copy link
Contributor

/hold cancel

@jlewi
Copy link
Contributor Author

jlewi commented May 21, 2018

/test all

@k8s-ci-robot k8s-ci-robot removed the lgtm label May 21, 2018
@ankushagarwal
Copy link
Contributor

/lgtm
/approve

@jlewi
Copy link
Contributor Author

jlewi commented May 21, 2018

/hold

* This config creates the K8s resources needed to run the bootstrapper
* Enable the ResourceManager API; this is used to get IAM policies
* Add IAM roles to the cloudservices account. This is needed so that
  the deployment manager has sufficient RBAC permissions to do what it needs
  to.

* Delete initialNodeCount and just make the default node pool a 1 CPU node pool.

* The bootstrapper isn't running successfully; it looks like its trying
  to create a pytorch component but its using an older version of the registry
  which doesn't include the pytorch operator.

* Set delete policy on K8s resources to ABANDON otherwise we get internal errors.
* We can use actions to enable APIs and then we won't try to delete
  the API when the deployment is deleted which causes errors.

fix kubeflow#833
@jlewi
Copy link
Contributor Author

jlewi commented May 22, 2018

/test all
Most recent test failure was because we ran out of ip address quota.

@jlewi
Copy link
Contributor Author

jlewi commented May 22, 2018

Another internal error deleting the deployment

+ gcloud deployment-manager --project=kubeflow-ci --quiet deployments delete z23-c4971e3-1605-9f0c
Waiting for delete [operation-1526951147049-56cc104b5902b-342b1ac7-1f9f4298]...
...............failed.
ERROR: (gcloud.deployment-manager.deployments.delete) Delete operation operation-1526951147049-56cc104b5902b-342b1ac7-1f9f4298 failed.
Error in Operation [operation-1526951147049-56cc104b5902b-342b1ac7-1f9f4298]: errors:
- code: INTERNAL_ERROR
 message: "Code: '-7647148574420173012'"

@jlewi
Copy link
Contributor Author

jlewi commented May 22, 2018

/test all

@jlewi
Copy link
Contributor Author

jlewi commented May 22, 2018

@ankushagarwal Could you take another look I had to update the tests again and I switched to using actions to enable the APIs so we wouldn't need a separate deployment.

@ankushagarwal
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ankushagarwal

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jlewi
Copy link
Contributor Author

jlewi commented May 22, 2018

/hold cancel

@jlewi
Copy link
Contributor Author

jlewi commented May 23, 2018

/test all

@k8s-ci-robot k8s-ci-robot merged commit 4cffc20 into kubeflow:master May 23, 2018
saffaalvi pushed a commit to StatCan/kubeflow that referenced this pull request Feb 11, 2021
* This config creates the K8s resources needed to run the bootstrapper
* Enable the ResourceManager API; this is used to get IAM policies
* Add IAM roles to the cloudservices account. This is needed so that
  the deployment manager has sufficient RBAC permissions to do what it needs
  to.

* Delete initialNodeCount and just make the default node pool a 1 CPU node pool.

* The bootstrapper isn't running successfully; it looks like its trying
  to create a pytorch component but its using an older version of the registry
  which doesn't include the pytorch operator.

* Set delete policy on K8s resources to ABANDON otherwise we get internal errors.
* We can use actions to enable APIs and then we won't try to delete
  the API when the deployment is deleted which causes errors.

fix kubeflow#833
surajkota pushed a commit to surajkota/kubeflow that referenced this pull request Jun 13, 2022
…ubeflow#823)

* image gcr.io/kubeflow-images-public/jupyter-web-app:vmaster-gc80316d3
* Image built from kubeflow/kubeflow@c80316d3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants