Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP deployment manager test handle internal errors #833

Closed
jlewi opened this issue May 19, 2018 · 9 comments
Closed

GCP deployment manager test handle internal errors #833

jlewi opened this issue May 19, 2018 · 9 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented May 19, 2018

In #823 we observed test flakes during teardown due to internal errors.

Your active configuration is: [default]
+ gcloud deployment-manager --project=kubeflow-ci --quiet deployments delete z23-25b96cd-1584-c7cc
Waiting for delete [operation-1526763570795-56c95584b11f9-22f0bb38-c3ac6396]...
..............failed.
ERROR: (gcloud.deployment-manager.deployments.delete) Delete operation operation-1526763570795-56c95584b11f9-22f0bb38-c3ac6396 failed.
Error in Operation [operation-1526763570795-56c95584b11f9-22f0bb38-c3ac6396]: errors:
- code: INTERNAL_ERROR
 message: "Code: '-3751873619725894346'"

We should make the tests more robust to these flakes.

/priority p1

@jlewi
Copy link
Contributor Author

jlewi commented May 21, 2018

Encountered this again.

Waiting for delete [operation-1526931651153-56cbc7aa9cc68-dea36067-bcd2893e]...
...............failed.
ERROR: (gcloud.deployment-manager.deployments.delete) Delete operation operation-1526931651153-56cbc7aa9cc68-dea36067-bcd2893e failed.
Error in Operation [operation-1526931651153-56cbc7aa9cc68-dea36067-bcd2893e]: errors:
- code: INTERNAL_ERROR
 message: "Code: '-1846714646454417798'"

gcloud --project=kubeflow-ci deployment-manager operations describe operation-1526931651153-56cbc7aa9cc68-dea36067-bcd2893e

endTime: '2018-05-21T12:41:08.283-07:00'
error:
  errors:
  - code: INTERNAL_ERROR
    message: "Code: '-1846714646454417798'"
httpErrorMessage: SERVICE UNAVAILABLE
httpErrorStatusCode: 503
id: '7124900659503996460'
insertTime: '2018-05-21T12:40:51.360-07:00'
kind: deploymentmanager#operation
name: operation-1526931651153-56cbc7aa9cc68-dea36067-bcd2893e
operationType: delete
progress: 100
selfLink: https://www.googleapis.com/deploymentmanager/v2/projects/kubeflow-ci/global/operations/operation-1526931651153-56cbc7aa9cc68-dea36067-bcd2893e
startTime: '2018-05-21T12:40:51.478-07:00'
status: DONE
targetId: '8916802122527187256'
targetLink: https://www.googleapis.com/deploymentmanager/v2/projects/kubeflow-ci/global/deployments/z23-25b96cd-1597-5bbe
user: kubeflow-testing@kubeflow-ci.iam.gserviceaccount.com

@jlewi
Copy link
Contributor Author

jlewi commented May 21, 2018

The deployment z23-25b96cd-1597-5bbe still exists and is in an error state.

gcloud --project=kubeflow-ci deployment-manager deployments describe  z23-25b96cd-1597-5bbe
---
fingerprint: d3WV5xXfHWwvG9HtWG-mow==
id: '8916802122527187256'
insertTime: '2018-05-21T12:36:55.923-07:00'
manifest: manifest-1526931415935
name: z23-25b96cd-1597-5bbe
operation:
  endTime: '2018-05-21T12:41:08.283-07:00'
  error:
    errors:
    - code: INTERNAL_ERROR
      message: "Code: '-1846714646454417798'"
  name: operation-1526931651153-56cbc7aa9cc68-dea36067-bcd2893e
  operationType: delete
  progress: 100
  startTime: '2018-05-21T12:40:51.478-07:00'
  status: DONE
  user: kubeflow-testing@kubeflow-ci.iam.gserviceaccount.com
update:
  manifest: https://www.googleapis.com/deploymentmanager/v2/projects/kubeflow-ci/global/deployments/z23-25b96cd-1597-5bbe/manifests/empty-manifest-for-delete
NAME                                         TYPE                                                                                                            STATE    INTENT
admin-namespace                              kubeflow-ci/z23-25b96cd-1597-5bbe-kubeflow-type:/api/v1/namespaces                                              ABORTED  DELETE
bootstrap-statefulset                        kubeflow-ci/z23-25b96cd-1597-5bbe-kubeflow-type-apps-v1:/apis/apps/v1/namespaces/{namespace}/statefulsets       FAILED   DELETE
dm-rbac                                      kubeflow-ci/z23-25b96cd-1597-5bbe-kubeflow-type-rbac-v1:/apis/rbac.authorization.k8s.io/v1/clusterrolebindings  ABORTED  DELETE
resource-manager-api                         deploymentmanager.v2.virtual.enableService                                                                      ABORTED  DELETE
z23-25b96cd-1597-5bbe-kubeflow               container.v1.cluster                                                                                            ABORTED  DELETE
z23-25b96cd-1597-5bbe-kubeflow-cpu-pool-v1   container.v1.nodePool                                                                                           ABORTED  DELETE
z23-25b96cd-1597-5bbe-kubeflow-type          deploymentmanager.v2beta.typeProvider                                                                           ABORTED  DELETE
z23-25b96cd-1597-5bbe-kubeflow-type-rbac-v1  deploymentmanager.v2beta.typeProvider                                                                           ABORTED  DELETE

@jlewi
Copy link
Contributor Author

jlewi commented May 21, 2018

Lets try reissuing the delete

gcloud --project=kubeflow-ci deployment-manager deployments delete  z23-25b96cd-1597-5bbe
The following deployments will be deleted:
- z23-25b96cd-1597-5bbe

Do you want to continue (y/N)?  y

Waiting for delete [operation-1526936941319-56cbdb5fb5358-2c1b9345-aa6bb86c]...failed.                                                                                                             
ERROR: (gcloud.deployment-manager.deployments.delete) Delete operation operation-1526936941319-56cbdb5fb5358-2c1b9345-aa6bb86c failed.
Error in Operation [operation-1526936941319-56cbdb5fb5358-2c1b9345-aa6bb86c]: errors:
- code: RESOURCE_NOT_FOUND
  message: The type [kubeflow-ci/z23-25b96cd-1597-5bbe-kubeflow

The error shown in the pantheon UI is

The type [kubeflow-ci/z23-25b96cd-1597-5bbe-kubeflow-type-apps-v1:/apis/apps/v1/namespaces/{namespace}/statefulsets] was not found. Consider using the delete policy 'ABANDON'.

@jlewi
Copy link
Contributor Author

jlewi commented May 21, 2018

I added

metadata:
   deletePolicy: ABANDON

to all the K8s resources. Now I can delete all the K8s resources and the node pools but I get the following error

{"ResourceType":"deploymentmanager.v2.virtual.enableService","ResourceErrorCode":"412","ResourceErrorMessage":"deactivation has some internal error:[]"}

I tried setting the deletePolicy on that resource as well but it looks like it didn't work.

jlewi added a commit to jlewi/kubeflow that referenced this issue May 22, 2018
* This config creates the K8s resources needed to run the bootstrapper
* Enable the ResourceManager API; this is used to get IAM policies
* Add IAM roles to the cloudservices account. This is needed so that
  the deployment manager has sufficient RBAC permissions to do what it needs
  to.

* Delete initialNodeCount and just make the default node pool a 1 CPU node pool.

* The bootstrapper isn't running successfully; it looks like its trying
  to create a pytorch component but its using an older version of the registry
  which doesn't include the pytorch operator.

* Set delete policy on K8s resources to ABANDON otherwise we get internal errors.
* We can use actions to enable APIs and then we won't try to delete
  the API when the deployment is deleted which causes errors.

fix kubeflow#833
@jlewi
Copy link
Contributor Author

jlewi commented May 22, 2018

So I'm observing two failure modes on #823

  1. Problems deleting the deployment because its try to deactivate an API
  1. Also observing occasional failures trying to delete the statefulset for the bootstrapper

@jlewi
Copy link
Contributor Author

jlewi commented May 22, 2018

Happened again on presubmit for #833

gcloud --project=kubeflow-ci deployment-manager operations describe operation-1526951147049-56cc104b5902b-342b1ac7-1f9f4298
endTime: '2018-05-21T18:06:04.694-07:00'
error:
  errors:
  - code: INTERNAL_ERROR
    message: "Code: '-7647148574420173012'"
httpErrorMessage: SERVICE UNAVAILABLE
httpErrorStatusCode: 503
id: '6631691422962471428'
insertTime: '2018-05-21T18:05:47.132-07:00'
kind: deploymentmanager#operation
name: operation-1526951147049-56cc104b5902b-342b1ac7-1f9f4298
operationType: delete
progress: 100
selfLink: https://www.googleapis.com/deploymentmanager/v2/projects/kubeflow-ci/global/operations/operation-1526951147049-56cc104b5902b-342b1ac7-1f9f4298
startTime: '2018-05-21T18:05:47.272-07:00'
status: DONE
targetId: '486701487157597464'
targetLink: https://www.googleapis.com/deploymentmanager/v2/projects/kubeflow-ci/global/deployments/z23-c4971e3-1605-9f0c
user: kubeflow-testing@kubeflow-ci.iam.gserviceaccount.com

@jlewi
Copy link
Contributor Author

jlewi commented May 22, 2018

Looks like the deployment was successfully deleted.

@jlewi
Copy link
Contributor Author

jlewi commented May 24, 2018

The resource not found errors are caused because we delete the type providers before the K8s resources so we get errors when we try to delete the K8s resources.

We can fix this either by

  1. Explicitly adding a dependsOn to each K8s resource so it depends on the type provider
  2. Using references in the type value of K8s resources
  • This will cause the K8s resources to be deleted before the type provider.

@jlewi jlewi reopened this May 24, 2018
jlewi added a commit to jlewi/kubeflow that referenced this issue May 24, 2018
…nager.

* The scripts replaces our bash commands
* For teardown we want to add retries to better handle INTERNAL_ERRORS
  with deployment manager that are causing the test to be flaky.

Related to kubeflow#836 verify Kubeflow deployed correctly with deployment manager.

* Fix resource_not_found errors in delete (kubeflow#833)

* The not found error was due to the type providers for K8s resources
  being deleted before the corresponding K8s resources. So the subsequent
  delete of the K8s resource would fail because the type provider did not
  exist.

* We fix this by using a $ref to refer to the type provider in the type field
  of K8s resources.
k8s-ci-robot pushed a commit that referenced this issue May 25, 2018
…er. (#866)

* Create python scripts for deploying Kubeflow on GCP via deployment manager.

* The scripts replaces our bash commands
* For teardown we want to add retries to better handle INTERNAL_ERRORS
  with deployment manager that are causing the test to be flaky.

Related to #836 verify Kubeflow deployed correctly with deployment manager.

* Fix resource_not_found errors in delete (#833)

* The not found error was due to the type providers for K8s resources
  being deleted before the corresponding K8s resources. So the subsequent
  delete of the K8s resource would fail because the type provider did not
  exist.

* We fix this by using a $ref to refer to the type provider in the type field
  of K8s resources.

* * deletePolicy can't be set per resource

* Autoformat jsonnet.
@jlewi
Copy link
Contributor Author

jlewi commented May 29, 2018

#866 should add retries that handle internal error.

@jlewi jlewi closed this as completed May 29, 2018
saffaalvi pushed a commit to StatCan/kubeflow that referenced this issue Feb 11, 2021
* This config creates the K8s resources needed to run the bootstrapper
* Enable the ResourceManager API; this is used to get IAM policies
* Add IAM roles to the cloudservices account. This is needed so that
  the deployment manager has sufficient RBAC permissions to do what it needs
  to.

* Delete initialNodeCount and just make the default node pool a 1 CPU node pool.

* The bootstrapper isn't running successfully; it looks like its trying
  to create a pytorch component but its using an older version of the registry
  which doesn't include the pytorch operator.

* Set delete policy on K8s resources to ABANDON otherwise we get internal errors.
* We can use actions to enable APIs and then we won't try to delete
  the API when the deployment is deleted which causes errors.

fix kubeflow#833
saffaalvi pushed a commit to StatCan/kubeflow that referenced this issue Feb 11, 2021
…er. (kubeflow#866)

* Create python scripts for deploying Kubeflow on GCP via deployment manager.

* The scripts replaces our bash commands
* For teardown we want to add retries to better handle INTERNAL_ERRORS
  with deployment manager that are causing the test to be flaky.

Related to kubeflow#836 verify Kubeflow deployed correctly with deployment manager.

* Fix resource_not_found errors in delete (kubeflow#833)

* The not found error was due to the type providers for K8s resources
  being deleted before the corresponding K8s resources. So the subsequent
  delete of the K8s resource would fail because the type provider did not
  exist.

* We fix this by using a $ref to refer to the type provider in the type field
  of K8s resources.

* * deletePolicy can't be set per resource

* Autoformat jsonnet.
yanniszark pushed a commit to arrikto/kubeflow that referenced this issue Feb 15, 2021
Signed-off-by: Ce Gao <gaoce@caicloud.io>
surajkota pushed a commit to surajkota/kubeflow that referenced this issue Jun 13, 2022
* Remove 1.15 selectors for Seldon

* Add kubeflow namespace selectors to webhook configuration

* Change seldon webhook conf to matchLabels
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants