GCP deployment manager test handle internal errors #833

jlewi · 2018-05-19T21:40:17Z

In #823 we observed test flakes during teardown due to internal errors.

Your active configuration is: [default]
+ gcloud deployment-manager --project=kubeflow-ci --quiet deployments delete z23-25b96cd-1584-c7cc
Waiting for delete [operation-1526763570795-56c95584b11f9-22f0bb38-c3ac6396]...
..............failed.
ERROR: (gcloud.deployment-manager.deployments.delete) Delete operation operation-1526763570795-56c95584b11f9-22f0bb38-c3ac6396 failed.
Error in Operation [operation-1526763570795-56c95584b11f9-22f0bb38-c3ac6396]: errors:
- code: INTERNAL_ERROR
 message: "Code: '-3751873619725894346'"

We should make the tests more robust to these flakes.

/priority p1

The text was updated successfully, but these errors were encountered:

jlewi · 2018-05-21T20:55:24Z

Encountered this again.

Waiting for delete [operation-1526931651153-56cbc7aa9cc68-dea36067-bcd2893e]...
...............failed.
ERROR: (gcloud.deployment-manager.deployments.delete) Delete operation operation-1526931651153-56cbc7aa9cc68-dea36067-bcd2893e failed.
Error in Operation [operation-1526931651153-56cbc7aa9cc68-dea36067-bcd2893e]: errors:
- code: INTERNAL_ERROR
 message: "Code: '-1846714646454417798'"

gcloud --project=kubeflow-ci deployment-manager operations describe operation-1526931651153-56cbc7aa9cc68-dea36067-bcd2893e

endTime: '2018-05-21T12:41:08.283-07:00'
error:
  errors:
  - code: INTERNAL_ERROR
    message: "Code: '-1846714646454417798'"
httpErrorMessage: SERVICE UNAVAILABLE
httpErrorStatusCode: 503
id: '7124900659503996460'
insertTime: '2018-05-21T12:40:51.360-07:00'
kind: deploymentmanager#operation
name: operation-1526931651153-56cbc7aa9cc68-dea36067-bcd2893e
operationType: delete
progress: 100
selfLink: https://www.googleapis.com/deploymentmanager/v2/projects/kubeflow-ci/global/operations/operation-1526931651153-56cbc7aa9cc68-dea36067-bcd2893e
startTime: '2018-05-21T12:40:51.478-07:00'
status: DONE
targetId: '8916802122527187256'
targetLink: https://www.googleapis.com/deploymentmanager/v2/projects/kubeflow-ci/global/deployments/z23-25b96cd-1597-5bbe
user: kubeflow-testing@kubeflow-ci.iam.gserviceaccount.com

jlewi · 2018-05-21T20:57:18Z

The deployment z23-25b96cd-1597-5bbe still exists and is in an error state.

gcloud --project=kubeflow-ci deployment-manager deployments describe  z23-25b96cd-1597-5bbe
---
fingerprint: d3WV5xXfHWwvG9HtWG-mow==
id: '8916802122527187256'
insertTime: '2018-05-21T12:36:55.923-07:00'
manifest: manifest-1526931415935
name: z23-25b96cd-1597-5bbe
operation:
  endTime: '2018-05-21T12:41:08.283-07:00'
  error:
    errors:
    - code: INTERNAL_ERROR
      message: "Code: '-1846714646454417798'"
  name: operation-1526931651153-56cbc7aa9cc68-dea36067-bcd2893e
  operationType: delete
  progress: 100
  startTime: '2018-05-21T12:40:51.478-07:00'
  status: DONE
  user: kubeflow-testing@kubeflow-ci.iam.gserviceaccount.com
update:
  manifest: https://www.googleapis.com/deploymentmanager/v2/projects/kubeflow-ci/global/deployments/z23-25b96cd-1597-5bbe/manifests/empty-manifest-for-delete
NAME                                         TYPE                                                                                                            STATE    INTENT
admin-namespace                              kubeflow-ci/z23-25b96cd-1597-5bbe-kubeflow-type:/api/v1/namespaces                                              ABORTED  DELETE
bootstrap-statefulset                        kubeflow-ci/z23-25b96cd-1597-5bbe-kubeflow-type-apps-v1:/apis/apps/v1/namespaces/{namespace}/statefulsets       FAILED   DELETE
dm-rbac                                      kubeflow-ci/z23-25b96cd-1597-5bbe-kubeflow-type-rbac-v1:/apis/rbac.authorization.k8s.io/v1/clusterrolebindings  ABORTED  DELETE
resource-manager-api                         deploymentmanager.v2.virtual.enableService                                                                      ABORTED  DELETE
z23-25b96cd-1597-5bbe-kubeflow               container.v1.cluster                                                                                            ABORTED  DELETE
z23-25b96cd-1597-5bbe-kubeflow-cpu-pool-v1   container.v1.nodePool                                                                                           ABORTED  DELETE
z23-25b96cd-1597-5bbe-kubeflow-type          deploymentmanager.v2beta.typeProvider                                                                           ABORTED  DELETE
z23-25b96cd-1597-5bbe-kubeflow-type-rbac-v1  deploymentmanager.v2beta.typeProvider                                                                           ABORTED  DELETE

jlewi · 2018-05-21T21:11:00Z

Lets try reissuing the delete

gcloud --project=kubeflow-ci deployment-manager deployments delete  z23-25b96cd-1597-5bbe
The following deployments will be deleted:
- z23-25b96cd-1597-5bbe

Do you want to continue (y/N)?  y

Waiting for delete [operation-1526936941319-56cbdb5fb5358-2c1b9345-aa6bb86c]...failed.                                                                                                             
ERROR: (gcloud.deployment-manager.deployments.delete) Delete operation operation-1526936941319-56cbdb5fb5358-2c1b9345-aa6bb86c failed.
Error in Operation [operation-1526936941319-56cbdb5fb5358-2c1b9345-aa6bb86c]: errors:
- code: RESOURCE_NOT_FOUND
  message: The type [kubeflow-ci/z23-25b96cd-1597-5bbe-kubeflow

The error shown in the pantheon UI is

The type [kubeflow-ci/z23-25b96cd-1597-5bbe-kubeflow-type-apps-v1:/apis/apps/v1/namespaces/{namespace}/statefulsets] was not found. Consider using the delete policy 'ABANDON'.

jlewi · 2018-05-21T21:30:37Z

I added

metadata:
   deletePolicy: ABANDON

to all the K8s resources. Now I can delete all the K8s resources and the node pools but I get the following error

{"ResourceType":"deploymentmanager.v2.virtual.enableService","ResourceErrorCode":"412","ResourceErrorMessage":"deactivation has some internal error:[]"}

I tried setting the deletePolicy on that resource as well but it looks like it didn't work.

* This config creates the K8s resources needed to run the bootstrapper * Enable the ResourceManager API; this is used to get IAM policies * Add IAM roles to the cloudservices account. This is needed so that the deployment manager has sufficient RBAC permissions to do what it needs to. * Delete initialNodeCount and just make the default node pool a 1 CPU node pool. * The bootstrapper isn't running successfully; it looks like its trying to create a pytorch component but its using an older version of the registry which doesn't include the pytorch operator. * Set delete policy on K8s resources to ABANDON otherwise we get internal errors. * We can use actions to enable APIs and then we won't try to delete the API when the deployment is deleted which causes errors. fix kubeflow#833

jlewi · 2018-05-22T00:31:35Z

So I'm observing two failure modes on #823

Problems deleting the deployment because its try to deactivate an API

This should be fixed in Start deploying the bootstrapper via deployment manager. #823 by using actions instead of virtual service
Actions should allow us to enable the API but not deactivate it on delete

Also observing occasional failures trying to delete the statefulset for the bootstrapper

jlewi · 2018-05-22T01:23:21Z

Happened again on presubmit for #833

gcloud --project=kubeflow-ci deployment-manager operations describe operation-1526951147049-56cc104b5902b-342b1ac7-1f9f4298
endTime: '2018-05-21T18:06:04.694-07:00'
error:
  errors:
  - code: INTERNAL_ERROR
    message: "Code: '-7647148574420173012'"
httpErrorMessage: SERVICE UNAVAILABLE
httpErrorStatusCode: 503
id: '6631691422962471428'
insertTime: '2018-05-21T18:05:47.132-07:00'
kind: deploymentmanager#operation
name: operation-1526951147049-56cc104b5902b-342b1ac7-1f9f4298
operationType: delete
progress: 100
selfLink: https://www.googleapis.com/deploymentmanager/v2/projects/kubeflow-ci/global/operations/operation-1526951147049-56cc104b5902b-342b1ac7-1f9f4298
startTime: '2018-05-21T18:05:47.272-07:00'
status: DONE
targetId: '486701487157597464'
targetLink: https://www.googleapis.com/deploymentmanager/v2/projects/kubeflow-ci/global/deployments/z23-c4971e3-1605-9f0c
user: kubeflow-testing@kubeflow-ci.iam.gserviceaccount.com

jlewi · 2018-05-22T01:25:47Z

Looks like the deployment was successfully deleted.

jlewi · 2018-05-24T22:51:21Z

The resource not found errors are caused because we delete the type providers before the K8s resources so we get errors when we try to delete the K8s resources.

We can fix this either by

Explicitly adding a dependsOn to each K8s resource so it depends on the type provider
Using references in the type value of K8s resources

This will cause the K8s resources to be deleted before the type provider.

…nager. * The scripts replaces our bash commands * For teardown we want to add retries to better handle INTERNAL_ERRORS with deployment manager that are causing the test to be flaky. Related to kubeflow#836 verify Kubeflow deployed correctly with deployment manager. * Fix resource_not_found errors in delete (kubeflow#833) * The not found error was due to the type providers for K8s resources being deleted before the corresponding K8s resources. So the subsequent delete of the K8s resource would fail because the type provider did not exist. * We fix this by using a $ref to refer to the type provider in the type field of K8s resources.

…er. (#866) * Create python scripts for deploying Kubeflow on GCP via deployment manager. * The scripts replaces our bash commands * For teardown we want to add retries to better handle INTERNAL_ERRORS with deployment manager that are causing the test to be flaky. Related to #836 verify Kubeflow deployed correctly with deployment manager. * Fix resource_not_found errors in delete (#833) * The not found error was due to the type providers for K8s resources being deleted before the corresponding K8s resources. So the subsequent delete of the K8s resource would fail because the type provider did not exist. * We fix this by using a $ref to refer to the type provider in the type field of K8s resources. * * deletePolicy can't be set per resource * Autoformat jsonnet.

jlewi · 2018-05-29T13:06:23Z

#866 should add retries that handle internal error.

* This config creates the K8s resources needed to run the bootstrapper * Enable the ResourceManager API; this is used to get IAM policies * Add IAM roles to the cloudservices account. This is needed so that the deployment manager has sufficient RBAC permissions to do what it needs to. * Delete initialNodeCount and just make the default node pool a 1 CPU node pool. * The bootstrapper isn't running successfully; it looks like its trying to create a pytorch component but its using an older version of the registry which doesn't include the pytorch operator. * Set delete policy on K8s resources to ABANDON otherwise we get internal errors. * We can use actions to enable APIs and then we won't try to delete the API when the deployment is deleted which causes errors. fix kubeflow#833

…er. (kubeflow#866) * Create python scripts for deploying Kubeflow on GCP via deployment manager. * The scripts replaces our bash commands * For teardown we want to add retries to better handle INTERNAL_ERRORS with deployment manager that are causing the test to be flaky. Related to kubeflow#836 verify Kubeflow deployed correctly with deployment manager. * Fix resource_not_found errors in delete (kubeflow#833) * The not found error was due to the type providers for K8s resources being deleted before the corresponding K8s resources. So the subsequent delete of the K8s resource would fail because the type provider did not exist. * We fix this by using a $ref to refer to the type provider in the type field of K8s resources. * * deletePolicy can't be set per resource * Autoformat jsonnet.

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* Remove 1.15 selectors for Seldon * Add kubeflow namespace selectors to webhook configuration * Change seldon webhook conf to matchLabels

jlewi added platform/gcp area/build-release labels May 19, 2018

k8s-ci-robot added the priority/p1 label May 19, 2018

jlewi added the area/bootstrap label May 21, 2018

jlewi mentioned this issue May 22, 2018

Start deploying the bootstrapper via deployment manager. #823

Merged

k8s-ci-robot closed this as completed in 4cffc20 May 23, 2018

jlewi reopened this May 24, 2018

jlewi mentioned this issue May 24, 2018

Create a python script to deploy Kubeflow on GCP via deployment manager. #866

Merged

jlewi closed this as completed May 29, 2018

yanniszark pushed a commit to arrikto/kubeflow that referenced this issue Feb 15, 2021

fix: Update liveness probe to avoid problems (kubeflow#833)

18c4a8b

Signed-off-by: Ce Gao <gaoce@caicloud.io>

surajkota pushed a commit to surajkota/kubeflow that referenced this issue Jun 13, 2022

Remove 1.15 selectors for Seldon (kubeflow#833)

dae5b73

* Remove 1.15 selectors for Seldon * Add kubeflow namespace selectors to webhook configuration * Change seldon webhook conf to matchLabels

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCP deployment manager test handle internal errors #833

GCP deployment manager test handle internal errors #833

jlewi commented May 19, 2018

jlewi commented May 21, 2018

jlewi commented May 21, 2018

jlewi commented May 21, 2018

jlewi commented May 21, 2018

jlewi commented May 22, 2018

jlewi commented May 22, 2018

jlewi commented May 22, 2018

jlewi commented May 24, 2018 •

edited

Loading

jlewi commented May 29, 2018

GCP deployment manager test handle internal errors #833

GCP deployment manager test handle internal errors #833

Comments

jlewi commented May 19, 2018

jlewi commented May 21, 2018

jlewi commented May 21, 2018

jlewi commented May 21, 2018

jlewi commented May 21, 2018

jlewi commented May 22, 2018

jlewi commented May 22, 2018

jlewi commented May 22, 2018

jlewi commented May 24, 2018 • edited Loading

jlewi commented May 29, 2018

jlewi commented May 24, 2018 •

edited

Loading