Skip to content
This repository has been archived by the owner on Jul 23, 2020. It is now read-only.

Recurring/intermittent issue - user account env reset fails (configmaps fabric8-environments not found) #3500

Closed
ldimaggi opened this issue May 7, 2018 · 25 comments

Comments

@ldimaggi
Copy link
Collaborator

ldimaggi commented May 7, 2018

Resetting a user's environment sometimes fails with this error:

untitled

Failed to load resource: the server responded with a status of 500 (Internal Server Error)
_cleanup:1 Form submission canceled because the form is not connected
api.openshift.io/api/spaces...
Failed to load resource: the server responded with a status of 500 (Internal Server Error)

Over time, I have seen this error less than 5% of the time - in debugging tests today, I have been seeing it ~20% of the time - randomly.

@joshuawilson
Copy link
Member

I just ran into this. I ran the reset again and it cleared it out.

@ldimaggi
Copy link
Collaborator Author

ldimaggi commented May 8, 2018

For the most part, a retry resolves this problem. However, I have seem instances where multiple retries are needed. No details other than the 500 error are logged.

@joshuawilson
Copy link
Member

I think this is a duplicate of #2867

@ldimaggi
Copy link
Collaborator Author

ldimaggi commented May 9, 2018

Seeing some additional patterns.

Before the reset operation that is to fail is started, the reset env page only displays a subset of the user's current spaces - and the formatting of the horizontal row for the spaces is truncated - in the following example, there were actually 2 spaces in the user's account:

1

Here's the correct set of spaces:

4

So - perhaps the error does not occur in the actual deletion of the spaces, but in the collecting of the spaces before the reset operation is performed?

@ldimaggi
Copy link
Collaborator Author

Seeing a pattern where additional information is available:

The 500/server error returned includes:

configmaps fabric8-environments not found

@ldimaggi
Copy link
Collaborator Author

This is happening repeatedly - this should be a SEV2.

@joshuawilson
Copy link
Member

could be related to #3556

@ldimaggi
Copy link
Collaborator Author

+1 - that is the same error message as in #3556

@ldimaggi ldimaggi changed the title Recurring/intermittent issue - user account env reset fails Recurring/intermittent issue - user account env reset fails (configmaps fabric8-environments not found) May 24, 2018
@joshuawilson
Copy link
Member

@aslakknutsen do you have any info or ideas on this?

@joshuawilson joshuawilson added this to the Sprint 150 milestone May 29, 2018
@ebaron
Copy link
Collaborator

ebaron commented Jun 8, 2018

I think what is happening (with some help from @jiekang), is that the frontend is calling the Delete Space API and Clean Tenant API asynchronously [1]. Before these two APIs operated on different backend resources, but this is no longer true now that Delete Space also cleans up OpenShift resources. Interleaving these two API calls could trigger a variety of errors.

Perhaps we should make cleaning up OpenShift resources optional in the Delete Space API. Then it could do so when deleting individual spaces, but defer to the more robust/faster cleanup in Clean Tenant when resetting the environment.

[1] https://github.com/fabric8-ui/fabric8-ui/blob/7494b283e1b86875aae6592b119ace9c86dd2d3c/src/app/profile/cleanup/cleanup.component.ts#L90

@jiekang
Copy link
Collaborator

jiekang commented Jun 12, 2018

@ebaron I wouldn't mind fixing the race condition in cleanup.component.ts. What's the suggested API to use now? Is having it run the delete spaces one by one, followed by clean tenant, an okay fix?

@ebaron
Copy link
Collaborator

ebaron commented Jun 12, 2018

@jiekang Yes, I think what you described would be the way to fix it in the frontend. We could also add an optional parameter to the API where you can skip deleting OpenShift resources when deleting a space. Then I don't believe we would have to synchronize between deleting spaces and cleaning the tenant.

@jiekang jiekang self-assigned this Jun 12, 2018
@jiekang
Copy link
Collaborator

jiekang commented Jun 12, 2018

Ah; so the longer path is to provide a parameter in the API, and then have the UI use it. Hmm...

@jiekang
Copy link
Collaborator

jiekang commented Jun 12, 2018

@ebaron Were you planning to take a look at providing that parameter?

@ebaron
Copy link
Collaborator

ebaron commented Jun 12, 2018

@jiekang Sure, if there are no objections, I can add this parameter to the Delete Space API. The frontend will have to make sure that this argument is set to true only when resetting the environment.

@jiekang
Copy link
Collaborator

jiekang commented Jun 12, 2018

I've assigned myself here as well. I can alter the frontend cleanup code to set the parameter to true.

@ldimaggi
Copy link
Collaborator Author

Just to confirm - the fix will be to delete spaces one by one, followed by clean tenant - correct?

@jiekang
Copy link
Collaborator

jiekang commented Jun 12, 2018

@ldimaggi We're currently going with a different fix:

Make delete space API have parameter that, when true, makes delete API not clash with clean tenant API. Then front-end is free to request delete spaces and clean tenant APIs at the same time.

@jiekang
Copy link
Collaborator

jiekang commented Jun 12, 2018

However I'm open to any opinions otherwise.

@ebaron
Copy link
Collaborator

ebaron commented Jun 12, 2018

I am open to suggestions as well. The clean tenant API clears a superset of what is cleared by the delete space API. When resetting the environment, this makes any OpenShift cleanup done by the delete space API redundant and also prone to racing with the clean tenant API.

@jiekang
Copy link
Collaborator

jiekang commented Jun 12, 2018

I'd prefer a single endpoint exist for the 'reset environment' action that handles it however it wants, but that might be exposing too much at a single point. If there aren't any opinions I think we can just go ahead with the optional parameter until someone objects :)

@ebaron
Copy link
Collaborator

ebaron commented Jun 15, 2018

@joshuawilson @jiekang Just a heads up, I opened a PR for the backend portion of the fix I proposed above (#3500 (comment)): fabric8-services/fabric8-wit#2121

@ebaron ebaron modified the milestones: Sprint 150, Sprint 151 Jun 20, 2018
ebaron added a commit to fabric8-services/fabric8-wit that referenced this issue Jul 11, 2018
…2121)

When a user resets their environment, the front-end makes calls to the Delete Space API and Clean Tenant API. It makes these calls asynchronously, and due to both APIs acting on the same resources, I suspect this is the reason we are seeing a variety of errors in openshiftio/openshift.io#3500.

Since the Clean Tenant API cleans out the user's entire namespaces, it is not necessary in this case for Delete Space to delete anything from OpenShift. This PR adds an optional parameter skipCluster, to the Delete Space API, which if true, will not attempt to delete any deployments from OpenShift. The front-end could then use this parameter only when resetting the user's environment. An alternative would be for the front-end to synchronize between deleting spaces and calling Clean Tenant, but this would be less efficient.

Fixes (partially): openshiftio/openshift.io#3500
kwk added a commit to openshiftio/saas-openshiftio that referenced this issue Jul 16, 2018
**commit** fabric8-services/fabric8-wit@cefe36a
**Author:** Rohit Kumar Rai <rohitkrai03@gmail.com>
**Date:**   Wed Jul 11 17:54:40 2018 +0530

Changed SHA checksum for dep-darwin-amd64 and setting UNAME_S variable (fabric8-services/fabric8-wit#2163)

- For macOS dep package was updated with a new SHA. This SHA value is checked in `Makefile` with a hardcoded SHA checksum to verify dep package.
- Updated the SHA value to new one.
- Added initialization of `$(UNAME_S)` variable `UNAME_S=$(shell uname -s)` since it was being used but never set which lead to explicitly exporting the variable from shell.

Similar PR for fabric8-auth - fabric8-services/fabric8-auth#549

Note: We can probably look into automating the process where SHA value is fetched dynamically instead of hard coding.

**commit** fabric8-services/fabric8-wit@f13e094
**Author:** Elliott Baron <ebaron@redhat.com>
**Date:**   Wed Jul 11 13:46:08 2018 -0400

Add parameter to Delete Space to skip deleting OpenShift resources (fabric8-services/fabric8-wit#2121)

When a user resets their environment, the front-end makes calls to the Delete Space API and Clean Tenant API. It makes these calls asynchronously, and due to both APIs acting on the same resources, I suspect this is the reason we are seeing a variety of errors in openshiftio/openshift.io#3500.

Since the Clean Tenant API cleans out the user's entire namespaces, it is not necessary in this case for Delete Space to delete anything from OpenShift. This PR adds an optional parameter skipCluster, to the Delete Space API, which if true, will not attempt to delete any deployments from OpenShift. The front-end could then use this parameter only when resetting the user's environment. An alternative would be for the front-end to synchronize between deleting spaces and calling Clean Tenant, but this would be less efficient.

Fixes (partially): openshiftio/openshift.io#3500

**commit** fabric8-services/fabric8-wit@cb85aa7
**Author:** Ibrahim Jarif <jarifibrahim@gmail.com>
**Date:**   Fri Jul 13 12:57:43 2018 +0530

Refactor search_blackbox_test.go (fabric8-services/fabric8-wit#2148)

**commit** eae146f7d3cdf1d6c40588608e60d688aaf6ad83
**Author:** Dhriti Shikhar <dhriti.shikhar.rokz@gmail.com>
**Date:**   Fri Jul 13 16:55:40 2018 +0530

Increase paging limit (fabric8-services/fabric8-wit#2166)

**commit** fabric8-services/fabric8-wit@d198813
**Author:** Baiju Muthukadan <baiju.m.mail@gmail.com>
**Date:**   Fri Jul 13 17:46:12 2018 +0530

Revert "List work items part of child iterations (fabric8-services/fabric8-wit#2146)" (fabric8-services/fabric8-wit#2168)

This reverts commit 68996d1555a29a9ef310403403855b47559d5a71.


This is required to address #3974

**commit** fabric8-services/fabric8-wit@a4d9061
**Author:** Michael Kleinhenz <kleinhenz@redhat.com>
**Date:**   Fri Jul 13 16:24:37 2018 +0200

feat(boardview): Board View for WIT. (fabric8-services/fabric8-wit#2111)
aslakknutsen pushed a commit to openshiftio/saas-openshiftio that referenced this issue Jul 17, 2018
**commit** fabric8-services/fabric8-wit@cefe36a
**Author:** Rohit Kumar Rai <rohitkrai03@gmail.com>
**Date:**   Wed Jul 11 17:54:40 2018 +0530

Changed SHA checksum for dep-darwin-amd64 and setting UNAME_S variable (fabric8-services/fabric8-wit#2163)

- For macOS dep package was updated with a new SHA. This SHA value is checked in `Makefile` with a hardcoded SHA checksum to verify dep package.
- Updated the SHA value to new one.
- Added initialization of `$(UNAME_S)` variable `UNAME_S=$(shell uname -s)` since it was being used but never set which lead to explicitly exporting the variable from shell.

Similar PR for fabric8-auth - fabric8-services/fabric8-auth#549

Note: We can probably look into automating the process where SHA value is fetched dynamically instead of hard coding.

**commit** fabric8-services/fabric8-wit@f13e094
**Author:** Elliott Baron <ebaron@redhat.com>
**Date:**   Wed Jul 11 13:46:08 2018 -0400

Add parameter to Delete Space to skip deleting OpenShift resources (fabric8-services/fabric8-wit#2121)

When a user resets their environment, the front-end makes calls to the Delete Space API and Clean Tenant API. It makes these calls asynchronously, and due to both APIs acting on the same resources, I suspect this is the reason we are seeing a variety of errors in openshiftio/openshift.io#3500.

Since the Clean Tenant API cleans out the user's entire namespaces, it is not necessary in this case for Delete Space to delete anything from OpenShift. This PR adds an optional parameter skipCluster, to the Delete Space API, which if true, will not attempt to delete any deployments from OpenShift. The front-end could then use this parameter only when resetting the user's environment. An alternative would be for the front-end to synchronize between deleting spaces and calling Clean Tenant, but this would be less efficient.

Fixes (partially): openshiftio/openshift.io#3500

**commit** fabric8-services/fabric8-wit@cb85aa7
**Author:** Ibrahim Jarif <jarifibrahim@gmail.com>
**Date:**   Fri Jul 13 12:57:43 2018 +0530

Refactor search_blackbox_test.go (fabric8-services/fabric8-wit#2148)

**commit** eae146f7d3cdf1d6c40588608e60d688aaf6ad83
**Author:** Dhriti Shikhar <dhriti.shikhar.rokz@gmail.com>
**Date:**   Fri Jul 13 16:55:40 2018 +0530

Increase paging limit (fabric8-services/fabric8-wit#2166)

**commit** fabric8-services/fabric8-wit@d198813
**Author:** Baiju Muthukadan <baiju.m.mail@gmail.com>
**Date:**   Fri Jul 13 17:46:12 2018 +0530

Revert "List work items part of child iterations (fabric8-services/fabric8-wit#2146)" (fabric8-services/fabric8-wit#2168)

This reverts commit 68996d1555a29a9ef310403403855b47559d5a71.


This is required to address #3974

**commit** fabric8-services/fabric8-wit@a4d9061
**Author:** Michael Kleinhenz <kleinhenz@redhat.com>
**Date:**   Fri Jul 13 16:24:37 2018 +0200

feat(boardview): Board View for WIT. (fabric8-services/fabric8-wit#2111)
@ebaron
Copy link
Collaborator

ebaron commented Jul 19, 2018

@jiekang there is now a "skipCluster" boolean argument for the delete space API in production. By setting this to true (e.g. DELETE https://openshift.io/api/spaces/780dc9bb-0f04-4e39-97cf-c3110be78005?skipCluster=true), WIT will not attempt to delete any OpenShift resources. We should specify this only when calling the delete space API when resetting the user's environment, since tenant will clean it up, and not when deleting an individual space.

@jiekang
Copy link
Collaborator

jiekang commented Jul 19, 2018

Okay I will look at opening a PR for that.

@ebaron
Copy link
Collaborator

ebaron commented Aug 21, 2018

All PRs have been merged to make use of the "skipCluster" argument when deleting spaces during an environment reset, and are now in production. If the issue reoccurs, feel free to reopen.

@ebaron ebaron closed this as completed Aug 21, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants