Skip to content
This repository has been archived by the owner on May 6, 2022. It is now read-only.

Deleting a ServiceInstance or ServiceBinding while async operation in progress fails #1587

Closed
staebler opened this issue Nov 17, 2017 · 9 comments · Fixed by #1708
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@staebler
Copy link
Contributor

When an async operation is in progress for a ServiceInstance or ServiceBinding, the Controller prioritizes polling the last operation over deleting the resource. If the async operation completes after the user has deleted the resource, then the reconciled generation will get set to the incorrect generation--that being the generation after the delete instead of the generation before the delete. When the Controller attempts to update the status of the resource to set the current operation to deprovision or unbind, the update will be rejected by the API Server because a current operation cannot be set when the reconciled generation equals the current generation.

Here is an integration test that fails but should succeed.

func TestServiceInstanceDeleteWithAsyncOperationInProgress(t *testing.T) {
	ct := controllerTest{
		t:                            t,
		broker:                       getTestBroker(),
		instance:                     getTestInstance(),
		skipVerifyingInstanceSuccess: true,
		setup: func(ct *controllerTest) {
			ct.osbClient.ProvisionReaction.(*fakeosb.ProvisionReaction).Response.Async = true
			ct.osbClient.PollLastOperationReaction = &fakeosb.PollLastOperationReaction{
				Response: &osb.LastOperationResponse{
					State: osb.StateInProgress,
				},
			}
		},
	}
	ct.run(func(ct *controllerTest) {
		if err := util.WaitForInstanceCondition(ct.client, ct.instance.Namespace, ct.instance.Name,
			v1beta1.ServiceInstanceCondition{
				Type:   v1beta1.ServiceInstanceConditionReady,
				Status: v1beta1.ConditionFalse,
				Reason: "Provisioning",
			}); err != nil {
			t.Fatalf("error waiting for instance to be provisioning asynchronously: %v", err)
		}
		if err := ct.client.ServiceInstances(ct.instance.Namespace).Delete(ct.instance.Name, &metav1.DeleteOptions{}); err != nil {
			t.Fatalf("failed to delete instance: %v", err)
		}
		ct.osbClient.PollLastOperationReaction = &fakeosb.PollLastOperationReaction{
			Response: &osb.LastOperationResponse{
				State: osb.StateSucceeded,
			},
		}
		if err := util.WaitForInstanceToNotExist(ct.client, ct.instance.Namespace, ct.instance.Name); err != nil {
			t.Fatalf("error waiting for instance to not exist: %v", err)
		}
	})
}
@staebler staebler added the kind/bug Categorizes issue or PR as related to a bug. label Nov 17, 2017
@jboyd01
Copy link
Contributor

jboyd01 commented Jan 30, 2018

I've got several issues reported against this in OpenShift. Discussed with @staebler 2 possible approaches:

A. At completion of async provisioning, if deletion timestamp is set, don't update the reconciled generation.

B. Store the generation used when the async operation starts and make that the reconciled version when it completes successfully.

Matt said deletion is the only pending operation that should cause this problem - api server prevents all others from getting in line while async is pending.

The 2nd option may be the right long term, but for short term, I'm pursuing option A. @kibbles-n-bytes I'd like to get your review and thoughts on this for an immediate approach.

@kibbles-n-bytes
Copy link
Contributor

I'm okay with (A) for the short term. I agree (B) would be ideal long-term, and would come in handy as well if we ever wanted to support concurrent updates.

An additional alternative would be to change our logic to bail out of the poll and start attempting to deprovision. That could be done by adding more complex logic to getReconciliationActionForServiceInstance that prioritizes reconcileServiceInstanceDelete over pollServiceInstance. There are pros and cons to this, as some brokers may not be equipped to handle the sudden flip and would return an error until the resource was provision/update finished anyway. But some could potentially handle the logic to abort a long-running operation and get us a significant speedup.

(... though, deprovision requires a plan_id, and in the update case we're back to not knowing which one to send... argh! 😭 )

@staebler
Copy link
Contributor Author

I prefer bailing out of the last-operation polling when the object is deleted. I wasn't sure whether we wanted to disobey the guidance in the OSB API spec that says the following (https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#polling-last-operation).

The Platform SHOULD continue polling until the Service Broker returns a valid response or the maximum polling duration is reached.

@jboyd01
Copy link
Contributor

jboyd01 commented Jan 30, 2018

I've got (A) implemented and tested and was just adding the integration test for it.

I'm concerned about (C), my feeling is many brokers won't be able to abort immediately although as kibbles wrote, I'd expect them all to eventually finish the provisioning and then successfully execute the deprovision on the next pass.

@staebler, @kibbles-n-bytes, do you feel strongly about (C) over (A)?

@staebler
Copy link
Contributor Author

staebler commented Jan 30, 2018

@jboyd01 I am good with (A) being the short-term solution.

As for (C), even if a broker did finish the provision before accepting the deprovision request, we would be no worse off then we would be if service-catalog waited until the provision completed. The only difference would be that service-catalog would be sending deprovision requests--to which the broker would respond with 422 Unprocessable Entity--instead of sending poll-last-operation requests.

@kibbles-n-bytes
Copy link
Contributor

kibbles-n-bytes commented Jan 30, 2018

@jboyd01 Same; I'm good with (A), and we should definitely get your changes in for it.

@staebler Interesting. I agree it seems like we're going against the guideline, though I think it's fine in this case. There's also precedent with the maximum retry duration for the Platform aborting a poll loop and starting to deprovision in order to orphan mitigate. I think we should clarify the spec though to make it clear that Platforms can decide to abort polling for a good reason, but that they shouldn't just abort at the first sign of a poll error.

kibbles-n-bytes pushed a commit that referenced this issue Feb 15, 2018
…nc update (#1587) (#1708)

* handle instance deletion that occurs during async provisioning (#1587)

* updated processUpdateServiceInstanceFailure(), processUpdateServiceInstanceSuccess (), and processProvisionFailure(), corrected and supplemented test cases.

* added/refined tests for delete during async instance operations

* reworked termination of async operation to avoid potential race condition

* fix race condition with a channel

* use atomic int for signaling end of async operation

* add TODO and refactor common code
@jboyd01
Copy link
Contributor

jboyd01 commented Feb 16, 2018

Reopening. pull #1708 only addressed instances, async bindings still has this same problem. Additionally, instances was addressed with a quick fix (don't bump the rec generation if there is a pending delete) vs the agreed upon correct fix (persist the async operation's generation and set that at the completion).

@jboyd01 jboyd01 reopened this Feb 16, 2018
@nilebox
Copy link
Contributor

nilebox commented Feb 20, 2018

How about (D): with ObservedGeneration (see #1747) we can just update ObservedGeneration at the start of the async operation, and never update it on the polling completion.
As described in #1747, the combination of ObservedGeneration + Ready: True would be a sign of completed processing.

Then we wouldn't need to store a generation separately as proposed in (B).

@jboyd01
Copy link
Contributor

jboyd01 commented Feb 21, 2018

@nilebox I'm not yet up to speed on all the changes required for 1747. On the surface I agree, this looks like a good solution that doesn't require persisting the pending generation.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
4 participants