Deleting a ServiceInstance or ServiceBinding while async operation in progress fails #1587

staebler · 2017-11-17T18:23:56Z

When an async operation is in progress for a ServiceInstance or ServiceBinding, the Controller prioritizes polling the last operation over deleting the resource. If the async operation completes after the user has deleted the resource, then the reconciled generation will get set to the incorrect generation--that being the generation after the delete instead of the generation before the delete. When the Controller attempts to update the status of the resource to set the current operation to deprovision or unbind, the update will be rejected by the API Server because a current operation cannot be set when the reconciled generation equals the current generation.

Here is an integration test that fails but should succeed.

func TestServiceInstanceDeleteWithAsyncOperationInProgress(t *testing.T) {
	ct := controllerTest{
		t:                            t,
		broker:                       getTestBroker(),
		instance:                     getTestInstance(),
		skipVerifyingInstanceSuccess: true,
		setup: func(ct *controllerTest) {
			ct.osbClient.ProvisionReaction.(*fakeosb.ProvisionReaction).Response.Async = true
			ct.osbClient.PollLastOperationReaction = &fakeosb.PollLastOperationReaction{
				Response: &osb.LastOperationResponse{
					State: osb.StateInProgress,
				},
			}
		},
	}
	ct.run(func(ct *controllerTest) {
		if err := util.WaitForInstanceCondition(ct.client, ct.instance.Namespace, ct.instance.Name,
			v1beta1.ServiceInstanceCondition{
				Type:   v1beta1.ServiceInstanceConditionReady,
				Status: v1beta1.ConditionFalse,
				Reason: "Provisioning",
			}); err != nil {
			t.Fatalf("error waiting for instance to be provisioning asynchronously: %v", err)
		}
		if err := ct.client.ServiceInstances(ct.instance.Namespace).Delete(ct.instance.Name, &metav1.DeleteOptions{}); err != nil {
			t.Fatalf("failed to delete instance: %v", err)
		}
		ct.osbClient.PollLastOperationReaction = &fakeosb.PollLastOperationReaction{
			Response: &osb.LastOperationResponse{
				State: osb.StateSucceeded,
			},
		}
		if err := util.WaitForInstanceToNotExist(ct.client, ct.instance.Namespace, ct.instance.Name); err != nil {
			t.Fatalf("error waiting for instance to not exist: %v", err)
		}
	})
}

The text was updated successfully, but these errors were encountered:

jboyd01 · 2018-01-30T19:35:06Z

I've got several issues reported against this in OpenShift. Discussed with @staebler 2 possible approaches:

A. At completion of async provisioning, if deletion timestamp is set, don't update the reconciled generation.

B. Store the generation used when the async operation starts and make that the reconciled version when it completes successfully.

Matt said deletion is the only pending operation that should cause this problem - api server prevents all others from getting in line while async is pending.

The 2nd option may be the right long term, but for short term, I'm pursuing option A. @kibbles-n-bytes I'd like to get your review and thoughts on this for an immediate approach.

kibbles-n-bytes · 2018-01-30T21:33:43Z

I'm okay with (A) for the short term. I agree (B) would be ideal long-term, and would come in handy as well if we ever wanted to support concurrent updates.

An additional alternative would be to change our logic to bail out of the poll and start attempting to deprovision. That could be done by adding more complex logic to getReconciliationActionForServiceInstance that prioritizes reconcileServiceInstanceDelete over pollServiceInstance. There are pros and cons to this, as some brokers may not be equipped to handle the sudden flip and would return an error until the resource was provision/update finished anyway. But some could potentially handle the logic to abort a long-running operation and get us a significant speedup.

(... though, deprovision requires a plan_id, and in the update case we're back to not knowing which one to send... argh! 😭 )

staebler · 2018-01-30T21:54:40Z

I prefer bailing out of the last-operation polling when the object is deleted. I wasn't sure whether we wanted to disobey the guidance in the OSB API spec that says the following (https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#polling-last-operation).

The Platform SHOULD continue polling until the Service Broker returns a valid response or the maximum polling duration is reached.

jboyd01 · 2018-01-30T22:14:36Z

I've got (A) implemented and tested and was just adding the integration test for it.

I'm concerned about (C), my feeling is many brokers won't be able to abort immediately although as kibbles wrote, I'd expect them all to eventually finish the provisioning and then successfully execute the deprovision on the next pass.

@staebler, @kibbles-n-bytes, do you feel strongly about (C) over (A)?

staebler · 2018-01-30T22:35:45Z

@jboyd01 I am good with (A) being the short-term solution.

As for (C), even if a broker did finish the provision before accepting the deprovision request, we would be no worse off then we would be if service-catalog waited until the provision completed. The only difference would be that service-catalog would be sending deprovision requests--to which the broker would respond with 422 Unprocessable Entity--instead of sending poll-last-operation requests.

kibbles-n-bytes · 2018-01-30T23:32:32Z

@jboyd01 Same; I'm good with (A), and we should definitely get your changes in for it.

@staebler Interesting. I agree it seems like we're going against the guideline, though I think it's fine in this case. There's also precedent with the maximum retry duration for the Platform aborting a poll loop and starting to deprovision in order to orphan mitigate. I think we should clarify the spec though to make it clear that Platforms can decide to abort polling for a good reason, but that they shouldn't just abort at the first sign of a poll error.

…netes-retired#1587)

…nc update (#1587) (#1708) * handle instance deletion that occurs during async provisioning (#1587) * updated processUpdateServiceInstanceFailure(), processUpdateServiceInstanceSuccess (), and processProvisionFailure(), corrected and supplemented test cases. * added/refined tests for delete during async instance operations * reworked termination of async operation to avoid potential race condition * fix race condition with a channel * use atomic int for signaling end of async operation * add TODO and refactor common code

jboyd01 · 2018-02-16T18:15:06Z

Reopening. pull #1708 only addressed instances, async bindings still has this same problem. Additionally, instances was addressed with a quick fix (don't bump the rec generation if there is a pending delete) vs the agreed upon correct fix (persist the async operation's generation and set that at the completion).

nilebox · 2018-02-20T05:31:12Z

How about (D): with ObservedGeneration (see #1747) we can just update ObservedGeneration at the start of the async operation, and never update it on the polling completion.
As described in #1747, the combination of ObservedGeneration + Ready: True would be a sign of completed processing.

Then we wouldn't need to store a generation separately as proposed in (B).

jboyd01 · 2018-02-21T19:15:22Z

@nilebox I'm not yet up to speed on all the changes required for 1747. On the surface I agree, this looks like a good solution that doesn't require persisting the pending generation.

staebler added the kind/bug Categorizes issue or PR as related to a bug. label Nov 17, 2017

staebler mentioned this issue Nov 17, 2017

Adding Service Binding Create Integration Tests #1580

Merged

jboyd01 pushed a commit to jboyd01/service-catalog that referenced this issue Jan 31, 2018

handle instance deletion that occurs during async provisioning (kuber…

06ef7d2

…netes-retired#1587)

jboyd01 mentioned this issue Jan 31, 2018

handle instance deletion that occurs during async provisioning or async update (#1587) #1708

Merged

jboyd01 pushed a commit to jboyd01/service-catalog that referenced this issue Jan 31, 2018

handle instance deletion that occurs during async provisioning (kuber…

dcb55e1

…netes-retired#1587)

jboyd01 pushed a commit to jboyd01/service-catalog that referenced this issue Feb 12, 2018

handle instance deletion that occurs during async provisioning (kuber…

20d3e80

…netes-retired#1587)

kibbles-n-bytes closed this as completed in #1708 Feb 15, 2018

jboyd01 reopened this Feb 16, 2018

nilebox mentioned this issue Feb 21, 2018

Switch from ReconciledGeneration to ObservedGeneration (or change its semantics) #1747

Closed

kibbles-n-bytes mentioned this issue Feb 23, 2018

handle binding deletion that occurs during async bind #1760

Merged

jboyd01 closed this as completed in #1760 Feb 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deleting a ServiceInstance or ServiceBinding while async operation in progress fails #1587

Deleting a ServiceInstance or ServiceBinding while async operation in progress fails #1587

staebler commented Nov 17, 2017

jboyd01 commented Jan 30, 2018

kibbles-n-bytes commented Jan 30, 2018

staebler commented Jan 30, 2018

jboyd01 commented Jan 30, 2018

staebler commented Jan 30, 2018 •

edited

kibbles-n-bytes commented Jan 30, 2018 •

edited

jboyd01 commented Feb 16, 2018

nilebox commented Feb 20, 2018 •

edited

jboyd01 commented Feb 21, 2018

Deleting a ServiceInstance or ServiceBinding while async operation in progress fails #1587

Deleting a ServiceInstance or ServiceBinding while async operation in progress fails #1587

Comments

staebler commented Nov 17, 2017

jboyd01 commented Jan 30, 2018

kibbles-n-bytes commented Jan 30, 2018

staebler commented Jan 30, 2018

jboyd01 commented Jan 30, 2018

staebler commented Jan 30, 2018 • edited

kibbles-n-bytes commented Jan 30, 2018 • edited

jboyd01 commented Feb 16, 2018

nilebox commented Feb 20, 2018 • edited

jboyd01 commented Feb 21, 2018

staebler commented Jan 30, 2018 •

edited

kibbles-n-bytes commented Jan 30, 2018 •

edited

nilebox commented Feb 20, 2018 •

edited