Skip to content
This repository has been archived by the owner on May 6, 2022. It is now read-only.

Duplicate OSB create/delete requests #1639

Closed
nilebox opened this issue Dec 22, 2017 · 9 comments
Closed

Duplicate OSB create/delete requests #1639

nilebox opened this issue Dec 22, 2017 · 9 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@nilebox
Copy link
Contributor

nilebox commented Dec 22, 2017

I've noticed duplicate OSB requests for a single change in ServiceInstance in different scenarios:

  • 5 deprovisioning requests reported there
  • 2 create requests (logs from ups-broker):
I1221 23:46:09.728136       1 server.go:111] Get Service Broker Catalog...
I1221 23:46:09.728152       1 controller.go:57] Catalog()
I1221 23:54:58.838849       1 server.go:137] CreateServiceInstance 434423cf-8a6e-482a-937e-ddf95f992486...
I1221 23:54:58.839113       1 controller.go:87] CreateServiceInstance()
I1221 23:54:58.839195       1 controller.go:118] Created User Provided Service Instance:
&{434423cf-8a6e-482a-937e-ddf95f992486 0xc4200fd1c0}
I1221 23:54:58.937091       1 server.go:137] CreateServiceInstance 434423cf-8a6e-482a-937e-ddf95f992486...
I1221 23:54:58.937207       1 controller.go:87] CreateServiceInstance()
I1221 23:54:58.937225       1 controller.go:118] Created User Provided Service Instance:
&{434423cf-8a6e-482a-937e-ddf95f992486 0xc4200fd1c8}

By looking at the Service Catalog logs, it seems that there is an issue with picking up a stale object from the queue (before it is synchronized with the latest version from etcd) which leads to duplicate processing:

I1221 23:54:58.942144       1 round_trippers.go:436] PUT https://10.96.0.1:443/apis/servicecatalog.k8s.io/v1beta1/namespaces/test-ns/serviceinstances/ups-instance123/status 409 Conflict in 4 milliseconds
I1221 23:54:58.942444       1 request.go:836] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on serviceinstances.servicecatalog.k8s.io \"ups-instance123\": the object has been modified; please apply your changes to the latest version and try again","reason":"Conflict","details":{"name":"ups-instance123","group":"servicecatalog.k8s.io","kind":"serviceinstances"},"code":409}
E1221 23:54:58.942585       1 controller_instance.go:1844] ServiceInstance "test-ns/ups-instance123": Failed to update status: Operation cannot be fulfilled on serviceinstances.servicecatalog.k8s.io "ups-instance123": the object has been modified; please apply your changes to the latest version and try again
I1221 23:54:58.942617       1 controller.go:237] Error syncing ServiceInstance test-ns/ups-instance123: Operation cannot be fulfilled on serviceinstances.servicecatalog.k8s.io "ups-instance123": the object has been modified; please apply your changes to the latest version and try again

@nilebox
Copy link
Contributor Author

nilebox commented Dec 22, 2017

This conflict is a "normal" situation for Kubernetes-style APIs with eventual consistency and idempotency, but it could be a problem for OSB brokers who don't necessarily guarantee the idempotency semantics.

@nilebox
Copy link
Contributor Author

nilebox commented Dec 22, 2017

My guess is that the problem is that we pick up the events from the queue "too early" without giving enough time to receive the latest version of the instance we just updated processing a previous event:

  1. ServiceInstance is created by the user: Event e1, object version o1.
  2. Service Catalog picks up the event e1 from the queue and starts processing it.
  3. Service Catalog updates the status of the ServiceInstance to "provisioning": Event e2, object version o2.
  4. Service Catalog successfully sends the OSB request and receives a success response. It marks the ServiceInstance with Ready: True condition. Event e3, object version o3.
  5. Service Catalog finishes processing of the e1 event.
  6. Service Catalog picks up the e2 event, and in the cache there is o2 version of the object (o3 hasn't been received yet), which still says that the provisioning has not been done, so it goes and sends the OSB request again. It tries to send a status update with Ready: True but fails with a conflict (because the update is based on the outdated o2 version of the object). This step can actually occur several times if we receive another event in the meantime (as I mentioned above, I've seen delete request sent 5 times).
  7. Service Catalog finally picks up the e3 event with o3 version in the cache, and this time the processing ends with no-op.

@nilebox nilebox added the kind/bug Categorizes issue or PR as related to a bug. label Dec 22, 2017
@nilebox
Copy link
Contributor Author

nilebox commented Dec 22, 2017

The naive solution for this problem I see is to add an artificial delay in between processing events of a single ServiceInstance instance (and other resource types too). Just adding 2-3 seconds should be enough.

We can use a workqueue.DelayingInterface instead of workqueue.Interface for that: https://github.com/kubernetes/client-go/blob/master/util/workqueue/delaying_queue.go#L32
and invoke queue.AddAfter(key, <delay>) instead of queue.Add(key) in controller_instance.go

It will mean that there will be an artificial pause between making a change in the ServiceInstance spec and processing it, but I think it will significantly reduce the likelihood of duplicate requests (if the delay will be big enough for syncing the cache with the latest status).

If there are no other suggestions, I can give it a try.

@nilebox
Copy link
Contributor Author

nilebox commented Dec 22, 2017

After changing the code to

c.instanceQueue.AddAfter(key, 3*time.Second)

(adding a delay of 3 seconds for adding ServiceInstance key to the queue), I cannot reproduce the issue anymore:

I1222 05:57:17.028697       1 server.go:111] Get Service Broker Catalog...
I1222 05:57:17.028753       1 controller.go:57] Catalog()
I1222 05:57:49.550884       1 server.go:137] CreateServiceInstance a25c8a51-f4b9-43e8-9a1c-47a9b3ac5d3d...
I1222 05:57:49.551098       1 controller.go:87] CreateServiceInstance()
I1222 05:57:49.551121       1 controller.go:118] Created User Provided Service Instance:
&{a25c8a51-f4b9-43e8-9a1c-47a9b3ac5d3d 0xc42000e298}
I1222 05:58:40.238310       1 server.go:166] RemoveServiceInstance a25c8a51-f4b9-43e8-9a1c-47a9b3ac5d3d...
I1222 05:58:40.238326       1 controller.go:138] RemoveServiceInstance()
I1222 05:59:07.547323       1 server.go:137] CreateServiceInstance 1264669f-a334-448f-95eb-5e71a52e1c1e...
I1222 05:59:07.547450       1 controller.go:87] CreateServiceInstance()
I1222 05:59:07.547473       1 controller.go:118] Created User Provided Service Instance:
&{1264669f-a334-448f-95eb-5e71a52e1c1e 0xc42000e300}
I1222 05:59:19.636256       1 server.go:166] RemoveServiceInstance 1264669f-a334-448f-95eb-5e71a52e1c1e...
I1222 05:59:19.636270       1 controller.go:138] RemoveServiceInstance()
I1222 06:01:32.648728       1 server.go:137] CreateServiceInstance 3fe4302c-256a-43bf-9b03-5bde6845602f...
I1222 06:01:32.648801       1 controller.go:87] CreateServiceInstance()
I1222 06:01:32.648807       1 controller.go:118] Created User Provided Service Instance:
&{3fe4302c-256a-43bf-9b03-5bde6845602f 0xc420078250}
I1222 06:01:44.736378       1 server.go:166] RemoveServiceInstance 3fe4302c-256a-43bf-9b03-5bde6845602f...
I1222 06:01:44.736475       1 controller.go:138] RemoveServiceInstance()

so it seems to work?

If we consider this a valid fix, I can send a PR to make this change in all our controller queues.

@nilebox
Copy link
Contributor Author

nilebox commented Dec 22, 2017

The MutationCache (see mutation_cache.go) interface could be another option to resolve this issue: every time we modify ServiceInstance we could update it in this cache, and it will be up-to-date by the time we start processing the next event from the queue. It's not widely used in the Kubernetes codebase, so not sure if it's a recommended to use interface. Also not clear if it's compatible with the Lister interface we currently use to fetch objects from sharedIndexInformer.

@staebler
Copy link
Contributor

This is a duplicate of #1363.
See #1213 for the introduction of this bug.

@nilebox
Copy link
Contributor Author

nilebox commented Dec 22, 2017

@staebler I agree that the cause of the issue is the same, but it wasn't mentioned that this behavior leads to duplicate OSB requests. As I mentioned above, it's probably okay to have duplicate processing within Kubernetes world, but OSB spec doesn't guarantee idempotency so the provisioning/deprovisioning could break because of that.

@staebler
Copy link
Contributor

@nilebox The OSB spec does guarantee idempotency. For example, if the same provision request is made again after the first one is successful, the spec states that the broker must respond with a 200 OK. That is why there has not been a greater push to resolve this issue in service catalog.

@nilebox
Copy link
Contributor Author

nilebox commented Dec 24, 2017

@staebler OSB doesn't guarantee a deterministic behaviour on concurrent requests though. If Service Catalog issues a duplicate request while initial request is still in progress, it might lead to either 200 or 422, if I understand correctly. See https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#blocking-operations

But I personally agree that brokers should be idempotent.

luksa added a commit to luksa/service-catalog that referenced this issue Mar 1, 2018
…tance object

This prevents multiple provision requests from being sent to the broker
and posting old resource versions to the API server.

Fixes kubernetes-retired#1639 and kubernetes-retired#1363
luksa added a commit to luksa/service-catalog that referenced this issue Mar 5, 2018
…tance object

This prevents multiple provision requests from being sent to the broker
and posting old resource versions to the API server.

Fixes kubernetes-retired#1639 and kubernetes-retired#1363
@nilebox nilebox closed this as completed in 07ef743 Mar 7, 2018
jeremyrickard pushed a commit to jeremyrickard/service-catalog that referenced this issue Mar 13, 2018
…es-retired#1779)

* Simplify generateChecksumOfParameters() in tests
* Make the reconciliation loop exit after each update of the ServiceInstance object

This prevents multiple provision requests from being sent to the broker
and posting old resource versions to the API server.

Fixes kubernetes-retired#1639 and kubernetes-retired#1363
* Make resolveReferences return true only when an error didn't occur
* Remove unnecessary var
* Don't return the updatedInstance in controller.resolveReferences()
* Improve name of test helper function
* Exit the reconciliation loop iteration after each update when handling ServiceInstance update and deprovision calls
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants