Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconciliation Proposal #997

Closed

Conversation

alaypatel07
Copy link
Contributor

Describe what this PR does and why we need it:

Add spec-reconciliation.md

@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 26, 2018
@alaypatel07 alaypatel07 changed the title Add spec-reconciliation.md Reconciliation Proposal Jun 26, 2018
Copy link
Member

@djzager djzager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good start. My recommendation would be to focus on one approach and refine that.

My understanding that the service-catalog has a reconciliation it does whenever it relists. Have you looked into how they handle it?

2. Fetch new specs
3. Add all the fetched specs to the datastore

This leads to a bunch of bugs [Bug 1583495](https://bugzilla.redhat.com/show_bug.cgi?id=1583495), [Bug 1577810](https://bugzilla.redhat.com/show_bug.cgi?id=1583495), [Bug 1586345](https://bugzilla.redhat.com/show_bug.cgi?id=1586345)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these would be better added as a list with a small summary of the problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I can add a summary describing the problem

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one needs to be updated [Bug 1577810](https://bugzilla.redhat.com/show_bug.cgi?id=1583495)



Instead, the broker can take the following actions:
####Approach#1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be ### Approach 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sure


Either approach will take care of the all objectives and bugs.

##Other ideas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut feeling is that these should be separate proposals and are immediately related to the reconciliation problem.

1. add new specs to the datastore
2. delete that specs that are removed from datastore and do not have a service instance.
3. mark the specs that are removed from datastore and has a service instance
2. During deprovision, if the instance getting deprovisioned is the last one delete the spec from datastore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you know this is the last service instance associated with a particular bundle spec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either maintain a counter of the number of instances provisioned in bundle.Spec or compare all the instance specs with the spec of the instance being deprovisioned. I think the former would be better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bundle spec should only get deleted in the case that it had been previously marked for deletion, but was not deleted because it had an outstanding instance, correct?

@alaypatel07
Copy link
Contributor Author

@djzager both the approaches have pros and cons which approach do you think I should pick to refine?

And yes, I started looking into the service catalog reconciliation but the biggest difference in their environment is, they are getting all the changes as a callback. So, I stopped looking, thinking their approach would not be applicable in our environment as we use polling to sync the specs.

Describes broker behavior for:
1) untimely provision request
2) out of sync provision request
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 26, 2018
Copy link
Contributor

@shawn-hurley shawn-hurley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should go with approach 2 for the reconciliation.

I have inline comments about the other approaches.
Once a decision has been made on the approaches to take we need to update this doc with that.

### Approaches to safely deleting specs

The broker can take the following actions to safely delete the specs:
#### Approach#1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: space between # and 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually remove the # and make it #### Approach 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, let's summarize the approach with a paragraph before going into the individual steps. Approach 1 seems to be in place merge of the specs. Approach 2 seems to be use a reaper task to remove specs with mark for deletion. Alternatively, we could just change the headline to include that.

Approach 1 - In place merge

Approach 2 - Reaper task

1. If the fetch fails increment a failed connection counter. If the failed connection counter reaches a limit delete all the specs that do not have a service instance
2. If the fetch succeeds:
1. add new specs to the datastore
2. delete that specs that are removed from datastore and do not have a service instance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should be looking at bundle instances and not service instances

#### Approach#1
The broker always sends an error code if it is busy provisioning. OSB [spec](https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#response-2) does not have an appropriate error code for busy.

#### Approach#2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that a combination of 2 and 3 is an option. in this scenario we start the provision with some soft of waiting mechanism, that will get notified when bootstrap finishes. While this is happening async provision last operation request's can be called just like they are now and will error if the spec is deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 to Approach 1, sending an error code because we are busy is bad and as you already pointed out not OSB spec compliant.

-1 Approach 2 by itself is inadequate we can't block the request, we must return within 60 seconds which is why we have the async provision concept.

Approach 3 is the most correct, in my opinion, we begin the provision and if we can't find the spec we need we fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also see the combo approach @shawn-hurley suggested above would be acceptable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with the above comments. I don't think 1 is reasonable, 2 shouldn't block. It would be semantically correct to respond with a 202 saying "okay, we've accepted your request and we're working on it". The fact that we're waiting on the bootstrap to complete is a broker implementation detail; it has nothing to do with the OSB spec.

Once the bootstrap completes, the job waiting should then investigate the current state of the specs. If the requested instance refers to a spec that has been deleted, we can mark the job failed. If the spec still exists, proceed as normal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like approach 3 the best and agree with the above comments.


These ideas are not directly related to the objectives of this proposal but are mentioned here and might become independent proposals of their own when addressing those issues.

#### APB Versioning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section is a different PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create another proposal called "bundle versioning" and move these two sections to that proposal. Feel free to link to that proposal from this proposal if you want. But I agree with @shawn-hurley these are two different topics. It will be easier to review in a separate doc as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@jmrodri jmrodri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay finished reviewing. A few typos and grammar issues. And a suggestion for changing the workflow to approach 1 of the safely deleting scenario.

### Approaches to handle untimely provision request


To to handle a case where provision is requested after the broker has started bootstrapping but before it has completed bootstrapping, the broker can take the following approaches:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TYPO: To to should just be To

Successfully serves the provision request (very similar to scenario #1): This approach might lead to unexpected problems during execution.

#### Approach#2
Error out the provision request. It is clear with the polling approach of the Service Catalog to update the [specs](https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#response-2), that this scenario will eventually popup, but the specs doesnt have an error code specifically meant for this scenario.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TYPO: popup -> pop up

TYPO: doesnt -> don't

# Spec Reconciliation Proposal

## Introduction
Add a way for broker to sync the list of specs it has with registries while handling the following corner cases:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GRAMMAR: for broker -> for the broker

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would reword the first sentence:

Refactor the way the broker syncs the list of specs it has with registries while handling the following corner cases

It already has a way to sync them, we're just changing how it works.

1. Temporary loss of connection with registry
2. A spec removed from the registry but has provisioned instance/s
3. The service catalog sends a provision request to the broker after the broker has started bootstrapping but before it has completed bootstrapping (untimely provision request)
4. The service catalog sends a provision request to the broker for specs that has been marked deleted in the broker (out of sync provision request)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GRAMMAR: specs that has -> specs that have

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 the corner cases seem correct.


The broker can take the following actions to safely delete the specs:
#### Approach#1
1. Fetch the specs from registry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GRAMMAR: from registry -> from a registry

### Approaches to safely deleting specs

The broker can take the following actions to safely delete the specs:
#### Approach#1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, let's summarize the approach with a paragraph before going into the individual steps. Approach 1 seems to be in place merge of the specs. Approach 2 seems to be use a reaper task to remove specs with mark for deletion. Alternatively, we could just change the headline to include that.

Approach 1 - In place merge

Approach 2 - Reaper task

#### Approach#1
The broker always sends an error code if it is busy provisioning. OSB [spec](https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#response-2) does not have an appropriate error code for busy.

#### Approach#2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 to Approach 1, sending an error code because we are busy is bad and as you already pointed out not OSB spec compliant.

-1 Approach 2 by itself is inadequate we can't block the request, we must return within 60 seconds which is why we have the async provision concept.

Approach 3 is the most correct, in my opinion, we begin the provision and if we can't find the spec we need we fail.

#### Approach#1
The broker always sends an error code if it is busy provisioning. OSB [spec](https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#response-2) does not have an appropriate error code for busy.

#### Approach#2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also see the combo approach @shawn-hurley suggested above would be acceptable.


These ideas are not directly related to the objectives of this proposal but are mentioned here and might become independent proposals of their own when addressing those issues.

#### APB Versioning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create another proposal called "bundle versioning" and move these two sections to that proposal. Feel free to link to that proposal from this proposal if you want. But I agree with @shawn-hurley these are two different topics. It will be easier to review in a separate doc as well.

### Approaches to safely deleting specs

The broker can take the following actions to safely delete the specs:
#### Approach#1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like Approach 1 best but I'd like to change it a bit.

  1. Fetch the specs from a registry
  2. Fetch the specs from the datastore
  3. Delete any specs in the datastore currently marked for deletion, assuming there is no instance that has a reference to this spec which is marked for deletion.
    i. if there is a reference, leave the spec alone
    ii. if there is no reference, delete the spec
  4. For each spec from the registry, get the spec from the datastore
    i. if found, update it, and save to datastore
    ii. if not found, add it to the datastore
  5. For each spec in the datastore, NOT in registry list, mark it for deletion.

Connection failure
i. if connection fails, retry 10 times backing of 10 seconds each time, then give up until the next bootstrap loop runs
if the bootstrap loop is < our retry time, I'd abort and wait for the next bootstrap loop to run.

Catalog payload should change such that it returns all specs that are NOT marked for deletion

I would not do reference counting, I would simply look for any instances that have a reference to a spec that was marked for deletion. If so, then we do not delete it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should mark all specs for deletion if the fetch fails. If we get NOTHING back, then yes. But if there is an ERROR, then no, leave them alone until the next bootstrap call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eriknelson @jmrodri I think this is where we should discuss the approach to the actual reconciliation loop proposal.

I find approach 1 to be limiting and overloading different pieces. The bootstrap could mark that it no longer sees a spec, and have another process handle if it should delete it. This will make the bootstrap loop have less work to do and will allow for the bootstrap to maybe not timeout?

I also think some sort of control loop that only deals with managing bundles gets us much closer to being kubelike, and will better separate our logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @shawn-hurley, we already frequently hit 504s with the broker route when bootstrapping I would prefer to not add too much computation to the bootstrap loop.

Copy link
Contributor

@eriknelson eriknelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good @alaypatel07. I'm a little uneasy with reference counters; they can get out of sync, especially in concurrency situations, and I'm not sure the complexity of coordinating these is worth it.

1. add new specs to the datastore
2. delete that specs that are removed from datastore and do not have a service instance.
3. mark the specs that are removed from datastore and has a service instance
2. During deprovision, if the instance getting deprovisioned is the last one delete the spec from datastore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bundle spec should only get deleted in the case that it had been previously marked for deletion, but was not deleted because it had an outstanding instance, correct?

#### Approach#1
The broker always sends an error code if it is busy provisioning. OSB [spec](https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#response-2) does not have an appropriate error code for busy.

#### Approach#2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with the above comments. I don't think 1 is reasonable, 2 shouldn't block. It would be semantically correct to respond with a 202 saying "okay, we've accepted your request and we're working on it". The fact that we're waiting on the bootstrap to complete is a broker implementation detail; it has nothing to do with the OSB spec.

Once the bootstrap completes, the job waiting should then investigate the current state of the specs. If the requested instance refers to a spec that has been deleted, we can mark the job failed. If the spec still exists, proceed as normal.

To handle the case where service catalog sends a request to provision an instance of a spec that has been marked for deletion the broker can take the following approaches:

#### Approach#1
Successfully serves the provision request (very similar to scenario #1): This approach might lead to unexpected problems during execution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what it means to "successfully serve the provision request" here, but I don't think this is the correct approach. If we know the spec has been marked for deletion, IMO it should effectively not exist anymore, and therefore we should respond as if it doesn't exist.

Error out the provision request. It is clear with the polling approach of the Service Catalog to update the [specs](https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#response-2), that this scenario will eventually popup, but the specs doesnt have an error code specifically meant for this scenario.

#### Approach#3
It can send a 202 Accepted status, suggesting that provisioning is in progress. The service catalog will when keep polling to check if the service instance is fully provisioned. The [last operation](https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#response-1) has a mechanism where broker can tell if the provision is success of failure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced we should respond with a 202 on the initial response if we know the spec has been marked for deletion. It's effectively gone at that point, and I don't think we should accepting any more requests for instances (I think I'm advocating a bad request response here).


These ideas are not directly related to the objectives of this proposal but are mentioned here and might become independent proposals of their own when addressing those issues.

#### APB Versioning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@jmrodri
Copy link
Contributor

jmrodri commented Jun 28, 2018

So I had a private conversation with Alay earlier today, and another one with Erik & Shawn. Figured I'll post my comments here:

not a huge reason, if reaper was preferred by majority, I'd be okay with it. I just wouldn't do that if I was implementing this.

a reaper will have to be yet another "goroutine" running that interacts with a bootstrap "goroutine" and each provision/deprovision, etc is another set of "goroutines".

I don't think deprovision should ever delete a spec. If something is marked for deletion and a deprovision comes in, the spec should remain marked for deletion until the bootstrap reruns. Basically my in place bootstrap loop is basically a reaper process as well. So when the bootstrap runs again, first thing it should do is remove the specs that were marked for deletion and no longer provisioned.

and I don't think provision or deprovision should ever have to have logic about a marked for deletion spec. When they ask for a spec the method should filter it for them.

though I might be glossing over some details, I'd have to look at the code again to see how we get them for provision, deprovision. /me assumes it's just a getSpec or something like that

    if spec, err = a.dao.GetSpec(specID); err != nil {

3. the admin can know what version of apb is provisioned in the cluster


#### Multiple tags
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a separate proposal as well IMO.


This would require modification to the bootstrap function where after fetching all the specs, the broker needs to iterate through the list of named specs and fetch specific tags from the registry

Adding a method to adaptor interface that takes two inputs, image name and image tag, and fetches its spec. The broker will then call this method iteratively for all named pair of specs and tags
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmrodri @eriknelson @shawn-hurley I think the in-place merge (approach 1) could still be a good way to achieve reconciliation. My modified approach is as follows:

  1. Maintain a bundle instance counter along with each spec in the CRD.
  2. Compare the list of registrySpecs with crdSpecs.
  3. Add all the specs that present in registrySpecList but absent in crdSpecList, with bundle instance counter initialized to 0
  4. Remove all specs present in crdSpecList but absent in registrySpecList and with bundle instance counter value 0

Increment the spec bundle instance counter during provision
Decrement the spec bundle instance counter during deprovision

I think step 2 and 3 in the above algorithm would be the ones that would take most of the time. But those steps would be required even if the reaper approach is taken, to mark the specs for deletion. Maintaining the counter value would help save the time consuming computation required to determine whether a spec has a bundle instance. This will also avoid all the concurrency problems that complicates the reaper approach

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like maintaining a separate counter because that is information we can calculate by looking for instances that have references to the specs. While that's not trivial, it is accurate. My gut tells me those counters are going to get out of date and create another issue. But that's just me, I'll let others chime in

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmrodri @eriknelson @shawn-hurley I think modifying the reaper approach as follows could be one way to handle reconciliation with some simplicity:

  1. There is a background process reaper which listens for data from bootstrap. Bootstrap runs periodically as it does now.
  2. The bootstrap process is responsible for determining which spec should be added and adding them. The bootstrap will also mark it for deletion, so that other function(catalog and provision) can handle deletion of spec.
  3. The bootstrap process will then determine which spec to be deleted and send them to reaper(one by one as it determines).
  4. The reaper always runs in the background waiting to get a spec from bootstrap. When the reaper gets a spec, it deletes the spec if there are no bundle instances, else it will ignore the spec.

The reaper now tries to safely delete the specs it gets from bootstrap. If it is not safely able to delete, it will ignore it. Since the adding of specs is done by bootstrap, and safe removal is done by reaper, they no longer compete for a shared resource. This will always ensure a desirable state of the broker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the delete is not the expensive operation, it's the iteration of outstanding service instances to determine which instances correspond to specs, so it's not desirable to have the bootstrap process determine the specs that are safe to delete. The purpose of the reaper is to delegate this work to the background.

@jmrodri
Copy link
Contributor

jmrodri commented Jul 3, 2018

I met with @eriknelson & @alaypatel07 to discuss the bootstrap. The main issue is deleting ALL specs causes many bugs, primarily in the cases where they are going to be added again.

Each of the 3 bugs identified in the proposal was case where we deleted the specs, before we were able to process the specs from the registry. Either, we didn't add the spec yet or there was an error
reaching the registry. Both scenarios exhibited the same behavior because our process was to delete ALL specs first.

Given the above, we came to the conclusion that the simplest solution to this problem is to simply not delete all of the specs, but to process them inline. Basically back to approach #1 as outlined by:

#997 (comment)

  1. Fetch the specs from a registry
  2. Fetch the specs from the datastore
  3. Delete any specs in the datastore currently marked for deletion, assuming there is no instance that has a reference to this spec which is marked for deletion.
    i. if there is a reference, leave the spec alone
    ii. if there is no reference, delete the spec
  4. For each spec from the registry, get the spec from the datastore
    i. if found, update it, and save to datastore
    ii. if not found, add it to the datastore
  5. For each spec in the datastore, NOT in registry list, mark it for deletion.

This removes all of the concurrency issues we were trying to "solve" in all of the calls. It's easy to understand. Because we will be doing the handling of the delete (#3 above) in-place this will cause the
bootstrap to take longer.

We can make the bootstrap endpoint async so that it won't be affected by how long the bootstrap takes to reconcile. The topic of adding a last_operation like endpoint for bootstrap came up, but for now, we are forgoing that portion of the solution. The last_operation is only of interest to clients that interact with the broker directly, for instance, sbcli. We can implement the same polling mechanism that the catalog uses for the standard OSB methods.

The above will fix the original bugs we outlined, without too much intrusion. We didn't go too deep into backwards compatibility with existing clients, which seems to only affect apb tool. Apb tool is being replaced by sbcli anyway.

The plan is as follows:

  1. send out this email
  2. add the above solution as a comment to the proposal PR
  3. close proposal PR
  4. create trello card with above steps
  5. @alaypatel07 will implement the solution from the trello card

If anyone has any strong objections please speak up now.

@jmrodri jmrodri closed this Jul 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants