New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconciliation Proposal #997
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good start. My recommendation would be to focus on one approach and refine that.
My understanding that the service-catalog has a reconciliation it does whenever it relists. Have you looked into how they handle it?
| 2. Fetch new specs | ||
| 3. Add all the fetched specs to the datastore | ||
|
|
||
| This leads to a bunch of bugs [Bug 1583495](https://bugzilla.redhat.com/show_bug.cgi?id=1583495), [Bug 1577810](https://bugzilla.redhat.com/show_bug.cgi?id=1583495), [Bug 1586345](https://bugzilla.redhat.com/show_bug.cgi?id=1586345) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these would be better added as a list with a small summary of the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I can add a summary describing the problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this one needs to be updated [Bug 1577810](https://bugzilla.redhat.com/show_bug.cgi?id=1583495)
|
|
||
|
|
||
| Instead, the broker can take the following actions: | ||
| ####Approach#1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be ### Approach 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok sure
|
|
||
| Either approach will take care of the all objectives and bugs. | ||
|
|
||
| ##Other ideas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gut feeling is that these should be separate proposals and are immediately related to the reconciliation problem.
| 1. add new specs to the datastore | ||
| 2. delete that specs that are removed from datastore and do not have a service instance. | ||
| 3. mark the specs that are removed from datastore and has a service instance | ||
| 2. During deprovision, if the instance getting deprovisioned is the last one delete the spec from datastore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you know this is the last service instance associated with a particular bundle spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either maintain a counter of the number of instances provisioned in bundle.Spec or compare all the instance specs with the spec of the instance being deprovisioned. I think the former would be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bundle spec should only get deleted in the case that it had been previously marked for deletion, but was not deleted because it had an outstanding instance, correct?
|
@djzager both the approaches have pros and cons which approach do you think I should pick to refine? And yes, I started looking into the service catalog reconciliation but the biggest difference in their environment is, they are getting all the changes as a callback. So, I stopped looking, thinking their approach would not be applicable in our environment as we use polling to sync the specs. |
Describes broker behavior for: 1) untimely provision request 2) out of sync provision request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should go with approach 2 for the reconciliation.
I have inline comments about the other approaches.
Once a decision has been made on the approaches to take we need to update this doc with that.
| ### Approaches to safely deleting specs | ||
|
|
||
| The broker can take the following actions to safely delete the specs: | ||
| #### Approach#1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: space between # and 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually remove the # and make it #### Approach 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, let's summarize the approach with a paragraph before going into the individual steps. Approach 1 seems to be in place merge of the specs. Approach 2 seems to be use a reaper task to remove specs with mark for deletion. Alternatively, we could just change the headline to include that.
Approach 1 - In place merge
Approach 2 - Reaper task
| 1. If the fetch fails increment a failed connection counter. If the failed connection counter reaches a limit delete all the specs that do not have a service instance | ||
| 2. If the fetch succeeds: | ||
| 1. add new specs to the datastore | ||
| 2. delete that specs that are removed from datastore and do not have a service instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should be looking at bundle instances and not service instances
| #### Approach#1 | ||
| The broker always sends an error code if it is busy provisioning. OSB [spec](https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#response-2) does not have an appropriate error code for busy. | ||
|
|
||
| #### Approach#2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that a combination of 2 and 3 is an option. in this scenario we start the provision with some soft of waiting mechanism, that will get notified when bootstrap finishes. While this is happening async provision last operation request's can be called just like they are now and will error if the spec is deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 to Approach 1, sending an error code because we are busy is bad and as you already pointed out not OSB spec compliant.
-1 Approach 2 by itself is inadequate we can't block the request, we must return within 60 seconds which is why we have the async provision concept.
Approach 3 is the most correct, in my opinion, we begin the provision and if we can't find the spec we need we fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can also see the combo approach @shawn-hurley suggested above would be acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with the above comments. I don't think 1 is reasonable, 2 shouldn't block. It would be semantically correct to respond with a 202 saying "okay, we've accepted your request and we're working on it". The fact that we're waiting on the bootstrap to complete is a broker implementation detail; it has nothing to do with the OSB spec.
Once the bootstrap completes, the job waiting should then investigate the current state of the specs. If the requested instance refers to a spec that has been deleted, we can mark the job failed. If the spec still exists, proceed as normal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like approach 3 the best and agree with the above comments.
|
|
||
| These ideas are not directly related to the objectives of this proposal but are mentioned here and might become independent proposals of their own when addressing those issues. | ||
|
|
||
| #### APB Versioning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this section is a different PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create another proposal called "bundle versioning" and move these two sections to that proposal. Feel free to link to that proposal from this proposal if you want. But I agree with @shawn-hurley these are two different topics. It will be easier to review in a separate doc as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay finished reviewing. A few typos and grammar issues. And a suggestion for changing the workflow to approach 1 of the safely deleting scenario.
| ### Approaches to handle untimely provision request | ||
|
|
||
|
|
||
| To to handle a case where provision is requested after the broker has started bootstrapping but before it has completed bootstrapping, the broker can take the following approaches: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TYPO: To to should just be To
| Successfully serves the provision request (very similar to scenario #1): This approach might lead to unexpected problems during execution. | ||
|
|
||
| #### Approach#2 | ||
| Error out the provision request. It is clear with the polling approach of the Service Catalog to update the [specs](https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#response-2), that this scenario will eventually popup, but the specs doesnt have an error code specifically meant for this scenario. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TYPO: popup -> pop up
TYPO: doesnt -> don't
| # Spec Reconciliation Proposal | ||
|
|
||
| ## Introduction | ||
| Add a way for broker to sync the list of specs it has with registries while handling the following corner cases: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GRAMMAR: for broker -> for the broker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would reword the first sentence:
Refactor the way the broker syncs the list of specs it has with registries while handling the following corner cases
It already has a way to sync them, we're just changing how it works.
| 1. Temporary loss of connection with registry | ||
| 2. A spec removed from the registry but has provisioned instance/s | ||
| 3. The service catalog sends a provision request to the broker after the broker has started bootstrapping but before it has completed bootstrapping (untimely provision request) | ||
| 4. The service catalog sends a provision request to the broker for specs that has been marked deleted in the broker (out of sync provision request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GRAMMAR: specs that has -> specs that have
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 the corner cases seem correct.
|
|
||
| The broker can take the following actions to safely delete the specs: | ||
| #### Approach#1 | ||
| 1. Fetch the specs from registry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GRAMMAR: from registry -> from a registry
| ### Approaches to safely deleting specs | ||
|
|
||
| The broker can take the following actions to safely delete the specs: | ||
| #### Approach#1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, let's summarize the approach with a paragraph before going into the individual steps. Approach 1 seems to be in place merge of the specs. Approach 2 seems to be use a reaper task to remove specs with mark for deletion. Alternatively, we could just change the headline to include that.
Approach 1 - In place merge
Approach 2 - Reaper task
| #### Approach#1 | ||
| The broker always sends an error code if it is busy provisioning. OSB [spec](https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#response-2) does not have an appropriate error code for busy. | ||
|
|
||
| #### Approach#2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 to Approach 1, sending an error code because we are busy is bad and as you already pointed out not OSB spec compliant.
-1 Approach 2 by itself is inadequate we can't block the request, we must return within 60 seconds which is why we have the async provision concept.
Approach 3 is the most correct, in my opinion, we begin the provision and if we can't find the spec we need we fail.
| #### Approach#1 | ||
| The broker always sends an error code if it is busy provisioning. OSB [spec](https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#response-2) does not have an appropriate error code for busy. | ||
|
|
||
| #### Approach#2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can also see the combo approach @shawn-hurley suggested above would be acceptable.
|
|
||
| These ideas are not directly related to the objectives of this proposal but are mentioned here and might become independent proposals of their own when addressing those issues. | ||
|
|
||
| #### APB Versioning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create another proposal called "bundle versioning" and move these two sections to that proposal. Feel free to link to that proposal from this proposal if you want. But I agree with @shawn-hurley these are two different topics. It will be easier to review in a separate doc as well.
| ### Approaches to safely deleting specs | ||
|
|
||
| The broker can take the following actions to safely delete the specs: | ||
| #### Approach#1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like Approach 1 best but I'd like to change it a bit.
- Fetch the specs from a registry
- Fetch the specs from the datastore
- Delete any specs in the datastore currently marked for deletion, assuming there is no instance that has a reference to this spec which is marked for deletion.
i. if there is a reference, leave the spec alone
ii. if there is no reference, delete the spec - For each spec from the registry, get the spec from the datastore
i. if found, update it, and save to datastore
ii. if not found, add it to the datastore - For each spec in the datastore, NOT in registry list, mark it for deletion.
Connection failure
i. if connection fails, retry 10 times backing of 10 seconds each time, then give up until the next bootstrap loop runs
if the bootstrap loop is < our retry time, I'd abort and wait for the next bootstrap loop to run.
Catalog payload should change such that it returns all specs that are NOT marked for deletion
I would not do reference counting, I would simply look for any instances that have a reference to a spec that was marked for deletion. If so, then we do not delete it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should mark all specs for deletion if the fetch fails. If we get NOTHING back, then yes. But if there is an ERROR, then no, leave them alone until the next bootstrap call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eriknelson @jmrodri I think this is where we should discuss the approach to the actual reconciliation loop proposal.
I find approach 1 to be limiting and overloading different pieces. The bootstrap could mark that it no longer sees a spec, and have another process handle if it should delete it. This will make the bootstrap loop have less work to do and will allow for the bootstrap to maybe not timeout?
I also think some sort of control loop that only deals with managing bundles gets us much closer to being kubelike, and will better separate our logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @shawn-hurley, we already frequently hit 504s with the broker route when bootstrapping I would prefer to not add too much computation to the bootstrap loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good @alaypatel07. I'm a little uneasy with reference counters; they can get out of sync, especially in concurrency situations, and I'm not sure the complexity of coordinating these is worth it.
| 1. add new specs to the datastore | ||
| 2. delete that specs that are removed from datastore and do not have a service instance. | ||
| 3. mark the specs that are removed from datastore and has a service instance | ||
| 2. During deprovision, if the instance getting deprovisioned is the last one delete the spec from datastore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bundle spec should only get deleted in the case that it had been previously marked for deletion, but was not deleted because it had an outstanding instance, correct?
| #### Approach#1 | ||
| The broker always sends an error code if it is busy provisioning. OSB [spec](https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#response-2) does not have an appropriate error code for busy. | ||
|
|
||
| #### Approach#2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with the above comments. I don't think 1 is reasonable, 2 shouldn't block. It would be semantically correct to respond with a 202 saying "okay, we've accepted your request and we're working on it". The fact that we're waiting on the bootstrap to complete is a broker implementation detail; it has nothing to do with the OSB spec.
Once the bootstrap completes, the job waiting should then investigate the current state of the specs. If the requested instance refers to a spec that has been deleted, we can mark the job failed. If the spec still exists, proceed as normal.
| To handle the case where service catalog sends a request to provision an instance of a spec that has been marked for deletion the broker can take the following approaches: | ||
|
|
||
| #### Approach#1 | ||
| Successfully serves the provision request (very similar to scenario #1): This approach might lead to unexpected problems during execution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what it means to "successfully serve the provision request" here, but I don't think this is the correct approach. If we know the spec has been marked for deletion, IMO it should effectively not exist anymore, and therefore we should respond as if it doesn't exist.
| Error out the provision request. It is clear with the polling approach of the Service Catalog to update the [specs](https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#response-2), that this scenario will eventually popup, but the specs doesnt have an error code specifically meant for this scenario. | ||
|
|
||
| #### Approach#3 | ||
| It can send a 202 Accepted status, suggesting that provisioning is in progress. The service catalog will when keep polling to check if the service instance is fully provisioned. The [last operation](https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#response-1) has a mechanism where broker can tell if the provision is success of failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced we should respond with a 202 on the initial response if we know the spec has been marked for deletion. It's effectively gone at that point, and I don't think we should accepting any more requests for instances (I think I'm advocating a bad request response here).
|
|
||
| These ideas are not directly related to the objectives of this proposal but are mentioned here and might become independent proposals of their own when addressing those issues. | ||
|
|
||
| #### APB Versioning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
So I had a private conversation with Alay earlier today, and another one with Erik & Shawn. Figured I'll post my comments here:
|
| 3. the admin can know what version of apb is provisioned in the cluster | ||
|
|
||
|
|
||
| #### Multiple tags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a separate proposal as well IMO.
|
|
||
| This would require modification to the bootstrap function where after fetching all the specs, the broker needs to iterate through the list of named specs and fetch specific tags from the registry | ||
|
|
||
| Adding a method to adaptor interface that takes two inputs, image name and image tag, and fetches its spec. The broker will then call this method iteratively for all named pair of specs and tags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmrodri @eriknelson @shawn-hurley I think the in-place merge (approach 1) could still be a good way to achieve reconciliation. My modified approach is as follows:
- Maintain a bundle instance counter along with each spec in the CRD.
- Compare the list of registrySpecs with crdSpecs.
- Add all the specs that present in registrySpecList but absent in crdSpecList, with bundle instance counter initialized to 0
- Remove all specs present in crdSpecList but absent in registrySpecList and with bundle instance counter value 0
Increment the spec bundle instance counter during provision
Decrement the spec bundle instance counter during deprovision
I think step 2 and 3 in the above algorithm would be the ones that would take most of the time. But those steps would be required even if the reaper approach is taken, to mark the specs for deletion. Maintaining the counter value would help save the time consuming computation required to determine whether a spec has a bundle instance. This will also avoid all the concurrency problems that complicates the reaper approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like maintaining a separate counter because that is information we can calculate by looking for instances that have references to the specs. While that's not trivial, it is accurate. My gut tells me those counters are going to get out of date and create another issue. But that's just me, I'll let others chime in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmrodri @eriknelson @shawn-hurley I think modifying the reaper approach as follows could be one way to handle reconciliation with some simplicity:
- There is a background process reaper which listens for data from bootstrap. Bootstrap runs periodically as it does now.
- The bootstrap process is responsible for determining which spec should be added and adding them. The bootstrap will also mark it for deletion, so that other function(catalog and provision) can handle deletion of spec.
- The bootstrap process will then determine which spec to be deleted and send them to reaper(one by one as it determines).
- The reaper always runs in the background waiting to get a spec from bootstrap. When the reaper gets a spec, it deletes the spec if there are no bundle instances, else it will ignore the spec.
The reaper now tries to safely delete the specs it gets from bootstrap. If it is not safely able to delete, it will ignore it. Since the adding of specs is done by bootstrap, and safe removal is done by reaper, they no longer compete for a shared resource. This will always ensure a desirable state of the broker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the delete is not the expensive operation, it's the iteration of outstanding service instances to determine which instances correspond to specs, so it's not desirable to have the bootstrap process determine the specs that are safe to delete. The purpose of the reaper is to delegate this work to the background.
|
I met with @eriknelson & @alaypatel07 to discuss the bootstrap. The main issue is deleting ALL specs causes many bugs, primarily in the cases where they are going to be added again. Each of the 3 bugs identified in the proposal was case where we deleted the specs, before we were able to process the specs from the registry. Either, we didn't add the spec yet or there was an error Given the above, we came to the conclusion that the simplest solution to this problem is to simply not delete all of the specs, but to process them inline. Basically back to approach #1 as outlined by:
This removes all of the concurrency issues we were trying to "solve" in all of the calls. It's easy to understand. Because we will be doing the handling of the delete (#3 above) in-place this will cause the We can make the bootstrap endpoint async so that it won't be affected by how long the bootstrap takes to reconcile. The topic of adding a last_operation like endpoint for bootstrap came up, but for now, we are forgoing that portion of the solution. The last_operation is only of interest to clients that interact with the broker directly, for instance, sbcli. We can implement the same polling mechanism that the catalog uses for the standard OSB methods. The above will fix the original bugs we outlined, without too much intrusion. We didn't go too deep into backwards compatibility with existing clients, which seems to only affect apb tool. Apb tool is being replaced by sbcli anyway. The plan is as follows:
If anyone has any strong objections please speak up now. |
Describe what this PR does and why we need it:
Add spec-reconciliation.md