Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Force delete a service-catalog resource #751

Closed
wants to merge 2 commits into from
Closed

WIP: Force delete a service-catalog resource #751

wants to merge 2 commits into from

Conversation

philipgough
Copy link
Contributor

Describe what this PR does and why we need it:

When a user deprovisions and the apb fails, they wont be able to redeploy without renaming template or tearing down environment. We want to allow the user to set some config which will help with the cleanup of resources left behind

Changes proposed in this pull request

  • Add force_delete flag to broker config
  • If a deprovision fails, and flag is true, go ahead and clean up the service instance in etcd
  • Log that we will try to clean up but its ultimately up to the user to do so
  • When client calls deprovision endpoint and its in progress, return success, regardless

Which issue this PR fixes (This will close that issue when PR gets merged)
fixes #666

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 9, 2018
Copy link
Contributor

@rthallisey rthallisey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work. Minor nit

@@ -441,6 +442,11 @@ parameters:
name: AUTO_ESCALATE
value: "false"

- description: Allow the broker to prune services instances orphaned by failed deprovision
displayname: Force delete
name: FROCE_DELETE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

misspelling

@jmrodri jmrodri self-requested a review February 9, 2018 13:27
@philipgough
Copy link
Contributor Author

@rthallisey couple of questions before we think about merging this:

  • Would it be best to return early from the deprovision handler with this code https://github.com/PhilipGough/ansible-service-broker/blob/b9f72c717d8ae3fe3321059daa2ee8c587c2e570/pkg/handler/handler.go#L454-L457. Doing so would avoid the last operation endpoint being called at all?
  • When I delete the provisioned service, I see the logs with DAO deleting the instance but in the ui the "Provisioned Service" shows "The service was marked for deletion." but never goes away, am I missing a step? The backing Pods never go away either I assume we dont need to clean these up?
  • Any test cases you would like to see added?
  • Previous comment was to import the config pkg and use GetBool instead of injecting it directly but not sure how that works as I need a reference to the config

Thanks

Copy link
Contributor

@jmrodri jmrodri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea of doing a forced cleanup is good. I'm not entirely sure if we need a force_delete to do this. I almost feel like this should be the normal behavior. We should try to deprovision more than once during an error. If those fail, we still delete from etcd and log all errors that happened during the deprovision.

Is there ever a time we do not want to delete from etcd?

@@ -298,6 +298,7 @@ objects:
ssl_cert_key: /etc/tls/private/tls.key
ssl_cert: /etc/tls/private/tls.crt
auto_escalate: ${AUTO_ESCALATE}
force_delete: ${FORCE_DELETE}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if this value is not set here? Will it default to false? I would advocate not adding this to the simple template. If folks want this turned on they can edit it manually. I don't want this template to become unwieldy like the others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jmrodri on not adding this to the simple template if possible. The deploy-ansible-service-broker.template.yaml template is good for the more complicated stuff.

@@ -722,6 +724,10 @@ func (a AnsibleBroker) Deprovision(
log.Info("Synchronous deprovision in progress")
_, err = apb.Deprovision(&instance)
if err != nil {
if a.brokerConfig.ForceDelete {
log.Infof("Deprovision failed. Attempting to clean up related resources. User should ensure clean up is successful")
cleanupDeprovision(&instance, a.dao)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're ignoring the err returned by cleanupDeprovision at a minimum we need to log it so we can figure out what was really going on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do ignore the error from cleanupDeprovision but that function does take care of logging the error. Also we return the original error to the caller. Do you think this is sufficient?

@philipgough
Copy link
Contributor Author

@jmrodri thanks for your input. So what I took from the original issue was that this should be a config change enabled by the developer to enhance developer workflow. If this option is false, which it is by default, then we will return this to the client and they can retry. Thoughts?

@rthallisey
Copy link
Contributor

I think the idea of doing a forced cleanup is good. I'm not entirely sure if we need a force_delete to do this. I almost feel like this should be the normal behavior. We should try to deprovision more than once during an error. If those fail, we still delete from etcd and log all errors that happened during the deprovision.

Is there ever a time we do not want to delete from etcd?

Re reading the open service broker api spec I found: When a Service Broker receives a deprovision request from a Platform, it MUST delete any resources it created during the provision. Usually this means that all resources are immediately reclaimed for future provisions.

I think you're right @jmrodri. I think the broker is not completely spec compliant on deprovision because we don't guarantee that all provision resources are cleaned up. I think this PR gets us most of the way there by allowing the service instance to be removed. I agree that this should be the default behaviour.

@philipgough
Copy link
Contributor Author

@rthallisey @jmrodri ok, thanks guys. Looks like I need to refactor then with the flag removed and just force this behaviour.

@philipgough
Copy link
Contributor Author

So having a think about this issue and specifically this from the spec:

"When a Service Broker receives a deprovision request from a Platform, it MUST delete any resources it created during the provision. "

To me this means we should clean up all the k8's objects that get created as part of the provision. However I see some issues with this.

What happens if the deprovision playbook is broken? We have no way (currently) of tracking these resources. We wont have the context of what needs to be deleted.

So going forward my suggestion for this pr is as follows if you guys agree on it.

  1. Assume deprovision playbook is correct and attempt to retry the deprovision a specific number of times (should this be configurable)?
  2. If above fails assume at this point that the playbook is broken and we cant progress the deprovision. At this point, do what we currently do in this pr and make a final best effort to clean up the service instance etc

Thoughts?

@rthallisey
Copy link
Contributor

Assume deprovision playbook is correct and attempt to retry the deprovision a specific number of times (should this be configurable)?

Retrying could have some additional challenges. The broker has no idea what's in the playbook and retrying could trigger some weird path in the playbook. I think retrying is something worth looking into in the future, but let's have a separate discussion for that.

If above fails assume at this point that the playbook is broken and we cant progress the deprovision. At this point, do what we currently do in this pr and make a final best effort to clean up the service instance etc

👍 . This will make it so can at least re provision the same service and we'll error really loudly so that the user is aware what happened.

@rthallisey
Copy link
Contributor

Highlights from IRC discussion:

  1. Stick with using a flag that will remove the guarantee that resources are cleaned up.
  2. Follow up in the mailing list (ansible-service-broker@redhat.com) with a discussion about what should be the default behaviour on a deprovision so that we can improve the user experience.

@rthallisey
Copy link
Contributor

Would it be best to return early from the deprovision handler with this code https://github.com/PhilipGough/ansible-service-broker/blob/b9f72c717d8ae3fe3321059daa2ee8c587c2e570/pkg/handler/handler.go#L454-L457. Doing so would avoid the last operation endpoint being called at all?

Is there any benefit of avoiding the last operation endpoint? Might be worth continuing to process last_operation for logging.

When I delete the provisioned service, I see the logs with DAO deleting the instance but in the ui the "Provisioned Service" shows "The service was marked for deletion." but never goes away, am I missing a step?

I think the provisioned service should get removed in the UI. I think that's the service instance resource.

The backing Pods never go away either I assume we dont need to clean these up?

If the deprovision failed without cleaning up anything then it's up to the user to cleanup.

Any test cases you would like to see added?

Ya let's test the Run function.pkg/broker/provision_job_test.go should provide a template.

Previous comment was to import the config pkg and use GetBool instead of injecting it directly but not sure how that works as I need a reference to the config

You do need a reference to that object. Makes sense to keep it as is.

@philipgough
Copy link
Contributor Author

philipgough commented Feb 12, 2018

@rthallisey

"Retrying could have some additional challenges. The broker has no idea what's in the playbook and retrying could trigger some weird path in the playbook."

Sure, I was operating on the assumption that the playbook would be idempotent but I think this assumption is not a correct one to make.

"I think the provisioned service should get removed in the UI. I think that's the service instance resource."

I came back to this today, and this is in fact only cleaned up when the last operation handler returns a success. So wondering if its sufficient to return 200 from deprovision handler as I suggested above when the force_delete flag is set?

@rthallisey
Copy link
Contributor

Sure, I was operating on the assumption that the playbook would be idempotent but I think this assumption is not a correct one to make.

+1 @philipgough APBs should be idempotent. My comment was not entirely accurate. But let's do retries separately so this PR is easier to review.

"I think the provisioned service should get removed in the UI. I think that's the service instance resource."

I came back to this today, and this is in fact only cleaned up when the last operation handler returns a success. So wondering if its sufficient to return 200 from deprovision handler as I suggested above when the force_delete flag is set?

Returning a 200 from deprovision when force_delete is set makes sense to me.

@philipgough
Copy link
Contributor Author

philipgough commented Feb 13, 2018

@rthallisey I began writing unit tests for these changes today but realised I would need a further refactor to inject the Deprovision function into the job.

I spoke to @maleck13 about this who pointed me at the #619 that he is working on. See the DI for job here and the related test here.

The majority of the work is already done and after discussion @maleck13 felt it would be better to potentially merge this (if thats what you guys want to do), than me porting over his changes into this pr and causing lots of conflicts for his.

Not sure how you want to progress but at this point I cant do much more with this pr. Works as expected I believe, with service instance cleaned up immediately in the ui and I was able to redeploy a failed apb

@jmrodri
Copy link
Contributor

jmrodri commented Feb 14, 2018

@philipgough @eriknelson @shawn-hurley @rthallisey based on the discussion on this PR and the mailing list, I'd like to suggest we close this PR in favor of a proposal. Any objections?

@rthallisey
Copy link
Contributor

That's fine with me. We can always re open if we decide we want to use this solution.

@eriknelson
Copy link
Contributor

eriknelson commented Feb 14, 2018

+1. I would like to pull back and state the problem in a proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Force delete a service-catalog resource
7 participants