Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always ensure to have machine-delete-finalizer #811

Merged
merged 1 commit into from
Oct 16, 2020

Conversation

phiphi282
Copy link
Contributor

@phiphi282 phiphi282 commented Aug 17, 2020

What this PR does / why we need it:
After creating a new provider instance we always want to have the
machine-delete-finalizer. Possible provider specific finalizers can't be
deleted without it.

We have noted this especially for azure clusters, where the machine-controller doesn't try to run deleteCloudProvider instance because the finalizer machine-delete-finalizer is missing.

Which issue(s) this PR fixes (optional, in fixes #<issue number> format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Optional Release Note:

NONE

After creating a new provider instancewe always want to hacve the
machine-delete-finalizer possible provider specific finalizers can't be
deleted without it.

Signed-off-by: Phillip Stagnet <p.stagnet@syseleven.de>
@kubermatic-bot kubermatic-bot added dco-signoff: yes Denotes that all commits in the pull request have the valid DCO signoff message. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 17, 2020
@kubermatic-bot
Copy link
Contributor

Hi @phiphi282. Thanks for your PR.

I'm waiting for a kubermatic member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kubermatic-bot kubermatic-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Aug 17, 2020
@phiphi282
Copy link
Contributor Author

/assign @xrstf

@xrstf
Copy link
Contributor

xrstf commented Sep 1, 2020

/ok-to-test
/test all

@kubermatic-bot kubermatic-bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 1, 2020
@phiphi282
Copy link
Contributor Author

Is there any progress on this? :)

@irozzo-1A
Copy link
Contributor

Hey @phiphi282. Let me see if I got your problem right. With some providers (at least Azure) the instance is created requiring clean-up and an error returned at the same time by Create method. Is it correct?

@phiphi282
Copy link
Contributor Author

The problem is with the deletion of the machine-delete-finalizer.

If the provider instance is created but the function returns an error, the machine-delete-finalizer will be deleted.

The cleanupCloudProviderInstance function is only trying to delete the cloud-provider instance when having this finalizer though. At least for azure this keeps the instances from getting deleted because there are azure specific finalizers that don't get cleaned up as well.

@irozzo-1A
Copy link
Contributor

@phiphi282 Ok, I think I got your point right. Having an error returned by provider.Create() method does not necessarily mean that the instance clean-up logic should not be run. I will dig a bit to understand when and why this check was added.

Do you have an example at hand of error that is returned with Azure provider.Create() where instance clean-up is needed?

@phiphi282
Copy link
Contributor Author

I kind of forgot what we did exactly unfortunately, but we created and deleted some instances relatively fast. I think this caused the machine to be created on the azure side but without completely finishing the creation.

The machine already got all azure related finalizers attached to it already which kept the machine from being deleted.
cc @eduardostalinho Maybe you can describe a bit better what to do to create the error.

@eduardostalinho
Copy link
Contributor

I'm afraid that I don't have the logs here anymore.
But I remember we tried to deete a cluster that wasn't created properly (or in time) because of prov.Create and the deletion hanged because of this finalizer missing.

@irozzo-1A
Copy link
Contributor

I think that this PR risks to have side effects. The fact that cluster deletion hanged as mentioned bu @eduardostalinho is probably because the instance was created despite the error returned by Create method, and this was preventing the deletion of subnet or some other resources.

The problem I see is that when Create fails and no instance is actually created, the Cleanup method could fail:

completelyGone, err := prov.Cleanup(machine, r.providerData)

causing the machine deletion to hang.

@irozzo-1A
Copy link
Contributor

@PhillipAmend @eduardostalinho What I propose is to keep this on hold at the moment, wait for a re-occurrence of the issues to have a better understanding of what happened. Is it ok for you?

@phiphi282
Copy link
Contributor Author

@irozzo-1A From what I could see in the provider code all Delete function check the existence of the Instance on the provider side first.

e.g aws:

instance, err := p.get(machine)
if err != nil {
	if err == cloudprovidererrors.ErrInstanceNotFound {
		return true, nil
	}
	return false, err
}

With this the machine deletion should not be hanging because of a failed delete. If we don't try to delete the machine at all however (because of the missing finalizer) the machine deletion will definitely hang when there are still other finalizers on the machine.

I would be okay for me to wait a bit but we probably won't have re-occurrence of the issue from our side, since we are already using this fix (and from what we can tell it works fine on openstack, aws and azure at least).

@irozzo-1A
Copy link
Contributor

Hi @phiphi282, I agree that if all Cleaup methods for all providers can run safely even when Create did not complete successfully this PR should not have side effects. Knowing that this is already tested on AWS, Openstack and Azure is also reassuring.

I will take another look beginning of next week.

@irozzo-1A
Copy link
Contributor

/lgtm

@kubermatic-bot kubermatic-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 16, 2020
@kubermatic-bot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 16b55425d25d9741145de0adea18c534c243119e

@irozzo-1A
Copy link
Contributor

/approve

@kubermatic-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: irozzo-1A, phiphi282

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubermatic-bot kubermatic-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 16, 2020
@irozzo-1A
Copy link
Contributor

Thx for this contribution @phiphi282. I double-checked and I think you were right, this should not have side-effects. If we will observer regressions for some providers won't be critical.

@kubermatic-bot kubermatic-bot merged commit e6a2601 into kubermatic:master Oct 16, 2020
@phiphi282 phiphi282 deleted the fix_delete_machines branch October 16, 2020 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Denotes that all commits in the pull request have the valid DCO signoff message. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants