Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use --force flag for reconciling addons #69424

Closed
wants to merge 1 commit into from

Conversation

kawych
Copy link
Contributor

@kawych kawych commented Oct 4, 2018

What this PR does / why we need it:
Addon manager should use kubectl --force flag when reconciling addons. Without it, following scenario breaks the cluster:

  • kubectl delete on any resource with immutable fields, i.e. kube-dns service
  • immediately kubectl create your own resource with the same name, but different value for the immutable field, for example service kube-dns with ClusterIP = 10.35.240.12

After that, Addon Manager fails to reconcile this resource. A failure in this steps leads to further failures, for example Addon Manager doesn't attempt to prune not-wanted addons.

Release note:

Fix Addon Manager failure to reconcile addons with immutable fields.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 4, 2018
@MrHohn
Copy link
Member

MrHohn commented Oct 4, 2018

Humm, good observation on that. What does it mean for those immutable fields after adding the --force flag? Would we get into a situation that apiserver force the API objects to be changed, but controllers fail to operate because of the invalid state?

/assign @mikedanese
/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Oct 4, 2018
@MrHohn
Copy link
Member

MrHohn commented Oct 4, 2018

/assign

@kawych
Copy link
Contributor Author

kawych commented Oct 5, 2018

I've tested this, it requires also newer version of kubectl. I have one more problem with that - if Addon Manager finds kube-dns service with incorrect ClusterIP (immutable field), instead of re-aplying it immediately, kube-dns service gets pruned and recreated only in the next Addon Manager run.

@MrHohn The behavior I expect is for the resource to be re-created with correct configuration.
"Would we get into a situation that apiserver force the API objects to be changed, but controllers fail to operate because of the invalid state?" - I didn't fully understand. Do you mean:
(1) If cluster uses forces addon changes, can it cause controllers to fail?
(2) If Addon Manager forces addon changes, can it cause controllers to fail?
For (1) I don't know, maybe it's possible. For (2), I believe not, any difference in addon configuration is probably caused by cluster user applying/forcing their changes, and I believe Addon Manager should revert those changes.

@MrHohn
Copy link
Member

MrHohn commented Oct 6, 2018

@kawych Thanks for the explanation, I wasn't sure about what --force means for kubectl apply but now it is clear (also took a look at #66602).

I was asking about if kubectl apply --force will force an in-place update on API object, in which case controller may not know what to do when that happens, but seems like that is not the case.

Copy link
Member

@MrHohn MrHohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 6, 2018
@MrHohn
Copy link
Member

MrHohn commented Oct 6, 2018

I have one more problem with that - if Addon Manager finds kube-dns service with incorrect ClusterIP (immutable field), instead of re-aplying it immediately, kube-dns service gets pruned and recreated only in the next Addon Manager run.

That means ~1 minute gap between deletion and recreation, guessing not too bad?

@kawych
Copy link
Contributor Author

kawych commented Oct 8, 2018

/retest

@kawych
Copy link
Contributor Author

kawych commented Oct 8, 2018

/cc @mikedanese

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 21, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 18, 2019
@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 19, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kawych
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: mikedanese

If they are not already assigned, you can assign the PR to them by writing /assign @mikedanese in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kawych
Copy link
Contributor Author

kawych commented Mar 19, 2019

@mikedanese PTAL

@MrHohn
Copy link
Member

MrHohn commented Mar 21, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 21, 2019
@kawych
Copy link
Contributor Author

kawych commented Mar 22, 2019

/retest

@mikedanese
Copy link
Member

I'm looking at the kubectl apply flags and I see this:

--force=false: Only used when grace-period=0. If true, immediately remove resources from API and bypass graceful deletion. Note that immediate deletion of some resources may result in inconsistency or data loss and requires confirmation.
  1. This doesn't sound like the behavior you are describing.
  2. It requires confirmation?

From the code, it looks like this does what you describe. But this looks really scary to me:

if err != nil && (errors.IsConflict(err) || errors.IsInvalid(err)) && p.Force {
patchBytes, patchObject, err = p.deleteAndCreate(current, modified, namespace, name)
}

What happens if we push an object config that is invalid in some clusters? What if a customer accidentally misconfigures an admission webhook and we start getting invalid requests on applies? It seems like we would just start deleting objects and recreating them.

Even more likely, what if the deployment controller updates a deployment at the same time we run apply and get a conflict? Wouldn't we delete the deployment, wait for it to fully delete, then recreate it?

cc @apelisse

@kawych
Copy link
Contributor Author

kawych commented Mar 25, 2019

/remove-lifecycle rotten

What if a customer accidentally misconfigures an admission webhook and we start getting invalid requests on applies? It seems like we would just start deleting objects and recreating them.

I think the same behavior happens for normal apply behavior, with the exception of immutable objects - only because normal kubectl apply will fail.

Even more likely, what if the deployment controller updates a deployment at the same time we run apply and get a conflict? Wouldn't we delete the deployment, wait for it to fully delete, then recreate it?

Not sure if this is possible, if so, it indeed seems like an issue. However, the desired outcome would be not to simply skip reconciling, but to pick up the updated deployment and perform reconcile on it, right? The code you linked includes retries for the patch operation, shouldn't it resolve the problem?

Note the problem that I'm trying to address here: Addon Manager fails to reconcile some addons and the reconcile loop breaks. I believe the right solution would be to force reconciling immutable objects. If there are risks to it, we have to think of another solution. For example, can we add a logic that resumes the reconcile loop, so that kubectl apply on a single object doesn't affect all other objects?

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 25, 2019
@apelisse
Copy link
Member

I'm not an expert of client-side apply, adding people who wrote/reviewed the change: @dixudx @soltysh

@soltysh
Copy link
Contributor

soltysh commented Mar 29, 2019

--force in apply is enforced when a conflict occurs or in case of invalid resources. So what @mikedanese writes holds true, I don't have the option to verify how this affects addon manager, but I can speak about using --force is very dangerous and I'd discourage from using it in automated scripts.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 27, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 27, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants