Deployment enters creation hot-loop when rs field is mutated by API server #57167

janetkuo · 2017-12-14T00:48:09Z

/kind bug

What Happened

A deployment enters create new replicaset hot-loop.

In deployment's spec.template:

      volumes:
      - emptyDir:
          sizeLimit: "0"
        name: foo

In its rs's spec.template:

      volumes:
      - emptyDir: {}
        name: foo

This will happen when you create a Deployment that specifies volumes.EmptyDir in 1.7.0 - 1.7.5, and then upgrade the cluster to >= 1.8.0, with LocalStorageCapacityIsolation disabled.

Root Cause

Some background information:

In pod spec, a new field volumes.EmptyDir.sizeLimit was introduced in 1.7.0, it's an optional field, but incorrectly set as resource.Quantity type.
To fix the above issue, it's changed to pointer type *resource.Quantity later (Change SizeLimit to a pointer #50163) in 1.8.0 and backported to 1.7.6.
In 1.8.0, this field is set to nil if the LocalStorageCapacityIsolation feature isn’t enabled:
https://github.com/kubernetes/kubernetes/blob/v1.8.0/pkg/api/pod/util.go#L242 (this fix will soon be cherrypicked to the next 1.7.x release).
Deployment creates a new rs by copying its own template to the rs. Deployment finds a new rs by comparing the deployment's template against the replicasets it owns.

If you create a Deployment that specifies volumes.EmptyDir in 1.7.0 - 1.7.5, it will incorrectly set sizeLimit to "0" by default, because of 1 mentioned above.

      volumes:
      - emptyDir:
          sizeLimit: "0"
        name: foo

If you then upgrade the cluster to 1.8.0, the sizeLimit: "0" in rs will be cleared, because of 3 mentioned above .

Deployment cannot find its new replicaset because of the template change, and continuous creating more new replicasets, which will still have different template after creation.

Solution

A possible solution is to implement Create() in dry-run mode, and have deployments to use dry-run created repliaset template (instead of deployment template) to compare and find current replicaset. This is a long term solution.

A possible short term solution is to implement a hack that clears Deployment's volumes.EmptyDir.sizeLimit with ReplicaSets. The code here should do the trick: https://github.com/kubernetes/kubernetes/blob/release-1.8/pkg/registry/extensions/deployment/strategy.go#L90-L91 except that the Deployment needs to be updated to trigger this cleanup code.

Workaround

For someone who hit this issue, updating a deployment will trigger https://github.com/kubernetes/kubernetes/blob/release-1.8/pkg/registry/extensions/deployment/strategy.go#L90-L91 and thus solve the problem automatically.

@kubernetes/sig-apps-bugs @liggitt

The text was updated successfully, but these errors were encountered:

janetkuo · 2017-12-14T01:59:19Z

Another hack to prevent the hot loop is to call DropDisabledAlphaFields in deployment controller code where it compares pod templates:

kubernetes/pkg/api/pod/util.go

Lines 231 to 252 in 0b9efae

    
           // DropDisabledAlphaFields removes disabled fields from the pod spec. 
        
           // This should be called from PrepareForCreate/PrepareForUpdate for all resources containing a pod spec. 
        
           func DropDisabledAlphaFields(podSpec *api.PodSpec) { 
        
           	if !utilfeature.DefaultFeatureGate.Enabled(features.PodPriority) { 
        
           		podSpec.Priority = nil 
        
           		podSpec.PriorityClassName = "" 
        
           	} 
        
           	if !utilfeature.DefaultFeatureGate.Enabled(features.LocalStorageCapacityIsolation) { 
        
           		for i := range podSpec.Volumes { 
        
           			if podSpec.Volumes[i].EmptyDir != nil { 
        
           				podSpec.Volumes[i].EmptyDir.SizeLimit = nil 
        
           			} 
        
           		} 
        
           	} 
        
           	for i := range podSpec.Containers { 
        
           		DropDisabledVolumeMountsAlphaFields(podSpec.Containers[i].VolumeMounts) 
        
           	} 
        
           	for i := range podSpec.InitContainers { 
        
           		DropDisabledVolumeMountsAlphaFields(podSpec.InitContainers[i].VolumeMounts) 
        
           	} 
        
           }

liggitt · 2017-12-14T02:04:23Z

Deployment cannot find its new replicaset because of the template change, and continuous creating more new replicasets, which will still have different template after creation.

This is a fragile check, and means deployments can already encounter this in the face of admission plugin modifications, not just because of this specific field.

janetkuo · 2017-12-14T02:10:52Z

This is a fragile check, and means deployments can already encounter this in the face of admission plugin modifications, not just because of this specific field.

Agree, but there's no better existing way to check it. This is also documented here (re. mutating webhooks): https://github.com/kubernetes/website/pull/6650/files#diff-50fc51cb7d01e2cae2085d75b41e9ce8R324

liggitt · 2017-12-14T02:13:34Z

Agree, but there's no better existing way to check it.

Seems like a perfect use case for generation... on creation, record the generation of the deployment the replicaset is being created from. The new replicaset generation gets set to 1. If the replicaset spec is modified, its generation is incremented and can no longer be assumed to match the deployment spec.

janetkuo · 2017-12-14T02:24:48Z

If the replicaset spec is modified, its generation is incremented and can no longer be assumed to match the deployment spec.

What if the replicaset is created manually and then adopted by the deployment? What if the user decided to roll back to a previous version of deployment (using replicaset as history)? This will break both use cases. What's more, the deployment should only ~~create~~ care about templates, not other things in the spec, such as replicas.

liggitt · 2017-12-14T02:37:26Z

What if the replicaset is created manually and then adopted by the deployment?

it could record the name and generation it adopted

What if the user decided to roll back to a previous version of deployment (using replicaset as history)?

it could record the name and generation it rolled back to

What's more, the deployment should only create about templates, not other things in the spec, such as replicas.

That doesn't seem right. I'd expect things auto-modifying scale to be controlling the top of the object chain, otherwise scale changes would be lost when rolling out the next version of the deployment.

smarterclayton · 2017-12-14T02:50:56Z

I’m pretty sure that if deployments are completely broken when people very reasonably try to initialize / default fields on a pod spec, then deployments may not be sufficiently well designed.

I think it’s reasonable to convert a tag to an image sha when a rs is created. It’s also reasonable to set resources, or add annotations to a pod template, or turn a config map ref into a copied sha.

I will note we don’t have this option with StatefulSets, so callers may still have to solve this via direct mutation of the set in some cases, and we could argue consistency matters between deployment and stateful set more than flexibility on RS

janetkuo · 2017-12-14T22:23:18Z

@liggitt

it could record the name and generation it adopted

But the RS already has generation=X set before it's adopted. How does the deployment use its own generation to compare with the adopted RS's generation?

it could record the name and generation it rolled back to

When rolling back, the deployment's template gets updated and then its generation++ (say it's N). Who will update RS's generation to make it match N?

That doesn't seem right. I'd expect things auto-modifying scale to be controlling the top of the object chain, otherwise scale changes would be lost when rolling out the next version of the deployment.

Scale change has never been recorded in workloads. It's a fundamental design decision made early on. Rollouts are only triggered by template updates and rollbacks never touch things except for templates. When users roll back, their workloads won't be scaled, or anything like rollout strategy will be updated. This is implemented in other workloads API too.

re revision comparison:

We had solved this revision comparison problem before with DaemonSet. We implemented templateGeneration which is only increased when template is updated, and then label the child resource with parent's templateGeneration. This is needed for DaemonSet at that time because DaemonSet doesn't have history object then and we can't compare its template with its pods. templateGeneration is then deprecated with the introduction of history object (ControllerRevision) and for consistency.

Another downside of templateGeneration is that when user wants to delete and recreate a Deployment with orphan-adoption (kubectl delete deployment --cascade=false), the user needs to manually set templateGeneration of the Deployment, otherwise the Deployment's history is messed up.

Also, with templateGeneration, when the user rolls back a Deployment, the Deployment needs to update its own spec (applying template from an old ReplicaSet), and then update that ReplicaSet's (it's new ReplicaSet now after the rollback) label with the deployment's updated templateGeneration. This can't be done atomically. Then how does the deployment find its new ReplicaSet if the label update fails?

Another open question wrt mutating webhook and workloads. If a webhook controller is updated, should it triggers rollouts? For example, an RS is now mutated differently on creation, should the Deployment start a new rollout? If it should, the generation approach won't work, either.

janetkuo · 2017-12-14T22:30:42Z

@smarterclayton

I think it’s reasonable to convert a tag to an image sha when a rs is created. It’s also reasonable to set resources, or add annotations to a pod template, or turn a config map ref into a copied sha.

Is there a reason not to initialize/default this at pod-level or deployment-level?

If webhook mutates rs created by deployments, it changes deployment history too. It may have some side-effect on rollback, is this a concern too?

I wish we could simply diff rs and deployment with strategic merge.

mattmoor · 2017-12-27T15:20:03Z

Is there a reason not to initialize/default this at pod-level or deployment-level?

IIUC Initializers only run at creation, so doing this at the Deployment level would miss updates.

For tasks like resolving tag->digest, you want this to happen at a cut-point in the deployment process (to give strong consistency across replication), so per-Pod resolution is not much better than imagePullPolicy: Always.

smarterclayton · 2018-01-03T22:08:04Z

Correct. Now the challenge is that STS and DS don't have this same intermediate object, so any solution for them is going to have to be on the deployment itself, which breaks apply (in some cases)

…

On Wed, Dec 27, 2017 at 10:20 AM, Matt Moore ***@***.***> wrote: Is there a reason not to initialize/default this at pod-level or deployment-level? IIUC Initializers only run at creation, so doing this at the Deployment level would miss updates. For tasks like resolving tag->digest, you want this to happen at a cut-point in the deployment process (to give strong consistency across replication), so per-Pod resolution is not much better than imagePullPolicy: Always. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#57167 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_pxV6uD4Zw3gRUx6qfO1JOC2znfNeks5tEmAwgaJpZM4RBZDa> .

janetkuo · 2018-01-10T19:29:05Z

OTOH, DropDisabledAlphaFields (clears fields in registry) should be done only at pod level, but not pod template level.

liggitt · 2018-01-10T20:10:11Z

OTOH, DropDisabledAlphaFields (clears fields in registry) should be done only at pod level, but not pod template level.

No, alpha fields should not be persisted in any object.

dhilipkumars · 2018-01-14T14:39:20Z

@janetkuo Are RollBacks not deprecated in apps/v1 ?

janetkuo · 2018-01-25T21:23:37Z

The rollbackTo field was deprecated, but the rollback behavior is still supported in kubectl.

We deprecated this field not because we didn't want to support rollback, but because we didn't want the controller to mutate its own spec.

fejta-bot · 2018-04-25T22:20:38Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

janetkuo · 2018-04-26T22:40:00Z

/remove-lifecycle stale
/lifecycle frozen

janetkuo · 2018-05-22T01:09:11Z

I wrote a doc to discuss the intersection of Deployment & mutating admission controllers: https://goo.gl/1JEEhS

smarterclayton · 2019-05-23T19:48:51Z

I came back to this, but this also means you can't toggle a cluster to enable a feature gate without potentially causing your deployments to go into a hot loop (for this and other DropDisabledTemplateFields which we have a lot more of now).

janetkuo added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Dec 14, 2017

janetkuo self-assigned this Dec 14, 2017

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 14, 2017

liggitt mentioned this issue Dec 23, 2017

Mutating ReplicaSet Spec during Initializer causes Deployment to thrash #57597

Closed

liggitt mentioned this issue Jan 10, 2018

Need reliable controller loop creation guarantees. #58032

Closed

kow3ns added this to Backlog in Workloads Feb 26, 2018

kow3ns moved this from Backlog to In Progress in Workloads Feb 27, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2018

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 26, 2018

liggitt mentioned this issue Jun 3, 2019

upgrading from v1.13 to v1.14 causes daemonset's container restart unexpectedly #78633

Closed

liggitt added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/reliability labels Jun 12, 2019

liggitt mentioned this issue Mar 9, 2020

WIP: nil defaults kubernetes/community#4571

Closed

rtheis mentioned this issue Aug 18, 2023

ReplicaSet explosion caused by conflicting mutations open-policy-agent/gatekeeper#2963

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deployment enters creation hot-loop when rs field is mutated by API server #57167

Deployment enters creation hot-loop when rs field is mutated by API server #57167

janetkuo commented Dec 14, 2017 •

edited

janetkuo commented Dec 14, 2017

liggitt commented Dec 14, 2017

janetkuo commented Dec 14, 2017

liggitt commented Dec 14, 2017 •

edited

janetkuo commented Dec 14, 2017 •

edited

liggitt commented Dec 14, 2017 •

edited

smarterclayton commented Dec 14, 2017

janetkuo commented Dec 14, 2017

janetkuo commented Dec 14, 2017

mattmoor commented Dec 27, 2017

smarterclayton commented Jan 3, 2018 via email

janetkuo commented Jan 10, 2018

liggitt commented Jan 10, 2018

dhilipkumars commented Jan 14, 2018

janetkuo commented Jan 25, 2018 •

edited

fejta-bot commented Apr 25, 2018

janetkuo commented Apr 26, 2018

janetkuo commented May 22, 2018

smarterclayton commented May 23, 2019

Deployment enters creation hot-loop when rs field is mutated by API server #57167

Deployment enters creation hot-loop when rs field is mutated by API server #57167

Comments

janetkuo commented Dec 14, 2017 • edited

What Happened

Root Cause

Solution

Workaround

janetkuo commented Dec 14, 2017

liggitt commented Dec 14, 2017

janetkuo commented Dec 14, 2017

liggitt commented Dec 14, 2017 • edited

janetkuo commented Dec 14, 2017 • edited

liggitt commented Dec 14, 2017 • edited

smarterclayton commented Dec 14, 2017

janetkuo commented Dec 14, 2017

janetkuo commented Dec 14, 2017

mattmoor commented Dec 27, 2017

smarterclayton commented Jan 3, 2018 via email

janetkuo commented Jan 10, 2018

liggitt commented Jan 10, 2018

dhilipkumars commented Jan 14, 2018

janetkuo commented Jan 25, 2018 • edited

fejta-bot commented Apr 25, 2018

janetkuo commented Apr 26, 2018

janetkuo commented May 22, 2018

smarterclayton commented May 23, 2019

janetkuo commented Dec 14, 2017 •

edited

liggitt commented Dec 14, 2017 •

edited

janetkuo commented Dec 14, 2017 •

edited

liggitt commented Dec 14, 2017 •

edited

janetkuo commented Jan 25, 2018 •

edited