Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment enters creation hot-loop when rs field is mutated by API server #57167

Open
janetkuo opened this issue Dec 14, 2017 · 19 comments
Open
Assignees
Labels
area/reliability kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects

Comments

@janetkuo
Copy link
Member

janetkuo commented Dec 14, 2017

/kind bug

What Happened

A deployment enters create new replicaset hot-loop.

In deployment's spec.template:

      volumes:
      - emptyDir:
          sizeLimit: "0"
        name: foo

In its rs's spec.template:

      volumes:
      - emptyDir: {}
        name: foo

This will happen when you create a Deployment that specifies volumes.EmptyDir in 1.7.0 - 1.7.5, and then upgrade the cluster to >= 1.8.0, with LocalStorageCapacityIsolation disabled.

Root Cause

Some background information:

  1. In pod spec, a new field volumes.EmptyDir.sizeLimit was introduced in 1.7.0, it's an optional field, but incorrectly set as resource.Quantity type.
  2. To fix the above issue, it's changed to pointer type *resource.Quantity later (Change SizeLimit to a pointer #50163) in 1.8.0 and backported to 1.7.6.
  3. In 1.8.0, this field is set to nil if the LocalStorageCapacityIsolation feature isn’t enabled:
    https://github.com/kubernetes/kubernetes/blob/v1.8.0/pkg/api/pod/util.go#L242 (this fix will soon be cherrypicked to the next 1.7.x release).
  4. Deployment creates a new rs by copying its own template to the rs. Deployment finds a new rs by comparing the deployment's template against the replicasets it owns.

If you create a Deployment that specifies volumes.EmptyDir in 1.7.0 - 1.7.5, it will incorrectly set sizeLimit to "0" by default, because of 1 mentioned above.

      volumes:
      - emptyDir:
          sizeLimit: "0"
        name: foo

If you then upgrade the cluster to 1.8.0, the sizeLimit: "0" in rs will be cleared, because of 3 mentioned above .

Deployment cannot find its new replicaset because of the template change, and continuous creating more new replicasets, which will still have different template after creation.

Solution

A possible solution is to implement Create() in dry-run mode, and have deployments to use dry-run created repliaset template (instead of deployment template) to compare and find current replicaset. This is a long term solution.

A possible short term solution is to implement a hack that clears Deployment's volumes.EmptyDir.sizeLimit with ReplicaSets. The code here should do the trick: https://github.com/kubernetes/kubernetes/blob/release-1.8/pkg/registry/extensions/deployment/strategy.go#L90-L91 except that the Deployment needs to be updated to trigger this cleanup code.

Workaround

For someone who hit this issue, updating a deployment will trigger https://github.com/kubernetes/kubernetes/blob/release-1.8/pkg/registry/extensions/deployment/strategy.go#L90-L91 and thus solve the problem automatically.

@kubernetes/sig-apps-bugs @liggitt

@janetkuo janetkuo added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Dec 14, 2017
@janetkuo janetkuo self-assigned this Dec 14, 2017
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 14, 2017
@janetkuo
Copy link
Member Author

Another hack to prevent the hot loop is to call DropDisabledAlphaFields in deployment controller code where it compares pod templates:

// DropDisabledAlphaFields removes disabled fields from the pod spec.
// This should be called from PrepareForCreate/PrepareForUpdate for all resources containing a pod spec.
func DropDisabledAlphaFields(podSpec *api.PodSpec) {
if !utilfeature.DefaultFeatureGate.Enabled(features.PodPriority) {
podSpec.Priority = nil
podSpec.PriorityClassName = ""
}
if !utilfeature.DefaultFeatureGate.Enabled(features.LocalStorageCapacityIsolation) {
for i := range podSpec.Volumes {
if podSpec.Volumes[i].EmptyDir != nil {
podSpec.Volumes[i].EmptyDir.SizeLimit = nil
}
}
}
for i := range podSpec.Containers {
DropDisabledVolumeMountsAlphaFields(podSpec.Containers[i].VolumeMounts)
}
for i := range podSpec.InitContainers {
DropDisabledVolumeMountsAlphaFields(podSpec.InitContainers[i].VolumeMounts)
}
}

@liggitt
Copy link
Member

liggitt commented Dec 14, 2017

Deployment cannot find its new replicaset because of the template change, and continuous creating more new replicasets, which will still have different template after creation.

This is a fragile check, and means deployments can already encounter this in the face of admission plugin modifications, not just because of this specific field.

@janetkuo
Copy link
Member Author

This is a fragile check, and means deployments can already encounter this in the face of admission plugin modifications, not just because of this specific field.

Agree, but there's no better existing way to check it. This is also documented here (re. mutating webhooks): https://github.com/kubernetes/website/pull/6650/files#diff-50fc51cb7d01e2cae2085d75b41e9ce8R324

@liggitt
Copy link
Member

liggitt commented Dec 14, 2017

Agree, but there's no better existing way to check it.

Seems like a perfect use case for generation... on creation, record the generation of the deployment the replicaset is being created from. The new replicaset generation gets set to 1. If the replicaset spec is modified, its generation is incremented and can no longer be assumed to match the deployment spec.

@janetkuo
Copy link
Member Author

janetkuo commented Dec 14, 2017

If the replicaset spec is modified, its generation is incremented and can no longer be assumed to match the deployment spec.

What if the replicaset is created manually and then adopted by the deployment? What if the user decided to roll back to a previous version of deployment (using replicaset as history)? This will break both use cases. What's more, the deployment should only create care about templates, not other things in the spec, such as replicas.

@liggitt
Copy link
Member

liggitt commented Dec 14, 2017

What if the replicaset is created manually and then adopted by the deployment?

it could record the name and generation it adopted

What if the user decided to roll back to a previous version of deployment (using replicaset as history)?

it could record the name and generation it rolled back to

What's more, the deployment should only create about templates, not other things in the spec, such as replicas.

That doesn't seem right. I'd expect things auto-modifying scale to be controlling the top of the object chain, otherwise scale changes would be lost when rolling out the next version of the deployment.

@smarterclayton
Copy link
Contributor

I’m pretty sure that if deployments are completely broken when people very reasonably try to initialize / default fields on a pod spec, then deployments may not be sufficiently well designed.

I think it’s reasonable to convert a tag to an image sha when a rs is created. It’s also reasonable to set resources, or add annotations to a pod template, or turn a config map ref into a copied sha.

I will note we don’t have this option with StatefulSets, so callers may still have to solve this via direct mutation of the set in some cases, and we could argue consistency matters between deployment and stateful set more than flexibility on RS

@janetkuo
Copy link
Member Author

@liggitt

it could record the name and generation it adopted

But the RS already has generation=X set before it's adopted. How does the deployment use its own generation to compare with the adopted RS's generation?

it could record the name and generation it rolled back to

When rolling back, the deployment's template gets updated and then its generation++ (say it's N). Who will update RS's generation to make it match N?

That doesn't seem right. I'd expect things auto-modifying scale to be controlling the top of the object chain, otherwise scale changes would be lost when rolling out the next version of the deployment.

Scale change has never been recorded in workloads. It's a fundamental design decision made early on. Rollouts are only triggered by template updates and rollbacks never touch things except for templates. When users roll back, their workloads won't be scaled, or anything like rollout strategy will be updated. This is implemented in other workloads API too.

re revision comparison:

We had solved this revision comparison problem before with DaemonSet. We implemented templateGeneration which is only increased when template is updated, and then label the child resource with parent's templateGeneration. This is needed for DaemonSet at that time because DaemonSet doesn't have history object then and we can't compare its template with its pods. templateGeneration is then deprecated with the introduction of history object (ControllerRevision) and for consistency.

Another downside of templateGeneration is that when user wants to delete and recreate a Deployment with orphan-adoption (kubectl delete deployment --cascade=false), the user needs to manually set templateGeneration of the Deployment, otherwise the Deployment's history is messed up.

Also, with templateGeneration, when the user rolls back a Deployment, the Deployment needs to update its own spec (applying template from an old ReplicaSet), and then update that ReplicaSet's (it's new ReplicaSet now after the rollback) label with the deployment's updated templateGeneration. This can't be done atomically. Then how does the deployment find its new ReplicaSet if the label update fails?

Another open question wrt mutating webhook and workloads. If a webhook controller is updated, should it triggers rollouts? For example, an RS is now mutated differently on creation, should the Deployment start a new rollout? If it should, the generation approach won't work, either.

@janetkuo
Copy link
Member Author

@smarterclayton

I think it’s reasonable to convert a tag to an image sha when a rs is created. It’s also reasonable to set resources, or add annotations to a pod template, or turn a config map ref into a copied sha.

Is there a reason not to initialize/default this at pod-level or deployment-level?

If webhook mutates rs created by deployments, it changes deployment history too. It may have some side-effect on rollback, is this a concern too?

I wish we could simply diff rs and deployment with strategic merge.

@mattmoor
Copy link
Contributor

Is there a reason not to initialize/default this at pod-level or deployment-level?

IIUC Initializers only run at creation, so doing this at the Deployment level would miss updates.

For tasks like resolving tag->digest, you want this to happen at a cut-point in the deployment process (to give strong consistency across replication), so per-Pod resolution is not much better than imagePullPolicy: Always.

@smarterclayton
Copy link
Contributor

smarterclayton commented Jan 3, 2018 via email

@janetkuo
Copy link
Member Author

OTOH, DropDisabledAlphaFields (clears fields in registry) should be done only at pod level, but not pod template level.

@liggitt
Copy link
Member

liggitt commented Jan 10, 2018

OTOH, DropDisabledAlphaFields (clears fields in registry) should be done only at pod level, but not pod template level.

No, alpha fields should not be persisted in any object.

@dhilipkumars
Copy link

@janetkuo Are RollBacks not deprecated in apps/v1 ?

@janetkuo
Copy link
Member Author

janetkuo commented Jan 25, 2018

The rollbackTo field was deprecated, but the rollback behavior is still supported in kubectl.

We deprecated this field not because we didn't want to support rollback, but because we didn't want the controller to mutate its own spec.

@kow3ns kow3ns added this to Backlog in Workloads Feb 26, 2018
@kow3ns kow3ns moved this from Backlog to In Progress in Workloads Feb 27, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2018
@janetkuo
Copy link
Member Author

/remove-lifecycle stale
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 26, 2018
@janetkuo
Copy link
Member Author

I wrote a doc to discuss the intersection of Deployment & mutating admission controllers: https://goo.gl/1JEEhS

@smarterclayton
Copy link
Contributor

I came back to this, but this also means you can't toggle a cluster to enable a feature gate without potentially causing your deployments to go into a hot loop (for this and other DropDisabledTemplateFields which we have a lot more of now).

@liggitt liggitt added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/reliability labels Jun 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/reliability kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
Status: Needs Triage
Workloads
  
In Progress
Development

No branches or pull requests

7 participants