Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP for garbage collection #2287

Closed
wants to merge 1 commit into
base: master
from

Conversation

@janetkuo
Copy link
Member

janetkuo commented Jun 19, 2018

Expand Kubernetes garbage collection to handle common uses which are awkward today. We will add an object-lifecycle-respecting TTL mechanism to the garbage collector. We will also allow users to mark resource objects to be cleaned up by the garbage collector after those objects are no longer in use: either terminated, completed, or not referenced by other resource objects.

This KEP covers existing problems and an overview of a proposed solution. Implementation details are discussed separately (also linked in KEP):

  1. Object-lifecycle-respecting TTL mechanism: https://goo.gl/9AUQ3n
  2. Garbage collect resources that finish execution: https://goo.gl/ws9t3b
  3. Garbage collect resources that are not used by other resources: https://goo.gl/KeHPiQ

/assign @kow3ns @lavalamp @erictune
/cc @kow3ns @lavalamp @erictune
@liggitt @deads2k @smarterclayton @bgrant0607

ref kubernetes/kubernetes#22368, kubernetes/kubernetes#28486, kubernetes/kubernetes#64470

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jun 19, 2018

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: calebamiles

Assign the PR to them by writing /assign @calebamiles in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@janetkuo janetkuo force-pushed the janetkuo:gc-design branch from 05c3765 to 40f6d86 Jun 20, 2018


Current TTL-based garbage collection has the following issues:

* Not generic. Only supports Events.

This comment has been minimized.

@liggitt

liggitt Jun 20, 2018

Member

that's true for kubernetes resources... other apiservers using the apimachinery storage layer can make use of this today to set variable-length TTLs on objects based on fields in the object

This comment has been minimized.

@janetkuo

janetkuo Jun 21, 2018

Author Member

Yes. I should clarify that this point is specific to built-in apiserver.

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Jun 20, 2018

The design has three main parts:

1. **Object-lifecycle-respecting TTL mechanism.** Provide observability before
deletion, and address the resource creation ordering problems. See [detailed

This comment has been minimized.

@liggitt

liggitt Jun 20, 2018

Member

address the resource creation ordering problems

it wasn't clear to me how this addressed that... was it because the TTL gives you a window within which the owning objects can be created? what would we recommend someone wanting to create a set of interdependent objects without worrying about creation order set this to? a few seconds? a few minutes? we probably don't want authors trying to guess how much latency the GC system and object creation might be subject to

This comment has been minimized.

@janetkuo

janetkuo Jun 21, 2018

Author Member

The main problem I want to solve is the observability before deletion. We can solve the creation ordering problem with something more reliable, such as having GC controller to remember if it has ever observed the owner or not.

If we don't want to encourage using TTLs to solve problems like creation order, we can make TTLs to be "GC after use" only. We can even combine TTL and GCIfUnused fields. For example, have a "durability" field, which can be durable (default, don't GC after use) or ephemeral (GC after not used for 2d), etc.

design of GC after use (part 1)][].
1. **Garbage collect resources that are not used by other resources.** Extract
usage references of a resource object and delete the object when it is no
longer referenced. Garbage collector will count usage references and address

This comment has been minimized.

@liggitt

liggitt Jun 20, 2018

Member

inverting the graph direction garbage collection is responsible for has some significant implications:

  • latency or disruption in the GC controller's informers means that new usages of a GCIfUnused: true resource are unobserved, and can incorrectly treat that resource as unused and delete it
  • any inability to discover or reach any aggregated API has to halt GC of all GCIfUnused: true resources until resolved, since the undiscovered resources could contain usages that would prevent deletion

This comment has been minimized.

@janetkuo

janetkuo Jun 21, 2018

Author Member

GC controller can do a sanity check before deleting a resource. This check could be expensive, though.

It'd be useful for GCIfUnused resources to specify the api group versions of its potential users. This would make queries less expensive and make unreachable aggregated API only affect GCIfUnused resources that cares about the API.

This comment has been minimized.

@lavalamp

lavalamp Jun 26, 2018

Member

To solve this problem, I suggested making the GC keep a per-group-version used/not used marker on objects opting in to this mechanism.

If the apiserver serving the group-version is down, it's true that GC can't observe new uses, but also true that people can't create new uses. So this leaves only a small window where someone adds a new use and then the apiserver goes down before GC observes it. This is likely preferable to blocking the whole GC (which could render the control plane inoperable) if an apiserver is down.

This comment has been minimized.

@janetkuo

janetkuo Jul 2, 2018

Author Member

Can we solve the informer latency problem with TTL, by requiring TTL to be larger than, say, 1 hour?

This comment has been minimized.

@caesarxuchao

caesarxuchao Jul 16, 2018

Member

A few concerns/thoughts on usage counting:

  • In the detailed doc, it says the ref count is recorded in memory. Does the GC kick in when the usage reference reaches 0 (v.s. being 0)? If so, because the memory doesn't survive GC crashes, GC will miss the "usage reference reaching 0" events. If GC starts deletion anytime usage count is 0, that could result in spurious start deletion & revert deletion.
    If the usage count or the "per-group-version used/not used marker" is kept on the objects, in case the owner is created before the dependent, how do we record?

  • Usage references are difficult to debug. As a user, if my configmap is not garbage collected, how can I know what's blocking its deletion?

  • "Can we solve the informer latency problem with TTL, by requiring TTL to be larger than, say, 1 hour?"
    Despite what we discussed offline, it might not work. Counter-example: If a user sets TTL of a configmap to 1 hour, the expectation is that if the configmap is used in the next hour, it will not be deleted. If the usage happens 59 mins after the configmap's creation, but GC's informer latency > 1 min, then the configmap will be wrongly deleted.

  • Thinking out loud, instead of introducing usage referencing, how about introducing an admission controller that updates the dependent's ownerReferences? It doesn't have the two problems liggitt mentioned. If we have such an admission controller, TTL-based GC, and optional UID in metadata.ownerRef, do we solve most use cases?

@janetkuo

This comment has been minimized.

Copy link
Member Author

janetkuo commented Jun 21, 2018

Thanks @liggitt for the review. Great feedback.

* Have a low cognitive burden. Easy to set up and reason about what resources
should be garbage collected.
* Be efficient enough that it wouldn’t introduce a new scalability bottleneck.

This comment has been minimized.

@erictune

erictune Jun 22, 2018

Member

The above goals are great. Also, the doc mentions in several places about avoiding cases where the an excess number of objects are created, and the apiserver/master slows down. I suggest calling that out as a specific goal. I think the way that you state this goal will affect how one evaluates the rest of the design.

In particular around default behaviors and whether to change them. One way to state it is that you want to give users tools, so that, if they use them, they won't create too many objects. This does not imply changing default behaviors. Another way to describe it is that you explicitly want to make it much less likely that users/controller-authors who are not aware of the new GC mechanisms to nevertheless be less likely to break a cluster. This, I think, has to imply changing default behavior. I'm not sure which we are trying to accomplish here.

resource has at least an owner. However, not all resources have owners (i.e.
non-empty `.metadata.ownerReferences`). You can think of the resources that
don't have owners as root nodes in a DAG (directed acyclic graph). Those
resources without owners may never be deleted after they're no longer needed.

This comment has been minimized.

@erictune

erictune Jun 22, 2018

Member

Though kubectl apply --prune will delete unneeded root items.

This comment has been minimized.

@janetkuo

janetkuo Jun 27, 2018

Author Member

kubectl apply --prune requires users to manually manage those objects, i.e. think about whether a manifest can be removed or not.

design of TTL][].
1. **Garbage collect resources that finish execution.** Delete a resource object
when it finishes execution (either terminates or completes). See [detailed
design of GC after use (part 1)][].

This comment has been minimized.

@erictune

erictune Jun 22, 2018

Member

Some Helm Charts include jobs in their template. See for example Minio's post-install-create-bucket-job.yaml

What advice should we give to those Charts?

  1. set metadata.gcifunused to false, so that once the job is completed, the tool can always see that it is finished, and so it won't try to recreate it.
  2. Discourage use of Jobs and other "finishable" resources in the the template of charts. Instead, move them to the "hooks" portion of helm charts. If so, then do other declarative tools like kubectl apply need a "hooks" concept too? (How does Helm know it has run the hooks, by the way, or is it best effort?)
  3. Ensure that the job is idempotent, and so it is okay if it gets GC'ed and Helm re-creates it.

This comment has been minimized.

@janetkuo

janetkuo Jun 22, 2018

Author Member

Totally agree with the advice.

In most cases "finishable" resources should be hooks. The Job used in the Minio chart is actually a hook -- it has hook related annotations. Finishable resources shouldn't be GC'd after use and their lifecycle should be managed by Helm. Helm provides cleanup after hook success/failure.

I like the idea of adding hooks to kubectl apply. Users applying configs in bulk should follow the same advice for Charts.

deletion, and address the resource creation ordering problems. See [detailed
design of TTL][].
1. **Garbage collect resources that finish execution.** Delete a resource object
when it finishes execution (either terminates or completes). See [detailed

This comment has been minimized.

@erictune

erictune Jun 22, 2018

Member

I like the referenced detailed design a lot.

usage references of a resource object and delete the object when it is no
longer referenced. Garbage collector will count usage references and address
the problem of unbounded list of owner references. See [detailed design of GC
after use (part 2)][].

This comment has been minimized.

@erictune

erictune Jun 22, 2018

Member

I also like the referenced detailed design (part 2) a lot.

performance and scalability issues. The timestamp only needs to be updated when
a resource is either in use or not in use.

Combining with TTL, users can specify the earliest safe time to clean up

This comment has been minimized.

@erictune

erictune Jun 22, 2018

Member

What is the default value for .metadata.gcttl?

If there is a default, is it for all types or is it a per-type value (which would be wierd for something in metadata).

If there is no default, how does this help protect etcd from users fail to opt-in? Even if we also have resource quotas, users still need some help figuring out which resources are "oldest" and so they should consider deleting. Can they use any of the proposed new features to do a "manual cleanup" once they realize they need one?

This comment has been minimized.

@janetkuo

janetkuo Jun 22, 2018

Author Member

Adding default ttl should be fine, as it'll only make GC take longer before deleting things.

I think what you meant is making resources being cleaned up by default (metadata.gcifunused = true) with some ttl value? I'm worried that it would break people unexpectedly and there's no way to recover from delete.

In the "Future Work - Namespace-wide Garbage Collection Policy" section in this KEP, I proposed allowing users to create a "clean up policy" object that will set a default cleanup policy for resources. We can also create a global policy resource or allowing cluster admins to use a flag to set a global default value.

This comment has been minimized.

@tnozicka

tnozicka Jul 3, 2018

If you set default GC delay by gcttl will it affect cascading deletion?

This comment has been minimized.

@janetkuo

janetkuo Jul 9, 2018

Author Member

If you set default GC delay by gcttl will it affect cascading deletion?

It'll affect background cascading deletion.

### Clean Up Resources After Use

A resource is no longer *in use* either when it finishes execution (such as a
terminated Pod or a completed Job), or when it’s not referenced/used by other

This comment has been minimized.

@erictune

erictune Jun 22, 2018

Member

Does Cron controller need to see completed jobs to update its status and decide whether or not to launch new ones? What strategy should it use?

This comment has been minimized.

@janetkuo

janetkuo Jun 22, 2018

Author Member

CronJob only needs to see the jobs that haven't completed so that it can react based on its concurrency policy. It doesn't need to see completed jobs.

The proposed GC will respect owner reference; therefore, Jobs owned by a CronJob won't be deleted by new GC after it completes. It's taken care of by CronJob controller using job history limit.

* Owner references of a resource may be excessively long. Resources like
ConfigMap can be referenced by Pods, ReplicaSets, and Deployments. If a
Deployment has 1k replicas, it will create 1k Pods and the ConfigMap needs to
add those 1k Pods to the ConfigMap’s owner references list. It will be easy

This comment has been minimized.

@erictune

erictune Jun 22, 2018

Member

Instead of pod-to-configMap owner references, what about pod-to-replicaSet ownerReferences, assuming the ReplicaSets of the Deployment are immutable?

This comment has been minimized.

@janetkuo

janetkuo Jun 22, 2018

Author Member

Configmaps can be "owned" by lots of pods (any pod that uses the configmap owns it), but pods will usually only be "owned" 1 replicaset.

This comment has been minimized.

@crimsonfaith91

crimsonfaith91 Jun 29, 2018

Contributor

This brings upon the question that should we enforce the requirement that pods should be owned by at most (0 or 1) parent at one time? This one-to-one mapping (if the pod is owned) may contribute to a GC optimization in future.

need to change the configuration of their applications. They don’t need to
manually check which ConfigMaps and Secrets are not used and can be cleaned
up. If the ConfigMaps and Secrets are shared among applications, other
applications can still use them before TTL expires.

This comment has been minimized.

@erictune

erictune Jun 22, 2018

Member

Consider this scenario:

  1. There is a Deployment, which has 1 RS, which has 1 replica pod. The Pod uses a ConfigMap.
  2. Then, something happens that makes the pod go away. Examples:
    1. HPA controls the deployment. Scales it to zero for while.
    2. User accidentally updates the pod spec to one which is invalid, and so the controller can't create it, goes to long lunch. And something causes the original pod to be deleted first, e.g. deployment settings.
    3. Pod fails, and deployment controller cannot re-create right now (quota, authz issue, bad version of controller-managed updated, network partition, etc.)
  3. An hour passes
  4. Configmap is GCed.
  5. Whatever happened in step 2 is fixed
  6. System fails to return to normal because the configmap is lost.

This comment has been minimized.

@janetkuo

janetkuo Jun 23, 2018

Author Member

Thanks that's a good example!

Actually I start thinking about combining .metadata.gcttl and .metadata.gcifunused into one field named durability (#2287 (comment)), as TTL seems less useful for owner references. An object's durability can be durable, ephemeral, and something else. Each durability comes with a sane TTL value.

This comment has been minimized.

@erictune

erictune Jun 27, 2018

Member

The answer you gave me in person is that there is a usage reference from the pod to the RS and to the deployment's pod template, so when it is scaled to zero it is still kept alive.

This comment has been minimized.

@crimsonfaith91

crimsonfaith91 Jun 29, 2018

Contributor

I think the notion of durability is hard to be achieved because different applications will have different sane TTL value. +1 for usage reference to handle the case of scaling down to zero.

This comment has been minimized.

@mattfarina

mattfarina Nov 2, 2018

Member

@janetkuo do you have an update to the kep for durability?

@jdumars jdumars added this to In SIG Review in KEP Tracking Jun 25, 2018


Generic TTL-based garbage collection will allow the user to set a TTL on each of
Kubernetes resources. The TTL starts when a resource is eligible for garbage
collection and stops when the resource is no longer eligible; the system will

This comment has been minimized.

@crimsonfaith91

crimsonfaith91 Jun 29, 2018

Contributor

Does the TTL reset to zero when the resource becomes no longer eligible for garbage collection?

This comment has been minimized.

@janetkuo

janetkuo Jul 2, 2018

Author Member

To clarify, TTL value is not reset, but the TTL timer is reset. See the diagram.

frequently created, short-lived resources, those resources can accumulate and
overload a Kubernetes cluster very easily. Even if we can avoid the overload
issue by implementing a cluster-wide (global) resource quota, users won't be
able to create new resources without cleaning up old ones first.

This comment has been minimized.

@crimsonfaith91

crimsonfaith91 Jun 29, 2018

Contributor

Do you mean that global resource quota is inefficient because the users have to wait for the old resources to be deleted before creating the new resources? It seems to me that the quota approach does work, but I still think the proposed TTL approach is better.

Another way is to create a global parent object for Jobs with no CronJob parent, Secrets, and ConfigMaps to trigger OwnerReference-based garbage collection.

This comment has been minimized.

@janetkuo

janetkuo Jul 2, 2018

Author Member

TTL is complementary to quota. Quota only stops you from creating more than allowed capacity, but you cannot move forward without deleting objects. Enhanced GC allows you to clean up objects easily and even avoid hitting resource quota.

A global parent only allows those resource objects to be gone with the parent. However, we'd like to clean up objects when they're not needed. One of the child resources isn't needed doesn't imply that parent object isn't needed.

user expects a resource to be needed intermittently and/or want to keep the
resource around for a while before garbage collection kicks in.

Note that we will not use etcd TTL to support this feature, because etcd TTLs

This comment has been minimized.

@crimsonfaith91

crimsonfaith91 Jun 29, 2018

Contributor

It sounds to me that the proposed TTL GC is a generic version of the etcd TTL that works only on events. Is the proposed TTL GC able to replace etcd TTL in future? To make it a step further, I think the proposed TTL GC should have the long-run goal of unifying all GC flavors (OwnerReference, Capacity, and TTL) into one.

This comment has been minimized.

@janetkuo

janetkuo Jul 2, 2018

Author Member

How we should handle Event TTL is still under discussion. TTL GC won't replace existing Event TTL mechanism in the first (alpha) release.

Secrets and ConfigMaps as immutable resources, and therefore Secrets and
ConfigMaps are frequently created and deployed to a Kubernetes cluster, but
never updated or cleaned up. Secrets and ConfigMaps accumulate over time, which
could cause master component downtime.

This comment has been minimized.

@crimsonfaith91

crimsonfaith91 Jun 29, 2018

Contributor

+1 for having immutable Secrets, but I am neutral on ConfigMaps. I think ConfigMaps should provide enough versatility for minor config or script injection changes.

This comment has been minimized.

@janetkuo

janetkuo Jul 2, 2018

Author Member

The downside of updating ConfigMaps is that it doesn't work well with rolling update features in workloads APIs. You don't have control over when a ConfigMap change will be updated to a Pod, it doesn't block bad rollouts, and you cannot roll back.

This comment has been minimized.

@tnozicka

tnozicka Jul 3, 2018

We were/are actually thinking about creating a top level object to allow for ConfigMap/Secret snapshoting and also triggering rollouts. (It would also have a revisionHistoryLimit and be bound to particular RS or ControllerRevision.) /cc @mfojtik

I don't think the Secret/ConfigMap should be immutable per type, but rather per instance - like having metadata.immutable=true as there are always different use cases across the ecosystem.

This comment has been minimized.

@crimsonfaith91

crimsonfaith91 Jul 3, 2018

Contributor

+1 on @tnozicka 's suggestion on creating a top level object for managing ConfigMap/Secret snapshots and rollouts.

If we are going to introduce a field indicating immutability preference, it may be better to add a standalone .immutable field outside of .metadata.

This comment has been minimized.

@janetkuo

janetkuo Jul 9, 2018

Author Member

I'm aware that RH has something called "trigger controller" and @kargakis had a proposal for it earlier.

This GC proposal isn't about making resources like ConfigMaps/Secrets immutable (see "Non-Goals" section in this proposal), but supporting the use cases that treat ConfigMaps/Secrets like immutable resources, i.e. keep creating new ones and update references in pod spec, but never update existing ConfigMaps/Secrets. kustomize is one example, and kubectl create configmaps has a flag --append-hash for users to append a hash as the suffix of a ConfigMap's name, which makes it easy to roll out new generated ConfigMaps.

This comment has been minimized.

@mattfarina

mattfarina Nov 2, 2018

Member

FYI, many take advantage of secrets being mutable. You see this many Helm charts. Helm can handle rolling back and changing secret values today. I imagine others do this, too. Making them mutable should require v2 secrets.

The ability to set a TTL could help cleanup controller code some apps have today.

resources after use, without regressions.

This will be promoted to GA once it's gone a sufficient amount of time as beta
with no changes.

This comment has been minimized.

@crimsonfaith91

crimsonfaith91 Jun 29, 2018

Contributor

+1 on this. For a GC to become GA, the best yardstick is to have it working for a predefined period of time (e.g., 6 months) without encountering any substantial issue.

However, if we are going ahead with the direction of unifying all GC flavors, I am not sure whether we should start thinking about it when the proposed TTL GC is beta or GA.

We can implement a garbage collection policy that applies to all resources in
the same namespace by default for easier garbage collection policy management.

### Move Event TTL out of etcd TTL

This comment has been minimized.

@crimsonfaith91

crimsonfaith91 Jun 29, 2018

Contributor

+1 on this. This helps in decoupling API server and etcd.

The biggest problem caused by Pod GC is that it breaks Job controller
([#28486][]).

We may move Pod GC and Event GC into garbage collector, or deprecate them and

This comment has been minimized.

@crimsonfaith91

crimsonfaith91 Jun 29, 2018

Contributor

+1 on deprecating and deleting them entirely in the long run (i.e., not agree on moving Pod GC and Event GC into the garbage collector). Ideally, we want to maintain as small number of GC flavors as possible.

Because users cannot set or modify this field on a non-Pod resource (the field
will be silently dropped), it will be compatible to move it outside of
`.metadata` of resources other than Pod. This field can be part of a
Pod-specific `.metadata`, or can even move this field to Pod spec in the future

This comment has been minimized.

@crimsonfaith91

crimsonfaith91 Jun 29, 2018

Contributor

.metadata attributes across various types of resources should be as similar as possible. Hence, I think it makes more sense to move deletionGracePeriodSeconds field to pod spec in future.

This comment has been minimized.

@janetkuo

janetkuo Jul 2, 2018

Author Member

Yes, but it has to be v2 Pod (for compatibility).

@deads2k

This comment has been minimized.

Copy link
Contributor

deads2k commented Jul 10, 2018

Inverting the direction of GC references means that inspection of a resource is insufficient to determine whether that resource can be deleted. Instead, the current state of every resource must be known in order to make that decision.

There is no reasonable timeout that will be sufficient. For instance, an aggregated API server can go down for an extended periods of time, during which it would not be safe to GC any object. The direction of our current links ensure that such a situation does not freeze unaffected objects. The new design proposed here can't make a safe decision in those conditions.

I don't see why we would trade a design that is not vulnerable to such a problem (what we have today) for one where we do have that problem and have to build various and numerous compensation methods to avoid those repercussions.

Is there a separate proposal just for the TTL aspect? A TTL that respects finalizers is a thing I could see being generally useful.

@janetkuo

This comment has been minimized.

Copy link
Member Author

janetkuo commented Jul 11, 2018

@deads2k thank you for the review!

Current owner references GC is not enough for cleaning up unused objects. We can implement "clean up when unused" with owner references by treating each "user object" as "owner", but that would require "GC when there's no owners/users", which is not currently supported. I also added a diagram (link) to compare owner reference and usage reference:
public garbage collect resources after use 2 2f2_ usage references

For instance, an aggregated API server can go down for an extended periods of time, during which it would not be safe to GC any object.

Only the object that enables "clean up when unused" will be affected. If GC controller keeps track of the user objects of a used object, GC controller won't need to pause when there's no user objects behind the aggregated API server.

Is there a separate proposal just for the TTL aspect? A TTL that respects finalizers is a thing I could see being generally useful.

Yes, I split the KEP into 3 parts so that they can be discussed and implemented separately.

  1. The TTL design doc is here: https://goo.gl/9AUQ3n
  2. The design for GC after finished: https://goo.gl/ws9t3b - this is also useful and doesn't introduce new GC references.
  3. The design for GC after unused: https://goo.gl/KeHPiQ - this seems to be much more controversial

Would it help if I make "3" its own KEP and don't mention it in this KEP? Alternatively, we can also make this KEP a high level overview of the problems we'd like to solve, and leave the design details to the design docs I linked above.

ConfigMaps and Secrets by creating new ConfigMaps or Secrets and updating the
references to those ConfigMaps or Secrets in containers ([#22368][]). The old
ConfigMaps and Secrets, if not referenced by any container, are no longer
needed.

This comment has been minimized.

@mfojtik

mfojtik Jul 11, 2018

Contributor

ConfigMaps and secrets can be consumed by other means than containers right? You can have an operator that is reading the content of an configMap or secret (can be a CA or something)...

This comment has been minimized.

@mfojtik

mfojtik Jul 11, 2018

Contributor

(also you can reference configMap via downward api)

This comment has been minimized.

@janetkuo

janetkuo Jul 11, 2018

Author Member

Resource objects, including ConfigMaps/Secrets, can decide whether they want to use this feature (GC after unused) or not.

Referencing via downward api is also covered by this proposal. The schema information can be encoded in OpenAPI spec. If the operator uses an API object to read/reference a ConfigMap/Secret, the usage can be discovered in the OpenAPI spec. If the operator simply reads/gets the ConfigMap/Secret, I would imagine that such ConfigMap/Secret won't be massively created or short-lived but be updated directly. In this case, they don't need GC after unused.

This comment has been minimized.

@mattfarina

mattfarina Nov 2, 2018

Member

@janetkuo Just to verify, it's opt-in to non-referenced ConfigMaps/Secrets being GC'd if not attached to a container rather than opt-out. Right?


Another common use case is completed [Jobs][]. Users keep creating Jobs to run
some batch tasks, and those Jobs, once complete, are no longer needed in the
system.

This comment has been minimized.

@mfojtik

mfojtik Jul 11, 2018

Contributor

@soltysh might prove me wrong, but I thought there is something like revisionHistoryLimit for Jobs which will cause the old jobs to be deleted by the jobs controller.

This comment has been minimized.

@janetkuo

janetkuo Jul 11, 2018

Author Member

Here I'm talking about the Jobs that aren't managed by CronJobs. Some users create Jobs directly with their own tools.

@deads2k

This comment has been minimized.

Copy link
Contributor

deads2k commented Jul 12, 2018

Would it help if I make "3" its own KEP and don't mention it in this KEP? Alternatively, we can also make this KEP a high level overview of the problems we'd like to solve, and leave the design details to the design docs I linked above

I think that 3 is the most controversial. I would suggest separate KEPs for separate features.

A generic TTL (1) that mirrors etcd TTL and respects finalizers is a thing that I can see being broadly useful. Without inverted GC links (3), I don't think I see why I'd tie it to GC, but an etcd-like TTL seems pretty useful.

Deleting finished resources (2) doesn't seem related to GC to me. Given an object, a separate controller could watch for finished things (I think I'd suggest using a condition) without any ownerRefs. Without inverted GC links (3), I don't think I see why I'd tie it to GC, but a "clean up completed things" may be useful. I suggest a separate KEP this too. Removing resources without express user permission is a pretty aggressive thing to do so I don't expect controllers to set it.

@janetkuo

This comment has been minimized.

Copy link
Member Author

janetkuo commented Jul 13, 2018

(1) and (2) ties to GC because an object is only safe to be cleaned up after all GC mechanisms agree it's safe to be cleaned up. If an object's owners still exist, it shouldn't be cleaned up even after it finishes and TTL expires.

Removing resources without express user permission is a pretty aggressive thing to do so I don't expect controllers to set it.

Users can decide if they want to clean up a completed thing or not (opt-in/out in a field of .metadata). GC controller then provides a generic way to clean up finished resources with TTL and respects owner references and finalizers.

@deads2k

This comment has been minimized.

Copy link
Contributor

deads2k commented Jul 13, 2018

(1) and (2) ties to GC because an object is only safe to be cleaned up after all GC mechanisms agree it's safe to be cleaned up. If an object's owners still exist, it shouldn't be cleaned up even after it finishes and TTL expires.

A TTL is a general machinery item, but if a TTL is set, then GC doesn't enter into it until the item is deleted. Same as an etcd TTL today like we use for events. The only reason it seemed related if links are inverted is because you need a mechanism to avoid "early" reaping. If you don't invert links, you don't have the problem.

Similarly, for deleted resources. GC doesn't get involved until the item is actually deleted. Deciding to delete the item belongs to something else. The only reason it seemed related is that if links are inverted you need a mechanism to build the entire graph and chase the links. if you don't invert the links, you don't have the problem because the single object is all the context you need.

@janetkuo

This comment has been minimized.

Copy link
Member Author

janetkuo commented Jul 13, 2018

The use cases I want to address is cleaning up frequently-created, short-lived resource objects in a generic way. That includes the resources that already finish (2) or are unused (3). Both use cases can benefit from TTL (1) to allow some delays before cleanup actually happens. We also want to avoid early reaping for finished resource objects, for example, after a Job completes users need some time to collect results before it can be deleted. If it doesn't respect owner references, it'd be confusing to users. Similarly, if user declares a TTL, GC controller shouldn't delete an object earlier, even when the object's owners are deleted. How might we support the aforementioned use cases?

@caesarxuchao
Copy link
Member

caesarxuchao left a comment

I really like the TTL-based GC. I also like the "finish-execution" GC, but not sure if we should reuse other GC fields, or introducing a general .status.finished field, see question. I have a few concerns on the usage references, I'd like to discuss more alternatives.

design of GC after use (part 1)][].
1. **Garbage collect resources that are not used by other resources.** Extract
usage references of a resource object and delete the object when it is no
longer referenced. Garbage collector will count usage references and address

This comment has been minimized.

@caesarxuchao

caesarxuchao Jul 16, 2018

Member

A few concerns/thoughts on usage counting:

  • In the detailed doc, it says the ref count is recorded in memory. Does the GC kick in when the usage reference reaches 0 (v.s. being 0)? If so, because the memory doesn't survive GC crashes, GC will miss the "usage reference reaching 0" events. If GC starts deletion anytime usage count is 0, that could result in spurious start deletion & revert deletion.
    If the usage count or the "per-group-version used/not used marker" is kept on the objects, in case the owner is created before the dependent, how do we record?

  • Usage references are difficult to debug. As a user, if my configmap is not garbage collected, how can I know what's blocking its deletion?

  • "Can we solve the informer latency problem with TTL, by requiring TTL to be larger than, say, 1 hour?"
    Despite what we discussed offline, it might not work. Counter-example: If a user sets TTL of a configmap to 1 hour, the expectation is that if the configmap is used in the next hour, it will not be deleted. If the usage happens 59 mins after the configmap's creation, but GC's informer latency > 1 min, then the configmap will be wrongly deleted.

  • Thinking out loud, instead of introducing usage referencing, how about introducing an admission controller that updates the dependent's ownerReferences? It doesn't have the two problems liggitt mentioned. If we have such an admission controller, TTL-based GC, and optional UID in metadata.ownerRef, do we solve most use cases?

@deads2k

This comment has been minimized.

Copy link
Contributor

deads2k commented Jul 16, 2018

The use cases I want to address is cleaning up frequently-created, short-lived resource objects in a generic way. That includes the resources that already finish (2) or are unused (3). Both use cases can benefit from TTL (1) to allow some delays before cleanup actually happens. We also want to avoid early reaping for finished resource objects, for example, after a Job completes users need some time to collect results before it can be deleted. If it doesn't respect owner references, it'd be confusing to users. Similarly, if user declares a TTL, GC controller shouldn't delete an object earlier, even when the object's owners are deleted. How might we support the aforementioned use cases?

You've described a new TTL controller which can delete a resource after a certain time and what amounts to a graceful deletion controller that gives a separate entity a deadline to do work after a deletion timestamp set and removes a finalizer after that time. If you need a guarantee, not grace you've described a new finalizer for your type.

Neither of those things involve GC which serves the narrow purpose of traversing links to reap dead children.

@janetkuo

This comment has been minimized.

Copy link
Member Author

janetkuo commented Jul 18, 2018

We have discussed this proposal offline. We can start by implementing (1) + (2) for Jobs only, and generalize it based on user feedback. I have drafted another proposal to add a TTL controller for Jobs: https://goo.gl/YxtxTi.

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jul 18, 2018

@janetkuo: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

For example, when a Deployment owns a ReplicaSet, that ReplicaSet's
`.metadata.ownerReferences` will include the Deployment. If the generic garbage
collector finds that the Deployment doesn't exist in the system anymore, and if
the Deployment is the ReplicaSet's only owner, it will delete the ReplicaSet.

This comment has been minimized.

@krmayankk

krmayankk Aug 14, 2018

Contributor

what about the case when the selector is changed. Currently we have a case where because of selector changes, all of our olds Pods became orphaned. I have the following questions:-

  • The DaemonSet's selector changed,which orphaned the old pods and DaemonSet created new pods. I saw that the ownerReferences of the old Pods doesnt exist any more. My question is who removed those ownerReferences ?
  • Does your proposal cover deleting such orphans whose parents are gone because of selector change ?
    If we didnt remove the ownerReferences of these old Pods , but changed the uid of the owner since the selector of the owner change, we could achieve this. Thoughts @janetkuo ?

This comment has been minimized.

@janetkuo

janetkuo Aug 14, 2018

Author Member

Currently we have a case where because of selector changes, all of our olds Pods became orphaned.

That is exactly why we made selector immutable in apps/v1. Changing selector of a workload controller is dangerous, and this specific use case is not something we want to support.

This comment has been minimized.

@krmayankk

krmayankk Aug 15, 2018

Contributor

thanks @janetkuo i am fine with this. But why when the selector changed, did the orphan'd pods loose their ownerReferences ?

This comment has been minimized.

@janetkuo

janetkuo Aug 15, 2018

Author Member

ControllerReferences (OwnerReferences with controller==true) of Pods are updated automatically by workload controllers, to make sure that no two controllers are fighting against each others controlling the same group of Pods (think of it as a lock). When a controller no longer controls a group of Pods (selector no longer matches), it removes itself from the Pods' ControllerReferences to release the lock.

Label selectors are used for identifying relationships between controllers and controllee. When a controller's selector is changed to something that doesn't match its controllees' labels, it has the same effect as orphaning the controllees by changing their labels, as the controllers finds controllees through "label/selector matches".

If you really want to change the selector of a controller, the recommended way is to just create a new one with the new selector / template labels, and delete the old one.

This comment has been minimized.

@krmayankk

krmayankk Aug 17, 2018

Contributor

Thanks @janetkuo for helping understand. the intention is not to change the selectors . We changed it by accident but just be able to properly reason about its behavior . How does the controller know that a pod’s label no longer matches it’s selector . The controller can know which pods matches its label and rest everything else does not match its label . So does the controlller actually goes and looks at all pods that didn’t match its selector and see if it has an owner reference pointing to it and then Remove it self from those pods? Isn’t that too costly? Is the selector immutable in apps/v1 for all workload controllers ? Do you see any issues with GC always deleting a pod which doesn’t have any ownerreferneces or if it’s owner doesn’t exist . Why does t have to be triggered by owners deletion ? Sorry lot of questions :-)

This comment has been minimized.

@janetkuo

janetkuo Aug 17, 2018

Author Member

So does the controlller actually goes and looks at all pods that didn’t match its selector and see if it has an owner reference pointing to it and then Remove it self from those pods?

This is correct.

Isn’t that too costly?

Most controllers use informer framework, so instead of listing all controllees (e.g. pods), the controller watches them and acts upon their changes. So it's not too costly.

Do you see any issues with GC always deleting a pod which doesn’t have any ownerreferneces or if it’s owner doesn’t exist. Why does t have to be triggered by owners deletion ?

GC doesn't delete things that don't have ownerRefs. It deletes things whose owners are all gone. Think of ownerRefs as a signal for cascading deletion: the owner wants its dependents to be gone with it; if there are more than one owners, the GC should only happen after all owners are deleted. Note that if an item is removed from a dependent's ownerRefs list, that item is no longer considered the owner of the dependent.

@kargakis

This comment has been minimized.

Copy link
Member

kargakis commented Aug 14, 2018

@justaugustus

This comment has been minimized.

Copy link
Member

justaugustus commented Oct 13, 2018

/kind kep

@mattfarina

This comment has been minimized.

Copy link
Member

mattfarina commented Nov 2, 2018

/sig apps

Since it's listed as a participating sig

@justaugustus

This comment has been minimized.

Copy link
Member

justaugustus commented Nov 20, 2018

REMINDER: KEPs are moving to k/enhancements on November 30. Please attempt to merge this KEP before then to signal consensus.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.

@justaugustus

This comment has been minimized.

Copy link
Member

justaugustus commented Dec 1, 2018

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Dec 1, 2018

@justaugustus: Closed this PR.

In response to this:

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.