Add MaxCheckpointsPerContainer to the kubelet #115888

adrianreber · 2023-02-19T20:58:13Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This adds the configuration option "MaxCheckpointsPerContainer" to the kubelet. The goal of this change is to provide a mechanism in combination with container checkpointing to avoid filling up all existing disk space by creating a large number of checkpoints from a container.

"MaxCheckpointsPerContainer" defaults to 10 and this means that once 10 checkpoints of a certain container have been created the oldest existing container checkpoint archive will be removed from disk. This way only the defined number of checkpoints is kept on disk.

This also moves the location of the checkpoint archives from /var/lib/kubelet/checkpoints to /var/lib/kubelet/checkpoints/POD-ID/.

The main reason for this move was to avoid confusion between the checkpoint archives concerning namespace, pod name and container name.

This also changes the time stamp encoded into the file name from RFC3339 to UnixNano(). The reason for this change is that there were questions in how far the ':' in the file name generated by RFC3339 would be problematic on Windows.

This also introduces a counter after the time stamp in the file name to ensure that each checkpoint archive has a unique file name.

As this is still an Alpha feature it should be acceptable to change the location of the checkpoint archive.

Does this PR introduce a user-facing change?

Checkpoint archives created by the Alpha Feature `ContainerCheckpoint` are now located in `/var/lib/kubelet/checkpoints/POD-ID/` instead of  `/var/lib/kubelet/checkpoints`

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/issues/2008
- [Other doc]: https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/

k8s-ci-robot · 2023-02-19T20:58:21Z

Hi @adrianreber. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2023-02-19T21:56:43Z

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

saschagrunert

/ok-to-test

pkg/kubelet/kubelet.go

cici37 · 2023-02-21T21:23:01Z

/remove-sig api-machinery

bart0sh · 2023-02-23T12:37:32Z

@adrianreber please fix CI failures, thanks.

adrianreber · 2023-09-16T14:18:30Z

/test pull-kubernetes-conformance-kind-ipv6-parallel

adrianreber · 2023-09-17T07:21:10Z

/retest-required

adrianreber · 2023-09-17T16:56:04Z

/test pull-kubernetes-e2e-kind

adrianreber · 2023-09-18T07:06:25Z

/test pull-kubernetes-e2e-kind-ipv6

adrianreber · 2023-09-18T08:13:23Z

/test pull-kubernetes-e2e-kind-ipv6

adrianreber · 2023-09-18T11:10:40Z

/test pull-kubernetes-e2e-kind-ipv6

mrunalp · 2023-09-18T15:34:55Z

@rphillips @mikebrow ptal.

rphillips · 2023-09-18T15:42:40Z

/lgtm

k8s-ci-robot · 2023-09-18T15:42:48Z

LGTM label has been added.

Git tree hash: 92f3e5c378642a22b3b5f4456dd89b6bb8cb0ccd

mikebrow

See comments..

A general comment: This is for the forensics use case... but I'm having a time mapping the kep to this change. It reads more like a checkpoint manager enhancement, where one can be creating some numbers of checkpoints at a given rate of time, like backups. And kubelet will manage garbage collection. I was thinking we'd open the kep up for the additional case(s) and add a checkpoint manager. Thoughts? Maybe draft up a use case description for this enhancement in the context of forensic debug..

mikebrow · 2023-09-18T21:38:08Z

pkg/kubelet/apis/config/types.go

+
+	// MaxCheckpointsPerContainer specifies the maximum number of checkpoints
+	// that Kubernetes will create of one specific container before removing old
+	// checkpoints. This option exist to ensure the local disk is not filled


Suggested change

// checkpoints. This option exist to ensure the local disk is not filled

// checkpoints. This option exists to ensure the local disk is not filled

but it will fill it.. 2147483647 checkpoints probably isn't the right maximum limit to the number of checkpoints per container per pod..

Are you saying a hardcoded upper limit should exist? What should it be?

I was thinking a few or two as default, depending on how the user/client is using these. For forensics not sure why it's more than 1 at a time unless you envision a diff tool to compare a success case vs failure case which would be 2 or 3, where 3 could allow 3way diff cases? Still.. the idea of creating a managed set.. implies the discussion of a manager for checkpoints.

mikebrow · 2023-09-18T21:41:22Z

pkg/kubelet/apis/config/types.go

+	// that Kubernetes will create of one specific container before removing old
+	// checkpoints. This option exist to ensure the local disk is not filled
+	// with container checkpoints. This option is per container. The field
+	// value must be greater than 0.


size or total % of disk use or a different drive or some other reasonable mech for lowering impact.. perhaps default of 1 or 2? with the idea the user would off load one before adding another..

mikebrow · 2023-09-18T21:41:49Z

pkg/kubelet/apis/config/types.go

+	// that Kubernetes will create of one specific container before removing old
+	// checkpoints. This option exist to ensure the local disk is not filled
+	// with container checkpoints. This option is per container. The field
+	// value must be greater than 0.


put another way.. why was 10 selected and not 1?

mikebrow · 2023-09-18T22:34:52Z

pkg/kubelet/kubelet.go

 			podFullName,
 			containerName,
-			time.Now().Format(time.RFC3339),


I was thinking for the forensic case the caller would be responsible for the clean up if need (probably with a listener) and the use would be rare.

I was thinking for the forensic case

Even if the story around checkpoint/restore is the forensic case, this does not mean this is what people use it for. There are many use cases and the forensic use case is just one.

the caller would be responsible for the clean up if need (probably with a listener) and the use would be rare.

This is a strange argument from my point of view, that the user is responsible for the clean up. At this point it feels the complete PR is questioned and this argument comes very late in the lifetime of this PR. This PR has been reworked completely multiple times since the first posting 7 months ago. This is a feature people have been asking for because they see it as a problem and do not want to clean up manually. This is one of the first things people are asking for during conference talks.

Not sure what the goal of this comment is, sorry.

The reason for limiting to the forensics case was probably to avoid slow walking the managed checkpoint use cases. Until there was a WG or whatnot put in place to design how kubernetes would manage the checkpoints for the other cases for what people would use it for.

At this point the user requested the checkpoint, vs kubelet creating the checkpoint based on a pod policy / contract. Asking kubelet to garbage collect that which it did not ask to be created implies kubelet knows why the containers are being checkpointed for these pods.

If the rule is to keep 10 which 10, the last ten the first 10 what if the last 10 are all created during failure modes? If this is for rolling backups.. you may want one per month one per week...going back to the last month to drop the per week from that month etc..

.. today was the first day I heard about this PR... Apologies for the frustration. I agree 100%, overuse of the checkpointing end point without having a management design is a problem. I didn't expect over use to be a problem for the sig-node approved forensics use case. That's all I mean. For logs we've used similar designs to this one, in kubelet, to employ rolling log models for long lived containers. So I'm trying to understand this rolling checkpoint idea.. to map it to a use case. If someone is wanting to do rolling backups ok.. I would get the use case, but even in that case I would want to have a discussion about how we do it. Checkpoint 1 + delta + delta for example would be orders of magnitude better from a resource consumption perspective.

mikebrow · 2023-09-18T22:38:53Z

pkg/kubelet/kubelet.go

+	// name of the checkpoint archive is unique. It already contains a time
+	// stamp but in case the clock resolution is not good enough this counter
+	// make it unique.
+	maxCheckpointsPerContainerCounter int32


from this I take it that you want to use the counter as a node level index?

suggest maxCheckpointOnNodeCounter or something similar..

Included OnNode in the variable name.

mikebrow · 2023-09-18T22:57:48Z

pkg/kubelet/kubelet.go

 			podFullName,
 			containerName,
-			time.Now().Format(time.RFC3339),
+			time.Now().UnixNano(),


why unix nano? going from human readable format for the forensic case to a large number...

I thought I submitted this comment two days ago. Trying once more:

This PR has gone though multiple iterations. The first iterations was using the existing file name with the human readable and the number of checkpoints was tracked in a JSON file. That was then changed to work without a JSON file using stat(). The time resolution using stat() on all file systems was questioned so the goal was to use a regex and file system sorting. The problem with the human readable file name was that it is really hard to capture it via regex if timezones are used in the file name. In combination with pod-name container-name collisions during regex or file system level sorting the current implementation using an integer time stamp and a counter.

As the second resolution is to coarse to be unique (without the counter) I switched to something else. I thought about micro or nano seconds, and settled for nano seconds because it will be just as unreadable as micro seconds.

We have written the tool checkpointctl which will display all information, including time created and checkpointed, so if wanted it can be displayed with the help of an additional tool.

The current approach gives us file system based sorting by using unique names in combination with podUID, nano seconds and node wide counter in the file name. No external JSON file is any more needed or no need to ready the result of stat() which might be to coarse, depending on the used file system.

With the old implementation if would have been possible to create two checkpoints of different pod/containers at the same time and once would replace another. The current implementation does not have this problem and easier to sort.

looking up... the dreaded windows aversion to colon due to their drive letter : format..

sounds like you had a lot of reasons..

mikebrow · 2023-09-18T23:01:06Z

pkg/kubelet/kubelet.go

 			podFullName,
 			containerName,
-			time.Now().Format(time.RFC3339),
+			time.Now().UnixNano(),
+			int(math.Log10(float64(math.MaxInt32)))+1,


? pls add a comment explaining this ..

width of the format will be 10 every time right?

Yes, for sorting the counter should always include all leading zeroes. I can change this to 10. I just didn't want to count it manually and let the computer do the work.

kk.. :-) leading for the nanos to? though I think better off with the human readable on the timestamp.

I hardcoded the number leading zeros for both fields.

mikebrow · 2023-09-18T23:43:20Z

pkg/kubelet/kubelet.go

+
+	checkpointDirectoryPath := filepath.Join(
+		kl.getCheckpointsDir(),
+		string(podUID),


I believe podUID can be reused here.. which could create a reuse issue ..

If the namespace, pod and container have the same name or one of the described collisions above. Thanks to the timestamp and the counter it should not result in checkpoints being overwritten, but maybe older checkpoints, from a podUID collisions, being removed first. Not sure if that is a problem or not.

yeah was more a tree/identifier issue... here with poduid.. one will need to tree | grep to find their checkpoints by name, here you can also identify by poduid.. if that was nec.. and missing we could add the poduid to the original name as an alternative

The name of the checkpoint is returned when triggering the kubelet checkpoint API endpoint. I am also working on code to see how it could look like if the kubelet checkpoint API endpoint is available on the kubectl level.

What I am trying to say is that searching for a checkpoint archive is something I would not expect that happens a lot, so that the path is not super important. Whatever makes most sense works for me. The main goal is to avoid any ambiguity in the file name. The podUID was suggested somewhere along the review in this PR and makes sense to me.

There was also agreement that we can change the location as long this is still marked as an alpha feature.

As mentioned in another comment using https://github.com/checkpoint-restore/checkpointctl is something I would recommend to get all details from the checkpoint archive and rely less on encoding information into the file name.

nod.. typically, in this space, we would create references to image files like these in a meta db.. and the image/layers would be stored by sha.. This change looks/feels like we're inching into managed checkpoint use case scenarios. I would feel more comfortable about this change if we had a kep open for r.next checkpointing/ wg charting out the rough direction this is going to take so we could map this change to that direction.

The pain point this PR is addressing is over use of the drive.. which should not be happening in the approved forensics cases? IMO there are other designs that would more appropriately address the desired feature(s).

At min, IMO, we should tell sig-node about this change to begin automatically removing checkpoints past some limit and get a general consensus if that is acceptable in the interim before the next checkpoint KEP update.

The pain point this PR is addressing is over use of the drive.. which should not be happening in the approved forensics cases?

It feels like this seems to be one of our main discussion points. I agree that it "should not be happening" but at the same time it is something people are actively talking about. From my point of view it feels unrealistic to expect that people always clean up unused files. That is why I want to help by cleaning it up automatically.

At min, IMO, we should tell sig-node about this change to begin automatically removing checkpoints past some limit and get a general consensus if that is acceptable in the interim before the next checkpoint KEP update.

Okay. I will add it on the sig-node agenda.

I think if we want to introduce this clean up functionality we need to add it to the existing KEP as an additional feature or start a new one. As discussed at SIG Node meeting today, once we have this setting we can start exploring integrating this into GC and eviction logic, having per-pod policies, etc. This all creates an unnecessary burden on a kubelet.

pkg/kubelet/kubelet.go

k8s-ci-robot · 2023-09-19T11:34:20Z

New changes are detected. LGTM label has been removed.

adrianreber · 2023-09-19T13:29:01Z

/retest-required

This adds the configuration option "MaxCheckpointsPerContainer" to the kubelet. The goal of this change is to provide a mechanism in combination with container checkpointing to avoid filling up all existing disk space by creating a large number of checkpoints from a container. "MaxCheckpointsPerContainer" defaults to 10 and this means that once 10 checkpoints of a certain container have been created the oldest existing container checkpoint archive will be removed from disk. This way only the defined number of checkpoints is kept on disk. This also moves the location of the checkpoint archives from /var/lib/kubelet/checkpoints to /var/lib/kubelet/checkpoints/POD-ID/ The main reason for this move was to avoid confusion between the checkpoint archives concerning namespace, pod name and container name. This also changes the time stamp encoded into the file name from RFC3339 to UnixNano(). The reason for this change is that there were questions in how far the ':' in the file name generated by RFC3339 would be problematic on Windows. This also introduces a counter after the time stamp in the file name to ensure that each checkpoint archive has a unique file name. Signed-off-by: Adrian Reber <areber@redhat.com>

SergeyKanzhelev · 2023-09-27T00:09:28Z

I feel like to the big extend this functionality is suggested for the kubelet because it is guaranteed to be present on the node. If we only look at checkpoints as a forensic mechanism today, this functionality does not belong to kubelet. Maybe a better place will be some separate agent. For example, we may consider updating the NPD to detect and resolve local problems. So NPD can do checkpoint files rotation. Or kubelet health checks. Or similar.

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 19, 2023

k8s-ci-robot requested review from mikedanese and mrunalp February 19, 2023 20:58

bart0sh added this to Triage in SIG Node PR Triage Feb 20, 2023

saschagrunert reviewed Feb 21, 2023

View reviewed changes

pkg/kubelet/kubelet.go Outdated Show resolved Hide resolved

pkg/kubelet/kubelet.go Outdated Show resolved Hide resolved

pkg/kubelet/kubelet.go Outdated Show resolved Hide resolved

pkg/kubelet/kubelet.go Outdated Show resolved Hide resolved

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 21, 2023

k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Feb 21, 2023

bart0sh moved this from Triage to Waiting on Author in SIG Node PR Triage Feb 24, 2023

adrianreber force-pushed the 2023-02-19-max-container-checkpoints branch from fddbb33 to 58e029e Compare February 27, 2023 15:49

k8s-ci-robot added the area/apiserver label Feb 27, 2023

adrianreber force-pushed the 2023-02-19-max-container-checkpoints branch 5 times, most recently from d76d6ad to 32fb688 Compare September 17, 2023 16:20

k8s-ci-robot assigned rphillips Sep 18, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 18, 2023

mikebrow suggested changes Sep 18, 2023

View reviewed changes

mikebrow reviewed Sep 18, 2023

View reviewed changes

kwilczynski reviewed Sep 19, 2023

View reviewed changes

pkg/kubelet/kubelet.go Outdated Show resolved Hide resolved

adrianreber force-pushed the 2023-02-19-max-container-checkpoints branch from 32fb688 to 6f93687 Compare September 19, 2023 11:34

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 19, 2023

k8s-ci-robot requested a review from rphillips September 19, 2023 11:34

adrianreber force-pushed the 2023-02-19-max-container-checkpoints branch from 6f93687 to e89d14d Compare September 20, 2023 12:15

adrianreber closed this Sep 27, 2023

SIG Node PR Triage automation moved this from Needs Approver to Done Sep 27, 2023

adrianreber mentioned this pull request Sep 27, 2023

Forensic Container Checkpointing kubernetes/enhancements#2008

Open

21 tasks

	// checkpoints. This option exist to ensure the local disk is not filled
	// checkpoints. This option exists to ensure the local disk is not filled

Add MaxCheckpointsPerContainer to the kubelet #115888

Add MaxCheckpointsPerContainer to the kubelet #115888

Conversation

adrianreber commented Feb 19, 2023 • edited

What type of PR is this?

What this PR does / why we need it:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Feb 19, 2023

k8s-triage-robot commented Feb 19, 2023

saschagrunert left a comment

Choose a reason for hiding this comment

cici37 commented Feb 21, 2023

bart0sh commented Feb 23, 2023

adrianreber commented Sep 16, 2023

adrianreber commented Sep 17, 2023

adrianreber commented Sep 17, 2023

adrianreber commented Sep 18, 2023

adrianreber commented Sep 18, 2023

adrianreber commented Sep 18, 2023

mrunalp commented Sep 18, 2023

rphillips commented Sep 18, 2023

k8s-ci-robot commented Sep 18, 2023

mikebrow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikebrow Sep 19, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikebrow Sep 18, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikebrow Sep 18, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 19, 2023

adrianreber commented Sep 19, 2023

SergeyKanzhelev commented Sep 27, 2023

adrianreber commented Feb 19, 2023 •

edited

mikebrow Sep 19, 2023 •

edited

mikebrow Sep 18, 2023 •

edited

mikebrow Sep 18, 2023 •

edited