Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MaxCheckpointsPerContainer to the kubelet #115888

Conversation

adrianreber
Copy link
Contributor

@adrianreber adrianreber commented Feb 19, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

This adds the configuration option "MaxCheckpointsPerContainer" to the kubelet. The goal of this change is to provide a mechanism in combination with container checkpointing to avoid filling up all existing disk space by creating a large number of checkpoints from a container.

"MaxCheckpointsPerContainer" defaults to 10 and this means that once 10 checkpoints of a certain container have been created the oldest existing container checkpoint archive will be removed from disk. This way only the defined number of checkpoints is kept on disk.

This also moves the location of the checkpoint archives from /var/lib/kubelet/checkpoints to /var/lib/kubelet/checkpoints/POD-ID/.

The main reason for this move was to avoid confusion between the checkpoint archives concerning namespace, pod name and container name.

This also changes the time stamp encoded into the file name from RFC3339 to UnixNano(). The reason for this change is that there were questions in how far the ':' in the file name generated by RFC3339 would be problematic on Windows.

This also introduces a counter after the time stamp in the file name to ensure that each checkpoint archive has a unique file name.

As this is still an Alpha feature it should be acceptable to change the location of the checkpoint archive.

Does this PR introduce a user-facing change?

Checkpoint archives created by the Alpha Feature `ContainerCheckpoint` are now located in `/var/lib/kubelet/checkpoints/POD-ID/` instead of  `/var/lib/kubelet/checkpoints`

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/issues/2008
- [Other doc]: https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 19, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @adrianreber. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 19, 2023
@k8s-ci-robot k8s-ci-robot added area/code-generation area/kubelet area/test kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 19, 2023
@k8s-triage-robot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@bart0sh bart0sh added this to Triage in SIG Node PR Triage Feb 20, 2023
Copy link
Member

@saschagrunert saschagrunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

pkg/kubelet/kubelet.go Outdated Show resolved Hide resolved
pkg/kubelet/kubelet.go Outdated Show resolved Hide resolved
pkg/kubelet/kubelet.go Outdated Show resolved Hide resolved
pkg/kubelet/kubelet.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 21, 2023
@cici37
Copy link
Contributor

cici37 commented Feb 21, 2023

/remove-sig api-machinery

@k8s-ci-robot k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Feb 21, 2023
@bart0sh
Copy link
Contributor

bart0sh commented Feb 23, 2023

@adrianreber please fix CI failures, thanks.

@bart0sh bart0sh moved this from Triage to Waiting on Author in SIG Node PR Triage Feb 24, 2023
@adrianreber adrianreber force-pushed the 2023-02-19-max-container-checkpoints branch from fddbb33 to 58e029e Compare February 27, 2023 15:49
@adrianreber
Copy link
Contributor Author

/test pull-kubernetes-conformance-kind-ipv6-parallel

@adrianreber
Copy link
Contributor Author

/retest-required

@adrianreber adrianreber force-pushed the 2023-02-19-max-container-checkpoints branch 5 times, most recently from d76d6ad to 32fb688 Compare September 17, 2023 16:20
@adrianreber
Copy link
Contributor Author

/test pull-kubernetes-e2e-kind

@adrianreber
Copy link
Contributor Author

/test pull-kubernetes-e2e-kind-ipv6

2 similar comments
@adrianreber
Copy link
Contributor Author

/test pull-kubernetes-e2e-kind-ipv6

@adrianreber
Copy link
Contributor Author

/test pull-kubernetes-e2e-kind-ipv6

@mrunalp
Copy link
Contributor

mrunalp commented Sep 18, 2023

@rphillips @mikebrow ptal.

@rphillips
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 18, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 92f3e5c378642a22b3b5f4456dd89b6bb8cb0ccd

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments..

A general comment: This is for the forensics use case... but I'm having a time mapping the kep to this change. It reads more like a checkpoint manager enhancement, where one can be creating some numbers of checkpoints at a given rate of time, like backups. And kubelet will manage garbage collection. I was thinking we'd open the kep up for the additional case(s) and add a checkpoint manager. Thoughts? Maybe draft up a use case description for this enhancement in the context of forensic debug..


// MaxCheckpointsPerContainer specifies the maximum number of checkpoints
// that Kubernetes will create of one specific container before removing old
// checkpoints. This option exist to ensure the local disk is not filled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// checkpoints. This option exist to ensure the local disk is not filled
// checkpoints. This option exists to ensure the local disk is not filled

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it will fill it.. 2147483647 checkpoints probably isn't the right maximum limit to the number of checkpoints per container per pod..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying a hardcoded upper limit should exist? What should it be?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking a few or two as default, depending on how the user/client is using these. For forensics not sure why it's more than 1 at a time unless you envision a diff tool to compare a success case vs failure case which would be 2 or 3, where 3 could allow 3way diff cases? Still.. the idea of creating a managed set.. implies the discussion of a manager for checkpoints.

// that Kubernetes will create of one specific container before removing old
// checkpoints. This option exist to ensure the local disk is not filled
// with container checkpoints. This option is per container. The field
// value must be greater than 0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size or total % of disk use or a different drive or some other reasonable mech for lowering impact.. perhaps default of 1 or 2? with the idea the user would off load one before adding another..

// that Kubernetes will create of one specific container before removing old
// checkpoints. This option exist to ensure the local disk is not filled
// with container checkpoints. This option is per container. The field
// value must be greater than 0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put another way.. why was 10 selected and not 1?

podFullName,
containerName,
time.Now().Format(time.RFC3339),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking for the forensic case the caller would be responsible for the clean up if need (probably with a listener) and the use would be rare.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking for the forensic case

Even if the story around checkpoint/restore is the forensic case, this does not mean this is what people use it for. There are many use cases and the forensic use case is just one.

the caller would be responsible for the clean up if need (probably with a listener) and the use would be rare.

This is a strange argument from my point of view, that the user is responsible for the clean up. At this point it feels the complete PR is questioned and this argument comes very late in the lifetime of this PR. This PR has been reworked completely multiple times since the first posting 7 months ago. This is a feature people have been asking for because they see it as a problem and do not want to clean up manually. This is one of the first things people are asking for during conference talks.

Not sure what the goal of this comment is, sorry.

Copy link
Member

@mikebrow mikebrow Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for limiting to the forensics case was probably to avoid slow walking the managed checkpoint use cases. Until there was a WG or whatnot put in place to design how kubernetes would manage the checkpoints for the other cases for what people would use it for.

At this point the user requested the checkpoint, vs kubelet creating the checkpoint based on a pod policy / contract. Asking kubelet to garbage collect that which it did not ask to be created implies kubelet knows why the containers are being checkpointed for these pods.

If the rule is to keep 10 which 10, the last ten the first 10 what if the last 10 are all created during failure modes? If this is for rolling backups.. you may want one per month one per week...going back to the last month to drop the per week from that month etc..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. today was the first day I heard about this PR... Apologies for the frustration. I agree 100%, overuse of the checkpointing end point without having a management design is a problem. I didn't expect over use to be a problem for the sig-node approved forensics use case. That's all I mean. For logs we've used similar designs to this one, in kubelet, to employ rolling log models for long lived containers. So I'm trying to understand this rolling checkpoint idea.. to map it to a use case. If someone is wanting to do rolling backups ok.. I would get the use case, but even in that case I would want to have a discussion about how we do it. Checkpoint 1 + delta + delta for example would be orders of magnitude better from a resource consumption perspective.

// name of the checkpoint archive is unique. It already contains a time
// stamp but in case the clock resolution is not good enough this counter
// make it unique.
maxCheckpointsPerContainerCounter int32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from this I take it that you want to use the counter as a node level index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest maxCheckpointOnNodeCounter or something similar..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included OnNode in the variable name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx..

podFullName,
containerName,
time.Now().Format(time.RFC3339),
time.Now().UnixNano(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why unix nano? going from human readable format for the forensic case to a large number...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I submitted this comment two days ago. Trying once more:

This PR has gone though multiple iterations. The first iterations was using the existing file name with the human readable and the number of checkpoints was tracked in a JSON file. That was then changed to work without a JSON file using stat(). The time resolution using stat() on all file systems was questioned so the goal was to use a regex and file system sorting. The problem with the human readable file name was that it is really hard to capture it via regex if timezones are used in the file name. In combination with pod-name container-name collisions during regex or file system level sorting the current implementation using an integer time stamp and a counter.

As the second resolution is to coarse to be unique (without the counter) I switched to something else. I thought about micro or nano seconds, and settled for nano seconds because it will be just as unreadable as micro seconds.

We have written the tool checkpointctl which will display all information, including time created and checkpointed, so if wanted it can be displayed with the help of an additional tool.

The current approach gives us file system based sorting by using unique names in combination with podUID, nano seconds and node wide counter in the file name. No external JSON file is any more needed or no need to ready the result of stat() which might be to coarse, depending on the used file system.

With the old implementation if would have been possible to create two checkpoints of different pod/containers at the same time and once would replace another. The current implementation does not have this problem and easier to sort.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking up... the dreaded windows aversion to colon due to their drive letter : format..

sounds like you had a lot of reasons..

podFullName,
containerName,
time.Now().Format(time.RFC3339),
time.Now().UnixNano(),
int(math.Log10(float64(math.MaxInt32)))+1,
Copy link
Member

@mikebrow mikebrow Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? pls add a comment explaining this ..

width of the format will be 10 every time right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for sorting the counter should always include all leading zeroes. I can change this to 10. I just didn't want to count it manually and let the computer do the work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk.. :-) leading for the nanos to? though I think better off with the human readable on the timestamp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hardcoded the number leading zeros for both fields.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx


checkpointDirectoryPath := filepath.Join(
kl.getCheckpointsDir(),
string(podUID),
Copy link
Member

@mikebrow mikebrow Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe podUID can be reused here.. which could create a reuse issue ..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the namespace, pod and container have the same name or one of the described collisions above. Thanks to the timestamp and the counter it should not result in checkpoints being overwritten, but maybe older checkpoints, from a podUID collisions, being removed first. Not sure if that is a problem or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah was more a tree/identifier issue... here with poduid.. one will need to tree | grep to find their checkpoints by name, here you can also identify by poduid.. if that was nec.. and missing we could add the poduid to the original name as an alternative

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the checkpoint is returned when triggering the kubelet checkpoint API endpoint. I am also working on code to see how it could look like if the kubelet checkpoint API endpoint is available on the kubectl level.

What I am trying to say is that searching for a checkpoint archive is something I would not expect that happens a lot, so that the path is not super important. Whatever makes most sense works for me. The main goal is to avoid any ambiguity in the file name. The podUID was suggested somewhere along the review in this PR and makes sense to me.

There was also agreement that we can change the location as long this is still marked as an alpha feature.

As mentioned in another comment using https://github.com/checkpoint-restore/checkpointctl is something I would recommend to get all details from the checkpoint archive and rely less on encoding information into the file name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nod.. typically, in this space, we would create references to image files like these in a meta db.. and the image/layers would be stored by sha.. This change looks/feels like we're inching into managed checkpoint use case scenarios. I would feel more comfortable about this change if we had a kep open for r.next checkpointing/ wg charting out the rough direction this is going to take so we could map this change to that direction.

The pain point this PR is addressing is over use of the drive.. which should not be happening in the approved forensics cases? IMO there are other designs that would more appropriately address the desired feature(s).

At min, IMO, we should tell sig-node about this change to begin automatically removing checkpoints past some limit and get a general consensus if that is acceptable in the interim before the next checkpoint KEP update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pain point this PR is addressing is over use of the drive.. which should not be happening in the approved forensics cases?

It feels like this seems to be one of our main discussion points. I agree that it "should not be happening" but at the same time it is something people are actively talking about. From my point of view it feels unrealistic to expect that people always clean up unused files. That is why I want to help by cleaning it up automatically.

At min, IMO, we should tell sig-node about this change to begin automatically removing checkpoints past some limit and get a general consensus if that is acceptable in the interim before the next checkpoint KEP update.

Okay. I will add it on the sig-node agenda.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we want to introduce this clean up functionality we need to add it to the existing KEP as an additional feature or start a new one. As discussed at SIG Node meeting today, once we have this setting we can start exploring integrating this into GC and eviction logic, having per-pod policies, etc. This all creates an unnecessary burden on a kubelet.

pkg/kubelet/kubelet.go Outdated Show resolved Hide resolved
@adrianreber adrianreber force-pushed the 2023-02-19-max-container-checkpoints branch from 32fb688 to 6f93687 Compare September 19, 2023 11:34
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 19, 2023
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@adrianreber
Copy link
Contributor Author

/retest-required

This adds the configuration option "MaxCheckpointsPerContainer" to the
kubelet. The goal of this change is to provide a mechanism in
combination with container checkpointing to avoid filling up all
existing disk space by creating a large number of checkpoints from a
container.

"MaxCheckpointsPerContainer" defaults to 10 and this means that once 10
checkpoints of a certain container have been created the oldest existing
container checkpoint archive will be removed from disk. This way only
the defined number of checkpoints is kept on disk.

This also moves the location of the checkpoint archives from
/var/lib/kubelet/checkpoints to
/var/lib/kubelet/checkpoints/POD-ID/

The main reason for this move was to avoid confusion between the
checkpoint archives concerning namespace, pod name and container name.

This also changes the time stamp encoded into the file name from RFC3339
to UnixNano(). The reason for this change is that there were questions
in how far the ':' in the file name generated by RFC3339 would be
problematic on Windows.

This also introduces a counter after the time stamp in the file name to
ensure that each checkpoint archive has a unique file name.

Signed-off-by: Adrian Reber <areber@redhat.com>
@adrianreber adrianreber force-pushed the 2023-02-19-max-container-checkpoints branch from 6f93687 to e89d14d Compare September 20, 2023 12:15
@SergeyKanzhelev
Copy link
Member

I feel like to the big extend this functionality is suggested for the kubelet because it is guaranteed to be present on the node. If we only look at checkpoints as a forensic mechanism today, this functionality does not belong to kubelet. Maybe a better place will be some separate agent. For example, we may consider updating the NPD to detect and resolve local problems. So NPD can do checkpoint files rotation. Or kubelet health checks. Or similar.

SIG Node PR Triage automation moved this from Needs Approver to Done Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/apiserver area/cloudprovider area/code-generation area/dependency Issues or PRs related to dependency changes area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/kubectl area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: API review completed, 1.28
Archived in project
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet