Skip to content

Conversation

@Tal-or
Copy link
Contributor

@Tal-or Tal-or commented Sep 2, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

For a resource within a group, such as memory,
we should validate the total Free and total Reserved size of the expected machineState and state restored from checkpoint file after kubelet start.
If total Free and total Reserved are equal, the restored state is valid.

The old comparison however was done by reflection.

There're times when the memory accounting is equals
but the allocations across the NUMA nodes are varies.

In such cases we still need to consider the states as equals.

Which issue(s) this PR fixes:

Fixes ##113130

Special notes for your reviewer:

This PR use as a replacement for #114501 which is not active for a long time now and is needed for MemoryManager GA graduation.
see: kubernetes/enhancements#1769 (comment)

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager#design-overview

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 2, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @Tal-or. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Sep 2, 2024
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 2, 2024
@Tal-or Tal-or force-pushed the mm_fix_checkpoint_file_comparison branch from 6f18f11 to fddd074 Compare September 2, 2024 14:25
@Tal-or
Copy link
Contributor Author

Tal-or commented Sep 2, 2024

/cc @ffromani @bart0sh

@ffromani
Copy link
Contributor

ffromani commented Sep 2, 2024

/ok-to-test
/priority important-longterm
/triage accepted

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 2, 2024
@Tal-or Tal-or mentioned this pull request Sep 3, 2024
23 tasks
@Tal-or Tal-or force-pushed the mm_fix_checkpoint_file_comparison branch from fddd074 to dc2170e Compare September 3, 2024 07:46
@bart0sh
Copy link
Contributor

bart0sh commented Sep 9, 2024

/retest

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provisional LGTM (need to do another pass).
The ordering of allocation should not matter indeed, only the totals should matter

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provisional LGTM

since we cannot guarantee we recreate pods in the very same ordering across restarts, there are multiple legal representations (even including pinning) of the same memory state, thus we should check the aggregate values

donggangcj and others added 4 commits October 29, 2024 12:08
(cherry picked from commit de03335)
(cherry picked from commit 91a9a19)
For a resource within a group, such as memory,
we should validate the total `Free` and total `Reserved` size of the expected `machineState` and state restored from checkpoint file after kubelet start.
If total `Free` and total `Reserved` are equal, the restored state is valid.

The old comparison however was done by reflection.

There're times when the memory accounting is equals
but the allocations across the NUMA nodes are varies.

In such cases we still need to consider the states as equals.

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
@Tal-or Tal-or force-pushed the mm_fix_checkpoint_file_comparison branch 2 times, most recently from 4459d6b to 18696da Compare October 29, 2024 10:24
perform the memoryStates comparison in helper function

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
@Tal-or Tal-or force-pushed the mm_fix_checkpoint_file_comparison branch from 18696da to d64f34e Compare October 29, 2024 12:22
Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 2, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: b5f370a652a9e22d8f83920885c9f39569958492

@ffromani
Copy link
Contributor

ffromani commented Nov 2, 2024

/test pull-kubernetes-node-kubelet-serial-containerd
/test pull-kubernetes-node-kubelet-serial-containerd-sidecar-containers
/test pull-kubernetes-node-kubelet-serial-cpu-manager
/test pull-kubernetes-node-kubelet-serial-hugepages
/test pull-kubernetes-node-kubelet-serial-memory-manager
/test pull-kubernetes-node-kubelet-serial-topology-manager

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrunalp, Tal-or

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 4, 2024
@k8s-ci-robot k8s-ci-robot merged commit 3036d10 into kubernetes:master Nov 4, 2024
20 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.32 milestone Nov 4, 2024
@Tal-or Tal-or deleted the mm_fix_checkpoint_file_comparison branch November 4, 2024 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Projects

Development

Successfully merging this pull request may close these issues.

6 participants