Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory manager #95479

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
4c75be0
memory manager: provide the skeleton for the memory manager
Mar 5, 2020
48ca6e5
memory manager: provide and use the checkpoint manager
Mar 10, 2020
86df524
memory manager: provide unittest for the state package
Oct 8, 2020
d0caec9
memory manager: add the policy interface
Oct 8, 2020
95f8137
memory manager: implement the manager interface methods
Mar 19, 2020
b95d45e
memory manager: add new flag type BracketSeparatedSliceMapStringString
Oct 8, 2020
93accb5
memory manager: add memory manager flag under kubelet options and kub…
Oct 8, 2020
9ae499a
memory manager: pass memory manager flags to the container manager
Oct 8, 2020
711e85a
memory manager: adding additional tests for server.go file, for parse…
k-wiatrzyk Apr 22, 2020
4a64102
memory manager: validate reserved-memory against Node Allocatable
cezaryzukowski Apr 23, 2020
afb1ae3
memory manager: add fake memory manager
Mar 24, 2020
371c918
memory manager: add memory manager policy to defaulter and conversion…
Apr 1, 2020
abb94be
memory manager: implement the memory manager static policy
Mar 29, 2020
18c8a82
memory manager: implement GetPodTopologyHints method
pablitoergosum Sep 25, 2020
d7175a8
memory manager: adding Memory Manager component unit tests
k-wiatrzyk Jun 3, 2020
f7845ed
memory manager: provide memory manager static policy unittests
Oct 11, 2020
24be74e
memory manager: update bazel files
Oct 11, 2020
27c5efe
memory manager: fix scheme unit test
Nov 2, 2020
aa63e5a
memory manager: provide an additional validation for reserved memory
Nov 4, 2020
a015e41
memory manager: rename state structs and fields
Nov 4, 2020
f3d4ac2
memory manager: add basice e2e tests
Oct 15, 2020
606fea2
memory manager: add e2e test to run guaranteed pod with init containers
pablitoergosum Nov 4, 2020
74eeef2
memory manager: provide additional e2e tests
Nov 5, 2020
ff2a110
memory manager: provide the new type to contain resources for each NU…
Nov 12, 2020
d0089db
memory manager: remove unused variable under stateCheckpoint
Nov 12, 2020
0fa5dd5
memory manager: move the fakeTopologyManagerWithHint
Nov 12, 2020
b7cfc40
memory manager: update kubelet config API
Nov 17, 2020
7561a0f
memory manager: provide new flag var to parse reserved-memory parameter
Nov 17, 2020
e8ea461
memory manager: update all relevant part of code to use []MemoryReser…
Nov 17, 2020
9321340
memory manager: update API constant to have camel case format
Nov 17, 2020
1021244
memory manager: improve the reserved memory validation logic
Dec 14, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions api/api-rules/violation_exceptions.list
Expand Up @@ -392,6 +392,7 @@ API rule violation: list_type_missing,k8s.io/kubelet/config/v1alpha1,CredentialP
API rule violation: list_type_missing,k8s.io/kubelet/config/v1beta1,KubeletConfiguration,AllowedUnsafeSysctls
API rule violation: list_type_missing,k8s.io/kubelet/config/v1beta1,KubeletConfiguration,ClusterDNS
API rule violation: list_type_missing,k8s.io/kubelet/config/v1beta1,KubeletConfiguration,EnforceNodeAllocatable
API rule violation: list_type_missing,k8s.io/kubelet/config/v1beta1,KubeletConfiguration,ReservedMemory
cynepco3hahue marked this conversation as resolved.
Show resolved Hide resolved
API rule violation: list_type_missing,k8s.io/kubelet/config/v1beta1,KubeletConfiguration,TLSCipherSuites
API rule violation: list_type_missing,k8s.io/metrics/pkg/apis/metrics/v1alpha1,PodMetrics,Containers
API rule violation: list_type_missing,k8s.io/metrics/pkg/apis/metrics/v1beta1,PodMetrics,Containers
Expand Down
5 changes: 5 additions & 0 deletions cmd/kubelet/app/options/options.go
Expand Up @@ -550,4 +550,9 @@ Runtime log sanitization may introduce significant computation overhead and ther

// Graduated experimental flags, kept for backward compatibility
fs.BoolVar(&c.KernelMemcgNotification, "experimental-kernel-memcg-notification", c.KernelMemcgNotification, "Use kernelMemcgNotification configuration, this flag will be removed in 1.23.")

// Memory Manager Flags
fs.StringVar(&c.MemoryManagerPolicy, "memory-manager-policy", c.MemoryManagerPolicy, "Memory Manager policy to use. Possible values: 'None', 'Static'. Default: 'None'")
// TODO: once documentation link is available, replace KEP link with the documentation one.
fs.Var(&utilflag.ReservedMemoryVar{Value: &c.ReservedMemory}, "reserved-memory", "A comma separated list of memory reservations for NUMA nodes. (e.g. --reserved-memory 0:memory=1Gi,hugepages-1M=2Gi --reserved-memory 1:memory=2Gi). The total sum for each memory type should be equal to the sum of kube-reserved, system-reserved and eviction-threshold. See more details under https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager#reserved-memory-flag")
}
19 changes: 11 additions & 8 deletions cmd/kubelet/app/server.go
Expand Up @@ -687,6 +687,7 @@ func run(ctx context.Context, s *options.KubeletServer, kubeDeps *kubelet.Depend
s.SystemReserved["cpu"] = strconv.Itoa(reservedSystemCPUs.Size())
klog.Infof("After cpu setting is overwritten, KubeReserved=\"%v\", SystemReserved=\"%v\"", s.KubeReserved, s.SystemReserved)
}

kubeReserved, err := parseResourceList(s.KubeReserved)
if err != nil {
return err
Expand Down Expand Up @@ -732,14 +733,16 @@ func run(ctx context.Context, s *options.KubeletServer, kubeDeps *kubelet.Depend
ReservedSystemCPUs: reservedSystemCPUs,
HardEvictionThresholds: hardEvictionThresholds,
},
QOSReserved: *experimentalQOSReserved,
ExperimentalCPUManagerPolicy: s.CPUManagerPolicy,
ExperimentalCPUManagerReconcilePeriod: s.CPUManagerReconcilePeriod.Duration,
ExperimentalPodPidsLimit: s.PodPidsLimit,
EnforceCPULimits: s.CPUCFSQuota,
CPUCFSQuotaPeriod: s.CPUCFSQuotaPeriod.Duration,
ExperimentalTopologyManagerPolicy: s.TopologyManagerPolicy,
ExperimentalTopologyManagerScope: s.TopologyManagerScope,
QOSReserved: *experimentalQOSReserved,
ExperimentalCPUManagerPolicy: s.CPUManagerPolicy,
ExperimentalCPUManagerReconcilePeriod: s.CPUManagerReconcilePeriod.Duration,
ExperimentalMemoryManagerPolicy: s.MemoryManagerPolicy,
ExperimentalMemoryManagerReservedMemory: s.ReservedMemory,
ExperimentalPodPidsLimit: s.PodPidsLimit,
EnforceCPULimits: s.CPUCFSQuota,
CPUCFSQuotaPeriod: s.CPUCFSQuotaPeriod.Duration,
ExperimentalTopologyManagerPolicy: s.TopologyManagerPolicy,
ExperimentalTopologyManagerScope: s.TopologyManagerScope,
},
s.FailSwapOn,
devicePluginEnabled,
Expand Down
7 changes: 7 additions & 0 deletions pkg/features/kube_features.go
Expand Up @@ -123,6 +123,12 @@ const (
// Enable resource managers to make NUMA aligned decisions
TopologyManager featuregate.Feature = "TopologyManager"

// owner: @cynepco3hahue(alukiano) @cezaryzukowski @k-wiatrzyk
// alpha:: v1.20

// Allows setting memory affinity for a container based on NUMA topology
MemoryManager featuregate.Feature = "MemoryManager"

// owner: @sjenning
// beta: v1.11
//
Expand Down Expand Up @@ -697,6 +703,7 @@ var defaultKubernetesFeatureGates = map[featuregate.Feature]featuregate.FeatureS
ExpandInUsePersistentVolumes: {Default: true, PreRelease: featuregate.Beta},
ExpandCSIVolumes: {Default: true, PreRelease: featuregate.Beta},
CPUManager: {Default: true, PreRelease: featuregate.Beta},
MemoryManager: {Default: false, PreRelease: featuregate.Alpha},
CPUCFSQuotaPeriod: {Default: false, PreRelease: featuregate.Alpha},
TopologyManager: {Default: true, PreRelease: featuregate.Beta},
ServiceNodeExclusion: {Default: true, PreRelease: featuregate.GA, LockToDefault: true}, // remove in 1.22
Expand Down
1 change: 1 addition & 0 deletions pkg/kubelet/apis/config/fuzzer/fuzzer.go
Expand Up @@ -62,6 +62,7 @@ func Funcs(codecs runtimeserializer.CodecFactory) []interface{} {
obj.KernelMemcgNotification = false
obj.MaxOpenFiles = 1000000
obj.MaxPods = 110
obj.MemoryManagerPolicy = v1beta1.NoneMemoryManagerPolicy
obj.PodPidsLimit = -1
obj.NodeStatusUpdateFrequency = metav1.Duration{Duration: 10 * time.Second}
obj.NodeStatusReportFrequency = metav1.Duration{Duration: time.Minute}
Expand Down
9 changes: 9 additions & 0 deletions pkg/kubelet/apis/config/helpers_test.go
Expand Up @@ -206,6 +206,7 @@ var (
"StaticPodURLHeader[*][*]",
"MaxOpenFiles",
"MaxPods",
"MemoryManagerPolicy",
cynepco3hahue marked this conversation as resolved.
Show resolved Hide resolved
"NodeLeaseDurationSeconds",
"NodeStatusMaxImages",
"NodeStatusUpdateFrequency.Duration",
Expand All @@ -220,6 +221,14 @@ var (
"ReadOnlyPort",
"RegistryBurst",
"RegistryPullQPS",
"ReservedMemory[*].Limits[*].Format",
"ReservedMemory[*].Limits[*].d.Dec.scale",
"ReservedMemory[*].Limits[*].d.Dec.unscaled.abs[*]",
"ReservedMemory[*].Limits[*].d.Dec.unscaled.neg",
"ReservedMemory[*].Limits[*].i.scale",
"ReservedMemory[*].Limits[*].i.value",
"ReservedMemory[*].Limits[*].s",
"ReservedMemory[*].NumaNode",
"ReservedSystemCPUs",
"RuntimeRequestTimeout.Duration",
"RunOnce",
Expand Down
Expand Up @@ -55,6 +55,7 @@ logging:
makeIPTablesUtilChains: true
maxOpenFiles: 1000000
maxPods: 110
memoryManagerPolicy: None
nodeLeaseDurationSeconds: 40
nodeStatusMaxImages: 50
nodeStatusReportFrequency: 5m0s
Expand Down
Expand Up @@ -55,6 +55,7 @@ logging:
makeIPTablesUtilChains: true
maxOpenFiles: 1000000
maxPods: 110
memoryManagerPolicy: None
nodeLeaseDurationSeconds: 40
nodeStatusMaxImages: 50
nodeStatusReportFrequency: 5m0s
Expand Down
23 changes: 23 additions & 0 deletions pkg/kubelet/apis/config/types.go
Expand Up @@ -224,6 +224,9 @@ type KubeletConfiguration struct {
// CPU Manager reconciliation period.
// Requires the CPUManager feature gate to be enabled.
CPUManagerReconcilePeriod metav1.Duration
// MemoryManagerPolicy is the name of the policy to use.
// Requires the MemoryManager feature gate to be enabled.
MemoryManagerPolicy string
// TopologyManagerPolicy is the name of the policy to use.
// Policies other than "none" require the TopologyManager feature gate to be enabled.
TopologyManagerPolicy string
Expand Down Expand Up @@ -382,6 +385,20 @@ type KubeletConfiguration struct {
// Defaults to 10 seconds, requires GracefulNodeShutdown feature gate to be enabled.
// For example, if ShutdownGracePeriod=30s, and ShutdownGracePeriodCriticalPods=10s, during a node shutdown the first 20 seconds would be reserved for gracefully terminating normal pods, and the last 10 seconds would be reserved for terminating critical pods.
ShutdownGracePeriodCriticalPods metav1.Duration
// ReservedMemory specifies a comma-separated list of memory reservations for NUMA nodes.
// The parameter makes sense only in the context of the memory manager feature. The memory manager will not allocate reserved memory for container workloads.
// For example, if you have a NUMA0 with 10Gi of memory and the ReservedMemory was specified to reserve 1Gi of memory at NUMA0,
// the memory manager will assume that only 9Gi is available for allocation.
// You can specify a different amount of NUMA node and memory types.
// You can omit this parameter at all, but you should be aware that the amount of reserved memory from all NUMA nodes
// should be equal to the amount of memory specified by the node allocatable features(https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#node-allocatable).
// If at least one node allocatable parameter has a non-zero value, you will need to specify at least one NUMA node.
// Also, avoid specifying:
// 1. Duplicates, the same NUMA node, and memory type, but with a different value.
// 2. zero limits for any memory type.
// 3. NUMAs nodes IDs that do not exist under the machine.
// 4. memory types except for memory and hugepages-<size>
ReservedMemory []MemoryReservation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this interact with the SystemReserved, KubeReserved, and QOSReserved flags? The documentation in the field and flag need to explain it. Can you describe it for me here in a comment so i can review before it gets changed in the code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @smarterclayton ,

Could you provide a reference for QOSReserved? (I encounter this term first time)

I hope the excerpt from the draft of official documentation helps (notice that the syntax for --reserved-memory= is outdated, we will update it in the final doc):

Reserved memory flag

Node Allocatable Feature is commonly used by node administrators to reserve K8S node system resources for the kubelet or operating system processes in order to enhance the node stability. A dedicated set of flags can be used for this purpose to set the total amount of reserved memory for a node. This pre-configured value is subsequently utilized to calculate the real amount of node's "allocatable" memory available to pods. Also, K8S scheduler incorporates "allocatable" to optimise pod scheduling process. The foregoing flags include --kube-reserved, --system-reserved and --eviction-threshold. The sum of their values will account for the total amount of reserved memory.

A new --reserved-memory flag was added to Memory Manager to allow for this total reserved memory to be split (by a node administrator) and accordingly reserved across many NUMA nodes.

Syntax:

--reserved-memory=[{numa-node=int,type=string,limit=string}][,][...]

  • numa-node index, e.g. 0
  • type of memory:
    • memory - conventional memory
    • hugepages-2Mi or hugepages-1Gi - hugepages
  • limit - the amount of reserved memory, e.g. 1Gi

Example usage:

--reserved-memory={numa-node=0,type=memory,limit=1Gi},{numa-node=1,type=memory,limit=2Gi}

When you specify values for --reserved-memory flag, you must comply with the setting that you prior provided via Node Allocatable Feature flags. That is, the following rule must be obeyed for each memory type:

sum(reserved-memory(i)) = kube-reserved + system-reserved + eviction-threshold,

where i is an index of a NUMA node.

If you do not follow the formula above, the Memory Manager will show an error on startup.

In other words, the example above illustrates that for the conventional memory (type=memory), we reserve 3Gi in total, i.e.:

sum(reserved-memory(i)) = reserved-memory(0) + reserved-memory(1) = 1Gi + 2Gi = 3Gi

An example of Node Allocatable Feature flags configuration:

  • --kube-reserved=cpu=500m,memory=50Mi
  • --system-reserved=cpu=123m,memory=333Mi
  • --eviction-hard=memory.available<500Mi

NOTICE: hard eviction threshold is not equal to zero by default but 100Mi, so do not forget to decrease the total amount set via --reserved-memory by this 100Mi. Otherwise, the Memory Manager will display an error. Here is an example of a correct configuration:

--feature-gates=MemoryManager=true 
--kube-reserved=cpu=4,memory=4Gi 
--system-reserved=cpu=1,memory=1Gi 
--memory-manager-policy=static 
--reserved-memory={numa-node=0,type=memory,limit=3Gi},{numa-node=1,type=memory,limit=2148Mi}

Let us validate the configuration above:

  1. kube-reserved + system-reserved + eviction-hard(default) = reserved-memory(0) + reserved-memory(1)
  2. 4Gi + 1Gi + 100Mi = 3Gi + 2148Mi
  3. 5120Mi + 100Mi = 3072Mi + 2148Mi
  4. 5220Mi = 5220Mi (correct!)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we expect that the following formula must hold for each memory type:
sum(ReservedMemory(i)) = kube-reserved + system-reserved + hard-eviction-threshold, it means that if

--kube-reserved=cpu=500m,memory=250Mi
--system-reserved=cpu=500m,memory=250Mi
--eviction-hard=memory.available<500Mi
250+250+500=1Gi

the total amount of memory defined under the reserved-memory flag should be equal to 1Gi, it can be different combinations, like, reserve all memory from the first NUMA node --reserved-memory 0:memory=1Gi or reserve equally from two NUMA nodes --reserved-memory 0:memory=500Mi --reserved-memory 1:memory=500Mi.

Until the documentation PR will be merged we can point to the KEP section, and once it merged I will update the link to point to the documentation(the same way that we have under the CLI flag).

}

// KubeletAuthorizationMode denotes the authorization mode for the kubelet
Expand Down Expand Up @@ -535,3 +552,9 @@ type ExecEnvVar struct {
Name string
Value string
}

// MemoryReservation specifies the memory reservation of different types for each NUMA node
type MemoryReservation struct {
NumaNode int32
Limits v1.ResourceList
}
2 changes: 2 additions & 0 deletions pkg/kubelet/apis/config/v1beta1/BUILD
Expand Up @@ -19,10 +19,12 @@ go_library(
],
importpath = "k8s.io/kubernetes/pkg/kubelet/apis/config/v1beta1",
deps = [
"//pkg/apis/core/v1:go_default_library",
"//pkg/cluster/ports:go_default_library",
"//pkg/kubelet/apis/config:go_default_library",
"//pkg/kubelet/qos:go_default_library",
"//pkg/kubelet/types:go_default_library",
"//staging/src/k8s.io/api/core/v1:go_default_library",
"//staging/src/k8s.io/apimachinery/pkg/apis/meta/v1:go_default_library",
"//staging/src/k8s.io/apimachinery/pkg/conversion:go_default_library",
"//staging/src/k8s.io/apimachinery/pkg/runtime:go_default_library",
Expand Down
4 changes: 4 additions & 0 deletions pkg/kubelet/apis/config/v1beta1/defaults.go
Expand Up @@ -23,6 +23,7 @@ import (
kruntime "k8s.io/apimachinery/pkg/runtime"
componentbaseconfigv1alpha1 "k8s.io/component-base/config/v1alpha1"
kubeletconfigv1beta1 "k8s.io/kubelet/config/v1beta1"

// TODO: Cut references to k8s.io/kubernetes, eventually there should be none from this package
"k8s.io/kubernetes/pkg/cluster/ports"
"k8s.io/kubernetes/pkg/kubelet/qos"
Expand Down Expand Up @@ -154,6 +155,9 @@ func SetDefaults_KubeletConfiguration(obj *kubeletconfigv1beta1.KubeletConfigura
// Keep the same as default NodeStatusUpdateFrequency
obj.CPUManagerReconcilePeriod = metav1.Duration{Duration: 10 * time.Second}
}
if obj.MemoryManagerPolicy == "" {
obj.MemoryManagerPolicy = kubeletconfigv1beta1.NoneMemoryManagerPolicy
}
if obj.TopologyManagerPolicy == "" {
obj.TopologyManagerPolicy = kubeletconfigv1beta1.NoneTopologyManagerPolicy
}
Expand Down
37 changes: 37 additions & 0 deletions pkg/kubelet/apis/config/v1beta1/zz_generated.conversion.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions pkg/kubelet/apis/config/v1beta1/zz_generated.defaults.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 9 additions & 1 deletion pkg/kubelet/apis/config/validation/BUILD
Expand Up @@ -11,14 +11,17 @@ go_library(
srcs = [
"validation.go",
"validation_others.go",
"validation_reserved_memory.go",
"validation_windows.go",
],
importpath = "k8s.io/kubernetes/pkg/kubelet/apis/config/validation",
deps = [
"//pkg/apis/core/v1/helper:go_default_library",
"//pkg/features:go_default_library",
"//pkg/kubelet/apis/config:go_default_library",
"//pkg/kubelet/cm/cpuset:go_default_library",
"//pkg/kubelet/types:go_default_library",
"//staging/src/k8s.io/api/core/v1:go_default_library",
"//staging/src/k8s.io/apimachinery/pkg/apis/meta/v1:go_default_library",
"//staging/src/k8s.io/apimachinery/pkg/util/errors:go_default_library",
"//staging/src/k8s.io/apimachinery/pkg/util/validation:go_default_library",
Expand All @@ -43,10 +46,15 @@ filegroup(

go_test(
name = "go_default_test",
srcs = ["validation_test.go"],
srcs = [
"validation_reserved_memory_test.go",
"validation_test.go",
],
embed = [":go_default_library"],
deps = [
"//pkg/kubelet/apis/config:go_default_library",
"//staging/src/k8s.io/api/core/v1:go_default_library",
"//staging/src/k8s.io/apimachinery/pkg/api/resource:go_default_library",
"//staging/src/k8s.io/apimachinery/pkg/apis/meta/v1:go_default_library",
"//staging/src/k8s.io/apimachinery/pkg/util/errors:go_default_library",
],
Expand Down
2 changes: 2 additions & 0 deletions pkg/kubelet/apis/config/validation/validation.go
Expand Up @@ -193,6 +193,8 @@ func ValidateKubeletConfiguration(kc *kubeletconfig.KubeletConfiguration) error
}
}

allErrors = append(allErrors, validateReservedMemoryConfiguration(kc)...)

if err := validateKubeletOSConfiguration(kc); err != nil {
allErrors = append(allErrors, err)
}
Expand Down