Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement support for multiple sizes huge pages #84051

Merged

Conversation

@bart0sh
Copy link
Contributor

bart0sh commented Oct 17, 2019

What type of PR is this?

/kind feature

What this PR does / why we need it:

This is an implementation of recently merged update of the hugepages KEP

It was tested on a local cluster with allocated huge pages of two sizes:

$ kubectl describe node |grep -A4 Allocatable
Allocatable:
  cpu:                24
  ephemeral-storage:  905877587288
  hugepages-1Gi:      2Gi
  hugepages-2Mi:      20Mi

With this pod configuration:

kind: Pod
apiVersion: v1
metadata:
  name: test
spec:
  containers:
    - name: test
      image: ubuntu
      command: ["/bin/sh", "-c", "sleep 30000"]
 
      resources:
        requests:
          cpu: "250m"
          hugepages-2Mi: 2Mi
          hugepages-1Gi: 2Gi
        limits:
          cpu: "250m"
          hugepages-2Mi: 2Mi
          hugepages-1Gi: 2Gi

      volumeMounts:
      - mountPath: /hugepages-2Mi
        name: hugepage-2mi
      - mountPath: /hugepages-1Gi
        name: hugepage-1gi

  volumes:
  - name: hugepage-2mi
    emptyDir:
      medium: HugePages-2Mi
  - name: hugepage-1gi
    emptyDir:
      medium: HugePages-1Gi

  restartPolicy: Never

Both sizes hugepages where mounted correctly in the container:

# mount |grep hugepages
nodev on /hugepages-2Mi type hugetlbfs (rw,relatime,pagesize=2M)
nodev on /hugepages-1Gi type hugetlbfs (rw,relatime,pagesize=1024M)

Pod level allocations for both huge pages sizes look correct as well:

$ cat /sys/fs/cgroup/hugetlb/kubepods/burstable/pod1b2ff802-2560-4e77-ba16-96f0aa50530d/hugetlb.2MB.limit_in_bytes 
2097152
$ cat /sys/fs/cgroup/hugetlb/kubepods/burstable/pod1b2ff802-2560-4e77-ba16-96f0aa50530d/hugetlb.1GB.limit_in_bytes 
2147483648

Does this PR introduce a user-facing change?:

Added support for multiple sizes huge pages on a container level
@bart0sh

This comment has been minimized.

Copy link
Contributor Author

bart0sh commented Oct 17, 2019

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Oct 17, 2019

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

Copy link
Member

odinuge left a comment

Just some initial thoughts, but otherwise this makes much sense to me. Thanks for your work on this.

The questions about validation are open for discussion, and should probably not be a part of this PR.

Test failures look valid, so those should be addressed. We also have some e2e tests for hugepages, have you verified that they still pass?

Adding hold since this depends on support on node level: #82820
/hold

pkg/volume/emptydir/empty_dir.go Show resolved Hide resolved
pkg/volume/emptydir/empty_dir.go Show resolved Hide resolved
pkg/volume/emptydir/empty_dir.go Outdated Show resolved Hide resolved
pkg/volume/emptydir/empty_dir.go Outdated Show resolved Hide resolved
pkg/volume/emptydir/empty_dir.go Outdated Show resolved Hide resolved
pkg/apis/core/v1/helper/helpers.go Show resolved Hide resolved
pkg/apis/core/types.go Show resolved Hide resolved
@odinuge

This comment has been minimized.

Copy link
Member

odinuge commented Oct 17, 2019

/sig node

@bg-chun

This comment has been minimized.

Copy link
Member

bg-chun commented Oct 17, 2019

I think that overall changes in emptyDir.go follow the KEP update as well.

Regarding medium: HugePages, I have a question.
It seems that validation logic will pass the below case.
Below Sample Pod Spec shows that the Pod has both of medium: HugePages and medium: HugePages-1Gi and consumes only 1Gi hugepages.

The KEP says that For backwards compatibility, a pod that uses one page size should pass validation if a volume emptyDir medium=HugePages notation is used..
There is no actual restriction for volume, and we put a restriction for only page size.

Would it be okay to allow the above case?
I think it is just a little bit wired, but I guess there will be no issue to consume single size hugepages in below pod.
(correct me if my understanding is wrong)

[Pod Spec]

apiVersion: v1
kind: Pod
metadata:
  name: example
spec:
  containers:
  - name: container1
    volumeMounts:
    - mountPath: /hugepages
      name: hugepage
    resources:
      requests:
        hugepages-1Gi: 2Gi
      limits:
        hugepages-1Gi: 2Gi
  - name: container2
    volumeMounts:
    - mountPath: /hugepages-1Gi
      name: hugepage-1Gi
    resources:
      requests:
        hugepages-1Gi: 2Gi
      limits:
        hugepages-1Gi: 2Gi
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages
  - name: hugepage-1Gi
    emptyDir:
      medium: HugePages-1Gi
@kad

This comment has been minimized.

Copy link
Member

kad commented Oct 18, 2019

[Pod Spec]

apiVersion: v1
kind: Pod
metadata:
  name: example
spec:
  containers:
  - name: container1
    volumeMounts:
    - mountPath: /hugepages
      name: hugepage
    resources:
      requests:
        hugepages-1Gi: 2Gi
      limits:
        hugepages-1Gi: 2Gi
  - name: container2
    volumeMounts:
    - mountPath: /hugepages-1Gi
      name: hugepage-1Gi
    resources:
      requests:
        hugepages-1Gi: 2Gi
      limits:
        hugepages-1Gi: 2Gi
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages
  - name: hugepage-1Gi
    emptyDir:
      medium: HugePages-1Gi

This is an interesting corner case. Volumes are Pod level and right now the sizes of those volumes are calculated by sum of all containers requests/limits for hugepages. In theory, if pod has more than one container and more than one hugetlbfs mounts and not all of those volumemounts are used in all of the containers, we need to change logic of validation and logic of calculating sizes for each of hugepage volume.

@odinuge

This comment has been minimized.

Copy link
Member

odinuge commented Oct 18, 2019

@kad: This is an interesting corner case. Volumes are Pod level and right now the sizes of those volumes are calculated by sum of all containers requests/limits for hugepages. In theory, if pod has more than one container and more than one hugetlbfs mounts and not all of those volumemounts are used in all of the containers, we need to change logic of validation and logic of calculating sizes for each of hugepage volume.

Not sure if I understand what you mean. The "size" of a volume mount of type hugetlbfs is the page size used, and not the amount of memory available. By default a program can use all the pre allocated huge page memory from such a mount, but this is limited via the hugetlb cgroup.

How to handle multiple containers with multiple hugepage sizes, together with different volumes is maybe something we can incorporate into the KEP, but think it should be ok as is now too. The reason we verify that the page sizes used in the mount is in requests/limits is to make sure that they are valid on the node that schedules the pod. It is not a problem that a container (even when we start supporting container level cgroup enforcement) without any huge page limit/request mounts a hugetlbfs volume, since the cgroup will not allow the processes use it.

The problem will arise when a pod use a page size in a volume without having the size in the requests/limits:

apiVersion: v1
kind: Pod
metadata:
  name: example
spec:
  containers:
  - name: container
    volumeMounts:
    - mountPath: /hugepages
      name: hugepage
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages # alternatively HugePages-1Gi

This (the example above) can be valid on a given node if it supports 1GiB pages, but we cannot be sure since the scheduler doesn't take huge page support into account when finding a node. This is a case we successfully validates during volume mounting today, but the podspec is still "valid" in the sense that the apiserver will accept it.

In this pod we know that the node running the pod support 1GiB huge pages, so container2 can mount the volume, but the cgroup enforcement will limit its usage to 0 (when we starts supporting container level enforcement, today we only enforce it on pod level). So I think this example is a pod spec that should be treated as valid:

apiVersion: v1
kind: Pod
metadata:
  name: example
spec:
  containers:
  - name: container1
    volumeMounts:
    - mountPath: /hugepages
      name: hugepage
    resources:
      requests:
        hugepages-1Gi: 2Gi
      limits:
        hugepages-1Gi: 2Gi
  - name: container2
    volumeMounts:
    - mountPath: /hugepages-1Gi
      name: hugepage-1Gi
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages
  - name: hugepage-1Gi
    emptyDir:
      medium: HugePages-1Gi
@bart0sh bart0sh force-pushed the bart0sh:PR0079-multiple-sizes-hugepages branch from 5d89c9a to 1feb251 Oct 21, 2019
@bart0sh

This comment has been minimized.

Copy link
Contributor Author

bart0sh commented Oct 21, 2019

@odinuge @bg-chun thank you for the review! I've updated the PR according to your suggestions. Please review again.

@bart0sh

This comment has been minimized.

Copy link
Contributor Author

bart0sh commented Oct 22, 2019

/test pull-kubernetes-e2e-gce-storage-slow

@bart0sh bart0sh force-pushed the bart0sh:PR0079-multiple-sizes-hugepages branch from 515630e to 1a5cad6 Feb 13, 2020
@bart0sh

This comment has been minimized.

Copy link
Contributor Author

bart0sh commented Feb 13, 2020

@liggitt > is it expected that pods today could be specifying these new hugepages mediums?

Yes, if they need to use multiple huge pages sizes.

pkg/apis/core/validation/validation.go Outdated Show resolved Hide resolved
pkg/apis/core/validation/validation.go Outdated Show resolved Hide resolved
@@ -290,11 +290,11 @@ func (ed *emptyDir) setupHugepages(dir string) error {
}
// If the directory is a mountpoint with medium hugepages, there is no
// work to do since we are already in the desired state.
if isMnt && medium == v1.StorageMediumHugePages {
if isMnt && v1helper.IsHugePageMedium(medium) {

This comment has been minimized.

Copy link
@liggitt

liggitt Feb 13, 2020

Member

previously, there could only be a single hugepages size per pod, so all hugepages mounts had to agree. with this PR, that is no longer the case, correct?

@bart0sh bart0sh force-pushed the bart0sh:PR0079-multiple-sizes-hugepages branch 2 times, most recently from 6df5aa5 to 54f60a1 Feb 13, 2020
@bart0sh

This comment has been minimized.

Copy link
Contributor Author

bart0sh commented Feb 14, 2020

/retest

@bart0sh bart0sh force-pushed the bart0sh:PR0079-multiple-sizes-hugepages branch from 54f60a1 to c38dbb6 Feb 19, 2020
@k8s-ci-robot k8s-ci-robot added size/XXL and removed size/XL labels Feb 19, 2020
@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Feb 19, 2020

API validation changes lgtm, will defer the rest of the review to node/storage folks. Please assign me once those have lgtm and I'll add approval for the API bits.

/unassign

@bart0sh bart0sh force-pushed the bart0sh:PR0079-multiple-sizes-hugepages branch from c38dbb6 to 4430f88 Feb 19, 2020
bart0sh and others added 2 commits Oct 17, 2019
This implementation allows Pod to request multiple hugepage resources
of different size and mount hugepage volumes using storage medium
HugePage-<size>, e.g.

spec:
  containers:
    resources:
      requests:
        hugepages-2Mi: 2Mi
        hugepages-1Gi: 2Gi
    volumeMounts:
      - mountPath: /hugepages-2Mi
        name: hugepage-2mi
      - mountPath: /hugepages-1Gi
        name: hugepage-1gi
    ...
  volumes:
    - name: hugepage-2mi
      emptyDir:
        medium: HugePages-2Mi
    - name: hugepage-1gi
      emptyDir:
        medium: HugePages-1Gi

NOTE: This is an alpha feature.
      Feature gate HugePageStorageMediumSize must be enabled for it to work.
Co-Authored-By: Odin Ugedal <odin@ugedal.com>
@bart0sh bart0sh force-pushed the bart0sh:PR0079-multiple-sizes-hugepages branch 3 times, most recently from 23d1c36 to 03ecc20 Feb 20, 2020
Extended GetMountMedium function to check if hugetlbfs volume
is mounted with the page size equal to the medium size.

Page size is obtained from the 'pagesize' mount option of the
mounted hugetlbfs volume.
@bart0sh

This comment has been minimized.

Copy link
Contributor Author

bart0sh commented Feb 20, 2020

/retest

@derekwaynecarr

This comment has been minimized.

Copy link
Member

derekwaynecarr commented Feb 25, 2020

the kubelet changes still lgtm.

/approve
/lgtm

assign @liggitt

@k8s-ci-robot k8s-ci-robot added the lgtm label Feb 25, 2020
@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Feb 25, 2020

/approve

@liggitt liggitt added the approved label Feb 25, 2020
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Feb 25, 2020

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bart0sh, derekwaynecarr, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 851efa8 into kubernetes:master Feb 25, 2020
19 of 20 checks passed
19 of 20 checks passed
tide Not mergeable. Retesting: pull-kubernetes-kubemark-e2e-gce-big
Details
cla/linuxfoundation bart0sh authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-alpha-features Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-storage-slow Job succeeded.
Details
pull-kubernetes-e2e-gce-storage-snapshot Job succeeded.
Details
pull-kubernetes-e2e-kind Job succeeded.
Details
pull-kubernetes-e2e-kind-ipv6 Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Feb 25, 2020
if len(hugePageResources) > 1 {
allErrs = append(allErrs, field.Invalid(specPath, hugePageResources, "must use a single hugepage size in a pod spec"))
if !opts.AllowMultipleHugePageResources {
allErrs = append(allErrs, ValidatePodSingleHugePageResources(pod, specPath)...)

This comment has been minimized.

Copy link
@jingxu97

jingxu97 Feb 28, 2020

Contributor

if HugePageStorageMediumSize is enabled, should we validate the name and format etc. too? What if user gives an arbitrary string in this field?

This comment has been minimized.

Copy link
@liggitt

liggitt Feb 28, 2020

Member

if I understand correctly, the user could already give an arbitrary string in that field. if that's the case, we cannot easily tighten validation on an existing field

This comment has been minimized.

Copy link
@jingxu97

jingxu97 Feb 28, 2020

Contributor

oh, this ValidatePodSingleHugePageResources is for validating whether there are multiple huge page size is specified in the resources. So it makes sense to only validate if feature is disabled.

So one scenario is rollback, if MutipleHuagePageResource is enabled during pod creation, so it sets multiple sizes. If during pod rollback, MutipleHuagePageResource is disabled, it might fail to update pod?

I also checked a few cases, the following can pass validation which seems not right

resources:
  limits:
    hugepages-xGi: 100Mi

volumes:

  • name: hugepage
    emptyDir:
    medium: abc

This comment has been minimized.

Copy link
@liggitt

liggitt Feb 29, 2020

Member

If during pod rollback, MutipleHuagePageResource is disabled, it might fail to update pod?

that is addressed here:

func (podStrategy) ValidateUpdate(ctx context.Context, obj, old runtime.Object) field.ErrorList {
oldFailsSingleHugepagesValidation := len(validation.ValidatePodSingleHugePageResources(old.(*api.Pod), field.NewPath("spec"))) > 0
opts := validation.PodValidationOptions{
// Allow multiple huge pages on pod create if feature is enabled or if the old pod already has multiple hugepages specified
AllowMultipleHugePageResources: oldFailsSingleHugepagesValidation || utilfeature.DefaultFeatureGate.Enabled(features.HugePageStorageMediumSize),
}

the following can pass validation which seems not right

see discussion in #52936 (comment)

cynepco3hahue added a commit to cynepco3hahue/api that referenced this pull request Mar 23, 2020
As part of the telco effort we should provide possibility,
to use multiple sizes huge pages under the node.

Kubernetes supports this feature as alpha under the 1.18, to enable it
you should enable feature gate `HugePageStorageMediumSize`,
see kubernetes/kubernetes#84051.

Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.