Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow setting custom volume sizes for Windows containers with containerd #109702

Closed
wants to merge 2 commits into from

Conversation

marosset
Copy link
Contributor

@marosset marosset commented Apr 27, 2022

Signed-off-by: Mark Rossetti marosset@microsoft.com

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

On Windows containers are created with a default volume size of 20Gb.
With docker this can be specified per-container with --storage-opt "size=xxGb" or globally by specifying storage-opts in the dockerd config file.

https://docs.microsoft.com/virtualization/windowscontainers/manage-containers/container-storage#storage-limits

This PR allows for the container volume size to be specified in pod specs and will pass that size to CRI container create calls.

Which issue(s) this PR fixes:

Related to #108894 and containerd/containerd#6694

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Allow setting custom root volume size for Windows containers with containerd v1.7+ when []containers.resources.limits.ephemeral-storage is specified.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


/sig windows
/area kubelet

@k8s-ci-robot k8s-ci-robot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Apr 27, 2022
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.24 branch. This means every merged PR has to be cherry-picked into the release branch to be part of the upcoming v1.24.0 release.

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. sig/windows Categorizes an issue or PR as relevant to SIG Windows. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 27, 2022
@marosset
Copy link
Contributor Author

/hold
I will add some tests but will need the changes to pick up this new value implemented in containerd first

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 27, 2022
@marosset
Copy link
Contributor Author

/cc @dcantah

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Apr 27, 2022
@dcantah
Copy link
Member

dcantah commented Apr 28, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 28, 2022
@pacoxu pacoxu added this to Triage in SIG Node PR Triage May 5, 2022
@marosset marosset added this to In Progress (v1.25) in SIG-Windows May 5, 2022
@jsturtevant
Copy link
Contributor

/triage accepted
/priority important-soon
/milestone v1.25

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label May 11, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.25 milestone May 11, 2022
@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 11, 2022
@marosset marosset changed the title Allow setting custom volume sizes for Windows containers with contianerd WIP: Allow setting custom volume sizes for Windows containers with contianerd May 11, 2022
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 11, 2022
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jul 6, 2022
@marosset
Copy link
Contributor Author

marosset commented Jul 6, 2022

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 6, 2022
@marosset marosset moved this from In Progress (v1.25) to In Review (v1.25) in SIG-Windows Jul 6, 2022
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jul 7, 2022

@marosset: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-verify 39e84b1 link true /test pull-kubernetes-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

})

ginkgo.It("validate rootfs size can be set larger than 20Gb", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want a test that validates it can be smaller? The docs say Some users may want to override this default and configure the free space to a smaller or larger value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It cannot at the moment be smaller than 20GB, shrinking from the default is a harder problem.. I've been thinking of some ways to do this in containerd, but currently in main that's not there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This.
I can update the comments in code to say it can be set larger than the default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, is there anybody that has a use case for it to be smaller that we know?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a limitation in containerd only? https://docs.microsoft.com/en-us/virtualization/windowscontainers/manage-containers/container-storage#storage-limits seems to suggest that it is possible in docker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not in docker either iirc, I think the call succeeds but you stay at 20

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok good to know it isn't a missing feature, may need to get those docs updated then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, we'll update docs for the release (once we settle on what fields in the pod-spec control this)

@@ -188,3 +196,94 @@ func testPodWithROVolume(podName string, source v1.VolumeSource, path string) *v
},
}
}

func getNodeContainerRuntimeAndVersion(n v1.Node) (string, semver.Version, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in our utils.go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, probably :)
I'll update

Command: []string{
"powershell.exe",
"-Command",
"if (-not ((Get-PsDrive -Name C).Free -gt 21474836480)) { exit 1 } else { exit 0 }",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why greater than 21gb? is there a reason not test for 30 gb? I guess the scratch won't be the exact same size everytime?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I picked this number because it was greater than the default (20Gb) so we wouldn't get a false positives.
With this powershell command I can only get free space, not total space and because of that I wanted wiggle room.
This e2e test is primarily validating that the field on the pod-spec gets passed all the way to the CRI.
I think other layers (containerd or HCS) can test to make sure the size matches exactly since that is much easier to do from outside of the container.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, works for me, especially if the other layers are doing the validation

// override this default.
// https://docs.microsoft.com/virtualization/windowscontainers/manage-containers/container-storage#scratch-space
// https://docs.microsoft.com/virtualization/windowscontainers/manage-containers/container-storage#storage-limits
ephemeralStorageLimit := container.Resources.Limits.StorageEphemeral().Value()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading through https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

If the kubelet is managing local ephemeral storage as a resource, then the kubelet measures storage use in:

emptyDir volumes, except tmpfs emptyDir volumes
directories holding node-level logs
writeable container layers

If a Pod is using more ephemeral storage than you allow it to, the kubelet sets an eviction signal that triggers Pod eviction.

It seems that kubelet tracks more than just the rootfs with the Ephemeral Storage Limit value. The RootfssizeinBytes seems to be the "writable layer" here, but would this mean pods would be evicted if they added emtpydir volume as well as resize the rootsfs of the container?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Windows I don't think it is possible to resize the rootfs of the container.

but would this mean pods would be evicted if they added emtpydir volume as well as resize the rootsfs of the container?

I believe this would be the case that
Also if the writable layer is nearly full and the container produced a lot of logs that could possibly trigger an eviction too.

I went back and forth trying to decide between using limit or request here.
limit made more sense to me when looking at the context of only the container's ephemeral-storage but I could maybe request makes sense so users can set limits higher than requests to allow for emptydir usage?

@dcantah - do the volumes created here use 'thin' provisioning?
One concern I have if they do not is that it could be possible to over-provision storage capacity on the node if we use request for the size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msau42 @ddebroy @mauriciopoppe - Do any of you (or anyone else from SIG-storage) have any thoughts here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am struggling a little bit with overloading the ephemeralstoragelimit concept and the RootfsSizeInBytes. I am not sure they are 1 to 1 but this is the first time I am looking at it.

I was looking around and it doesn't look like Linux sets the scratch layer sizes anywhere. Does linux have a similar concept or is this windows only?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @liggitt
I remember discussing this with him (can't remember why exactly) but he asked if it would make sense to use ephemeral storage resources here. I think it can make sense but I'm not sure if/what the linux equivalent is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vhd itself for the volume is thin provisioned and will expand on data written, it's nowhere near 20GiB on the hosts disk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vhd itself for the volume is thin provisioned and will expand on data written, it's nowhere near 20GiB on the hosts disk

That was what I thought - thanks for confirming!

@k8s-ci-robot
Copy link
Contributor

@marosset: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 8, 2022
@jayunit100
Copy link
Member

thanks @marosset !!! in passing, id say things like matching parity/behaviour for density related things imo we can fix later - since we know this fixes a concrete problem people have.

@marosset
Copy link
Contributor Author

@jingxu97 @ddebroy @mauriciopoppe - Can someone from SIG-storage (preferably with some Windows knowledge :P) weigh in here?

On Windows the default size for the 'writable container layer' is 20Gb and this cannot be expanded after the VM is created.
We have added a CRI field to be able to specify the size of the 'writable container layer' and updated containerd to consume this value and are looking for how best to expose this to users.
Does anyone have opinions on if we should map this value to the ephemeral storage requests of limits?
Since this value cannot be expanded after the container is created limit might make sense. The problem I see with using limit is that limits are used for determining when pods should be evicted which could be confusing to a user that wants to use a large portion of the writable container layer and and use things like emptyDir volumes which are also used when seeing if pods should get evicted.

/sig storage

@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Jul 12, 2022
@jingxu97
Copy link
Contributor

@jingxu97 @ddebroy @mauriciopoppe - Can someone from SIG-storage (preferably with some Windows knowledge :P) weigh in here?

On Windows the default size for the 'writable container layer' is 20Gb and this cannot be expanded after the VM is created. We have added a CRI field to be able to specify the size of the 'writable container layer' and updated containerd to consume this value and are looking for how best to expose this to users. Does anyone have opinions on if we should map this value to the ephemeral storage requests of limits? Since this value cannot be expanded after the container is created limit might make sense. The problem I see with using limit is that limits are used for determining when pods should be evicted which could be confusing to a user that wants to use a large portion of the writable container layer and and use things like emptyDir volumes which are also used when seeing if pods should get evicted.

/sig storage

I also have concerns about overloading ephemeral storage request/limit for this usage. The request/limit is for the feature of local storage capacity isolation and have specific meaning about it. It already has some complexity on its own, I don't think we should overload more meaning on it. Right now this feature is beta and plan to promote to GA. But due to some system could not correctly get disk usage information (e.g., kind rootless), they have to disable this feature.

For allow setting customizable writable container layer, do you think it is useful to set per pod/container level, or just cluster level?

@marosset
Copy link
Contributor Author

@jingxu97 @ddebroy @mauriciopoppe - Can someone from SIG-storage (preferably with some Windows knowledge :P) weigh in here?
On Windows the default size for the 'writable container layer' is 20Gb and this cannot be expanded after the VM is created. We have added a CRI field to be able to specify the size of the 'writable container layer' and updated containerd to consume this value and are looking for how best to expose this to users. Does anyone have opinions on if we should map this value to the ephemeral storage requests of limits? Since this value cannot be expanded after the container is created limit might make sense. The problem I see with using limit is that limits are used for determining when pods should be evicted which could be confusing to a user that wants to use a large portion of the writable container layer and and use things like emptyDir volumes which are also used when seeing if pods should get evicted.
/sig storage

I also have concerns about overloading ephemeral storage request/limit for this usage. The request/limit is for the feature of local storage capacity isolation and have specific meaning about it. It already has some complexity on its own, I don't think we should overload more meaning on it. Right now this feature is beta and plan to promote to GA. But due to some system could not correctly get disk usage information (e.g., kind rootless), they have to disable this feature.

For allow setting customizable writable container layer, do you think it is useful to set per pod/container level, or just cluster level?

The best solution would allow for this to be set per-container but I think being able to set this per-pod would be acceptable.
I don't think that setting this per-cluster or per-node would be a good solution because the users who would want to set this might not have access to be able to do so.

@ddebroy
Copy link
Member

ddebroy commented Jul 15, 2022

Here are a couple of scenarios that may lead to unexpected outcomes as a result of wc.Resources.RootfsSizeInBytes := container.Resources.Limits.StorageEphemeral().Value() in this PR:

Unexpected pod eviction due to verbose logs:

  • User specifies a certain (tiny) ephemeral limit for a container to indicate a small rootfssize (assuming the change proposed will be documented).
  • The container generates lots of logs and the size of the logs exceeds rootfssize.
  • The above results in the eviction of the pod by the kubelet leaving the user confused.

Backward compatibility breaks:

  • User specified a really small ephemeral limit for several container based on existing docs/expectations
  • Upon upgrading to a version with the changes in the PR, the rootfssize is set based on the small limit.
  • Users' existing containers now fail to launch due to lack of space.

It feels like a distinct parameter (+ validations that it is < container ephemeral storage limit) to specify the rootfssize would be ideal. Today, emptydir limits can be specified as a distinct field (aside from overall pod ephemeral storage limit) as an example:

SizeLimit *resource.Quantity `json:"sizeLimit,omitempty" protobuf:"bytes,2,opt,name=sizeLimit"`

@hosseinsalahi
Copy link

Hi @marosset
Bug triage team here!
It looks like this PR has the pending review.
Just checking in to see if this is still on track for K8s 1.25.

@marosset
Copy link
Contributor Author

/milestone clear

@k8s-ci-robot k8s-ci-robot removed this from the v1.25 milestone Jul 27, 2022
@marosset
Copy link
Contributor Author

Hi @marosset Bug triage team here! It looks like this PR has the pending review. Just checking in to see if this is still on track for K8s 1.25.

We are still having some on-going discussions on the best way to expose this field to users.
I don't think we'll reach agreement before 1.25 code freeze so I've removed this PR from the milestone

@endocrimes endocrimes moved this from Triage to Archive-it in SIG Node CI/Test Board Aug 3, 2022
@marosset marosset removed this from In Review (v1.25) in SIG-Windows Sep 21, 2022
@marosset
Copy link
Contributor Author

/close
This needs more design/consideration to properly support.
Until then - the guidance SIG-Windows has for this is to use emptydir volumes.

SIG Node PR Triage automation moved this from Waiting on Author to Done Sep 22, 2022
@k8s-ci-robot
Copy link
Contributor

@marosset: Closed this PR.

In response to this:

/close
This needs more design/consideration to properly support.
Until then - the guidance SIG-Windows has for this is to use emptydir volumes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. sig/windows Categorizes an issue or PR as relevant to SIG Windows. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Archived in project
Archived in project
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

10 participants