Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SupportPodPidsLimit feature beta with tests #72076

Merged
merged 1 commit into from
Jan 10, 2019

Conversation

derekwaynecarr
Copy link
Member

@derekwaynecarr derekwaynecarr commented Dec 15, 2018

/kind feature

What this PR does / why we need it:
Graduate SupportPodPidsLImit feature to beta. This is needed to prevent a pod from starving pid resource.

Special notes for your reviewer:
need to bump runc to fix this for systemd
opencontainers/runc#1917

Does this PR introduce a user-facing change?:

Administrator is able to configure max pids for a pod on a node.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 15, 2018
@derekwaynecarr
Copy link
Member Author

/hold
need to bump runc prior to merge.

fyi @dims @dchen1107 @sjenning @mrunalp @rhatdan @smarterclayton

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 15, 2018
}

// this command takes the expected value and compares it against the actual value for the pod cgroup pids.max
command := fmt.Sprintf("expected=%v; actual=$(cat /tmp/pids/%v/pids.max); if [ \"$expected\" -ne \"$actual\" ]; then exit 1; fi; ", pidsLimit.Value(), cgroupFsName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are just checking if the pids.max is applied and not actually try something that fork-bombs. right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. No need to test cgroups, just that we set the right desired state. This is what we do for all other resources basically.

@dims
Copy link
Member

dims commented Dec 17, 2018

thanks @derekwaynecarr - looks like verify job is busted. LGTM otherwise!

@derekwaynecarr derekwaynecarr added this to the v1.14 milestone Dec 17, 2018
@derekwaynecarr derekwaynecarr added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Dec 17, 2018
@yujuhong
Copy link
Contributor

Was there a follow-up discussion about how to set the limits (see the original thread in #57973)?

@dims
Copy link
Member

dims commented Dec 17, 2018

@yujuhong not that i recall ..

@derekwaynecarr
Copy link
Member Author

@yujuhong - I have a hold so we can discuss in sig-node. I am fine adding a knob to protect across all pods in addition to per pod as it simplifies configuring a pid reserve. I felt that the big and small pod pid limit could be added in the future as it’s not inconsistent with a default pod pid limit enforced local to node.

@derekwaynecarr
Copy link
Member Author

mapping to systemd

  • knob in this PR is DefaultTasksMax
  • future iterations could add per pod TasksMax value. It could be restricted and/or defaulted via LimitRange or PSP. Those iterations could be different feature flags.

@derekwaynecarr
Copy link
Member Author

derekwaynecarr commented Dec 18, 2018

To capture discussion from sig-node meeting:

  1. we want to provide a seat belt for users to protect the node
  2. we will default this flag to the rlimit.maxPid - configured eviction threshold for pid.available. this sets a hard cap to prevent any pod from doing pid exhaustion. we could set that cap at the cgroup level for all pods, and each individual pod.
  3. in future, we want to support granular options per pod for pid limits, but we want to do that via a policy option. this would be tracked under a different feature gate, i.e. GranularPidLimitsPerPod.
  4. the defaulting applied in this flag would only be used if a pod had no associated pid limiting policy applied in a future date.

@yujuhong
Copy link
Contributor

2. we will default this flag to the rlimit.maxPid - configured eviction threshold for pid.available. this sets a hard cap to prevent any pod from doing pid exhaustion. we could set that cap at the cgroup level for all pods, and each individual pod.

My slight concern before was that the per-pod limit was going to be hard to pick/set, and not so useful after all. Alternatively, a node-wide limit for all pods (similar to allocatable) will provide a safety net for the node, and would be easier to roll out.

Setting a sensible default limit like this addressed my concern, so this looks good to me.

@dashpole
Copy link
Contributor

dashpole commented Dec 18, 2018

@derekwaynecarr a couple more thoughts to consider:

We currently have just basic node-level protection through node-level eviction. The next step IMO is isolation between node daemons (kubelet, runtime, etc) and user pods using Node Allocatable. This would allow isolating user-caused PID exhaustion to user pods to prevent the node from being made unusable if eviction doesn't catch pressure in-time. Achieving pod-to-pod PID isolation through PidLimitsPerPod would be the final step to prevent PID exhaustion from affecting any "innocent" workloads. It feels like we are trying to achieve protection for node daemons by using a pod-to-pod isolation mechanism, resulting in the "we need defaults for protection, but don't know how to set them well" issue.

As Node Allocatable is a well-established pattern for isolating node daemons and user workloads using a combination of cgroups and eviction. It would be easy to implement, and the design should not be contentious. I think we should seriously consider adding it under this same feature gate.

We should definitely still move forward with the PidLimitsPerPod mechanism, but I think adding Node Allocatable will alleviate the concerns we have over not having a good default for it, and allow operators to use it as a knob to isolate pods from each-other rather than as a sudo-node-allocatable hard limit to protect daemons.

Edit: Realized this feature gate is just for pids per pod, so allocatable shouldn't be tied to it. We should still consider adding it though!

@sjenning
Copy link
Contributor

#72114 merged bumping runc

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 19, 2018
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 7, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@derekwaynecarr
Copy link
Member Author

I updated the help text for pod-max-pids to state the following:

If -1, the kubelet defaults to the node allocatable pid capacity.

@dashpole - I agree we should have eviction threshold and node allocatable enforcement for pids in a follow-on PR. This affords us the opportunity for a future PR to restrict pid limiting at the node allocatable level while maintaining backward compatibility. FWIW, I am not convinced right now that PidPressure condition is actually working, we should investigate that further. PTAL at this PR.

Copy link
Contributor

@dashpole dashpole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm overall

}

// enablePodPidsLimitInKubelet enables pod pid limit feature for kubelet with a sensible default test limit
func enablePodPidsLimitInKubelet(f *framework.Framework) *kubeletconfig.KubeletConfiguration {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use tempSetCurrentKubeletConfig here to eliminate the need for this function. It also handles reverting the config to the default for subsequent tests.

@@ -560,7 +560,7 @@ func AddKubeletConfigFlags(mainfs *pflag.FlagSet, c *kubeletconfig.KubeletConfig
fs.Int32Var(&c.MaxPods, "max-pods", c.MaxPods, "Number of Pods that can run on this Kubelet.")

fs.StringVar(&c.PodCIDR, "pod-cidr", c.PodCIDR, "The CIDR to use for pod IP addresses, only used in standalone mode. In cluster mode, this is obtained from the master. For IPv6, the maximum number of IP's allocated is 65536")
fs.Int64Var(&c.PodPidsLimit, "pod-max-pids", c.PodPidsLimit, "<Warning: Alpha feature> Set the maximum number of processes per pod.")
fs.Int64Var(&c.PodPidsLimit, "pod-max-pids", c.PodPidsLimit, "Set the maximum number of processes per pod. If -1, the kubelet defaults to the node allocatable pid capacity.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to update the description for the Configuration type as well? https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/config/types.go#L222.

@dashpole
Copy link
Contributor

dashpole commented Jan 7, 2019

FWIW, I am not convinced right now that PidPressure condition is actually working, we should investigate that further.

It wouldn't be too hard to write an actual fork-bomb test using the eviction framework. I'll try and add that in the next few weeks. I opened #72654 to track this.

@derekwaynecarr
Copy link
Member Author

@dashpole updates requested made.

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 8, 2019
Copy link
Contributor

@dashpole dashpole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does -1 translate to the node allocatable pid limit?

@@ -0,0 +1,150 @@
/*
Copyright 2017 The Kubernetes Authors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2019?

@derekwaynecarr
Copy link
Member Author

@dashpole updated the copyright.

Where does -1 translate to the node allocatable pid limit?

At the node allocatable level, we currently do not bound pids.

cat /sys/fs/cgroup/pids/kubepods/pids.max
max

This means that by default, pods are bound to the node allocatable pid limit which is /proc/sys/kernel/pid_max value.

At the pod cgroup level, we do not write to the pid cgroup unless the configured value is positive.
See: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/pod_container_manager_linux.go#L89

The moment we add support for setting a node allocatable pid limit, the pod pid limit in this PR will be bounded by that value in the cgroup hierarchy. I kept the help text documentation in terms of node allocatable rather than host configuration since we know that will happen in a follow-on PR.

@dashpole
Copy link
Contributor

dashpole commented Jan 9, 2019

I see, thanks.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 9, 2019
@dashpole
Copy link
Contributor

dashpole commented Jan 9, 2019

/retest

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

3 similar comments
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 0dbc997 into kubernetes:master Jan 10, 2019
@dashpole
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants