-
Notifications
You must be signed in to change notification settings - Fork 38.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SupportPodPidsLimit feature beta with tests #72076
Conversation
/hold fyi @dims @dchen1107 @sjenning @mrunalp @rhatdan @smarterclayton |
} | ||
|
||
// this command takes the expected value and compares it against the actual value for the pod cgroup pids.max | ||
command := fmt.Sprintf("expected=%v; actual=$(cat /tmp/pids/%v/pids.max); if [ \"$expected\" -ne \"$actual\" ]; then exit 1; fi; ", pidsLimit.Value(), cgroupFsName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are just checking if the pids.max is applied and not actually try something that fork-bombs. right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. No need to test cgroups, just that we set the right desired state. This is what we do for all other resources basically.
thanks @derekwaynecarr - looks like verify job is busted. LGTM otherwise! |
b2728bc
to
819ed9c
Compare
Was there a follow-up discussion about how to set the limits (see the original thread in #57973)? |
@yujuhong not that i recall .. |
@yujuhong - I have a hold so we can discuss in sig-node. I am fine adding a knob to protect across all pods in addition to per pod as it simplifies configuring a pid reserve. I felt that the big and small pod pid limit could be added in the future as it’s not inconsistent with a default pod pid limit enforced local to node. |
mapping to systemd
|
To capture discussion from sig-node meeting:
|
My slight concern before was that the per-pod limit was going to be hard to pick/set, and not so useful after all. Alternatively, a node-wide limit for all pods (similar to allocatable) will provide a safety net for the node, and would be easier to roll out. Setting a sensible default limit like this addressed my concern, so this looks good to me. |
@derekwaynecarr a couple more thoughts to consider: We currently have just basic node-level protection through node-level eviction. The next step IMO is isolation between node daemons (kubelet, runtime, etc) and user pods using Node Allocatable. This would allow isolating user-caused PID exhaustion to user pods to prevent the node from being made unusable if eviction doesn't catch pressure in-time. Achieving pod-to-pod PID isolation through As Node Allocatable is a well-established pattern for isolating node daemons and user workloads using a combination of cgroups and eviction. It would be easy to implement, and the design should not be contentious. I think we should seriously consider adding it under this same feature gate. We should definitely still move forward with the Edit: Realized this feature gate is just for pids per pod, so allocatable shouldn't be tied to it. We should still consider adding it though! |
#72114 merged bumping runc |
819ed9c
to
69c207b
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: derekwaynecarr The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
69c207b
to
6f8a8c0
Compare
I updated the help text for
@dashpole - I agree we should have eviction threshold and node allocatable enforcement for pids in a follow-on PR. This affords us the opportunity for a future PR to restrict pid limiting at the node allocatable level while maintaining backward compatibility. FWIW, I am not convinced right now that PidPressure condition is actually working, we should investigate that further. PTAL at this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm overall
} | ||
|
||
// enablePodPidsLimitInKubelet enables pod pid limit feature for kubelet with a sensible default test limit | ||
func enablePodPidsLimitInKubelet(f *framework.Framework) *kubeletconfig.KubeletConfiguration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use tempSetCurrentKubeletConfig here to eliminate the need for this function. It also handles reverting the config to the default for subsequent tests.
@@ -560,7 +560,7 @@ func AddKubeletConfigFlags(mainfs *pflag.FlagSet, c *kubeletconfig.KubeletConfig | |||
fs.Int32Var(&c.MaxPods, "max-pods", c.MaxPods, "Number of Pods that can run on this Kubelet.") | |||
|
|||
fs.StringVar(&c.PodCIDR, "pod-cidr", c.PodCIDR, "The CIDR to use for pod IP addresses, only used in standalone mode. In cluster mode, this is obtained from the master. For IPv6, the maximum number of IP's allocated is 65536") | |||
fs.Int64Var(&c.PodPidsLimit, "pod-max-pids", c.PodPidsLimit, "<Warning: Alpha feature> Set the maximum number of processes per pod.") | |||
fs.Int64Var(&c.PodPidsLimit, "pod-max-pids", c.PodPidsLimit, "Set the maximum number of processes per pod. If -1, the kubelet defaults to the node allocatable pid capacity.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to update the description for the Configuration type as well? https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/config/types.go#L222.
It wouldn't be too hard to write an actual fork-bomb test using the eviction framework. I'll try and add that in the next few weeks. I opened #72654 to track this. |
6f8a8c0
to
aed0256
Compare
@dashpole updates requested made. /hold cancel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does -1 translate to the node allocatable pid limit?
test/e2e_node/pids_test.go
Outdated
@@ -0,0 +1,150 @@ | |||
/* | |||
Copyright 2017 The Kubernetes Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2019?
aed0256
to
bce9d5f
Compare
@dashpole updated the copyright.
At the node allocatable level, we currently do not bound pids. cat /sys/fs/cgroup/pids/kubepods/pids.max
max This means that by default, pods are bound to the node allocatable pid limit which is At the pod cgroup level, we do not write to the pid cgroup unless the configured value is positive. The moment we add support for setting a node allocatable pid limit, the pod pid limit in this PR will be bounded by that value in the cgroup hierarchy. I kept the help text documentation in terms of node allocatable rather than host configuration since we know that will happen in a follow-on PR. |
I see, thanks. /lgtm |
/retest |
/retest Review the full test history for this PR. Silence the bot with an |
3 similar comments
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
@derekwaynecarr this has caused the Validate Node Allocatable test to fail: https://k8s-testgrid.appspot.com/sig-node-kubelet#node-kubelet-serial&include-filter-by-regex=Validate%20Node%20Allocatable We need to check if |
/kind feature
What this PR does / why we need it:
Graduate
SupportPodPidsLImit
feature to beta. This is needed to prevent a pod from starving pid resource.Special notes for your reviewer:
need to bump runc to fix this for systemd
opencontainers/runc#1917
Does this PR introduce a user-facing change?: