SupportPodPidsLimit feature beta with tests #72076

derekwaynecarr · 2018-12-15T08:28:30Z

/kind feature

What this PR does / why we need it:
Graduate SupportPodPidsLImit feature to beta. This is needed to prevent a pod from starving pid resource.

Special notes for your reviewer:
need to bump runc to fix this for systemd
opencontainers/runc#1917

Does this PR introduce a user-facing change?:

Administrator is able to configure max pids for a pod on a node.

derekwaynecarr · 2018-12-15T08:30:09Z

/hold
need to bump runc prior to merge.

fyi @dims @dchen1107 @sjenning @mrunalp @rhatdan @smarterclayton

dims · 2018-12-15T23:27:23Z

test/e2e_node/pids_test.go

+	}
+
+	// this command takes the expected value and compares it against the actual value for the pod cgroup pids.max
+	command := fmt.Sprintf("expected=%v; actual=$(cat /tmp/pids/%v/pids.max); if [ \"$expected\" -ne \"$actual\" ]; then exit 1; fi; ", pidsLimit.Value(), cgroupFsName)


So we are just checking if the pids.max is applied and not actually try something that fork-bombs. right?

Correct. No need to test cgroups, just that we set the right desired state. This is what we do for all other resources basically.

dims · 2018-12-17T14:57:59Z

thanks @derekwaynecarr - looks like verify job is busted. LGTM otherwise!

yujuhong · 2018-12-17T21:30:24Z

Was there a follow-up discussion about how to set the limits (see the original thread in #57973)?

dims · 2018-12-17T22:39:12Z

@yujuhong not that i recall ..

derekwaynecarr · 2018-12-18T02:14:12Z

@yujuhong - I have a hold so we can discuss in sig-node. I am fine adding a knob to protect across all pods in addition to per pod as it simplifies configuring a pid reserve. I felt that the big and small pod pid limit could be added in the future as it’s not inconsistent with a default pod pid limit enforced local to node.

derekwaynecarr · 2018-12-18T02:26:18Z

mapping to systemd

knob in this PR is DefaultTasksMax
future iterations could add per pod TasksMax value. It could be restricted and/or defaulted via LimitRange or PSP. Those iterations could be different feature flags.

derekwaynecarr · 2018-12-18T18:42:33Z

To capture discussion from sig-node meeting:

we want to provide a seat belt for users to protect the node
we will default this flag to the rlimit.maxPid - configured eviction threshold for pid.available. this sets a hard cap to prevent any pod from doing pid exhaustion. we could set that cap at the cgroup level for all pods, and each individual pod.
in future, we want to support granular options per pod for pid limits, but we want to do that via a policy option. this would be tracked under a different feature gate, i.e. GranularPidLimitsPerPod.
the defaulting applied in this flag would only be used if a pod had no associated pid limiting policy applied in a future date.

yujuhong · 2018-12-18T19:25:03Z

2. we will default this flag to the rlimit.maxPid - configured eviction threshold for pid.available. this sets a hard cap to prevent any pod from doing pid exhaustion. we could set that cap at the cgroup level for all pods, and each individual pod.

My slight concern before was that the per-pod limit was going to be hard to pick/set, and not so useful after all. Alternatively, a node-wide limit for all pods (similar to allocatable) will provide a safety net for the node, and would be easier to roll out.

Setting a sensible default limit like this addressed my concern, so this looks good to me.

dashpole · 2018-12-18T19:25:15Z

@derekwaynecarr a couple more thoughts to consider:

We currently have just basic node-level protection through node-level eviction. The next step IMO is isolation between node daemons (kubelet, runtime, etc) and user pods using Node Allocatable. This would allow isolating user-caused PID exhaustion to user pods to prevent the node from being made unusable if eviction doesn't catch pressure in-time. Achieving pod-to-pod PID isolation through PidLimitsPerPod would be the final step to prevent PID exhaustion from affecting any "innocent" workloads. It feels like we are trying to achieve protection for node daemons by using a pod-to-pod isolation mechanism, resulting in the "we need defaults for protection, but don't know how to set them well" issue.

As Node Allocatable is a well-established pattern for isolating node daemons and user workloads using a combination of cgroups and eviction. It would be easy to implement, and the design should not be contentious. I think we should seriously consider adding it under this same feature gate.

We should definitely still move forward with the PidLimitsPerPod mechanism, but I think adding Node Allocatable will alleviate the concerns we have over not having a good default for it, and allow operators to use it as a knob to isolate pods from each-other rather than as a sudo-node-allocatable hard limit to protect daemons.

Edit: Realized this feature gate is just for pids per pod, so allocatable shouldn't be tied to it. We should still consider adding it though!

sjenning · 2018-12-19T14:25:21Z

#72114 merged bumping runc

k8s-ci-robot · 2019-01-07T14:59:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cmd/kubelet/OWNERS~~ [derekwaynecarr]
~~pkg/features/OWNERS~~ [derekwaynecarr]
~~pkg/kubelet/apis/config/OWNERS~~ [derekwaynecarr]
~~pkg/kubelet/cm/OWNERS~~ [derekwaynecarr]
~~test/e2e_node/OWNERS~~ [derekwaynecarr]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

derekwaynecarr · 2019-01-07T19:34:15Z

I updated the help text for pod-max-pids to state the following:

If -1, the kubelet defaults to the node allocatable pid capacity.

@dashpole - I agree we should have eviction threshold and node allocatable enforcement for pids in a follow-on PR. This affords us the opportunity for a future PR to restrict pid limiting at the node allocatable level while maintaining backward compatibility. FWIW, I am not convinced right now that PidPressure condition is actually working, we should investigate that further. PTAL at this PR.

dashpole

lgtm overall

dashpole · 2019-01-07T19:42:27Z

test/e2e_node/pids_test.go

+}
+
+// enablePodPidsLimitInKubelet enables pod pid limit feature for kubelet with a sensible default test limit
+func enablePodPidsLimitInKubelet(f *framework.Framework) *kubeletconfig.KubeletConfiguration {


You could use tempSetCurrentKubeletConfig here to eliminate the need for this function. It also handles reverting the config to the default for subsequent tests.

dashpole · 2019-01-07T19:49:22Z

cmd/kubelet/app/options/options.go

@@ -560,7 +560,7 @@ func AddKubeletConfigFlags(mainfs *pflag.FlagSet, c *kubeletconfig.KubeletConfig
 	fs.Int32Var(&c.MaxPods, "max-pods", c.MaxPods, "Number of Pods that can run on this Kubelet.")

 	fs.StringVar(&c.PodCIDR, "pod-cidr", c.PodCIDR, "The CIDR to use for pod IP addresses, only used in standalone mode.  In cluster mode, this is obtained from the master. For IPv6, the maximum number of IP's allocated is 65536")
-	fs.Int64Var(&c.PodPidsLimit, "pod-max-pids", c.PodPidsLimit, "<Warning: Alpha feature> Set the maximum number of processes per pod.")
+	fs.Int64Var(&c.PodPidsLimit, "pod-max-pids", c.PodPidsLimit, "Set the maximum number of processes per pod.  If -1, the kubelet defaults to the node allocatable pid capacity.")


Do we need to update the description for the Configuration type as well? https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/config/types.go#L222.

dashpole · 2019-01-07T19:54:58Z

FWIW, I am not convinced right now that PidPressure condition is actually working, we should investigate that further.

It wouldn't be too hard to write an actual fork-bomb test using the eviction framework. I'll try and add that in the next few weeks. I opened #72654 to track this.

derekwaynecarr · 2019-01-08T22:45:23Z

@dashpole updates requested made.

/hold cancel

dashpole

Where does -1 translate to the node allocatable pid limit?

dashpole · 2019-01-08T22:57:33Z

test/e2e_node/pids_test.go

@@ -0,0 +1,150 @@
+/*
+Copyright 2017 The Kubernetes Authors.


derekwaynecarr · 2019-01-09T16:05:02Z

@dashpole updated the copyright.

Where does -1 translate to the node allocatable pid limit?

At the node allocatable level, we currently do not bound pids.

cat /sys/fs/cgroup/pids/kubepods/pids.max
max

This means that by default, pods are bound to the node allocatable pid limit which is /proc/sys/kernel/pid_max value.

At the pod cgroup level, we do not write to the pid cgroup unless the configured value is positive.
See: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/pod_container_manager_linux.go#L89

The moment we add support for setting a node allocatable pid limit, the pod pid limit in this PR will be bounded by that value in the cgroup hierarchy. I kept the help text documentation in terms of node allocatable rather than host configuration since we know that will happen in a follow-on PR.

dashpole · 2019-01-09T17:12:30Z

I see, thanks.

/lgtm

dashpole · 2019-01-09T17:12:39Z

/retest

fejta-bot · 2019-01-09T21:47:46Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

fejta-bot · 2019-01-10T01:38:42Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

fejta-bot · 2019-01-10T04:47:42Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

fejta-bot · 2019-01-10T08:59:45Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

dashpole · 2019-01-17T17:40:47Z

@derekwaynecarr this has caused the Validate Node Allocatable test to fail: https://k8s-testgrid.appspot.com/sig-node-kubelet#node-kubelet-serial&include-filter-by-regex=Validate%20Node%20Allocatable

We need to check if cgroupConfig.ResourceParameters is nil here:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cgroup_manager_linux.go#L468

k8s-ci-robot requested review from dims and lavalamp December 15, 2018 08:29

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 15, 2018

dims reviewed Dec 15, 2018

View reviewed changes

sjenning mentioned this pull request Dec 17, 2018

vendor: bump runc to f000fe11 #72114

Merged

derekwaynecarr force-pushed the pid-limiting branch from b2728bc to 819ed9c Compare December 17, 2018 17:22

derekwaynecarr added this to the v1.14 milestone Dec 17, 2018

derekwaynecarr added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Dec 17, 2018

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 19, 2018

derekwaynecarr force-pushed the pid-limiting branch from 819ed9c to 69c207b Compare January 7, 2019 14:56

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 7, 2019

derekwaynecarr force-pushed the pid-limiting branch from 69c207b to 6f8a8c0 Compare January 7, 2019 19:28

dashpole reviewed Jan 7, 2019

View reviewed changes

dashpole mentioned this pull request Jan 7, 2019

Write Node E2e fork-bomb test #72654

Closed

derekwaynecarr force-pushed the pid-limiting branch from 6f8a8c0 to aed0256 Compare January 8, 2019 21:36

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 8, 2019

dashpole reviewed Jan 8, 2019

View reviewed changes

test/e2e_node/pids_test.go Outdated

@@ -0,0 +1,150 @@

/*

Copyright 2017 The Kubernetes Authors.

Copy link

Contributor

dashpole Jan 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2019?

SupportPodPidsLimit feature beta with tests

bce9d5f

derekwaynecarr force-pushed the pid-limiting branch from aed0256 to bce9d5f Compare January 9, 2019 15:51

k8s-ci-robot assigned dashpole Jan 9, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 9, 2019

k8s-ci-robot merged commit 0dbc997 into kubernetes:master Jan 10, 2019

dashpole mentioned this pull request Jan 17, 2019

[Text Fix] Fix panic in NodeAllocatable node e2e test #73034

Merged

derekwaynecarr mentioned this pull request Jan 29, 2019

KEP Pid Limiting kubernetes/enhancements#755

Merged

rfranzke mentioned this pull request Feb 26, 2019

Provide Fork Bomb Protection for Clusters gardener/gardener#663

Closed

palnabarun mentioned this pull request Sep 2, 2019

Fixes static check failures in test/e2e_node/* #81932

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SupportPodPidsLimit feature beta with tests #72076

SupportPodPidsLimit feature beta with tests #72076

derekwaynecarr commented Dec 15, 2018 •

edited

Loading

derekwaynecarr commented Dec 15, 2018

dims Dec 15, 2018

derekwaynecarr Dec 16, 2018

dims commented Dec 17, 2018

yujuhong commented Dec 17, 2018

dims commented Dec 17, 2018

derekwaynecarr commented Dec 18, 2018

derekwaynecarr commented Dec 18, 2018

derekwaynecarr commented Dec 18, 2018 •

edited

Loading

yujuhong commented Dec 18, 2018

dashpole commented Dec 18, 2018 •

edited

Loading

sjenning commented Dec 19, 2018

k8s-ci-robot commented Jan 7, 2019

derekwaynecarr commented Jan 7, 2019

dashpole left a comment

dashpole Jan 7, 2019

dashpole Jan 7, 2019

dashpole commented Jan 7, 2019 •

edited

Loading

derekwaynecarr commented Jan 8, 2019

dashpole left a comment

dashpole Jan 8, 2019

derekwaynecarr commented Jan 9, 2019

dashpole commented Jan 9, 2019

dashpole commented Jan 9, 2019

fejta-bot commented Jan 9, 2019

fejta-bot commented Jan 10, 2019

fejta-bot commented Jan 10, 2019

fejta-bot commented Jan 10, 2019

dashpole commented Jan 17, 2019

SupportPodPidsLimit feature beta with tests #72076

SupportPodPidsLimit feature beta with tests #72076

Conversation

derekwaynecarr commented Dec 15, 2018 • edited Loading

derekwaynecarr commented Dec 15, 2018

dims Dec 15, 2018

Choose a reason for hiding this comment

derekwaynecarr Dec 16, 2018

Choose a reason for hiding this comment

dims commented Dec 17, 2018

yujuhong commented Dec 17, 2018

dims commented Dec 17, 2018

derekwaynecarr commented Dec 18, 2018

derekwaynecarr commented Dec 18, 2018

derekwaynecarr commented Dec 18, 2018 • edited Loading

yujuhong commented Dec 18, 2018

dashpole commented Dec 18, 2018 • edited Loading

sjenning commented Dec 19, 2018

k8s-ci-robot commented Jan 7, 2019

derekwaynecarr commented Jan 7, 2019

dashpole left a comment

Choose a reason for hiding this comment

dashpole Jan 7, 2019

Choose a reason for hiding this comment

dashpole Jan 7, 2019

Choose a reason for hiding this comment

dashpole commented Jan 7, 2019 • edited Loading

derekwaynecarr commented Jan 8, 2019

dashpole left a comment

Choose a reason for hiding this comment

dashpole Jan 8, 2019

Choose a reason for hiding this comment

derekwaynecarr commented Jan 9, 2019

dashpole commented Jan 9, 2019

dashpole commented Jan 9, 2019

fejta-bot commented Jan 9, 2019

fejta-bot commented Jan 10, 2019

fejta-bot commented Jan 10, 2019

fejta-bot commented Jan 10, 2019

dashpole commented Jan 17, 2019

derekwaynecarr commented Dec 15, 2018 •

edited

Loading

derekwaynecarr commented Dec 18, 2018 •

edited

Loading

dashpole commented Dec 18, 2018 •

edited

Loading

dashpole commented Jan 7, 2019 •

edited

Loading