New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SupportPodPidsLimit feature beta with tests #72076

Merged
merged 1 commit into from Jan 10, 2019

Conversation

@derekwaynecarr
Copy link
Member

derekwaynecarr commented Dec 15, 2018

/kind feature

What this PR does / why we need it:
Graduate SupportPodPidsLImit feature to beta. This is needed to prevent a pod from starving pid resource.

Special notes for your reviewer:
need to bump runc to fix this for systemd
opencontainers/runc#1917

Does this PR introduce a user-facing change?:

Administrator is able to configure max pids for a pod on a node.
@derekwaynecarr

This comment has been minimized.

Copy link
Member Author

derekwaynecarr commented Dec 15, 2018

/hold
need to bump runc prior to merge.

fyi @dims @dchen1107 @sjenning @mrunalp @rhatdan @smarterclayton

}

// this command takes the expected value and compares it against the actual value for the pod cgroup pids.max
command := fmt.Sprintf("expected=%v; actual=$(cat /tmp/pids/%v/pids.max); if [ \"$expected\" -ne \"$actual\" ]; then exit 1; fi; ", pidsLimit.Value(), cgroupFsName)

This comment has been minimized.

@dims

dims Dec 15, 2018

Member

So we are just checking if the pids.max is applied and not actually try something that fork-bombs. right?

This comment has been minimized.

@derekwaynecarr

derekwaynecarr Dec 16, 2018

Author Member

Correct. No need to test cgroups, just that we set the right desired state. This is what we do for all other resources basically.

@dims

This comment has been minimized.

Copy link
Member

dims commented Dec 17, 2018

thanks @derekwaynecarr - looks like verify job is busted. LGTM otherwise!

@derekwaynecarr derekwaynecarr force-pushed the derekwaynecarr:pid-limiting branch from b2728bc to 819ed9c Dec 17, 2018

@derekwaynecarr derekwaynecarr added this to the v1.14 milestone Dec 17, 2018

@yujuhong

This comment has been minimized.

Copy link
Member

yujuhong commented Dec 17, 2018

Was there a follow-up discussion about how to set the limits (see the original thread in #57973)?

@dims

This comment has been minimized.

Copy link
Member

dims commented Dec 17, 2018

@yujuhong not that i recall ..

@derekwaynecarr

This comment has been minimized.

Copy link
Member Author

derekwaynecarr commented Dec 18, 2018

@yujuhong - I have a hold so we can discuss in sig-node. I am fine adding a knob to protect across all pods in addition to per pod as it simplifies configuring a pid reserve. I felt that the big and small pod pid limit could be added in the future as it’s not inconsistent with a default pod pid limit enforced local to node.

@derekwaynecarr

This comment has been minimized.

Copy link
Member Author

derekwaynecarr commented Dec 18, 2018

mapping to systemd

  • knob in this PR is DefaultTasksMax
  • future iterations could add per pod TasksMax value. It could be restricted and/or defaulted via LimitRange or PSP. Those iterations could be different feature flags.
@derekwaynecarr

This comment has been minimized.

Copy link
Member Author

derekwaynecarr commented Dec 18, 2018

To capture discussion from sig-node meeting:

  1. we want to provide a seat belt for users to protect the node
  2. we will default this flag to the rlimit.maxPid - configured eviction threshold for pid.available. this sets a hard cap to prevent any pod from doing pid exhaustion. we could set that cap at the cgroup level for all pods, and each individual pod.
  3. in future, we want to support granular options per pod for pid limits, but we want to do that via a policy option. this would be tracked under a different feature gate, i.e. GranularPidLimitsPerPod.
  4. the defaulting applied in this flag would only be used if a pod had no associated pid limiting policy applied in a future date.
@yujuhong

This comment has been minimized.

Copy link
Member

yujuhong commented Dec 18, 2018

2. we will default this flag to the rlimit.maxPid - configured eviction threshold for pid.available. this sets a hard cap to prevent any pod from doing pid exhaustion. we could set that cap at the cgroup level for all pods, and each individual pod.

My slight concern before was that the per-pod limit was going to be hard to pick/set, and not so useful after all. Alternatively, a node-wide limit for all pods (similar to allocatable) will provide a safety net for the node, and would be easier to roll out.

Setting a sensible default limit like this addressed my concern, so this looks good to me.

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Dec 18, 2018

@derekwaynecarr a couple more thoughts to consider:

We currently have just basic node-level protection through node-level eviction. The next step IMO is isolation between node daemons (kubelet, runtime, etc) and user pods using Node Allocatable. This would allow isolating user-caused PID exhaustion to user pods to prevent the node from being made unusable if eviction doesn't catch pressure in-time. Achieving pod-to-pod PID isolation through PidLimitsPerPod would be the final step to prevent PID exhaustion from affecting any "innocent" workloads. It feels like we are trying to achieve protection for node daemons by using a pod-to-pod isolation mechanism, resulting in the "we need defaults for protection, but don't know how to set them well" issue.

As Node Allocatable is a well-established pattern for isolating node daemons and user workloads using a combination of cgroups and eviction. It would be easy to implement, and the design should not be contentious. I think we should seriously consider adding it under this same feature gate.

We should definitely still move forward with the PidLimitsPerPod mechanism, but I think adding Node Allocatable will alleviate the concerns we have over not having a good default for it, and allow operators to use it as a knob to isolate pods from each-other rather than as a sudo-node-allocatable hard limit to protect daemons.

Edit: Realized this feature gate is just for pids per pod, so allocatable shouldn't be tied to it. We should still consider adding it though!

@sjenning

This comment has been minimized.

Copy link
Contributor

sjenning commented Dec 19, 2018

#72114 merged bumping runc

@derekwaynecarr

This comment has been minimized.

Copy link
Member Author

derekwaynecarr commented Dec 19, 2018

/hold

@derekwaynecarr derekwaynecarr force-pushed the derekwaynecarr:pid-limiting branch from 819ed9c to 69c207b Jan 7, 2019

@k8s-ci-robot k8s-ci-robot removed the lgtm label Jan 7, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 7, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@derekwaynecarr derekwaynecarr force-pushed the derekwaynecarr:pid-limiting branch from 69c207b to 6f8a8c0 Jan 7, 2019

@derekwaynecarr

This comment has been minimized.

Copy link
Member Author

derekwaynecarr commented Jan 7, 2019

I updated the help text for pod-max-pids to state the following:

If -1, the kubelet defaults to the node allocatable pid capacity.

@dashpole - I agree we should have eviction threshold and node allocatable enforcement for pids in a follow-on PR. This affords us the opportunity for a future PR to restrict pid limiting at the node allocatable level while maintaining backward compatibility. FWIW, I am not convinced right now that PidPressure condition is actually working, we should investigate that further. PTAL at this PR.

@dashpole
Copy link
Contributor

dashpole left a comment

lgtm overall

}

// enablePodPidsLimitInKubelet enables pod pid limit feature for kubelet with a sensible default test limit
func enablePodPidsLimitInKubelet(f *framework.Framework) *kubeletconfig.KubeletConfiguration {

This comment has been minimized.

@dashpole

dashpole Jan 7, 2019

Contributor

You could use tempSetCurrentKubeletConfig here to eliminate the need for this function. It also handles reverting the config to the default for subsequent tests.

@@ -560,7 +560,7 @@ func AddKubeletConfigFlags(mainfs *pflag.FlagSet, c *kubeletconfig.KubeletConfig
fs.Int32Var(&c.MaxPods, "max-pods", c.MaxPods, "Number of Pods that can run on this Kubelet.")

fs.StringVar(&c.PodCIDR, "pod-cidr", c.PodCIDR, "The CIDR to use for pod IP addresses, only used in standalone mode. In cluster mode, this is obtained from the master. For IPv6, the maximum number of IP's allocated is 65536")
fs.Int64Var(&c.PodPidsLimit, "pod-max-pids", c.PodPidsLimit, "<Warning: Alpha feature> Set the maximum number of processes per pod.")
fs.Int64Var(&c.PodPidsLimit, "pod-max-pids", c.PodPidsLimit, "Set the maximum number of processes per pod. If -1, the kubelet defaults to the node allocatable pid capacity.")

This comment has been minimized.

@dashpole

dashpole Jan 7, 2019

Contributor

Do we need to update the description for the Configuration type as well? https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/config/types.go#L222.

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Jan 7, 2019

FWIW, I am not convinced right now that PidPressure condition is actually working, we should investigate that further.

It wouldn't be too hard to write an actual fork-bomb test using the eviction framework. I'll try and add that in the next few weeks. I opened #72654 to track this.

@derekwaynecarr derekwaynecarr force-pushed the derekwaynecarr:pid-limiting branch from 6f8a8c0 to aed0256 Jan 8, 2019

@derekwaynecarr

This comment has been minimized.

Copy link
Member Author

derekwaynecarr commented Jan 8, 2019

@dashpole updates requested made.

/hold cancel

@dashpole
Copy link
Contributor

dashpole left a comment

Where does -1 translate to the node allocatable pid limit?

@@ -0,0 +1,150 @@
/*
Copyright 2017 The Kubernetes Authors.

This comment has been minimized.

@dashpole

dashpole Jan 8, 2019

Contributor

2019?

@derekwaynecarr derekwaynecarr force-pushed the derekwaynecarr:pid-limiting branch from aed0256 to bce9d5f Jan 9, 2019

@derekwaynecarr

This comment has been minimized.

Copy link
Member Author

derekwaynecarr commented Jan 9, 2019

@dashpole updated the copyright.

Where does -1 translate to the node allocatable pid limit?

At the node allocatable level, we currently do not bound pids.

cat /sys/fs/cgroup/pids/kubepods/pids.max
max

This means that by default, pods are bound to the node allocatable pid limit which is /proc/sys/kernel/pid_max value.

At the pod cgroup level, we do not write to the pid cgroup unless the configured value is positive.
See: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/pod_container_manager_linux.go#L89

The moment we add support for setting a node allocatable pid limit, the pod pid limit in this PR will be bounded by that value in the cgroup hierarchy. I kept the help text documentation in terms of node allocatable rather than host configuration since we know that will happen in a follow-on PR.

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Jan 9, 2019

I see, thanks.

/lgtm

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Jan 9, 2019

/retest

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jan 9, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

3 similar comments
@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jan 10, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jan 10, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jan 10, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 0dbc997 into kubernetes:master Jan 10, 2019

18 of 19 checks passed

pull-kubernetes-local-e2e-containerized Job triggered.
Details
cla/linuxfoundation derekwaynecarr authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-e2e-kubeadm-gce Skipped
pull-kubernetes-godeps Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details
@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Jan 17, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment