Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rlimit support #3595

Open
thockin opened this issue Jan 18, 2015 · 107 comments

Comments

@thockin
Copy link
Member

commented Jan 18, 2015

moby/moby#4717 (comment)

Now that this is in, we should define how we want to use it.

@bgrant0607

This comment has been minimized.

Copy link
Member

commented Jan 23, 2015

@rjnagal

This comment has been minimized.

Copy link
Member

commented Jan 23, 2015

We can set a sane default for now. Do we want this to be exposed as a knob
in the spec, or do we prefer a low/high toggle? The only advantage of
toggle is that we can possibly avoid too many jobs with high value landing
on the same machine.

On Fri, Jan 23, 2015 at 12:19 AM, Brian Grant notifications@github.com
wrote:

/cc @vishh https://github.com/vishh @rjnagal
https://github.com/rjnagal @vmarmol https://github.com/vmarmol


Reply to this email directly or view it on GitHub
#3595 (comment)
.

@thockin

This comment has been minimized.

Copy link
Member Author

commented Jan 23, 2015

Are there any downsides to setting high limits by default these days? I
can't keep straight what bugs we have fixed internally that might not have
been accepted upstream, especially regarding things like memcg accounting
of kernel structs.

On Fri, Jan 23, 2015 at 12:29 AM, Rohit Jnagal notifications@github.com
wrote:

We can set a sane default for now. Do we want this to be exposed as a knob
in the spec, or do we prefer a low/high toggle? The only advantage of
toggle is that we can possibly avoid too many jobs with high value landing
on the same machine.

On Fri, Jan 23, 2015 at 12:19 AM, Brian Grant notifications@github.com
wrote:

/cc @vishh https://github.com/vishh @rjnagal
https://github.com/rjnagal @vmarmol https://github.com/vmarmol

Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-71160651>

.

Reply to this email directly or view it on GitHub
#3595 (comment)
.

@vmarmol

This comment has been minimized.

Copy link
Contributor

commented Jan 23, 2015

+1 to toggle, putting it in the spec is overkill IMO.

@thockin

This comment has been minimized.

Copy link
Member Author

commented Jan 23, 2015

If there is a toggle for "few" vs "many" everyone will choose "many". We
need to understand and document why "few" is the better choice most of the
time, and think about how to restrict who uses "many".

On Fri, Jan 23, 2015 at 9:39 AM, Victor Marmol notifications@github.com
wrote:

+1 to toggle, putting it in the spec is overkill IMO.

Reply to this email directly or view it on GitHub
#3595 (comment)
.

@vishh

This comment has been minimized.

Copy link
Member

commented Jan 23, 2015

Kernel memory accounting seems to be disabled in our container vm image.
The overall fd limit might also be a factor to consider. Given these
constraints, providing a toggle option makes sense.

On Fri, Jan 23, 2015 at 9:56 AM, Tim Hockin notifications@github.com
wrote:

If there is a toggle for "few" vs "many" everyone will choose "many". We
need to understand and document why "few" is the better choice most of the
time, and think about how to restrict who uses "many".

On Fri, Jan 23, 2015 at 9:39 AM, Victor Marmol notifications@github.com
wrote:

+1 to toggle, putting it in the spec is overkill IMO.

Reply to this email directly or view it on GitHub
<
#3595 (comment)

.


Reply to this email directly or view it on GitHub
#3595 (comment)
.

@rjnagal

This comment has been minimized.

Copy link
Member

commented Jan 23, 2015

One way to restrict "many" would be to take into account the global machine
limits and use it in scheduling.
I don't think we have or are planning to add user-based capabilities.

On Fri, Jan 23, 2015 at 10:25 AM, Vish Kannan notifications@github.com
wrote:

Kernel memory accounting seems to be disabled in our container vm image.
The overall fd limit might also be a factor to consider. Given these
constraints, providing a toggle option makes sense.

On Fri, Jan 23, 2015 at 9:56 AM, Tim Hockin notifications@github.com
wrote:

If there is a toggle for "few" vs "many" everyone will choose "many". We
need to understand and document why "few" is the better choice most of
the
time, and think about how to restrict who uses "many".

On Fri, Jan 23, 2015 at 9:39 AM, Victor Marmol notifications@github.com

wrote:

+1 to toggle, putting it in the spec is overkill IMO.

Reply to this email directly or view it on GitHub
<

#3595 (comment)

.


Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-71234393>

.


Reply to this email directly or view it on GitHub
#3595 (comment)
.

@timcash

This comment has been minimized.

Copy link

commented Mar 4, 2015

For our project we would use both "few" and "many". The lower limit would be for our worker containers (sateless) and the high limit would be for our storage containers (stateful)

@timothysc

This comment has been minimized.

Copy link
Member

commented Mar 4, 2015

+1 to toggle, but what exactly does few and many mean?
Also what are the implications to scheduling?

@bgrant0607

This comment has been minimized.

Copy link
Member

commented Mar 6, 2015

I don't think few and many are useful categorizations. I also disagree with the stateless vs. storage distinction. Many frontends need lots of fds for sockets.

@rjnagal

This comment has been minimized.

Copy link
Member

commented Mar 6, 2015

I would assume that we would at best only do minimal checks in scheduler as these resources would be highly overcommitted. We can have an admission check on the node side to reject pod requests or inform scheduler when its running low - more of an out-of-resource model.

For large and few values, we can start with typical linux max for the resource as large, and typical default as few.

@bgrant0607 what kind of model did you have in mind for representing these as resources?

@bgrant0607

This comment has been minimized.

Copy link
Member

commented Mar 7, 2015

I don't know that we need to track these values in the scheduler. They are more for DoS prevention than allocating a finite resource.

I'm skeptical that "large" and "few" are adequate, because the lack of numerical values would make it difficult for users to predict what category they should request, and the choice might not even be portable and/or stable over time. Do you think users wouldn't know how many file descriptors to request, for example? That seems like it can be computed with simple arithmetic based on the number of clients one wants to support, for instance.

What's the downside of just exposing numerical parameters?

I agree we should choose reasonable, modest defaults.

@thockin

This comment has been minimized.

Copy link
Member Author

commented Mar 7, 2015

small and large feels clunky, but I like the ability for site-admins to
define a few grades of service and then to let users choose. I think it
works pretty well internally - at least most users survive with the default
of "small"

On Fri, Mar 6, 2015 at 9:13 PM, Brian Grant notifications@github.com
wrote:

I don't know that we need to track these values in the scheduler. They are
more for DoS prevention than allocating a finite resource.

I'm skeptical that "large" and "few" are adequate, because the lack of
numerical values would make it difficult for users to predict what category
they should request, and the choice might not even be portable and/or
stable over time. Do you think users wouldn't know how many file
descriptors to request, for example? That seems like it can be computed
with simple arithmetic based on the number of clients one wants to support,
for instance.

What's the downside of just exposing numerical parameters?

I agree we should choose reasonable, modest defaults.


Reply to this email directly or view it on GitHub
#3595 (comment)
.

@bgrant0607

This comment has been minimized.

Copy link
Member

commented Apr 16, 2015

Re. admin-defined policies, see moby/moby#11187

@vishh

This comment has been minimized.

Copy link
Member

commented Aug 10, 2015

@bgrant0607: Is this something that we can consider for v1.1?

@bgrant0607

This comment has been minimized.

Copy link
Member

commented Aug 10, 2015

We can, but I'll have 0 bandwidth to think about it for the next month, probably.

cc @erictune

@dchen1107

This comment has been minimized.

Copy link
Member

commented Sep 23, 2015

Docker rlimit feature is process-based, not cgroup based (of course, upstream kernel doesn't have rlimit cgroup yet). This means

  • The limit is applied to container's root process, and all children processes inherit it
  • There is no control on how many children processes
  • Processes for docker exec are not inheriting the same limit

Based on above, I don't think there is a very useful feature, or at least, not a easy-to-use feature to specify and manage.

@shaylevi2

This comment has been minimized.

Copy link

commented Nov 10, 2015

Where does this stand? is it available through any config?

@vishh

This comment has been minimized.

Copy link
Member

commented Nov 11, 2015

@shaylevi2: Look at the previous comment. Docker's current implementation isn't what we need.

@tobymiller1

This comment has been minimized.

Copy link

commented Oct 25, 2018

@rewt I can't get your solution (https://github.com/rewt/elasticsearch-mlock) to work, because ulimit can't do this as a non-root user, and elastic won't run as root. K8s doesn't seem to respect the /etc/security/limits.conf file of the host (or the image - which makes sense) so I can't find a way to get this right for elastic. We have swap on, which is why it's a problem at all.

@rewt

This comment has been minimized.

Copy link

commented Oct 25, 2018

@tobymiller1

This comment has been minimized.

Copy link

commented Oct 25, 2018

@rewt It's your point 2 that I'm struggling with. I have exactly the same dockerfile as you (elastic 5, but I doubt that's the problem), and I have the following in my pod definition:

securityContext:
  privileged: true
    capabilities:
      add:
      - IPC_LOCK
      - SYS_RESOURCE

Am I missing something?

@rewt

This comment has been minimized.

Copy link

commented Oct 25, 2018

@tobymiller1

This comment has been minimized.

Copy link

commented Oct 25, 2018

@rewt Thanks for your help. That is the configuration that I have, but it doesn't seem to allow ulimit to act as non-root in this way, and I don't understand why it should. I don't want to take over this thread though - I think we'll look for an alternative solution.

@Shifter2600

This comment has been minimized.

Copy link

commented Oct 30, 2018

can't you pass in -e JAVA_OPTS="-Xmx4096m" with the sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run -e JAVA_OPTS="-Xmx4096m"
rancher/rancher-agent:v2.1.1........I am pretty sure this works around this issue as I have seen this as a solution with ES operators for Kubernetes.

@kesor

This comment has been minimized.

Copy link

commented Nov 6, 2018

@tobymiller1 in the solution by @rewt for it to actually work you need to add the USER root stanza to the Dockerfile, and then drop to the nobody (or whatever) user after you execute the ulimit -l unlimited command in the alternate wrapper.

@jasongerard

This comment has been minimized.

Copy link

commented Nov 14, 2018

@tobymiller1 If you want to be able to run the image as a non root user in kubernetes, you can use setcap directly on the executable. RUN setcap cap_ipc_lock=+ep ./your_exe && setcap cap_sys_resource=+ep ./your_exe

You will need the libcap package installed in alpine to get that command. Add that line before you switch down to your limited user.

In LInux, capabilities are lost when the UID changes from 0 (root) to non zero (your limited user)

Do other stuff
RUN setcap cap_ipc_lock=+ep ./your_exe && setcap cap_sys_resource=+ep ./your_exe
USER your_limited_user

The executable can now change user limits and lock memory.

Another item to note, aufs does NOT support Capabilities as it doesn't support extended attributes. You must use another FS like overlay2. Make sure your Kubernetes provider is not using aufs.

@wahaha2001

This comment has been minimized.

Copy link

commented Jan 2, 2019

@jasongerard Just tried your solution but not work, anything I missed?

I am using Alpine as the base image and want to run "ulimit -l unlimited" in entrypoint.sh by non-root user esuser. My Kubernetes cluster running upon Ubuntu 16.04 and using overlay2 FS.

RUN apk add --no-cache libcap && setcap cap_ipc_lock=+ep /entrypoint.sh && setcap cap_sys_resource=+ep /entrypoint.sh
USER esuser

Got error message like this :
/entrypoint.sh: line 21: ulimit: max locked memory: cannot modify limit: Operation not permitted

@jasongerard

This comment has been minimized.

Copy link

commented Jan 2, 2019

@wahaha2001 you are calling setcap on a shell script. The executable that gets invoked requires the capability. In the case of a script, whatever the executable is in your ‘#!’ will need the capability.

@wahaha2001

This comment has been minimized.

Copy link

commented Jan 2, 2019

@jasongerard, thanks for the quick response. I am using #!/bin/bash in the entrypoint.sh, so I changed to :

RUN apk add --no-cache libcap && setcap cap_ipc_lock=+ep /bin/bash && setcap cap_sys_resource=+ep /bin/bash
USER esuser

But still get same error :(

@jasongerard

This comment has been minimized.

Copy link

commented Jan 2, 2019

@wahaha2001 You will still need to add the IPC_LOCK capability to your pod.

apiVersion: v1
kind: Pod
metadata:
  name: somename
spec:
  containers:
  - name: somename
    image: somename:latest
    securityContext:
      runAsNonRoot: true
      runAsUser: 1000
      capabilities:
        add: ["IPC_LOCK"]
  restartPolicy: Never
@bgrant0607

This comment has been minimized.

Copy link
Member

commented Feb 27, 2019

Putting this here so I can find it again later: https://do-db2.lkml.org/lkml/2011/6/19/170

@manojtr

This comment has been minimized.

Copy link

commented Mar 6, 2019

@arturalbov we do something similar with elasticsearch/cassandra, just create a custom image with an alternate entrypoint script, e.g:

#!/bin/bash

# Set memlock limit to unlimited
ulimit -l unlimited

# Call original entrypoint script
exec /usr/local/bin/docker-entrypoint.sh "${@}"

then supply the same capabilities you mentioned (privileged isn't required):

 securityContext:
     capabilities:
         add:
             - IPC_LOCK
             - SYS_RESOURCE

This only works if we use root to run the application. Any solution for non-root users?

@jasongerard

This comment has been minimized.

Copy link

commented Mar 6, 2019

@manojtr see my comments above about running for non-root. You need to use setcap.

@manojtr

This comment has been minimized.

Copy link

commented Mar 6, 2019

@jasongerard I tried the bellow and I am getting standard_init_linux.go:207: exec user process caused "operation not permitted" error when I simply build the docker using

docker build -t elasticsearch-2.4.6:dev --rm -f image/Dockerfile.v2 image
RUN setcap cap_ipc_lock=+ep /bin/bash && setcap cap_sys_resource=+ep /bin/bash

USER elasticsearch
CMD ["/bin/bash", "bin/es-docker"]

@mattayes

This comment has been minimized.

Copy link

commented Mar 26, 2019

For the Elasticsearch case, changing the rlimit/locking memory is only a concern if you can't disable swapping, correct? AFAIK swapping is disabled on Kubernetes nodes, so this shouldn't be an issue, right?

@guitmz

This comment has been minimized.

Copy link

commented Apr 30, 2019

btw I have tried the following for Cassandra:

      initContainers:
      - name: increase-memlock-ulimit
        image: busybox
        command: ["sh", "-c", "ulimit -l unlimited"]
        securityContext:
          privileged: true

with

      securityContext:
        capabilities:
          add:
            - IPC_LOCK
            - SYS_RESOURCE

but it is not working, I still see errors about RLIMIT_MEMLOCK and the value inside of the container is not changed to unlimited...

@jasongerard

This comment has been minimized.

Copy link

commented Apr 30, 2019

@guitmz I do not believe an initContainer will work. In that case, you are setting the ulimit for that container, not the pod as a whole.

Below are links to an example Dockerfile and pod.yaml for running a process succesfully as non-root with the ability to change ulimit. Please note that the process in this example changes the ulimit with a syscall.

Dockerfile: https://github.com/jasongerard/mlockex/blob/master/Dockerfile
pod.yaml: https://github.com/jasongerard/mlockex/blob/master/pod.yaml

@fejta-bot

This comment has been minimized.

Copy link

commented Jul 29, 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@agolomoodysaada

This comment has been minimized.

Copy link

commented Jul 29, 2019

Can we remove the stale bot for this issue?

@dims

This comment has been minimized.

Copy link
Member

commented Jul 29, 2019

/remove-lifecycle stale

johananl added a commit to kinvolk/cassandra-operator that referenced this issue Aug 5, 2019

Don't run ulimit inside Cassandra container
Running ulimit as non-root inside k8s pods is currently unsupported:
kubernetes/kubernetes#3595
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.