New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for --security-opt, --syscall, --ulimit...to swarm mode #25209

Open
mostolog opened this Issue Jul 29, 2016 · 68 comments

Comments

Projects
None yet
@mostolog

mostolog commented Jul 29, 2016

Hi

Looks like docker service create doesn't have any kernel configuration options. eg: --security-opt, --sysctl, --ulimit... which are sometimes required.
This is stopping us on using swarm mode to deploy ELK 5 on our testing servers.

Could you add at least a --container-args option? eg:
--container-args="--security-opt seccomp=unconfined --ulimit memlock=-1 --ulimit nofile=102400"

If this can be done somehow, sorry for mistake. Please let me know how to do it.

Regards.

@gittycat

This comment has been minimized.

gittycat commented Sep 18, 2016

The --security-opt is also needed by Elasticsearch. Currently, starting Elasticsearch gives this error unable to install syscall filter: seccomp unavailable: your kernel is buggy and you should upgrade.
The workaround given is to start containers using --security-opt=seccomp=unconfined but that's not available for services.

@thaJeztah

This comment has been minimized.

Member

thaJeztah commented Sep 19, 2016

ping @justincormack perhaps you have thoughts on this. As I commented on #25303 (comment) - one challenge will be "where to put the custom profile file" in a Swarm setup (unless the definition is stored in the Swarm service definition)

@mostolog

This comment has been minimized.

mostolog commented Sep 20, 2016

@thaJeztah Excuse me, but my lack of english doesn't allow me to properly understand what you are talking about...
Regarding our needs, each service has their own, so every parameter should be service independent (defined on service create/update), instead of swarm-level

@thaJeztah

This comment has been minimized.

Member

thaJeztah commented Sep 20, 2016

@mostolog I was thinking how a custom profile should be set (see https://docs.docker.com/engine/security/seccomp/#/passing-a-profile-for-a-container), because I think docker needs to have access to the file that contains that profile (on each node in the swarm)

@mostolog

This comment has been minimized.

mostolog commented Sep 20, 2016

Please, let me know if I understood properly, despite my far-too-brief description.

I guess when you specify --security-opt for a container, it inherits the default profile + add parameters for running. I also suppose the same happens with services.

If services created under swarm are deployed on other swarm nodes, a "Dockerfile" shall be sent to nodes in order to run those, hence this template could be part of the dockerfile, isnt it?
That's what you mean when you say "stored in the Swarm service definition", right?

@thaJeztah

This comment has been minimized.

Member

thaJeztah commented Sep 20, 2016

@mostolog no, not a Dockerfile, the (contents of) the profile.json from the example I linked above. Docker (in "swarm mode") stores the definition of services (what command they are running, which options are passed); the profile itself would have to be stored as part of that definition.

@mostolog

This comment has been minimized.

mostolog commented Sep 20, 2016

@thaJeztah Clear as water. Thanks a lot.
And yes, I agree with you: those parameters should also be part of service definition sent to swarm nodes.

@justincormack

This comment has been minimized.

Contributor

justincormack commented Sep 20, 2016

@thaJeztah seccomp is not an issue - the file is not used, the json contents of it are passed by the client to the daemon.

However, this just seems to be a workaround for elasticsearch trying to set its own seccomp profile and failing, which seems really odd, will look into what the cause is, it looks like a bug in elasticsearch.

@justincormack

This comment has been minimized.

Contributor

justincormack commented Sep 20, 2016

The seccomp error in elasticsearch was fixed here elastic/elasticsearch@f77e8a5 - we return EPERM as we already filter the unknown syscalls, they were expecting ENOSYS. I don't think we are that crazy here like the comments suggest. Looks like this will be in 5.0.0 when it is released.

@thaJeztah

This comment has been minimized.

Member

thaJeztah commented Sep 20, 2016

@justincormack ah, I was mistaken then, I assumed the file was needed on the daemon side, but thinking more that would only be for a default profile. 😅

@PatrickLang

This comment has been minimized.

PatrickLang commented Jan 3, 2017

This is also going to be especially important for Windows containers that need to run as service accounts. We need the --security-opt "credentialspec=..." to be passed through without modifications for this to work.

CC @anweiss @friism

@PatrickLang

This comment has been minimized.

PatrickLang commented Jan 3, 2017

--isolation=... is also going to be important. When someone deploys a service, they may need to use --isolation=hyperv for compliance or compatibility reasons. This setting should also be service-specific and not host-wide.

@mostolog

This comment has been minimized.

mostolog commented Jan 13, 2017

Are --ulimit or --syscall already implemented in 1.13.0-RC5 for docker service or docker stack? I'm not able to get it working...

@cpuguy83

This comment has been minimized.

Contributor

cpuguy83 commented Jan 14, 2017

@mostolog Nope.

@xiaohai2016

This comment has been minimized.

xiaohai2016 commented Jan 26, 2017

Are we expecting this issue to be fixed soon? It is really important!

@thaJeztah thaJeztah added this to backlog in maintainers-session Jan 26, 2017

@thaJeztah thaJeztah removed this from backlog in maintainers-session Jan 26, 2017

@macjl

This comment has been minimized.

macjl commented Feb 2, 2017

I've also have the problem. I'm not able to run systemd based containers without the security_opt option.

@ehazlett

This comment has been minimized.

Contributor

ehazlett commented Feb 10, 2017

FYI I've opened #30894 to address some of these and would love feedback. If that PR is agreed upon, I'm planning to do the same for "resources" which should address the other things (ulimits, isolation, pids-limit, etc).

@titpetric

This comment has been minimized.

titpetric commented Feb 26, 2017

I'd love to set --sysctl net.core.somaxconn=4096 somehow to a swarm service. The container the swarm service starts has some kind of default (128), and isn't tunable somehow? Redis for example tries to set it to 511 or something, and gives a warning if this can't be set.

1.) I asume --sysctl will be "ported" to service create,
2.) is there some work-around currently?

@brandonroyal

This comment has been minimized.

brandonroyal commented Mar 2, 2017

We're seeing lots of asks for use of domain identities using --security-opt "credentialspec=...". Not having this available will be a blocker for using integrated auth for SQL Server (significant blocker for a number of lift&shift .NET apps). Any chance this is being prioritized?

@aluzzardi

This comment has been minimized.

Member

aluzzardi commented Mar 2, 2017

@diogomonica

This comment has been minimized.

Contributor

diogomonica commented Mar 3, 2017

@ehazlett and I chatted, we think that this would be a good opportunity to introduce either a secret-type or a good use case for random blobs that have to be delivered to tasks.

For example, this could operate in the following manner:
echo "BLA" | docker secret create —type credential-spec my-cred-spec
and then we could:
docker service create —secrets=my-cred-spec
removing the need for this --security-opt.

We would have to switch on secret types, and then internally pass the contents of that secret to it.

Thoughts @cyli @aaronlehmann @aluzzardi

@aluzzardi

This comment has been minimized.

Member

aluzzardi commented Mar 3, 2017

Sorry I don't know what a credential spec is.

Is its content secret in the literal sense?

What's the problem with --security-opt?

@diogomonica

This comment has been minimized.

Contributor

diogomonica commented Mar 4, 2017

@aluzzardi I don't think we want to propagate any of the security flags of docker run to docker service create

@aluzzardi

This comment has been minimized.

Member

aluzzardi commented Mar 4, 2017

But here we are as well - except they're encapsulated into a secret which is even worse to deprecate?

I might be getting out of topic, but I think we have to fix docker run rather than considering it totaled and trying to get a better docker service. 99.9% of our users are using docker run.

I think we should really fix docker run and just have a 1:1 mapping with docker service.

If we continue down this path:

  • docker run, used by the vast majority, has the wrong security model and there is no incentive to fix this
  • docker service lacks basic features that other orchestration platforms, docker run and classic swarm support have supported for years
  • docker run and docker service get farther away every time while in fact we are trying to do the opposite with convergence
  • It leads to a subpar UX. You have to learn two products at once. First you experiment with docker run to get your container up and running, then when you want to run it for "real" as a service, you'll soon find the same flags don't work and you have to learn about a new way. Which is the worst of both worlds

I believe the number one advantage of built-in orchestration is it feels natural to go from dev (single machine) to prod (cluster) - same tools, same UI, same platform.

However, if we go ahead with this, we're basically creating a fracture where it's going to feel like using different tools.

Let's put ourselves in the shoes of a lambda user deploying SQL server. You'll probably start by doing a docker run to get things going, tweaking the config, and so on and so on. Then you move to a docker service create (or stack deploy), and you'll notice the CLI spitting out errors like --security-opt: no such flag. Then you have to spend some time on Google, only to find out it's not supported and have to use an entirely different workflow. Then you flip the table :)

(╯°□°)╯︵ ┻━┻

Just to re-iterate, I think the way forward is:

  1. We fix stuff that is broken in docker run. Caps, security opts, privileged? Let's fix those.
  2. Docker service is a 1:1 copy of docker run. When we fix run, we fix service.
@mostolog

This comment has been minimized.

mostolog commented May 3, 2017

Thanks for tip. I would test tomorrow and let you know.

@thepill

This comment has been minimized.

thepill commented May 9, 2017

We would also need to use --security-opt "credentialspec=...." within swarm as mentioned before. If this will be possible via a argument or through a config file i would vote for a config file

@titpetric

This comment has been minimized.

titpetric commented May 9, 2017

Given how many different software requires either credentialspec or cap-add or similar, I would be interested in what any pitfalls might be to provide these via Dockerfile (ie, additional commands like RUN). Obviously, software like nginx doesn't require high sysctl tuning, but having the option to tune it via Dockerfile seems like a logical step. Inheritance with FROM also enabled you to set/override defaults that might be set by the parent image. It would solve at least the problems with redis, which expects at least minimal tuning. Judging by this and other threads, the ELK stack basically requires some settings and will bug out without them. Shouldn't it be fine that security/capability concerns would be offloaded to the obviously smaller population of image maintainers, vs. devs/users which would need to explicitly bolt these on with various --sysctl, --cap* or other arguments?

@cpuguy83

This comment has been minimized.

Contributor

cpuguy83 commented May 9, 2017

@thepill credential spec will be supported on services in 17.06 (I believe the API is there in 17.05).

@titpetric These are all host specific settings that really don't belong in the image format.
There is a proposal to introduce entitlements: #32801 which would be baked into the image, but I do not thing things like sysctl would work for this.
Caps definitely would.

@mostolog

This comment has been minimized.

mostolog commented May 15, 2017

Hi

@dliappis Sorry for delay!

Currently, our /etc/systemd/system/docker.service file contains:

LimitNOFILE=1048576
...
LimitNPROC=infinity
LimitCORE=infinity

Suggested command shows:

docker run --rm debian:8.8 /bin/bash -c 'ulimit -Hn && ulimit -Sn && ulimit -Hu && ulimit -Su'
1048576
1048576
unlimited
unlimited

Setting LimitNOFILE=unlimited:

65536
65536
unlimited
unlimited

Setting LimitNOFILE=infinity:

4096
4096
unlimited
unlimited

So...are you sure 8db6109 is working as expected?

@shashanktomar

This comment has been minimized.

shashanktomar commented Jul 6, 2017

Looks like this will take some time. Is there any workaround to set "net.ipv4.tcp_keepalive_time" in swarm mode as sysctl is not yet supported. This is blocking us from using it in production.

@dliappis

This comment has been minimized.

dliappis commented Jul 6, 2017

@mostolog Unfortunately I missed your message, sorry! I guess better late than never.

The docker systemd unit file you referred to (8db6109) is working alright, but there are few subtle things about limits. See also this serverfault article.

For the sake of brevity, I will only address NOFILE.

The only allowed keyword mentioned in the systemd man page is infinity.
If you use unlimited you will see the following error message in systemctl status docker:

Failed to parse resource value, ignoring: unlimited

I guess in this case it inherits whatever is the system default for systemd (see below).

For infinity the man page reads:

Use the string infinity to configure no limit on a specific resource.

This means that the Docker service will not change anything and inherit whatever is currently active for systemd. Since systemd is running as PID 1 on modern distros, you can check the current value under /proc/1/limits. On a newly started ubuntu-16.04 vagrant box I see:

# cat /proc/1/limits | grep files
Max open files            65536                65536                files     

I then installed the latest docker-ce and got the same NOFILE value as you:

$ systemctl cat docker.service | grep NOFILE
LimitNOFILE=1048576

As expected the container reports the same nofile:

$ docker run --rm debian:8.8 /bin/bash -c 'ulimit -Hn && ulimit -Sn && ulimit -Hu && ulimit -Su'
1048576
1048576
unlimited
unlimited

Now, if I override the LimitNOFILE (I used systemctl edit docker.service to create an override file and then systemctl daemon-reload/restart docker), I can verify the change:

$ sudo systemctl cat docker
# /lib/systemd/system/docker.service
...
# /etc/systemd/system/docker.service.d/override.conf
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd
LimitNOFILE=infinity

This makes the unit file not set limits and inherit systemd defaults. As expected I see:

$ docker run --rm debian:8.8 /bin/bash -c 'ulimit -Hn && ulimit -Sn && ulimit -Hu && ulimit -Su'
65536
65536
unlimited
unlimited

For the discrepancy in your system I'd check what are the default for systemd and inspect systemctl cat docker to see if your changes have really propagated. The low 4096 value, when specifying infinity sounds like a pam default, but without going into system specific details I wouldn't be able to identify which limit is being picked up and why.

In general though, the defaults in 8db6109 should provide high enough defaults for NOFILE and NPROC.

@n4ss

This comment has been minimized.

n4ss commented Jul 7, 2017

As mentioned earlier in this issue, we're currently working on entitlements to provide a high level security interface to users and the ability to create a security profile tied to images.

We're first looking for a capabilities/seccomp/apparmor/APi-access configuration support but we'll definitely look into sysctl configuration (link to opened issue right above).

Feel free to propose/discuss stuff there too, we're looking for use-cases and needs to come up with a great granularity.

@bitgandtter

This comment has been minimized.

bitgandtter commented Jul 18, 2017

Any advances on the --ulimits flag for swarm stack deploy? without it elasticsearch cant be deployed as part of an stack

@imyoungyang

This comment has been minimized.

imyoungyang commented Jul 26, 2017

Hi @bitgandtter
@dliappis comments give us a very clear instruction to adjust the docker service ulimit.

You can reference the Vagrant file to let docker service max locked memory unlimited and Docker image to setup elasticsearch cluster.

@xificurC

This comment has been minimized.

xificurC commented Aug 15, 2017

@imyoungyang IIUC that's a workaround on how to set the ulimits for the docker daemon. Changing those settings changes them for every container. Just because elasticsearch needs e.g. 65k file descriptors doesn't mean we should let everyone have such fun.

I guess we need to wait for libentitlement to land? @n4ss any advance in the last month?

@n4ss

This comment has been minimized.

n4ss commented Aug 15, 2017

@xificurC yes, we're having more entitlements implemented and images such as nginx or dind are starting to work with it :)

@dnephin dnephin referenced this issue Aug 29, 2017

Closed

elasticsearch on swarm cluster, ulimits ignored #88

2 of 3 tasks complete
@dliappis

This comment has been minimized.

dliappis commented Sep 6, 2017

IIUC that's a workaround on how to set the ulimits for the docker daemon. Changing those settings changes them for every container. Just because elasticsearch needs e.g. 65k file descriptors doesn't mean we should let everyone have such fun.

@xificurC The Docker Engine defaults since 8db6109 have high defaults (for performance reasons). Therefore you don't need to change them (for the sake of increased requirements, say, of Elasticsearch) with recent versions of docker-ce/ee etc. However, you'd need to do the reverse, i.e. reduce the limits per container if you feel that a specific one may potentially abuse resources, so entitlements would be needed for this case.

@darklow

This comment has been minimized.

darklow commented Jan 22, 2018

It would be great is some workaround could be provided at least low level or at least at daemon.json level (btw setting default-ulimits in daemon.json still doesn't work on latest docker, docker daemon doesn't start). So many services have downgraded performance because of multiple options missing when running in docker swarm mode. I am still having elasticsearch issues because of memory lock and ulimit problems (ended up removing swap disk partition which is not nice). I am having performance problems on load balancers and webservers because I couldn't find any way of increasingnet.core.somaxconn more than default 128 (even if I increased it on host machine and tried multiple other ideas without success). Almost every single performance issue I had came down to running in docker swarm mode. Unfortunately I'm already in production and wasn't aware of so many limitations and looking for some workarounds or maybe this issue could be prioritised. Thank you.

@eyz

This comment has been minimized.

eyz commented Jan 22, 2018

Additionally, there are also some cases where other non-Swarm flags like --privileged are required, such as running docker-in-docker for CI

@thaJeztah

This comment has been minimized.

Member

thaJeztah commented Jan 23, 2018

btw setting default-ulimits in daemon.json still doesn't work on latest docker, docker daemon doesn't start

Could you elaborate? This should work; for example:

{
	"default-ulimits": {
		"nofile": {
			"Name": "nofile",
			"Hard": 2048,
			"Soft": 1024
		}
	}
}
@darklow

This comment has been minimized.

darklow commented Jan 23, 2018

@thaJeztah Sorry, I must have copied wrong syntax, yours does work indeed, thank you.

@jmarcos-cano

This comment has been minimized.

jmarcos-cano commented Jan 31, 2018

To anyone stumbling with the net.core.somaxconn in swarm, one can do a workaround:

redis:
    image: redis:3
    ports:
      - "6379"
    volumes:
    - /etc/localtime:/etc/localtime:ro
    - /proc:/writable-proc
    entrypoint: [ "/bin/bash", "-c", "echo 1024 > /writable-proc/sys/net/core/somaxconn && exec docker-entrypoint.sh redis-server" ]

grabbed the idea from stack overflow

unfortunately options are limited

@raarts

This comment has been minimized.

raarts commented Mar 6, 2018

I am deeply worried by the fact that the moby/libentitlement repo (which is supposed to fix this issue) has been at a standstill for 3 months now...

@moby moby deleted a comment from 13428282016 Jun 1, 2018

@zicklag

This comment has been minimized.

zicklag commented Jun 19, 2018

I managed a very limited workaround that I used to run a Docker volume plugin container that needed to do a FUSE mount. I created a Docker image, kadimasolutions/docker-run-d, that is meant to run another container using the Docker CLI. You run this container as a swarm service and mount the Docker socket into it. You pass in a Docker run command and it will use the Docker CLI to run the command against the Docker socket mounted into the container. For example:

...
privileged-nginx:
    image: kadimasolutions/docker-run-d:latest
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    command:
      - "--privileged -p 80:80 nginx"
...

The docker-run-d container will start the nginx container when the swarm service is run and it will stop the nginx container when the service is stopped. This has a whole lot of limitations and nuances and is in no way a good workaround, but it was the only option for my use case.

@thaJeztah

This comment has been minimized.

Member

thaJeztah commented Aug 23, 2018

WIP Pull request for setting sysctl for swarm services: #37701 / docker/swarmkit#2729

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment