New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many (>25) TLS certificates causes the load balancer to fail #2924

Closed
japsu opened this Issue Oct 10, 2017 · 11 comments

Comments

Projects
None yet
2 participants
@japsu

japsu commented Oct 10, 2017

There seems to be an implicit, undocumented limit on the amount of certificates that can be attached to a load balancer. We hit this limit at 25 certificates. When we try to add the 26th one, deployment fails with this error message:

[done] Reading stack platform metadata from Kontena Master
[done] Upgrading stack platform
[done] Triggering deployment of stack platform
[done] Waiting for deployment to start
[fail] Deploying service lb1
Deployment of service plat2-grid/platform/lb1 failed:
- halting deploy of plat2-grid/platform/lb1, one or more instances failed
- GridServiceInstanceDeployer::StateError: Service instance is not running, but restarting (on node autumn-river-15)

Digging deeper, sudo docker logs platform.lb1-1 on autumn-river-15 gives me this:

goroutine 1 [running, locked to thread]:
panic(0x7eb2e0, 0xc820238460)
    /usr/lib/go1.6/src/runtime/panic.go:481 +0x3e6
github.com/urfave/cli.HandleAction.func1(0xc8201bd2e8)
    /build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/Godeps/_workspace/src/github.com/urfave/cli/app.go:478 +0x38e
panic(0x7eb2e0, 0xc820238460)
    /usr/lib/go1.6/src/runtime/panic.go:443 +0x4e9
github.com/opencontainers/runc/libcontainer.(*LinuxFactory).StartInitialization.func1(0xc8201bcbf8, 0xc82001a0a0, 0xc8201bcd08)
    /build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/factory_linux.go:259 +0x136
github.com/opencontainers/runc/libcontainer.(*LinuxFactory).StartInitialization(0xc820058870, 0x7f81a4b55500, 0xc820238460)
    /build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/factory_linux.go:277 +0x5b1
main.glob.func8(0xc820076780, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/main_unix.go:26 +0x68
reflect.Value.call(0x74fac0, 0x9012a0, 0x13, 0x847808, 0x4, 0xc8201bd268, 0x1, 0x1, 0x0, 0x0, ...)
    /usr/lib/go1.6/src/reflect/value.go:435 +0x120d
reflect.Value.Call(0x74fac0, 0x9012a0, 0x13, 0xc8201bd268, 0x1, 0x1, 0x0, 0x0, 0x0)
    /usr/lib/go1.6/src/reflect/value.go:303 +0xb1
github.com/urfave/cli.HandleAction(0x74fac0, 0x9012a0, 0xc820076780, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/Godeps/_workspace/src/github.com/urfave/cli/app.go:487 +0x2ee
github.com/urfave/cli.Command.Run(0x84a6b8, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8e05e0, 0x51, 0x0, ...)
    /build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/Godeps/_workspace/src/github.com/urfave/cli/command.go:191 +0xfec
github.com/urfave/cli.(*App).Run(0xc820001680, 0xc82000a100, 0x2, 0x2, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/Godeps/_workspace/src/github.com/urfave/cli/app.go:240 +0xaa4
main.main()
    /build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/main.go:137 +0xe24
panic: standard_init_linux.go:178: exec user process caused "argument list too long" [recovered]
    panic: standard_init_linux.go:178: exec user process caused "argument list too long"

The error goes away when I reduce the number of certificates below 26.

We are a software consultancy running a large number of websites and services on shared infrastucture. The implicit limit on TLS certificates significantly hampers our ability to scale using the Kontena load balancer.

Keywords for search & Google: SSL, HTTPS, HAproxy

@japsu

This comment has been minimized.

Show comment
Hide comment
@japsu

japsu Oct 10, 2017

Apparently environment variables are used to smuggle the certificates into the LB container? There is a 128 KiB limit on the size of the environment that cannot be changed without recompiling the kernel.

http://man7.org/linux/man-pages/man2/execve.2.html
Search for: Limits on size of arguments and environment

http://elixir.free-electrons.com/linux/latest/source/include/uapi/linux/limits.h#L7

japsu commented Oct 10, 2017

Apparently environment variables are used to smuggle the certificates into the LB container? There is a 128 KiB limit on the size of the environment that cannot be changed without recompiling the kernel.

http://man7.org/linux/man-pages/man2/execve.2.html
Search for: Limits on size of arguments and environment

http://elixir.free-electrons.com/linux/latest/source/include/uapi/linux/limits.h#L7

@SpComb

This comment has been minimized.

Show comment
Hide comment
@SpComb

SpComb Oct 10, 2017

Contributor

Yes, the LB SSL certs are passed via the SSL_CERTS environment variable. All of the certs are joined together into a single blob, and it looks like the effective limit here is on the size of each separate env variable: Additionally, the limit per string is 32 pages (the kernel constant MAX_ARG_STRLEN) (32 * 4KB = 128KB)

Unfortunately I don't know of any workaround for this limit with the current way that the SSL_CERT env is implemented. Eventually we will need support for filesystem-based secrets, in the meanwhile, we may need to look at splitting up the certs across multiple env variables...

There is a 128 KiB limit on the size of the environment that cannot be changed without recompiling the kernel.

It's not quite that drastic, that's for On Linux prior to kernel 2.6.23 .... You can create and run a container with more than 128KB of total envs just fine on CoreOS stable. The default RLIMIT_STACK on CoreOS stable for the Docker service/container seems to be at 8MB, which implies a total 2MB env size limit per the man page, which seems to be the case: 2000 * 1KB env vars works, but 2048 * 1KB env vars fails with a similar panic.

irb(main):046:0> env = (1..2000).map{|x| "TEST#{x}=#{'a' * 1024}"}; true
=> true
irb(main):047:0> container = Docker::Container.create(name: 'env-test', 'Image' => 'alpine:3.5', 'Env' => env, 'Cmd' => ['env'])
=> #<Docker::Container:0x00560b0f85b5b8 @id="1fe036ae17ef090da28a44cd6087e767151535669898a4746ac805d4026fb12e", @info={"Warnings"=>nil, "id"=>"1fe036ae17ef090da28a44cd6087e767151535669898a4746ac805d4026fb12e"}, @connection=#<Docker::Connection:0x00560b0f93e958 @url="unix:///", @options={:socket=>"/var/run/docker.sock"}>>
irb(main):048:0> container.start!
=> #<Docker::Container:0x00560b0f85b5b8 @id="1fe036ae17ef090da28a44cd6087e767151535669898a4746ac805d4026fb12e", @info={"Warnings"=>nil, "id"=>"1fe036ae17ef090da28a44cd6087e767151535669898a4746ac805d4026fb12e"}, @connection=#<Docker::Connection:0x00560b0f93e958 @url="unix:///", @options={:socket=>"/var/run/docker.sock"}>>
irb(main):049:0> container.streaming_logs(stdout: true).size
=> 2066992
irb(main):050:0> puts container.streaming_logs(stderr: true)

=> nil

However, this test indeed fails with a single 128 * 1024 long env...

irb(main):040:0> env = (1..1).map{|x| "TEST#{x}=#{'a' * 1024 * 128}"}; true
=> true
irb(main):041:0> container = Docker::Container.create(name: 'env-test', 'Image' => 'alpine:3.5', 'Env' => env, 'Cmd' => ['env'])
=> #<Docker::Container:0x00560b0f842ef0 @id="3ed441b74185ff47f1304507510c9703e6a98b14fdb88e4564d375d87bacf254", @info={"Warnings"=>nil, "id"=>"3ed441b74185ff47f1304507510c9703e6a98b14fdb88e4564d375d87bacf254"}, @connection=#<Docker::Connection:0x00560b0f93e958 @url="unix:///", @options={:socket=>"/var/run/docker.sock"}>>
irb(main):042:0> container.start!
=> #<Docker::Container:0x00560b0f842ef0 @id="3ed441b74185ff47f1304507510c9703e6a98b14fdb88e4564d375d87bacf254", @info={"Warnings"=>nil, "id"=>"3ed441b74185ff47f1304507510c9703e6a98b14fdb88e4564d375d87bacf254"}, @connection=#<Docker::Connection:0x00560b0f93e958 @url="unix:///", @options={:socket=>"/var/run/docker.sock"}>>
irb(main):043:0> container.streaming_logs(stdout: true).size
=> 0
irb(main):044:0> puts container.streaming_logs(stderr: true)
panic: standard_init_linux.go:178: exec user process caused "argument list too long" [recovered]
	panic: standard_init_linux.go:178: exec user process caused "argument list too long"

goroutine 1 [running, locked to thread]:
panic(0x7eb2e0, 0xc8201f7490)
	/usr/lib/go1.6/src/runtime/panic.go:481 +0x3e6
github.com/urfave/cli.HandleAction.func1(0xc8201792e8)
	/build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/Godeps/_workspace/src/github.com/urfave/cli/app.go:478 +0x38e
panic(0x7eb2e0, 0xc8201f7490)
	/usr/lib/go1.6/src/runtime/panic.go:443 +0x4e9
github.com/opencontainers/runc/libcontainer.(*LinuxFactory).StartInitialization.func1(0xc820178bf8, 0xc82001a0c8, 0xc820178d08)
	/build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/factory_linux.go:259 +0x136
github.com/opencontainers/runc/libcontainer.(*LinuxFactory).StartInitialization(0xc820051630, 0x7f8e1b64f728, 0xc8201f7490)
	/build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/factory_linux.go:277 +0x5b1
main.glob.func8(0xc82006ea00, 0x0, 0x0)
	/build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/main_unix.go:26 +0x68
reflect.Value.call(0x74fac0, 0x9012a0, 0x13, 0x847808, 0x4, 0xc820179268, 0x1, 0x1, 0x0, 0x0, ...)
	/usr/lib/go1.6/src/reflect/value.go:435 +0x120d
reflect.Value.Call(0x74fac0, 0x9012a0, 0x13, 0xc820179268, 0x1, 0x1, 0x0, 0x0, 0x0)
	/usr/lib/go1.6/src/reflect/value.go:303 +0xb1
github.com/urfave/cli.HandleAction(0x74fac0, 0x9012a0, 0xc82006ea00, 0x0, 0x0)
	/build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/Godeps/_workspace/src/github.com/urfave/cli/app.go:487 +0x2ee
github.com/urfave/cli.Command.Run(0x84a6b8, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8e05e0, 0x51, 0x0, ...)
	/build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/Godeps/_workspace/src/github.com/urfave/cli/command.go:191 +0xfec
github.com/urfave/cli.(*App).Run(0xc820001500, 0xc82000a100, 0x2, 0x2, 0x0, 0x0)
	/build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/Godeps/_workspace/src/github.com/urfave/cli/app.go:240 +0xaa4
main.main()
	/build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/main.go:137 +0xe24
Contributor

SpComb commented Oct 10, 2017

Yes, the LB SSL certs are passed via the SSL_CERTS environment variable. All of the certs are joined together into a single blob, and it looks like the effective limit here is on the size of each separate env variable: Additionally, the limit per string is 32 pages (the kernel constant MAX_ARG_STRLEN) (32 * 4KB = 128KB)

Unfortunately I don't know of any workaround for this limit with the current way that the SSL_CERT env is implemented. Eventually we will need support for filesystem-based secrets, in the meanwhile, we may need to look at splitting up the certs across multiple env variables...

There is a 128 KiB limit on the size of the environment that cannot be changed without recompiling the kernel.

It's not quite that drastic, that's for On Linux prior to kernel 2.6.23 .... You can create and run a container with more than 128KB of total envs just fine on CoreOS stable. The default RLIMIT_STACK on CoreOS stable for the Docker service/container seems to be at 8MB, which implies a total 2MB env size limit per the man page, which seems to be the case: 2000 * 1KB env vars works, but 2048 * 1KB env vars fails with a similar panic.

irb(main):046:0> env = (1..2000).map{|x| "TEST#{x}=#{'a' * 1024}"}; true
=> true
irb(main):047:0> container = Docker::Container.create(name: 'env-test', 'Image' => 'alpine:3.5', 'Env' => env, 'Cmd' => ['env'])
=> #<Docker::Container:0x00560b0f85b5b8 @id="1fe036ae17ef090da28a44cd6087e767151535669898a4746ac805d4026fb12e", @info={"Warnings"=>nil, "id"=>"1fe036ae17ef090da28a44cd6087e767151535669898a4746ac805d4026fb12e"}, @connection=#<Docker::Connection:0x00560b0f93e958 @url="unix:///", @options={:socket=>"/var/run/docker.sock"}>>
irb(main):048:0> container.start!
=> #<Docker::Container:0x00560b0f85b5b8 @id="1fe036ae17ef090da28a44cd6087e767151535669898a4746ac805d4026fb12e", @info={"Warnings"=>nil, "id"=>"1fe036ae17ef090da28a44cd6087e767151535669898a4746ac805d4026fb12e"}, @connection=#<Docker::Connection:0x00560b0f93e958 @url="unix:///", @options={:socket=>"/var/run/docker.sock"}>>
irb(main):049:0> container.streaming_logs(stdout: true).size
=> 2066992
irb(main):050:0> puts container.streaming_logs(stderr: true)

=> nil

However, this test indeed fails with a single 128 * 1024 long env...

irb(main):040:0> env = (1..1).map{|x| "TEST#{x}=#{'a' * 1024 * 128}"}; true
=> true
irb(main):041:0> container = Docker::Container.create(name: 'env-test', 'Image' => 'alpine:3.5', 'Env' => env, 'Cmd' => ['env'])
=> #<Docker::Container:0x00560b0f842ef0 @id="3ed441b74185ff47f1304507510c9703e6a98b14fdb88e4564d375d87bacf254", @info={"Warnings"=>nil, "id"=>"3ed441b74185ff47f1304507510c9703e6a98b14fdb88e4564d375d87bacf254"}, @connection=#<Docker::Connection:0x00560b0f93e958 @url="unix:///", @options={:socket=>"/var/run/docker.sock"}>>
irb(main):042:0> container.start!
=> #<Docker::Container:0x00560b0f842ef0 @id="3ed441b74185ff47f1304507510c9703e6a98b14fdb88e4564d375d87bacf254", @info={"Warnings"=>nil, "id"=>"3ed441b74185ff47f1304507510c9703e6a98b14fdb88e4564d375d87bacf254"}, @connection=#<Docker::Connection:0x00560b0f93e958 @url="unix:///", @options={:socket=>"/var/run/docker.sock"}>>
irb(main):043:0> container.streaming_logs(stdout: true).size
=> 0
irb(main):044:0> puts container.streaming_logs(stderr: true)
panic: standard_init_linux.go:178: exec user process caused "argument list too long" [recovered]
	panic: standard_init_linux.go:178: exec user process caused "argument list too long"

goroutine 1 [running, locked to thread]:
panic(0x7eb2e0, 0xc8201f7490)
	/usr/lib/go1.6/src/runtime/panic.go:481 +0x3e6
github.com/urfave/cli.HandleAction.func1(0xc8201792e8)
	/build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/Godeps/_workspace/src/github.com/urfave/cli/app.go:478 +0x38e
panic(0x7eb2e0, 0xc8201f7490)
	/usr/lib/go1.6/src/runtime/panic.go:443 +0x4e9
github.com/opencontainers/runc/libcontainer.(*LinuxFactory).StartInitialization.func1(0xc820178bf8, 0xc82001a0c8, 0xc820178d08)
	/build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/factory_linux.go:259 +0x136
github.com/opencontainers/runc/libcontainer.(*LinuxFactory).StartInitialization(0xc820051630, 0x7f8e1b64f728, 0xc8201f7490)
	/build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/factory_linux.go:277 +0x5b1
main.glob.func8(0xc82006ea00, 0x0, 0x0)
	/build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/main_unix.go:26 +0x68
reflect.Value.call(0x74fac0, 0x9012a0, 0x13, 0x847808, 0x4, 0xc820179268, 0x1, 0x1, 0x0, 0x0, ...)
	/usr/lib/go1.6/src/reflect/value.go:435 +0x120d
reflect.Value.Call(0x74fac0, 0x9012a0, 0x13, 0xc820179268, 0x1, 0x1, 0x0, 0x0, 0x0)
	/usr/lib/go1.6/src/reflect/value.go:303 +0xb1
github.com/urfave/cli.HandleAction(0x74fac0, 0x9012a0, 0xc82006ea00, 0x0, 0x0)
	/build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/Godeps/_workspace/src/github.com/urfave/cli/app.go:487 +0x2ee
github.com/urfave/cli.Command.Run(0x84a6b8, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8e05e0, 0x51, 0x0, ...)
	/build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/Godeps/_workspace/src/github.com/urfave/cli/command.go:191 +0xfec
github.com/urfave/cli.(*App).Run(0xc820001500, 0xc82000a100, 0x2, 0x2, 0x0, 0x0)
	/build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/Godeps/_workspace/src/github.com/urfave/cli/app.go:240 +0xaa4
main.main()
	/build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/main.go:137 +0xe24
@japsu

This comment has been minimized.

Show comment
Hide comment
@japsu

japsu Oct 10, 2017

We'll probably work around this in our next iteration by creating multiple load balancers (of 2 instances each) which each serve a subset of our services.

japsu commented Oct 10, 2017

We'll probably work around this in our next iteration by creating multiple load balancers (of 2 instances each) which each serve a subset of our services.

@SpComb

This comment has been minimized.

Show comment
Hide comment
@SpComb

SpComb Oct 10, 2017

Contributor

I think this should also have some validation to fail earlier... if the per-env size limit is exceeded, then the deploy should fail before the working service instance gets taken down.

This is also going to be a more significant issue with the new TLS-SNI automation, where the challenge certs are also deployed to the LB services.

Contributor

SpComb commented Oct 10, 2017

I think this should also have some validation to fail earlier... if the per-env size limit is exceeded, then the deploy should fail before the working service instance gets taken down.

This is also going to be a more significant issue with the new TLS-SNI automation, where the challenge certs are also deployed to the LB services.

@SpComb SpComb added agent bug lb labels Oct 10, 2017

@SpComb SpComb added this to the 1.4.1 milestone Oct 10, 2017

@japsu

This comment has been minimized.

Show comment
Hide comment
@japsu

japsu Oct 10, 2017

I suggest you also mention the effective limit on LB certificates in the documentation. This is the most likely scenario anyone will try to use a huge environment variable in.

japsu commented Oct 10, 2017

I suggest you also mention the effective limit on LB certificates in the documentation. This is the most likely scenario anyone will try to use a huge environment variable in.

@SpComb

This comment has been minimized.

Show comment
Hide comment
@SpComb

SpComb Oct 10, 2017

Contributor

This should be relatively easy to fix by enhancing the kontena-loadbalancer entrypoint to also pick up individual certs from separate SSL_CERT_* envs and write them out to /etc/haproxy/certs/. This wouldn't even need any kontena server/agent changes, the LB stackfiles would just have to be changed to use an interpolated name: SSL_CERT_{{subject}} for each certificate/secret env.

That would be enough to increase the combined limit on SSL certificates from the current 128KB to 2MB with the CoreOS stable default ulimits, and bumping that up even further should just be a matter of adjusting the container stack rlimit (systemd docker.service LimitSTACK= or docker run --ulimit)?

EDIT: the RPC serializer for the new kontena certificate authorize --type tls-sni-01 domain authorization challenge certs also hardcodes the SSL_CERTS name - that would also need fixing.

Contributor

SpComb commented Oct 10, 2017

This should be relatively easy to fix by enhancing the kontena-loadbalancer entrypoint to also pick up individual certs from separate SSL_CERT_* envs and write them out to /etc/haproxy/certs/. This wouldn't even need any kontena server/agent changes, the LB stackfiles would just have to be changed to use an interpolated name: SSL_CERT_{{subject}} for each certificate/secret env.

That would be enough to increase the combined limit on SSL certificates from the current 128KB to 2MB with the CoreOS stable default ulimits, and bumping that up even further should just be a matter of adjusting the container stack rlimit (systemd docker.service LimitSTACK= or docker run --ulimit)?

EDIT: the RPC serializer for the new kontena certificate authorize --type tls-sni-01 domain authorization challenge certs also hardcodes the SSL_CERTS name - that would also need fixing.

@SpComb

This comment has been minimized.

Show comment
Hide comment
@SpComb

SpComb Oct 26, 2017

Contributor

Status on this issue in checklist form, because there's several PRs needed to actually fix this:

  • kontena/docs#54 document the SSL_CERTS env size limits

  • kontena/kontena-loadbalancer#28 adds kontena/lb support for multiple SSL_CERT_* envs with separate cert bundles

    This doesn't need changes in Kontena, but docs and LB stacks have to be changed to use name: "SSL_CERT_{{ cert | replace: '.', '_' }}" instead of name: SSL_CERTS

  • #2951 validates the env size limits and should protect the running LB service containers by failing the deploy early if the env size limit is exceeded via SSL_CERTS secrets

  • Issue #2963 is vaugely related to tls-sni-01 challenge certs clogging up the SSL_CERTS env

    • the challenge certs should be exported by Kontena via separate SSL_CERT_* envs
    • #2994 they should be cleaned up automatically be Kontena once they are no longer needed for ACME domain authorizations
    • #2964 there may also need to be some mechanism to clean them up manually if cancelling domain authorizations
Contributor

SpComb commented Oct 26, 2017

Status on this issue in checklist form, because there's several PRs needed to actually fix this:

  • kontena/docs#54 document the SSL_CERTS env size limits

  • kontena/kontena-loadbalancer#28 adds kontena/lb support for multiple SSL_CERT_* envs with separate cert bundles

    This doesn't need changes in Kontena, but docs and LB stacks have to be changed to use name: "SSL_CERT_{{ cert | replace: '.', '_' }}" instead of name: SSL_CERTS

  • #2951 validates the env size limits and should protect the running LB service containers by failing the deploy early if the env size limit is exceeded via SSL_CERTS secrets

  • Issue #2963 is vaugely related to tls-sni-01 challenge certs clogging up the SSL_CERTS env

    • the challenge certs should be exported by Kontena via separate SSL_CERT_* envs
    • #2994 they should be cleaned up automatically be Kontena once they are no longer needed for ACME domain authorizations
    • #2964 there may also need to be some mechanism to clean them up manually if cancelling domain authorizations
@SpComb

This comment has been minimized.

Show comment
Hide comment
@SpComb

SpComb Nov 22, 2017

Contributor

Most of the important PRs here are now merged, so this is pretty close to being fixed. Still waiting for:

Contributor

SpComb commented Nov 22, 2017

Most of the important PRs here are now merged, so this is pretty close to being fixed. Still waiting for:

@japsu

This comment has been minimized.

Show comment
Hide comment
@japsu

japsu Dec 21, 2017

Mui. @SpComb, any updates on this one? Been a while since last activity.

It seems the legacy API used by kontena certificate get has been removed in some point release, and thus our custom auto-renewal is currently broken. We're anxious to get the tls-sni-01 based auto-renewals working, which is currently blocked by this family of bugs.

japsu commented Dec 21, 2017

Mui. @SpComb, any updates on this one? Been a while since last activity.

It seems the legacy API used by kontena certificate get has been removed in some point release, and thus our custom auto-renewal is currently broken. We're anxious to get the tls-sni-01 based auto-renewals working, which is currently blocked by this family of bugs.

@SpComb

This comment has been minimized.

Show comment
Hide comment
@SpComb

SpComb Dec 21, 2017

Contributor

Closing this as kontena/docs#61 was merged, so the docs are now up to date on the limitations and workarounds around SSL_CERTS: https://kontena.io/docs/using-kontena/loadbalancer#using-kontena-load-balancer-for-ssl-termination

The only remaining limitations are on the number of pending domain authorization challenges (#2964 #3076), but I would estimate that to be at around ~50 domains, and neither successfully authorized domains nor expired challenges count towards that limit.

It seems the legacy API used by kontena certificate get has been removed in some point release, and thus our custom auto-renewal is currently broken. We're anxious to get the tls-sni-01 based auto-renewals working, which is currently blocked by this family of bugs.

There was a regression in 1.4.1 (#3104) that broke the legacy POST /v1/certificates/... APIs, but that was fixed in 1.4.2 (#3107). If you noticed breakage in 1.4.1, then upgrade to a newer 1.4.x release and try again.

Contributor

SpComb commented Dec 21, 2017

Closing this as kontena/docs#61 was merged, so the docs are now up to date on the limitations and workarounds around SSL_CERTS: https://kontena.io/docs/using-kontena/loadbalancer#using-kontena-load-balancer-for-ssl-termination

The only remaining limitations are on the number of pending domain authorization challenges (#2964 #3076), but I would estimate that to be at around ~50 domains, and neither successfully authorized domains nor expired challenges count towards that limit.

It seems the legacy API used by kontena certificate get has been removed in some point release, and thus our custom auto-renewal is currently broken. We're anxious to get the tls-sni-01 based auto-renewals working, which is currently blocked by this family of bugs.

There was a regression in 1.4.1 (#3104) that broke the legacy POST /v1/certificates/... APIs, but that was fixed in 1.4.2 (#3107). If you noticed breakage in 1.4.1, then upgrade to a newer 1.4.x release and try again.

@SpComb SpComb closed this Dec 21, 2017

@japsu

This comment has been minimized.

Show comment
Hide comment
@japsu

japsu Dec 21, 2017

Cool, thanks! Converted some of our domains to SSL_CERT_*, works fine. I'll keep tracking #2964 #3076 for the rest.

japsu commented Dec 21, 2017

Cool, thanks! Converted some of our domains to SSL_CERT_*, works fine. I'll keep tracking #2964 #3076 for the rest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment