New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker daemon crashes if fluentd daemon is gone and buffer is full #32567

Open
gasparch opened this Issue Apr 12, 2017 · 14 comments

Comments

Projects
None yet
7 participants
@gasparch

gasparch commented Apr 12, 2017

Docker daemon crashes if

  • container is configured to log to fluentd
  • fluentd server shuts down (network problem or planned shutdown)
  • buffer overflows
  1. Configure logstash with fluentd input on some host
  2. run
docker run -d --log-driver=fluentd --log-opt fluentd-address=172.17.X.X:4000 --log-opt tag="test" --log-opt fluentd-buffer-limit=10KB --log-opt fluentd-max-retries=2 --name test1 --rm busybox /bin/sh -c 'yes "crashme" '

Yes, limits are on purpose low to expose Docker behavior.

  1. make sure that Logstash receives messages.
  2. stop Logstash
  3. on docker host watch 'docker ps' shows in matter of seconds that docker daemon is not available.

Describe the results you received:

Dockerd process crashed with segfault

....repeating lines
time="2017-04-12T17:01:47.079303891Z" level=error msg="Failed to log msg \"crashme\" for logger fluentd: fluent#appendBuffer: Buffer full, limit 10240"
time="2017-04-12T17:01:47.079324142Z" level=error msg="Failed to log msg \"crashme\" for logger fluentd: fluent#appendBuffer: Buffer full, limit 10240"
panic: fluent#reconnect: failed to reconnect!
goroutine 29272 [running]:
panic(0x1630320, 0xc421062010)
/usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/docker/docker/vendor/github.com/fluent/fluent-logger-golang/fluent.(*Fluent).reconnect(0xc42096a790)
/root/rpmbuild/BUILD/docker-ce/.gopath/src/github.com/docker/docker/vendor/github.com/fluent/fluent-logger-golang/fluent/fluent.go:276 +0xf8
created by github.com/docker/docker/vendor/github.com/fluent/fluent-logger-golang/fluent.(*Fluent).send
/root/rpmbuild/BUILD/docker-ce/.gopath/src/github.com/docker/docker/vendor/github.com/fluent/fluent-logger-golang/fluent/fluent.go:290 +0x136

Describe the results you expected:

Container should terminate, but dockerd should stay alive. Otherwise it seems as very easy DoS to crash docker on any host you have access.

Additional information you deem important (e.g. issue happens only occasionally):
100% reproducible with settings above.

Output of docker version:

Client:
 Version:      17.03.1-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   c6d412e
 Built:        Mon Mar 27 17:05:44 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.1-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   c6d412e
 Built:        Mon Mar 27 17:05:44 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 17
 Running: 1
 Paused: 0
 Stopped: 16
Images: 3
Server Version: 17.03.1-ce
Storage Driver: overlay
 Backing Filesystem: xfs
 Supports d_type: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-514.6.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 14.69 GiB
ID: 5BXG:QZRZ:L2PI:QVJO:LTFG:JUPJ:ZABK:7LBA:D2G7:WR7K:EO65:SPSH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

GCE n1-standard-4 instance

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Apr 12, 2017

Contributor

It's not crashed because of a buffer overflow, it's crashing because it's tried to reconnect several times and failed and so is throwing an exception.

Ping @tagomoris

Ideally this should not ever panic.

Contributor

cpuguy83 commented Apr 12, 2017

It's not crashed because of a buffer overflow, it's crashing because it's tried to reconnect several times and failed and so is throwing an exception.

Ping @tagomoris

Ideally this should not ever panic.

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Apr 12, 2017

Contributor

@tagomoris Can we just have fluent continuously try to reconnect with a max wait time?
Once the buffer is full it should just block new writes, no?

Contributor

cpuguy83 commented Apr 12, 2017

@tagomoris Can we just have fluent continuously try to reconnect with a max wait time?
Once the buffer is full it should just block new writes, no?

@tagomoris

This comment has been minimized.

Show comment
Hide comment
@tagomoris

tagomoris Apr 13, 2017

Contributor

I wonder how will you handle the troubles when destination Fluentd is going down, you don't find it and Docker daemon will continue to reconnect forever - without crashing it? All logs from containers will disappear, and is it ok for you?
IMO If you can find Fluentd outage, you can start it correctly until docker daemon meets the threshold for retries.

Contributor

tagomoris commented Apr 13, 2017

I wonder how will you handle the troubles when destination Fluentd is going down, you don't find it and Docker daemon will continue to reconnect forever - without crashing it? All logs from containers will disappear, and is it ok for you?
IMO If you can find Fluentd outage, you can start it correctly until docker daemon meets the threshold for retries.

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Apr 13, 2017

Contributor

@tagomoris It depends on how fluentd is buffering. When the buffer is full is it lossy or does it block?
Currently docker has a 1MB buffer for each container's stdout/stderr that blocks once it's full.

If the log fails to send (vs just blocked) the entry will be dropped.
If it blocks the buffers will start to fill and eventually block the container process from writing to stdout/stderr.

It's definitely undesirable to crash the daemon.

Contributor

cpuguy83 commented Apr 13, 2017

@tagomoris It depends on how fluentd is buffering. When the buffer is full is it lossy or does it block?
Currently docker has a 1MB buffer for each container's stdout/stderr that blocks once it's full.

If the log fails to send (vs just blocked) the entry will be dropped.
If it blocks the buffers will start to fill and eventually block the container process from writing to stdout/stderr.

It's definitely undesirable to crash the daemon.

@tagomoris

This comment has been minimized.

Show comment
Hide comment
@tagomoris

tagomoris Apr 13, 2017

Contributor

@cpuguy83 No, it doesn't depend on Fluentd buffer configuration. Fluentd forward input plugin will accept messages via TCP connection even when internal buffer is full. Of course, these messages will be lost, but it's not related with any clients (including logging driver).
For such cases, Fluentd's log warns about it. It's problem about Fluentd, not logging drivers.

But if Fluentd is not running, logging driver should tell/notify/warn it to users in any way. For example, containers don't run if it's configured to use Fluentd logging driver and destination isn't running.

Contributor

tagomoris commented Apr 13, 2017

@cpuguy83 No, it doesn't depend on Fluentd buffer configuration. Fluentd forward input plugin will accept messages via TCP connection even when internal buffer is full. Of course, these messages will be lost, but it's not related with any clients (including logging driver).
For such cases, Fluentd's log warns about it. It's problem about Fluentd, not logging drivers.

But if Fluentd is not running, logging driver should tell/notify/warn it to users in any way. For example, containers don't run if it's configured to use Fluentd logging driver and destination isn't running.

@gasparch

This comment has been minimized.

Show comment
Hide comment
@gasparch

gasparch Apr 13, 2017

For me consistent behavior would be to stop the container. Exactly in the way it does on startup.

Seems that log loss is inevitable in case if fluentd disapeared for a long time.

But in no way it should result in dockerd crash, because that takes down all containers on the host.

gasparch commented Apr 13, 2017

For me consistent behavior would be to stop the container. Exactly in the way it does on startup.

Seems that log loss is inevitable in case if fluentd disapeared for a long time.

But in no way it should result in dockerd crash, because that takes down all containers on the host.

@gasparch

This comment has been minimized.

Show comment
Hide comment
@gasparch

gasparch Apr 13, 2017

@tagomoris problem is if i have just one container which uses fluentd and there isa problem with fluentd - whole docker crashes.

That stops all other containers which have nothing to do with fluentd and totally contradicts promise of isolation of containers from each other.

gasparch commented Apr 13, 2017

@tagomoris problem is if i have just one container which uses fluentd and there isa problem with fluentd - whole docker crashes.

That stops all other containers which have nothing to do with fluentd and totally contradicts promise of isolation of containers from each other.

@tagomoris

This comment has been minimized.

Show comment
Hide comment
@tagomoris

tagomoris Apr 13, 2017

Contributor

IIRC, panic just stops to execute a container when I implemented this logging driver at first (Because there are no docker-daemon at that time).
Anyway, it sounds reasonable that whole docker crash is horrible (even for any reasons).

So, the ideal steps are:

  1. add option to avoid calling panic in fluent-logger-golang (and retry forever)
  2. enable that option in logging driver

I'll comment about further details for fluent-logger-golang in fluent/fluent-logger-golang#18

Contributor

tagomoris commented Apr 13, 2017

IIRC, panic just stops to execute a container when I implemented this logging driver at first (Because there are no docker-daemon at that time).
Anyway, it sounds reasonable that whole docker crash is horrible (even for any reasons).

So, the ideal steps are:

  1. add option to avoid calling panic in fluent-logger-golang (and retry forever)
  2. enable that option in logging driver

I'll comment about further details for fluent-logger-golang in fluent/fluent-logger-golang#18

@gasparch

This comment has been minimized.

Show comment
Hide comment
@gasparch

gasparch Apr 13, 2017

But would this stop the container? Because I see value in scenario, that container gets stopped/destroyed in absence on fluentd (say very small buffer) and they autohealing restarts container somewhere else - thus minimizing loss of logs.

If it continues trying forever - what would happen to container at that moment? Logs silently discarded? Process writing to stdout blocks?

gasparch commented Apr 13, 2017

But would this stop the container? Because I see value in scenario, that container gets stopped/destroyed in absence on fluentd (say very small buffer) and they autohealing restarts container somewhere else - thus minimizing loss of logs.

If it continues trying forever - what would happen to container at that moment? Logs silently discarded? Process writing to stdout blocks?

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Apr 13, 2017

Contributor

@gasparch If the call to Log() actually fails the log will be dropped and a error will appear in the docker daemon logs.

If the call to Log() blocks, the buffers will start filling up until the process can't write to the stream anymore.
There's also support for more of a ring buffer that just buffers until it's full and then starts dropping messages.

So the failure won't be silent.
It may help if instead of just logging an error if we increment some failure count in our metrics so it can be tracked more easily.

Contributor

cpuguy83 commented Apr 13, 2017

@gasparch If the call to Log() actually fails the log will be dropped and a error will appear in the docker daemon logs.

If the call to Log() blocks, the buffers will start filling up until the process can't write to the stream anymore.
There's also support for more of a ring buffer that just buffers until it's full and then starts dropping messages.

So the failure won't be silent.
It may help if instead of just logging an error if we increment some failure count in our metrics so it can be tracked more easily.

@gasparch

This comment has been minimized.

Show comment
Hide comment
@gasparch

gasparch Apr 13, 2017

thanks, monitoring docker's own logs can solve that problem.

Still I do not know if I like such architectural solution or prefer solution when software stops working if one component fails (like docker does when it cannot connect to fluentd on container creation).
It looks more consistent for me "no requested logging" => "no running software" approach, but Docker's user base opinion may be totally different :) so I do not know what is good for majority of users.

But any solution which does not crash docker is better than current one :)

gasparch commented Apr 13, 2017

thanks, monitoring docker's own logs can solve that problem.

Still I do not know if I like such architectural solution or prefer solution when software stops working if one component fails (like docker does when it cannot connect to fluentd on container creation).
It looks more consistent for me "no requested logging" => "no running software" approach, but Docker's user base opinion may be totally different :) so I do not know what is good for majority of users.

But any solution which does not crash docker is better than current one :)

@mirwan

This comment has been minimized.

Show comment
Hide comment
@mirwan

mirwan Aug 2, 2017

Hi,
I am currently facing this problem on a multi-tenant swarm cluster i'm operating.
How have your thinkings evolved about this issue?
AMHO, "no requested logging" => "no running software" approach cannot apply in my case as each tenant sends its logs to services external to the cluster and I just cannot have docker daemon crash because of that.
Until a satisfaying approach is found, would it just be possible to accept -1 value in the fluentd-max-retries parameter (which is compliant with fluent-logger-golang), that would let the operator decide the behaviour of docker daemon in that case?

mirwan commented Aug 2, 2017

Hi,
I am currently facing this problem on a multi-tenant swarm cluster i'm operating.
How have your thinkings evolved about this issue?
AMHO, "no requested logging" => "no running software" approach cannot apply in my case as each tenant sends its logs to services external to the cluster and I just cannot have docker daemon crash because of that.
Until a satisfaying approach is found, would it just be possible to accept -1 value in the fluentd-max-retries parameter (which is compliant with fluent-logger-golang), that would let the operator decide the behaviour of docker daemon in that case?

@gasparch

This comment has been minimized.

Show comment
Hide comment
@gasparch

gasparch Aug 3, 2017

make retry count ridicoulosly high and monitor Docker own logs for logging errors :)

other way - stop using Docker altogether :P :P :P okay, just switch to another logging driver ;)

gasparch commented Aug 3, 2017

make retry count ridicoulosly high and monitor Docker own logs for logging errors :)

other way - stop using Docker altogether :P :P :P okay, just switch to another logging driver ;)

@archmangler

This comment has been minimized.

Show comment
Hide comment
@archmangler

archmangler Jun 8, 2018

I've run into this on Docker 18+ and had to use these options to stop the docker daemon crashing:

    ```
    "fluentd-buffer-limit": "4g",
    "fluentd-max-retries": "144",
    "fluentd-retry-wait": 600,
    "fluentd-async-connect": "true"
    ```

Is there a better recommended solution thus far ?

archmangler commented Jun 8, 2018

I've run into this on Docker 18+ and had to use these options to stop the docker daemon crashing:

    ```
    "fluentd-buffer-limit": "4g",
    "fluentd-max-retries": "144",
    "fluentd-retry-wait": 600,
    "fluentd-async-connect": "true"
    ```

Is there a better recommended solution thus far ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment