vendor: github.com/fluent/fluent-logger-golang v1.8.0 #42979

akerouanton · 2021-11-01T10:12:54Z

- What I did

Update the fluent logger library to v1.8.0. Following commits/PRs were merged since last bump:

- How to verify it

Before this change, running the following commands would lead to docker hanging on the kill command:

$ docker run --rm -it --name test --log-driver=fluentd --log-opt fluentd-address=fluentdhost:24224 --log-opt fluentd-async=true debian /bin/bash -c 'echo test; sleep infinity;'
# In another terminal:
$ docker kill test

- Description for the changelog

Forcefully stop fluentd logger when a container stop and the logger is running in async mode.

thaJeztah · 2021-11-09T09:41:21Z

daemon/logger/fluentd/fluentd.go

@@ -51,6 +51,7 @@ const (
 	addressKey            = "fluentd-address"
 	asyncKey              = "fluentd-async"
 	asyncConnectKey       = "fluentd-async-connect" // deprecated option (use fluent-async instead)
+	forceStopAsyncSendKey = "force-stop-async-send"


This option should probably be prefixed with fluentd- (as it's specific to this driver)

Slightly curious though; would there be reasons for someone to disable this? (wondering if it needs to be configurable, or if using the default would be what most/all users would want)

Oops, indeed I forgot the prefix.

Since dockerd could hang when this flag is disabled and fluentd is down, I think it's not really safe to disable it (unless #35716 got merged and timeout handling is improved in fluent-logger-golang). IMHO most people want to keep this flag enabled, but some few users might want to never lost logs (that's why I made it configurable). However, AFAIK dockerd doesn't log when containers logs can't be transmitted to fluentd, although this could be fixed by leveraging fluent/fluent-logger-golang#97 (which is vendored in this PR).

From your maintainer's point of view, is it ok if logs are lost when fluentd is down? Would you like to log the container logs that can't be transmitted?

Hey I'm from the Fluent community...

is it ok if logs are lost when fluentd is down?

Is there any other option? If Fluentd/bit is down, then the logs have nowhere to go. You wouldn't want to keep buffering them for too long. Discarding them seems like the right behavior. Is there a way for users to configure how long it will buffer and keep trying to re-connect before it starts just dropping logs- that seems ideal?

although this could be fixed by leveraging fluent/fluent-logger-golang#97 (which is vendored in this PR).

What would the user experience look like if we used this?

although this could be fixed by leveraging fluent/fluent-logger-golang#97 (which is vendored in this PR).

What would the user experience look like if we used this?

It'd send logs to journald so it'd be possible to see them through # journalctl -u docker.service.

Is there any other option? If Fluentd/bit is down, then the logs have nowhere to go. You wouldn't want to keep buffering them for too long. Discarding them seems like the right behavior.

No, logs have nowhere to go and, currently, because of how the exponential back-off retry algorithm is configured, it might block the daemon forever and that's exactly what this PR try to avoid.

Is there a way for users to configure how long it will buffer and keep trying to re-connect before it starts just dropping logs- that seems ideal?

Short story: yes, but not implemented in Docker, and no, as it's probably broken in sync mode. Long story: Docker integration of fluentd doesn't provide a parameter to set lib's parameter WriteTimeout. Anyway, I believe the WriteTimeout mechanism is broken in sync mode under some circumstances: eg. trying to write a new log will change the write deadline of older logs, so under some relatively "high" load, deadlines will be pushed back forever such that no deadlines will be reached. To be more explicit, if the write timeout is set to 2 minutes and at least one log is written every <2m00, then no logs will time out. See this comment (this needs proper testing and reporting).

Seeing your comments, I believe the best is to remove this flag. If logging to journald logs that can't be sent to fluentd is a desirable feature, I should open a new issue/PR. But I'd prefer to receive some guidance from docker maintainers about that first (cc @thaJeztah).

Seeing your comments, I believe the best is to remove this flag. If logging to journald logs that can't be sent to fluentd is a desirable feature, I should open a new issue/PR. But I'd prefer to receive some guidance from docker maintainers about that first

I agree with this. Sounds like no one would want this current option. I am curious if some sort of max configurable retry attempts is possible?

If not, an optional dump logs to the docker journald log stream sounds like a useful option (default false tho). I think in practice very few of my users would use that, since they probably have a ton of container logs wouldn't want to risk the docker logs with their volume. If outputting to an option file or even falling back to the json log driver was possible... I think that would be an ideal user experience. But I would not put any great urgency on implementing that. Fixing the current bug here is most critical.

The following comment is kind of long and rambly and so I want to be clear that helping me understand is fully optional ... I really appreciate the work you are doing here and trust you as a maintainer/owner of this code... I had meant to try to read your code and try to deeply understand what it was actually doing... but was always busy with other work. I'm not a maintainer here and I don't have power to block or approve this, so I want to say that helping me understand is fully optional.

Docker integration of fluentd doesn't provide a parameter to set lib's parameter WriteTimeout. Anyway, I believe the WriteTimeout mechanism is broken in sync mode under some circumstances: eg. trying to write a new log will change the write deadline of older logs, so under some relatively "high" load, deadlines will be pushed back forever such that no deadlines will be reached. To be more explicit, if the write timeout is set to 2 minutes and at least one log is written every <2m00, then no logs will time out. See this comment (this needs proper testing and reporting).

Sorry I don't follow this and the linked comment. To me, it feels like if there's a single connection, and that connection blocks for a while, it should be possible to have a global timeout on that connection. Or is that not how it works? There's no single connection, each batch of logs is sent by a unique goroutine that can't safely have context on what others are seeing? And you need to tell Docker to give up at some point, and there's no way to coordinate around that? There's no way to have a "max async connection retries"? Are the only solutions possible- give up quickly, or risk being blocked forever?

So the "Fluent" tool that I maintain is "Fluent Bit". Which is a low level C program that has multi-threading too. At a high level, this is how it works:

Log Producer => Fluent Bit (thing that has multiple threads accepting logs, and multiple threads outputting logs) => destination

And basically the way it works is that on the input side you can have timeouts on connections, and on the output side you can have a configurable max retries. So that even though each thread has no global context, it won't retry its data more than N times. And as far as shutting things down go, you set it up so that you stop your producer and Fluent Bit together. And if something goes wrong, Fluent Bit can't block the log producer from being shut down.

In this case I think we have this diagram:

Container stdout/stderr => Fluentd log driver => Fluentd/Fluent Bit instance

I wonder if its the same style of engineering problem... and thus that the following should be possible:

Log driver issues can never stop the container from shutting down. Why does that happen in docker??

You handle the destination being down with a max retries per chunk/log event

thaJeztah · 2021-11-29T11:48:25Z

Haven't fully read up on all of the comments, but w.r.t. "sending logs to journald", I think that should definitely be separate from this, and at least not be the default (it's somewhat unexpected that container logs would start to leak to journald).

Note that current versions of docker also keep a local buffer for reading back logs if a remote logging driver is used that doesn't support reading back logs (https://docs.docker.com/config/containers/logging/dual-logging/). So there's already a buffer of the last XX logging messages, even if the remote logging driver isn't able to handle them.

PettitWesley · 2021-12-01T17:08:16Z

@akerouanton @thaJeztah Can we get this moving? Lots of folks are blocked on this bug. (In addition to being a Fluent person, I'm also a logging person in general at AWS...)

And again, feel free to disregard the lengthy comments I made... helping me understand this is not something I'd ask to have priority- releasing this bug fix should be the priority.

akerouanton · 2021-12-01T17:23:49Z

@PettitWesley I'll update this PR this evening 😉

Updates the fluent logger library to v1.8.0. Following PRs/commits were merged since last bump: * [Add callback for error handling when using async](fluent/fluent-logger-golang#97) * [Fix panic when accessing unexported struct field](fluent/fluent-logger-golang#99) * [Properly stop logger during (re)connect failure](fluent/fluent-logger-golang#82) * [Support a TLS-enabled connection](fluent/fluent-logger-golang@e5d6aa1) See https://github.com/fluent/fluent-logger-golang/compare/v1.6.1..v1.8.0 Signed-off-by: Albin Kerouanton <albinker@gmail.com>

The flag ForceStopAsyncSend was added to fluent logger lib in v1.5.0 (at this time named AsyncStop) to tell fluentd to abort sending logs asynchronously as soon as possible, when its Close() method is called. However this flag was broken because of the way the lib was handling it (basically, the lib could be stucked in retry-connect loop without checking this flag). Since fluent logger lib v1.7.0, calling Close() (when ForceStopAsyncSend is true) will really stop all ongoing send/connect procedure, wherever it's stucked. Signed-off-by: Albin Kerouanton <albinker@gmail.com>

akerouanton · 2021-12-02T01:03:20Z

I removed the flag force-stop-async-send (it's now automatically true if fluentd-async-connect is true) and I bumped fluent-logger-golang to v1.8.0, the latest version released 3 days ago. This new version:

Fixes a bug related to API consistency when running it in sync mode in multiple threads (which is not the case of docker), see Do not allow writing events after fluent.Close() ;
introduces support for TLS connections, see TLS Support. However this change only supports self-signed CA certs by passing TlsInsecureSkipVerify parameter. So I don't think it's mature enough to advertise it to Docker users.

cpuguy83

LGTM

thaJeztah

LGTM, thanks!

PettitWesley · 2021-12-03T00:53:55Z

@thaJeztah I see this was added to the 21.xx milestone... will this come out in a patch release soon? This is a bug fix. It should be come out in the next patch release ASAP.

cpuguy83 · 2021-12-03T00:56:23Z

@PettitWesley Definitely understand this fixes some issues for the built-in fluent logger.
We did discuss this today but didn't really decide on what to do since it's not just a bug fix.

I recommend using a plugin with the updated fluent for the time being.

PettitWesley · 2021-12-03T01:00:17Z

@cpuguy83 Thanks for the quick response! Once you make a decision, if you can update here or on the original issue, we'd highly appreciate that! #40063

In the meantime, I will look into seeing how I can get a custom patch build of this out to AWS customers.

* When async is enabled, this option defines the interval (ms) at which the connection to the fluentd-address is re-established. This option is useful if the address may resolve to one or more IP addresses, e.g. a Consul service address. While the change in moby#42979 resolves the issue where a Docker container can be stuck if the fluentd-address is unavailable, this functionality adds an additional benefit in that a new and healthy fluentd-address can be resolved, allowing logs to flow once again. This adds a `fluentd-async-reconnect-interval` log-opt for the fluentd logging driver. Signed-off-by: Sebastiaan van Stijn <github@gone.nl> Signed-off-by: Conor Evans <coevans@tcd.ie> Co-authored-by: Sebastiaan van Stijn <github@gone.nl> Co-authored-by: Conor Evans <coevans@tcd.ie>

thaJeztah added area/logging status/2-code-review labels Nov 1, 2021

thaJeztah reviewed Nov 9, 2021

View reviewed changes

angelcar mentioned this pull request Nov 15, 2021

Stopping fluent-bit (in firelens configuration) hangs the application container aws/amazon-ecs-agent#2787

Closed

akerouanton force-pushed the bump-fluent-logger branch from c0e458e to ecf21ac Compare November 19, 2021 01:53

akerouanton requested a review from thaJeztah November 19, 2021 01:53

akerouanton added 2 commits December 2, 2021 01:11

akerouanton force-pushed the bump-fluent-logger branch from ecf21ac to bd61629 Compare December 2, 2021 00:17

akerouanton changed the title ~~vendor: github.com/fluent/fluent-logger-golang v1.7.0~~ vendor: github.com/fluent/fluent-logger-golang v1.8.0 Dec 2, 2021

cpuguy83 approved these changes Dec 2, 2021

View reviewed changes

tonistiigi approved these changes Dec 2, 2021

View reviewed changes

thaJeztah approved these changes Dec 2, 2021

View reviewed changes

thaJeztah merged commit f6848ae into moby:master Dec 2, 2021

thaJeztah added the process/cherry-pick label Dec 2, 2021

thaJeztah added this to the 21.xx milestone Dec 2, 2021

akerouanton deleted the bump-fluent-logger branch December 5, 2021 20:14

thaJeztah mentioned this pull request Jan 12, 2022

[20.10 backport] backport fluentd log driver async connect fix #43147

Merged

thaJeztah added process/cherry-picked and removed process/cherry-pick labels Jan 13, 2022

thaJeztah mentioned this pull request Apr 10, 2022

daemon/logger: remove uses of pkg/urlutil, fix fluentd validation and parsing #43476

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vendor: github.com/fluent/fluent-logger-golang v1.8.0 #42979

vendor: github.com/fluent/fluent-logger-golang v1.8.0 #42979

akerouanton commented Nov 1, 2021 •

edited

thaJeztah Nov 9, 2021

thaJeztah Nov 9, 2021

akerouanton Nov 9, 2021

PettitWesley Nov 9, 2021

akerouanton Nov 19, 2021

PettitWesley Nov 19, 2021

PettitWesley Nov 19, 2021

thaJeztah commented Nov 29, 2021

PettitWesley commented Dec 1, 2021

akerouanton commented Dec 1, 2021

akerouanton commented Dec 2, 2021 •

edited

cpuguy83 left a comment

thaJeztah left a comment

PettitWesley commented Dec 3, 2021

cpuguy83 commented Dec 3, 2021

PettitWesley commented Dec 3, 2021

vendor: github.com/fluent/fluent-logger-golang v1.8.0 #42979

vendor: github.com/fluent/fluent-logger-golang v1.8.0 #42979

Conversation

akerouanton commented Nov 1, 2021 • edited

thaJeztah Nov 9, 2021

Choose a reason for hiding this comment

thaJeztah Nov 9, 2021

Choose a reason for hiding this comment

akerouanton Nov 9, 2021

Choose a reason for hiding this comment

PettitWesley Nov 9, 2021

Choose a reason for hiding this comment

akerouanton Nov 19, 2021

Choose a reason for hiding this comment

PettitWesley Nov 19, 2021

Choose a reason for hiding this comment

PettitWesley Nov 19, 2021

Choose a reason for hiding this comment

thaJeztah commented Nov 29, 2021

PettitWesley commented Dec 1, 2021

akerouanton commented Dec 1, 2021

akerouanton commented Dec 2, 2021 • edited

cpuguy83 left a comment

Choose a reason for hiding this comment

thaJeztah left a comment

Choose a reason for hiding this comment

PettitWesley commented Dec 3, 2021

cpuguy83 commented Dec 3, 2021

PettitWesley commented Dec 3, 2021

akerouanton commented Nov 1, 2021 •

edited

akerouanton commented Dec 2, 2021 •

edited