New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vendor: github.com/fluent/fluent-logger-golang v1.8.0 #42979
Conversation
daemon/logger/fluentd/fluentd.go
Outdated
@@ -51,6 +51,7 @@ const ( | |||
addressKey = "fluentd-address" | |||
asyncKey = "fluentd-async" | |||
asyncConnectKey = "fluentd-async-connect" // deprecated option (use fluent-async instead) | |||
forceStopAsyncSendKey = "force-stop-async-send" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This option should probably be prefixed with fluentd-
(as it's specific to this driver)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly curious though; would there be reasons for someone to disable this? (wondering if it needs to be configurable, or if using the default would be what most/all users would want)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, indeed I forgot the prefix.
Since dockerd could hang when this flag is disabled and fluentd is down, I think it's not really safe to disable it (unless #35716 got merged and timeout handling is improved in fluent-logger-golang). IMHO most people want to keep this flag enabled, but some few users might want to never lost logs (that's why I made it configurable). However, AFAIK dockerd doesn't log when containers logs can't be transmitted to fluentd, although this could be fixed by leveraging fluent/fluent-logger-golang#97 (which is vendored in this PR).
From your maintainer's point of view, is it ok if logs are lost when fluentd is down? Would you like to log the container logs that can't be transmitted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey I'm from the Fluent community...
is it ok if logs are lost when fluentd is down?
Is there any other option? If Fluentd/bit is down, then the logs have nowhere to go. You wouldn't want to keep buffering them for too long. Discarding them seems like the right behavior. Is there a way for users to configure how long it will buffer and keep trying to re-connect before it starts just dropping logs- that seems ideal?
although this could be fixed by leveraging fluent/fluent-logger-golang#97 (which is vendored in this PR).
What would the user experience look like if we used this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although this could be fixed by leveraging fluent/fluent-logger-golang#97 (which is vendored in this PR).
What would the user experience look like if we used this?
It'd send logs to journald so it'd be possible to see them through # journalctl -u docker.service
.
Is there any other option? If Fluentd/bit is down, then the logs have nowhere to go. You wouldn't want to keep buffering them for too long. Discarding them seems like the right behavior.
No, logs have nowhere to go and, currently, because of how the exponential back-off retry algorithm is configured, it might block the daemon forever and that's exactly what this PR try to avoid.
Is there a way for users to configure how long it will buffer and keep trying to re-connect before it starts just dropping logs- that seems ideal?
Short story: yes, but not implemented in Docker, and no, as it's probably broken in sync mode. Long story: Docker integration of fluentd doesn't provide a parameter to set lib's parameter WriteTimeout
. Anyway, I believe the WriteTimeout
mechanism is broken in sync mode under some circumstances: eg. trying to write a new log will change the write deadline of older logs, so under some relatively "high" load, deadlines will be pushed back forever such that no deadlines will be reached. To be more explicit, if the write timeout is set to 2 minutes and at least one log is written every <2m00, then no logs will time out. See this comment (this needs proper testing and reporting).
Seeing your comments, I believe the best is to remove this flag. If logging to journald logs that can't be sent to fluentd is a desirable feature, I should open a new issue/PR. But I'd prefer to receive some guidance from docker maintainers about that first (cc @thaJeztah).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing your comments, I believe the best is to remove this flag. If logging to journald logs that can't be sent to fluentd is a desirable feature, I should open a new issue/PR. But I'd prefer to receive some guidance from docker maintainers about that first
I agree with this. Sounds like no one would want this current option. I am curious if some sort of max configurable retry attempts is possible?
If not, an optional dump logs to the docker journald log stream sounds like a useful option (default false tho). I think in practice very few of my users would use that, since they probably have a ton of container logs wouldn't want to risk the docker logs with their volume. If outputting to an option file or even falling back to the json log driver was possible... I think that would be an ideal user experience. But I would not put any great urgency on implementing that. Fixing the current bug here is most critical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following comment is kind of long and rambly and so I want to be clear that helping me understand is fully optional ... I really appreciate the work you are doing here and trust you as a maintainer/owner of this code... I had meant to try to read your code and try to deeply understand what it was actually doing... but was always busy with other work. I'm not a maintainer here and I don't have power to block or approve this, so I want to say that helping me understand is fully optional.
Docker integration of fluentd doesn't provide a parameter to set lib's parameter WriteTimeout. Anyway, I believe the WriteTimeout mechanism is broken in sync mode under some circumstances: eg. trying to write a new log will change the write deadline of older logs, so under some relatively "high" load, deadlines will be pushed back forever such that no deadlines will be reached. To be more explicit, if the write timeout is set to 2 minutes and at least one log is written every <2m00, then no logs will time out. See this comment (this needs proper testing and reporting).
Sorry I don't follow this and the linked comment. To me, it feels like if there's a single connection, and that connection blocks for a while, it should be possible to have a global timeout on that connection. Or is that not how it works? There's no single connection, each batch of logs is sent by a unique goroutine that can't safely have context on what others are seeing? And you need to tell Docker to give up at some point, and there's no way to coordinate around that? There's no way to have a "max async connection retries"? Are the only solutions possible- give up quickly, or risk being blocked forever?
So the "Fluent" tool that I maintain is "Fluent Bit". Which is a low level C program that has multi-threading too. At a high level, this is how it works:
Log Producer => Fluent Bit (thing that has multiple threads accepting logs, and multiple threads outputting logs) => destination
And basically the way it works is that on the input side you can have timeouts on connections, and on the output side you can have a configurable max retries. So that even though each thread has no global context, it won't retry its data more than N times. And as far as shutting things down go, you set it up so that you stop your producer and Fluent Bit together. And if something goes wrong, Fluent Bit can't block the log producer from being shut down.
In this case I think we have this diagram:
Container stdout/stderr => Fluentd log driver => Fluentd/Fluent Bit instance
I wonder if its the same style of engineering problem... and thus that the following should be possible:
- Log driver issues can never stop the container from shutting down. Why does that happen in docker??
- You handle the destination being down with a max retries per chunk/log event
c0e458e
to
ecf21ac
Compare
Haven't fully read up on all of the comments, but w.r.t. "sending logs to journald", I think that should definitely be separate from this, and at least not be the default (it's somewhat unexpected that container logs would start to leak to journald). Note that current versions of docker also keep a local buffer for reading back logs if a remote logging driver is used that doesn't support reading back logs (https://docs.docker.com/config/containers/logging/dual-logging/). So there's already a buffer of the last XX logging messages, even if the remote logging driver isn't able to handle them. |
@akerouanton @thaJeztah Can we get this moving? Lots of folks are blocked on this bug. (In addition to being a Fluent person, I'm also a logging person in general at AWS...) And again, feel free to disregard the lengthy comments I made... helping me understand this is not something I'd ask to have priority- releasing this bug fix should be the priority. |
@PettitWesley I'll update this PR this evening 😉 |
Updates the fluent logger library to v1.8.0. Following PRs/commits were merged since last bump: * [Add callback for error handling when using async](fluent/fluent-logger-golang#97) * [Fix panic when accessing unexported struct field](fluent/fluent-logger-golang#99) * [Properly stop logger during (re)connect failure](fluent/fluent-logger-golang#82) * [Support a TLS-enabled connection](fluent/fluent-logger-golang@e5d6aa1) See https://github.com/fluent/fluent-logger-golang/compare/v1.6.1..v1.8.0 Signed-off-by: Albin Kerouanton <albinker@gmail.com>
The flag ForceStopAsyncSend was added to fluent logger lib in v1.5.0 (at this time named AsyncStop) to tell fluentd to abort sending logs asynchronously as soon as possible, when its Close() method is called. However this flag was broken because of the way the lib was handling it (basically, the lib could be stucked in retry-connect loop without checking this flag). Since fluent logger lib v1.7.0, calling Close() (when ForceStopAsyncSend is true) will really stop all ongoing send/connect procedure, wherever it's stucked. Signed-off-by: Albin Kerouanton <albinker@gmail.com>
ecf21ac
to
bd61629
Compare
I removed the flag
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
@thaJeztah I see this was added to the 21.xx milestone... will this come out in a patch release soon? This is a bug fix. It should be come out in the next patch release ASAP. |
@PettitWesley Definitely understand this fixes some issues for the built-in fluent logger. I recommend using a plugin with the updated fluent for the time being. |
* When async is enabled, this option defines the interval (ms) at which the connection to the fluentd-address is re-established. This option is useful if the address may resolve to one or more IP addresses, e.g. a Consul service address. While the change in moby#42979 resolves the issue where a Docker container can be stuck if the fluentd-address is unavailable, this functionality adds an additional benefit in that a new and healthy fluentd-address can be resolved, allowing logs to flow once again. This adds a `fluentd-async-reconnect-interval` log-opt for the fluentd logging driver. Signed-off-by: Sebastiaan van Stijn <github@gone.nl> Signed-off-by: Conor Evans <coevans@tcd.ie> Co-authored-by: Sebastiaan van Stijn <github@gone.nl> Co-authored-by: Conor Evans <coevans@tcd.ie>
Fixes #40063.
- What I did
Update the fluent logger library to v1.8.0. Following commits/PRs were merged since last bump:
- How to verify it
Before this change, running the following commands would lead to docker hanging on the
kill
command:- Description for the changelog