Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vendor: github.com/fluent/fluent-logger-golang v1.8.0 #42979

Merged
merged 2 commits into from Dec 2, 2021

Conversation

akerouanton
Copy link
Member

@akerouanton akerouanton commented Nov 1, 2021

Fixes #40063.

- What I did

Update the fluent logger library to v1.8.0. Following commits/PRs were merged since last bump:

- How to verify it

Before this change, running the following commands would lead to docker hanging on the kill command:

$ docker run --rm -it --name test --log-driver=fluentd --log-opt fluentd-address=fluentdhost:24224 --log-opt fluentd-async=true debian /bin/bash -c 'echo test; sleep infinity;'
# In another terminal:
$ docker kill test

- Description for the changelog

  • Forcefully stop fluentd logger when a container stop and the logger is running in async mode.

@@ -51,6 +51,7 @@ const (
addressKey = "fluentd-address"
asyncKey = "fluentd-async"
asyncConnectKey = "fluentd-async-connect" // deprecated option (use fluent-async instead)
forceStopAsyncSendKey = "force-stop-async-send"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option should probably be prefixed with fluentd- (as it's specific to this driver)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly curious though; would there be reasons for someone to disable this? (wondering if it needs to be configurable, or if using the default would be what most/all users would want)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, indeed I forgot the prefix.

Since dockerd could hang when this flag is disabled and fluentd is down, I think it's not really safe to disable it (unless #35716 got merged and timeout handling is improved in fluent-logger-golang). IMHO most people want to keep this flag enabled, but some few users might want to never lost logs (that's why I made it configurable). However, AFAIK dockerd doesn't log when containers logs can't be transmitted to fluentd, although this could be fixed by leveraging fluent/fluent-logger-golang#97 (which is vendored in this PR).

From your maintainer's point of view, is it ok if logs are lost when fluentd is down? Would you like to log the container logs that can't be transmitted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey I'm from the Fluent community...

is it ok if logs are lost when fluentd is down?

Is there any other option? If Fluentd/bit is down, then the logs have nowhere to go. You wouldn't want to keep buffering them for too long. Discarding them seems like the right behavior. Is there a way for users to configure how long it will buffer and keep trying to re-connect before it starts just dropping logs- that seems ideal?

although this could be fixed by leveraging fluent/fluent-logger-golang#97 (which is vendored in this PR).

What would the user experience look like if we used this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although this could be fixed by leveraging fluent/fluent-logger-golang#97 (which is vendored in this PR).

What would the user experience look like if we used this?

It'd send logs to journald so it'd be possible to see them through # journalctl -u docker.service.

Is there any other option? If Fluentd/bit is down, then the logs have nowhere to go. You wouldn't want to keep buffering them for too long. Discarding them seems like the right behavior.

No, logs have nowhere to go and, currently, because of how the exponential back-off retry algorithm is configured, it might block the daemon forever and that's exactly what this PR try to avoid.

Is there a way for users to configure how long it will buffer and keep trying to re-connect before it starts just dropping logs- that seems ideal?

Short story: yes, but not implemented in Docker, and no, as it's probably broken in sync mode. Long story: Docker integration of fluentd doesn't provide a parameter to set lib's parameter WriteTimeout. Anyway, I believe the WriteTimeout mechanism is broken in sync mode under some circumstances: eg. trying to write a new log will change the write deadline of older logs, so under some relatively "high" load, deadlines will be pushed back forever such that no deadlines will be reached. To be more explicit, if the write timeout is set to 2 minutes and at least one log is written every <2m00, then no logs will time out. See this comment (this needs proper testing and reporting).


Seeing your comments, I believe the best is to remove this flag. If logging to journald logs that can't be sent to fluentd is a desirable feature, I should open a new issue/PR. But I'd prefer to receive some guidance from docker maintainers about that first (cc @thaJeztah).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing your comments, I believe the best is to remove this flag. If logging to journald logs that can't be sent to fluentd is a desirable feature, I should open a new issue/PR. But I'd prefer to receive some guidance from docker maintainers about that first

I agree with this. Sounds like no one would want this current option. I am curious if some sort of max configurable retry attempts is possible?

If not, an optional dump logs to the docker journald log stream sounds like a useful option (default false tho). I think in practice very few of my users would use that, since they probably have a ton of container logs wouldn't want to risk the docker logs with their volume. If outputting to an option file or even falling back to the json log driver was possible... I think that would be an ideal user experience. But I would not put any great urgency on implementing that. Fixing the current bug here is most critical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following comment is kind of long and rambly and so I want to be clear that helping me understand is fully optional ... I really appreciate the work you are doing here and trust you as a maintainer/owner of this code... I had meant to try to read your code and try to deeply understand what it was actually doing... but was always busy with other work. I'm not a maintainer here and I don't have power to block or approve this, so I want to say that helping me understand is fully optional.

Docker integration of fluentd doesn't provide a parameter to set lib's parameter WriteTimeout. Anyway, I believe the WriteTimeout mechanism is broken in sync mode under some circumstances: eg. trying to write a new log will change the write deadline of older logs, so under some relatively "high" load, deadlines will be pushed back forever such that no deadlines will be reached. To be more explicit, if the write timeout is set to 2 minutes and at least one log is written every <2m00, then no logs will time out. See this comment (this needs proper testing and reporting).

Sorry I don't follow this and the linked comment. To me, it feels like if there's a single connection, and that connection blocks for a while, it should be possible to have a global timeout on that connection. Or is that not how it works? There's no single connection, each batch of logs is sent by a unique goroutine that can't safely have context on what others are seeing? And you need to tell Docker to give up at some point, and there's no way to coordinate around that? There's no way to have a "max async connection retries"? Are the only solutions possible- give up quickly, or risk being blocked forever?

So the "Fluent" tool that I maintain is "Fluent Bit". Which is a low level C program that has multi-threading too. At a high level, this is how it works:

Log Producer => Fluent Bit (thing that has multiple threads accepting logs, and multiple threads outputting logs) => destination

And basically the way it works is that on the input side you can have timeouts on connections, and on the output side you can have a configurable max retries. So that even though each thread has no global context, it won't retry its data more than N times. And as far as shutting things down go, you set it up so that you stop your producer and Fluent Bit together. And if something goes wrong, Fluent Bit can't block the log producer from being shut down.

In this case I think we have this diagram:

Container stdout/stderr => Fluentd log driver => Fluentd/Fluent Bit instance

I wonder if its the same style of engineering problem... and thus that the following should be possible:

  1. Log driver issues can never stop the container from shutting down. Why does that happen in docker??
  2. You handle the destination being down with a max retries per chunk/log event

@thaJeztah
Copy link
Member

Haven't fully read up on all of the comments, but w.r.t. "sending logs to journald", I think that should definitely be separate from this, and at least not be the default (it's somewhat unexpected that container logs would start to leak to journald).

Note that current versions of docker also keep a local buffer for reading back logs if a remote logging driver is used that doesn't support reading back logs (https://docs.docker.com/config/containers/logging/dual-logging/). So there's already a buffer of the last XX logging messages, even if the remote logging driver isn't able to handle them.

@PettitWesley
Copy link
Contributor

@akerouanton @thaJeztah Can we get this moving? Lots of folks are blocked on this bug. (In addition to being a Fluent person, I'm also a logging person in general at AWS...)

And again, feel free to disregard the lengthy comments I made... helping me understand this is not something I'd ask to have priority- releasing this bug fix should be the priority.

@akerouanton
Copy link
Member Author

@PettitWesley I'll update this PR this evening 😉

Updates the fluent logger library to v1.8.0. Following PRs/commits were
merged since last bump:

* [Add callback for error handling when using
  async](fluent/fluent-logger-golang#97)
* [Fix panic when accessing unexported struct
  field](fluent/fluent-logger-golang#99)
* [Properly stop logger during (re)connect
  failure](fluent/fluent-logger-golang#82)
* [Support a TLS-enabled connection](fluent/fluent-logger-golang@e5d6aa1)

See https://github.com/fluent/fluent-logger-golang/compare/v1.6.1..v1.8.0

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
The flag ForceStopAsyncSend was added to fluent logger lib in v1.5.0 (at
this time named AsyncStop) to tell fluentd to abort sending logs
asynchronously as soon as possible, when its Close() method is called.
However this flag was broken because of the way the lib was handling it
(basically, the lib could be stucked in retry-connect loop without
checking this flag).

Since fluent logger lib v1.7.0, calling Close() (when ForceStopAsyncSend
is true) will really stop all ongoing send/connect procedure,
wherever it's stucked.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
@akerouanton akerouanton changed the title vendor: github.com/fluent/fluent-logger-golang v1.7.0 vendor: github.com/fluent/fluent-logger-golang v1.8.0 Dec 2, 2021
@akerouanton
Copy link
Member Author

akerouanton commented Dec 2, 2021

I removed the flag force-stop-async-send (it's now automatically true if fluentd-async-connect is true) and I bumped fluent-logger-golang to v1.8.0, the latest version released 3 days ago. This new version:

  1. Fixes a bug related to API consistency when running it in sync mode in multiple threads (which is not the case of docker), see Do not allow writing events after fluent.Close() ;
  2. introduces support for TLS connections, see TLS Support. However this change only supports self-signed CA certs by passing TlsInsecureSkipVerify parameter. So I don't think it's mature enough to advertise it to Docker users.

Copy link
Member

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@thaJeztah thaJeztah merged commit f6848ae into moby:master Dec 2, 2021
@thaJeztah thaJeztah added this to the 21.xx milestone Dec 2, 2021
@PettitWesley
Copy link
Contributor

@thaJeztah I see this was added to the 21.xx milestone... will this come out in a patch release soon? This is a bug fix. It should be come out in the next patch release ASAP.

@cpuguy83
Copy link
Member

cpuguy83 commented Dec 3, 2021

@PettitWesley Definitely understand this fixes some issues for the built-in fluent logger.
We did discuss this today but didn't really decide on what to do since it's not just a bug fix.

I recommend using a plugin with the updated fluent for the time being.

@PettitWesley
Copy link
Contributor

@cpuguy83 Thanks for the quick response! Once you make a decision, if you can update here or on the original issue, we'd highly appreciate that! #40063

In the meantime, I will look into seeing how I can get a custom patch build of this out to AWS customers.

@akerouanton akerouanton deleted the bump-fluent-logger branch December 5, 2021 20:14
evol262 pushed a commit to evol262/moby that referenced this pull request Jan 12, 2022
 * When async is enabled, this option defines the interval (ms) at which the connection
to the fluentd-address is re-established. This option is useful if the address
may resolve to one or more IP addresses, e.g. a Consul service address.

While the change in moby#42979 resolves the issue where a Docker container can be stuck
if the fluentd-address is unavailable, this functionality adds an additional benefit
in that a new and healthy fluentd-address can be resolved, allowing logs to flow once again.

This adds a `fluentd-async-reconnect-interval` log-opt for the fluentd logging driver.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Conor Evans <coevans@tcd.ie>

Co-authored-by: Sebastiaan van Stijn <github@gone.nl>
Co-authored-by: Conor Evans <coevans@tcd.ie>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Container hangs when using fluentd logger
5 participants