Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[25.0 backport] logger/journald: fix tailing logs with systemd 255 #47243

Merged

Conversation

corhere
Copy link
Contributor

@corhere corhere commented Jan 29, 2024

- What I did

  • Made journald read tests compatible with systemd version 255
  • Made journald reader compatible with systemd version 255

- How I did it
By fixing the journald reader to no longer rely upon an invalid assumption about the sd-journal reader API. See the individual commit messages for more details.

- How to verify it
The journald reader tests pass in the dev container / CI (systemd 253) and in an archlinux container with systemd 255 installed.

- Description for the changelog

  • Fixed an issue with the journald log driver which prevented container logs from being followed correctly in certain circumstances when used with systemd version 255.

- A picture of a cute animal (not mandatory but encouraged)

Following logs with a non-negative tail when the container log is empty
is broken on the journald driver when used with systemd 255. Add tests
which cover this edge case to our loggertest suite.

Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 9315680)
Signed-off-by: Cory Snider <csnider@mirantis.com>
Synthesize a boot ID for journal entries fed into
systemd-journal-remote, as required by systemd 255.

Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 71bfffd)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The journald reader uses a timer to set an upper bound on how long to
wait for the final log message of a stopped container. However, the
timer channel is only received from in non-blocking select statements!
There isn't enough benefit of using a timer to offset the cost of having
to manage the timer resource. Setting a deadline and comparing the
current time is just as effective, without having to manage the
lifecycle of any runtime resources.

Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit e94ec80)
Signed-off-by: Cory Snider <csnider@mirantis.com>
While it doesn't really matter if the reader waits for an extra
arbitrary period beyond an arbitrary hardcoded timeout, it's also
trivial and cheap to implement, and nice to have.

Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit d70fe88)
Signed-off-by: Cory Snider <csnider@mirantis.com>
errDrainDone is a sentinel error which is never supposed to escape the
package. Consequently, it needs to be filtered out of returns all over
the place, adding boilerplate. Forgetting to filter out these errors
would be a logic bug which the compiler would not help us catch. Replace
it with boolean multi-valued returns as they can't be accidentally
ignored or propagated.

Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 905477c)
Signed-off-by: Cory Snider <csnider@mirantis.com>
@thaJeztah
Copy link
Member

Moving this to draft, as there's a follow-up that should be included with this;

@thaJeztah thaJeztah marked this pull request as draft January 31, 2024 09:00
@thaJeztah thaJeztah modified the milestones: 25.0.2, 25.0.3 Jan 31, 2024
The Go race detector was detecting a data race when running the
TestLogRead/Follow/Concurrent test against the journald logging driver.
The race was in the test harness, specifically syncLogger. The waitOn
field would be reassigned each time a log entry is sent to the journal,
which is not concurrency-safe. Make it concurrency-safe using the same
patterns that are used in the log follower implementation to synchronize
with the logger.

Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 982e777)
Signed-off-by: Cory Snider <csnider@mirantis.com>
Writing the systemd-journal-remote command output directly to os.Stdout
and os.Stderr makes it nearly impossible to tell which test case the
output is related to when the tests are not run in verbose mode. Extend
the journald sender fake to redirect output to the test log so they
interleave with the rest of the test output.

Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 5792bf7)
Signed-off-by: Cory Snider <csnider@mirantis.com>
- Check the return value when logging messages
- Log the stream (stdout/stderr) and list of messages that were not read
- Wait until the logger is closed before returning early (panic/fatal)

Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 39c5c16)
Signed-off-by: Cory Snider <csnider@mirantis.com>
The journald reader test harness injects an artificial asynchronous
delay into the logging pipeline: a logged message won't be written to
the journal until at least 150ms after the Log() call returns. If a test
returns while log messages are still in flight to be written, the logs
may attempt to be written after the TempDir has been cleaned up, leading
to spurious errors.

The logger read tests which interleave writing and reading have to
include explicit synchronization points to work reliably with this delay
in place. On the other hand, tests should not be required to sync the
logger explicitly before returning. Override the Close() method in the
test harness wrapper to wait for in-flight logs to be flushed to disk.

Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit d53b7d7)
Signed-off-by: Cory Snider <csnider@mirantis.com>
If a reader has caught up to the logger and is waiting for the next
message, it should stop waiting when the logger is closed. Otherwise
the reader will unnecessarily wait the full closedDrainTimeout for no
log messages to arrive.

This case was overlooked when the journald reader was recently
overhauled to be compatible with systemd 255, and the reader tests only
failed when a logical race happened to settle in such a way to exercise
the bugged code path. It was only after implicit flushing on close was
added to the journald test harness that the Follow tests would
repeatably fail due to this bug. (No new regression tests are needed.)

Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 987fe37)
Signed-off-by: Cory Snider <csnider@mirantis.com>
@corhere corhere marked this pull request as ready for review February 3, 2024 00:12
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@thaJeztah thaJeztah merged commit 6cf694f into moby:25.0 Feb 3, 2024
126 checks passed
@corhere corhere deleted the backport-25.0/fix-journald-logs-systemd-255 branch February 8, 2024 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants