New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to 20.10 breaks log tailing: unexpected EOF errors #41820
Comments
|
Related docker/for-mac#5145 |
|
I also can confirm this is happening. Worked fine before upgrade of docker engine from 19 to 20. I have multiple containers(about 7 running), 3 of them produce generous output. Usually, for one or two containers the logging fails with the EOF error during integration testing. Logs in the dockerd are filled with the following: Best regards, |
|
Hello! This issue is also affecting docker-for-mac version 3.0.3 me with a very verbose container Seeing the same error logs as other reports in I can currently reproduce consistantly, so if there's extra information i can provide just ask me! Thank you, |
|
Any chance we can get a sample of the log file in question? This would be under |
|
Good morning @cpuguy83 I tried to isolate a sample of the file where the error happens, but it's always a different part and i got the -log.json out and loaded every lines with JSON and they're all fine. Which means the content being written to this file is probably not the cause. My very wild guess is that there is a race condition between the writer and the reader of this file/stream, but that's just a guess. I noticed in the dockerd.log that i periodically get a single line like I can't really tell if it's the same container that gives the problem, but this is there alone and not causing any problem. This leaves me to believe that there is a resilient mechanism for handling the race condition, if that's the problem. But sometimes it just goes in a loop and retries up to 20000 times. This whole process takes ~2 seconds during which the container still produce logs. Here's the output of one failure from the And here's the corresponding lines in the -log.json At the time of this failure, the -log.json file was 35 Mb long |
|
Indeed we do handle the case where there can be a race between reading/writing a json entry. Working on getting more debugging in here (at least adding container ID to these error logs from the daemon). Do you have log rotation or anything else enabled? |
|
Also, is this while following logs or just tailing existing ones? |
|
@cpuguy83 I'm currently using docker-for-mac, following logs It usually takes a couple minutes before triggering. This only affect a single container producing massive logging. It's actually a postgres container logging all the queries while running a big suite of tests. Couple MBs of logs every test run. I've managed to get it to occur more if there are more containers around with several docker logs --follow. Our setup launches containers and --follow them printing in a file. The stack i use to reproduce is running a dozen containers. So that's a dozen |
|
I'm getting this very often - I'm running on Ubuntu - and it happens whenever I'm tailing ( |
|
@SvenDowideit Is there any chance you could provide and image that repros it? |
|
yup I'm using a more complicated version of: and then building something non-trivial with i don't think it happened the first time, while its filling its cache, but after that, I get it every time. The extra complexity is that I don't have the |
|
We've hit the same bug in the https://github.com/trinodb/trino while running our product tests (testcontainers based). When Trino is starting, it generates a lot of logs. During the startup, 20% of runs fail with EOF. It's pretty straightforward to reproduce it locally: |
|
I have the same issue when starting docker compose with ~10 Wildfly services, each generating rather normal amount of logs during WAR/EAR deployment. |
|
I have a local gitlab instance running. The problem happened right away when I upgraded. That's a great way to generate many megabytes per hour of logs, for anyone who needs to reproduce this. |
|
We are also seeing this issue in some integration tests that run against Docker that read logs from containers using the Go Docker API. |
|
Okay, I have come up with an easy way to reproduce it: In my case, I get the output: |
|
I have found running more than one copy of this at once (with |
|
@pete-woods Anything special about the environment? I can't repro with this either :( I'm assuming this is all happening in real envs and not Docker Desktop specifically? |
|
I have reproduced with this technique now on Docker for Mac, and in a Vagrant Ubuntu 18.04 VM. It definitely took more poking in the VMs than the Mac to reproduce, though. Running two copies of the Ubuntu seq container and tailing the logs from both did it fairly reliably for me. |
|
The other environment we're reproducing in is a GCP VM running Ubuntu 20.04 in our integration tests. (GCP is where CircleCI runs its VMs). |
|
My environment is a quad-core bare-metal Ubuntu 20.04 with 16GB RAM (half of it in avail state, and another 5GB in buff/cache state), running about 2 dozen containers. Load average usually about 0.80 - 1.50. |
|
I wasn't able to reproduce using @pete-woods's approach either, but it always seems to happen in a few seconds if I run docker run -it --name foo ubuntu sh -c "apt-get update && apt-get install -y nyancat && nyancat"in one terminal window, and then in a separate terminal window. For what it's worth, it also happens if I tail the logs over Docker's HTTP API. Running Docker 20.10.2 on MacOS on a Macbook Pro. |
|
@cpuguy83 did you run ˋdocker logs -fˋ? Do you use the json-file logging? |
|
Can you run a heavy mixed load on the worker and repeat the nyancat test? |
|
It breaks within 10 or so seconds for me as soon as you start requesting logs. |
|
+1 to @tobiasstadler 's last comment, the issue is only present if you run |
|
@tobiasstadler Yes and yes. Looking at your suggested change. |
|
re: #41820 (comment) Yes, I think that's the culprit. |
|
#42104 should take care of this, but will need to be backported after it is merged. |
|
I don't know if my case contributes to the resolution at all, but all clients in our Nomad cluster are flooding the syslog with these error messages, just like @sorinvisan89 and @lindycoder reported. I'm still able to retrieve logs both via The issue was noticed because our elasticsearch cluster alerted low disk space. For reference: |
|
Thanks for the fix, when will this be included in a release? |
|
I can't see it in the 20.10.6 milestone, where I think a lot of us would hope to. |
|
It actually is in the 20.10.6 milestone: |
|
I am still encountering these errors, while running the latest docker version: 20.10.6. Can somebody else confirm? How to reproduce (tested on Ubuntu 20.04.2 LTS & MacOS): The error is showing up immediately. |
|
Docker Engine Error json-file: fix sporadic unexpected EOF errors
level=warning msg="got error while decoding json" error="unexpected EOF" retries=19999 Docker driver for logs has a bug and it won't handle some log formatting. The bug then floods the system's log with retries messages. This situation generates a very annoying CPU load. Docker engine latest is 20.10.7. The issue has been fixed in release Docker engine 20.10.6 https://docs.docker.com/engine/release-notes/#20106 |
|
I checked a few of my instances at random and they are running version 20.10.6 where the bug should be fixed, but instead I still see lots of warning messages when
Maybe there's something else causing it too? |
|
I've upgraded the docker version on a test machine and still have the same warnings: |
|
Hello, we upgraded the engine to the latest and we are still flooded with that error message @cpuguy83 : |
|
I can also confirm this flooding. Using |
|
ditto: I can also confirm this flooding. Using Server Version: 20.10.11. Also running nomad |


Description
Our E2E test suite uses testcontainers-java to run our microservices and their dependencies in containers for the duration of our tests. All of the tests were passing with 19.03.14 until we updated our build nodes to Docker 20.10.1, at which point we started repeatedly getting
STDOUT: $Error grabbing logs: unexpected EOF errors, and our Wait.forLogMessage calls started timing out.Note however, that this issue seems sporadic as about 1 out of 10 builds still passes. After we reverted to Docker 19.03.14 the issue went away. We're using the latest version of Testcontainers (1.15.1 at the time of reporting the issue).
After contacting the developers I was told this is most likely a Docker issue.
Steps to reproduce the issue:
Unfortunately our projects are not OSS, so I cannot give you the exact containers used. I'll try to describe a matching scenario.
1.15.1).Wait.forLogMessage().STDOUT: $Error grabbing logs: unexpected EOF errorsand the wait policy will not trigger even though the messages are present in the log.Describe the results you received:
Error message:
STDOUT: $Error grabbing logs: unexpected EOF errors, wait policy for log message not triggering even though the message was present in the log.Describe the results you expected:
No EOF error, wait policy for log message working.
Additional information you deem important (e.g. issue happens only occasionally):
As mentioned in the description the issue seems sporadic, but it happens in about 80+% of our CI runs.
Output of
docker version:Previous version where the tests were passing:
Current version where the issue is present:
Output of
docker info:Previous version where the tests were passing:
Current version where the issue is present:
The current version and info parts were copied from an Arch box, while the previous version and info parts were copied from an Ubuntu box, but we could also reproduce the issue on Ubuntu after upgrading to 20.10.
Additional environment details (AWS, VirtualBox, physical, etc.):
The text was updated successfully, but these errors were encountered: