Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker logs crashes every couple of hours #46699

Open
wouthoekstra opened this issue Oct 23, 2023 · 31 comments
Open

Docker logs crashes every couple of hours #46699

wouthoekstra opened this issue Oct 23, 2023 · 31 comments
Labels
area/logging kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage version/20.10

Comments

@wouthoekstra
Copy link

Description

Every couple of hours docker compose logs crashes on my production server. In the system logs (/var/log/docker) I find something like this:
error unmarshalling log entry (size=108554): proto: LogEntry: illegal tag 0 (wire type 6)

This error is then followed by several errors that the log message is too large:
Error streaming logs: log message is too large (1937007727 > 1000000)

Note that the too large log message in this case is almost 2GB!

Manually running something like this also triggers the log message is too large error:
sudo docker logs abc123 --since "2023-10-16T18:49:10" --until "2023-10-16T18:50:00"

The environment

I am encountering this issue on two different production environments, both running on AWS EC2 instances:

  • A Kong API gateway, which is an application built on top of nginx. It is third party software, we did not write code for this
  • Our own node.js application, that runs in a container but has a separate nginx container running next to it

Both seem to fail on the access logs. The kong gateway crashes every couple of hours, while the environment running our nodejs application crashes about once a week. We use the docker awslog driver to send the logs to cloudwatch. Cloudwatch seems to receive all logs, while manually running docker logs seems to miss up to multiple hours of logs every time after the error occurs.

What I have tried

  • I have looked into the cloudwatch logs for 'weird' log entries at the time of the error, and I have particularly looked for big log entries or weird characters but could not find any.
  • I ran docker logs manually in an open ssh session, which results in the same error as reported in the system logs
  • I have downloaded the cached log files and went through them using less, tail and other tools. Doing this I found out that the cached logs misses entries that do exist in Cloudwatch
  • I've tried copying these logs and make them available in a dummy container, so I have a more 'stable' environment to figure out the bug, but I cant get it to work

Reproduce

I have tried to get a reproducible situation, but I am unable to do so. I cannot pinpoint the log that crashes the system, nor can I import cached logs in another container. But this is my situation:

  1. Run an application with nginx (this is quite an assumption)
  2. Log a lot of access logs
  3. Have the application crash

Expected behavior

Docker logs should not crash

docker version

Client:
 Version:           20.10.23
 API version:       1.41
 Go version:        go1.18.9
 Git commit:        7155243
 Built:             Tue Apr 11 22:56:36 2023
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.23
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.9
  Git commit:       6051f14
  Built:            Tue Apr 11 22:57:17 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.19
  GitCommit:        1e1ea6e986c6c86565bc33d52e34b81b3e2bc71f
 runc:
  Version:          1.1.7
  GitCommit:        f19387a6bec4944c770f7668ab51c4348d9c2f38
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc., 0.0.0+unknown)
  compose: Docker Compose (Docker Inc., v2.22.0)

Server:
 Containers: 3
  Running: 3
  Paused: 0
  Stopped: 0
 Images: 3
 Server Version: 20.10.23
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 1e1ea6e986c6c86565bc33d52e34b81b3e2bc71f
 runc version: f19387a6bec4944c770f7668ab51c4348d9c2f38
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.14.326-245.539.amzn2.x86_64
 Operating System: Amazon Linux 2
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 3.781GiB
 Name: ip-172-31-17-252.eu-west-1.compute.internal
 ID: 5ELM:KYZM:7AEB:RH2I:FFL6:G2SA:4OOR:SGWE:6JOC:NXBS:U7HA:HA3Y
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional Info

I am aware I have not much info to go on, since I am not capable of providing reproducible steps. I can however access the production environment and execute debugging steps if this helps.

Note: I am sent here after creating an issue in the the docker cli repo. (docker/cli#4617). The commenter over there suggests this version of docker might be maintained by AWS. I did indeed make a ticket with them, their (business tier) support however remains unresponsive.

@wouthoekstra wouthoekstra added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage labels Oct 23, 2023
@wouthoekstra
Copy link
Author

At the other issue I was asked if there is anything that could be interfering with the log files. Up until this error we left the environment itself very clean: we use AWS beanstalk to deploy the instances. After the crashes started to occur, we have made some customizations to the environment with the most crashes, such as configuring a cloudwatch agent. We only did this after issues occurred with the environments in question, so I must assume any conflicting tooling is a default of AWS.

@samuelkarp
Copy link
Member

I took a quick look and Amazon is applying some patches to Docker that change the log behavior. In particular, their patch docker-20.10.4-Limit-logger-errors-logged-into-daemon-logs.patch changes the behavior of the ring buffer and appears to attempt to log the entire failed message (2GB??) into the daemon log.

/cc @stewartsmith

How to see patches in Amazon's SRPM

Do this inside an amazonlinux:2 container

bash-4.2# yum install yum-utils -y
...
bash-4.2# amazon-linux-extras enable docker
...
bash-4.2# yum --showduplicates search docker | grep 20.10.23
docker-20.10.23-1.amzn2.0.1.x86_64 : Automates deployment of containerized
bash-4.2# yumdownloader --source docker-20.10.23-1.amzn2.0.1.x86_64
...
bash-4.2# rpm2cpio docker-20.10.23-1.amzn2.0.1.src.rpm > docker.src.cpio
bash-4.2# mkdir docker
bash-4.2# cpio -D docker -dimv < docker.src.cpio
...
bash-4.2#  cat docker/docker.spec | grep ^Patch
Patch2001:      docker-20.10.4-sysvinit-use-nohup.patch
Patch2002:      docker-20.10.4-sysvinit-add-storage-opts.patch
Patch2004:      docker-20.10.4-sysvinit-increase-daemon-maxfiles.patch
Patch2007:      docker-20.10.4-sysvinit-stop-before-network.patch
Patch2010:      docker-20.10.4-sysvinit-configurable-start-timeout.patch
Patch2103:      docker-20.10.4-Skip-devmapper-tests-that-don-t-work-in-a-buildroot.patch
Patch2104:      docker-20.10.4-Skip-mutating-vfs-tests.patch
Patch2108:      docker-20.10.4-Skip-pkg-sysinfo-tests-that-require-root.patch
Patch2109:      docker-20.10.4-Skip-mutating-volume-local-tests.patch
Patch2113:      docker-20.10.4-Skip-pkg-authorization-tests-that-create-sockets.patch
Patch2114:      docker-20.10.4-Skip-pkg-idtools-tests-that-require-root.patch
Patch2115:      docker-20.10.4-Skip-distribution-tests-that-network.patch
Patch2117:      docker-20.10.4-Skip-builder-tests-that-require-root.patch
Patch2118:      docker-20.10.7-Skip-overlay-tar-untar-test.patch
Patch2120:      no-private-mnt-namespace.patch
Patch2123:      runc-allow-git-sha-override.patch
Patch2125:      tini-allow-git-sha-override.patch
Patch2126:      docker-20.10.4-Skip-cli-tests-that-require-network.patch
Patch2130:      docker-20.10.4-Skip-auth-middleware-test.patch
Patch2133:      docker-20.10.4-Skip-quota-tests.patch
Patch2135:      docker-20.10.4-Skip-engine-layer-tests.patch
Patch2140:      docker-20.10.4-Skip-volume-store-tests.patch
Patch2141:      docker-20.10.4-systemd-unit-sysconfig.patch
Patch2143:     docker-20.10.4-Skip-get-source-mount.patch
Patch2144:     docker-20.10.4-systemd-unit-runtimes.d.patch
Patch2145:      docker-20.10.4-Skip-git-tests.patch
Patch2146:      docker-20.10.4-Skip-compression-tests-that-require-root.patch
Patch2148:      docker-20.10.4-Skip-btrfs-graphdriver-tests-if-not-root.patch
Patch2149:      docker-20.10.4-Restore-containerd-dependency-restart-policy-and-nof.patch
Patch3005:      docker-20.10.7-Skip-pkg-archive-tests-that-require-root.patch
Patch3008:      docker-20.10.7-Add-test-skip-helpers-to-testutil.patch
Patch3009:      docker-20.10.7-Add-test-skip-helpers-to-cli.patch
Patch3013:	docker-20.10.4-Limit-logger-errors-logged-into-daemon-logs.patch
Patch3014:      docker-20.10.17-Skip-kubeconfig-and-loadcontext-tests-that-require-root.patch

@wouthoekstra
Copy link
Author

I've executed the same commands on my instance and indeed the Limit-logger-errors-logged-into-daemon-logs.patch is there (see below).

Regarding the 2GB log message, my underbelly feeling is that it is the result of a concatenation of many access log messages which get combined after a serialization error. This one to be precise: error unmarshalling log entry (size=108554): proto: LogEntry: illegal tag 0 (wire type 6). This might explain the missing couple of hours in the daemon logs that do exist in cloudwatch. But to be honest, I am out of my depth here.

[ec2-user@... ~]$ cpio -D docker -dimv < docker.src.cpio
README-docker-runtimes.d
cli-20.10.23.tar.gz
docker-20.10.17-Skip-kubeconfig-and-loadcontext-tests-that-require-root.patch
docker-20.10.23.tar.gz
docker-20.10.4-Limit-logger-errors-logged-into-daemon-logs.patch
docker-20.10.4-Restore-containerd-dependency-restart-policy-and-nof.patch
docker-20.10.4-Skip-auth-middleware-test.patch
docker-20.10.4-Skip-btrfs-graphdriver-tests-if-not-root.patch
docker-20.10.4-Skip-builder-tests-that-require-root.patch
docker-20.10.4-Skip-cli-tests-that-require-network.patch
docker-20.10.4-Skip-compression-tests-that-require-root.patch
docker-20.10.4-Skip-devmapper-tests-that-don-t-work-in-a-buildroot.patch
docker-20.10.4-Skip-distribution-tests-that-network.patch
docker-20.10.4-Skip-engine-layer-tests.patch
docker-20.10.4-Skip-get-source-mount.patch
docker-20.10.4-Skip-git-tests.patch
docker-20.10.4-Skip-mutating-vfs-tests.patch
docker-20.10.4-Skip-mutating-volume-local-tests.patch
docker-20.10.4-Skip-pkg-authorization-tests-that-create-sockets.patch
docker-20.10.4-Skip-pkg-idtools-tests-that-require-root.patch
docker-20.10.4-Skip-pkg-sysinfo-tests-that-require-root.patch
docker-20.10.4-Skip-quota-tests.patch
docker-20.10.4-Skip-volume-store-tests.patch
docker-20.10.4-systemd-unit-runtimes.d.patch
docker-20.10.4-systemd-unit-sysconfig.patch
docker-20.10.4-sysvinit-add-storage-opts.patch
docker-20.10.4-sysvinit-configurable-start-timeout.patch
docker-20.10.4-sysvinit-increase-daemon-maxfiles.patch
docker-20.10.4-sysvinit-stop-before-network.patch
docker-20.10.4-sysvinit-use-nohup.patch
docker-20.10.7-Add-test-skip-helpers-to-cli.patch
docker-20.10.7-Add-test-skip-helpers-to-testutil.patch
docker-20.10.7-Skip-overlay-tar-untar-test.patch
docker-20.10.7-Skip-pkg-archive-tests-that-require-root.patch
docker-setup-runtimes.sh
docker-storage.sysconfig
docker.spec
docker.sysconfig
go-md2man-2.0.1.tar.gz
libnetwork-05b93e0.tar.gz
no-private-mnt-namespace.patch
runc-allow-git-sha-override.patch
tini-allow-git-sha-override.patch
tini-de40ad0.tar.gz
v0.10.4.tar.gz
128367 blocks

@lbergesio
Copy link

lbergesio commented Oct 26, 2023

I am seeing a similar problem and my vm is not in AWS:
error from daemon in stream: Error grabbing logs: log message is too large (168194932 > 1000000)

I'm using local logging driver and have rotation configured a certain way. In the container directory I have several files with the rotated logs but executing docker logs on the affected containers produce a huge output that suddely ends with the error message and logs are truncated (latest logs are lost).

version:

xxx@imm-gd-1xi8afbouye44-ins:~$ docker -v
Docker version 24.0.5, build ced0996

Error:

xxx@imm-gd-1xi8afbouye44-ins:~$ docker logs 38b3a20c1812
...
...
...
error from daemon in stream: Error grabbing logs: log message is too large (108229732 > 1000000)

Container logs dir:

xxx@imm-gd-1xi8afbouye44-ins:~$ sudo ls /var/lib/docker/containers/38b3a20c18122a8585a79cd3ef11b0b68a7e4481a3bc2f1d755946ab930c92a0/local-logs
container.log  container.log.1.gz  container.log.2.gz  container.log.3.gz  container.log.4.gz  container.log.5.gz  container.log.6.gz  container.log.7.gz

Docker logging conf:

xxx@imm-gd-1xi8afbouye44-ins:~$ sudo cat /etc/docker/daemon.json 
{
"log-driver": "local",
"log-opts": {
  "compress": "true",
  "max-size": "20m",
  "max-file": "8"
},
"iptables": false,
"bridge": "none"
}

@wouthoekstra
Copy link
Author

@lbergesio have you checked /var/log/docker as well? I think that is the default place for docker errors to be logged, and for me the errors always start with something along the lines of Error streaming logs: error unmarshalling log entry (size=1), and then a bunch of the 'too large' errors like you have as well.

@lbergesio
Copy link

lbergesio commented Oct 27, 2023

@lbergesio have you checked /var/log/docker as well? I think that is the default place for docker errors to be logged, and for me the errors always start with something along the lines of Error streaming logs: error unmarshalling log entry (size=1), and then a bunch of the 'too large' errors like you have as well.

I don't have/var/log/docker, I grep for "too large" in /var/log and had nothing. I remember seeing errors with "unmarshalling" when I was using the json log driver, but not with local.

@lbergesio
Copy link

lbergesio commented Oct 27, 2023

@wouthoekstra but I do have sudo journalctl -fu docker.service (mind this is for a different container, but same issue):

Oct 26 23:20:05 imm-gd-1xi8afbouye44-ins dockerd[1227]: time="2023-10-26T23:20:05.515920863Z" level=error msg="Error streaming logs: log message is too large (1952739189 > 1000000)" container=d7f8c9792ea361fcd09d65edac9ac24e99d5bc63654d7457335c77623900e336 method="(*Daemon).ContainerLogs" module=daemon
Oct 26 23:20:16 imm-gd-1xi8afbouye44-ins dockerd[1227]: time="2023-10-26T23:20:16.009040105Z" level=error msg="Error streaming logs: log message is too large (1952739189 > 1000000)" container=d7f8c9792ea361fcd09d65edac9ac24e99d5bc63654d7457335c77623900e336 method="(*Daemon).ContainerLogs" module=daemon
Oct 27 10:43:33 imm-gd-1xi8afbouye44-ins dockerd[1227]: time="2023-10-27T10:43:33.165421282Z" level=error msg="Error streaming logs: log message is too large (1952739189 > 1000000)" container=d7f8c9792ea361fcd09d65edac9ac24e99d5bc63654d7457335c77623900e336 method="(*Daemon).ContainerLogs" module=daemon

@ArtCzer
Copy link

ArtCzer commented Nov 9, 2023

Similar issue here.
The app is a Ruby on Rails web app configured via a docker-compose.yml, and uses the awslogs driver to log to Cloudwatch. It's deployed on Elastic Beanstalk.

The issue only started occurring after upgrading the Beanstalk image from Docker running on 64bit Amazon Linux 2/3.5.9 to Docker running on 64bit Amazon Linux 2/3.6.x
The biggest change I see between these versions is the update to docker compose V2.
I'm working around it by downgrading back to 3.5.9.

I've tried disabling colorized logs as I could see some unexpected Unicode characters in the /var/lib/docker/containers/<container id>/container-cached.log file, which didn't resolve the issue.

Relevant configs:
On the failing instance:

> docker-compose --version
Docker Compose version v2.23.0

> docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc., v0.0.0+unknown)
  compose: Docker Compose (Docker Inc., v2.23.0)

Server:
 Server Version: 20.10.25
 Logging Driver: json-file
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Plugins:
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Default Runtime: runc
 containerd version: 1e1ea6e986c6c86565bc33d52e34b81b3e2bc71f
 runc version: f19387a6bec4944c770f7668ab51c4348d9c2f38
 Kernel Version: 4.14.326-245.539.amzn2.x86_64
 Operating System: Amazon Linux 2
 Labels:
 Experimental: false

Working instance:

> docker-compose --version
docker-compose version 1.29.2, build unknown

> docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc., 0.0.0+unknown)

Server:
 Server Version: 20.10.23
 Logging Driver: json-file
 Plugins:
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 containerd version: 1e1ea6e986c6c86565bc33d52e34b81b3e2bc71f
 runc version: f19387a6bec4944c770f7668ab51c4348d9c2f38
 Kernel Version: 4.14.318-240.529.amzn2.x86_64
 Operating System: Amazon Linux 2
 Labels:
 Experimental: false

Update

  • looks like the docker-20.10.4-Limit-logger-errors-logged-into-daemon-logs.patch patch is installed in both the working and failing instance
  • issue occurs when a large number of logs are being sent to STDOUT
  • error from daemon in stream: Error grabbing logs: error unmarshalling log entry (size=73482): proto: LogEntry: illegal tag 0 (wire type 6) log entry seems to coincide with the issue.
  • Error streaming logs: log message is too large is also present in the logs when the log service crashes

@PeterCai7
Copy link

PeterCai7 commented Nov 17, 2023

To sum up, there are two factors that are making this issue so confusing.

  1. For environments that are running on AWS Elastic Beanstalk, Docker running on 64bit Amazon Linux 2/3.5.9 is using docker-compose version 1 which was written in Python so it won't bubble up the docker logs issue to crash docker compose logs service. But for Docker running on 64bit Amazon Linux 2/3.6.x which is using docker-compose version 2, it catches the log streamming error from docker logs and crashes. Therefore, the issue is not introduced by docker compose V2 but existed before and then exposed by docker compose V2.

  2. Amazon Linux has a docker-20.10.4-Limit-logger-errors-logged-into-daemon-logs.patch to lower the rate of writing log driver issue which may ultimately cause the log message too large when the buffer is full.

But the problem here is why we would get error unmarshalling log entry (size=73482): proto: LogEntry: illegal tag 0 (wire type 6)? Are there something in our applications breaking the protocol buffer serialization?

@wouthoekstra
Copy link
Author

@PeterCai7 would it help if we removed the patch from our most problematic instance? On a particular instance compose logs crashes within a couple of hours most of the time. If it does not within a day after removing the patch, we have a stronger case in blaming the patch.

I would need some instructions about removing the patch if you desire this, either through this issue or the case I created with AWS support.

@PeterCai7
Copy link

@wouthoekstra
If you would like to test an Amazon Linux Docker version(with this patch removed), you would have to rebuild the SRPM file and replace the original docker-20.10.23-1.amzn2.0.1.src.rpm with your new built one. Then build RPM binaries with this new SRPM. Uninstall old docker version and install with new RPM files. Overall, it takes sort of extra efforts. But given that this issue is also occuring in non-AL instance. I highly doubt if this is an efficient approach of digging in.

@wouthoekstra
Copy link
Author

Good point. I was not aware of how much work this was going to be, and I rather not do all of that on our production environment. I'll just await your progress.

@PeterCai7
Copy link

PeterCai7 commented Nov 22, 2023

@wouthoekstra

After taking a further look, my theory is that this issue could be caused by unfitted encode/decode mechanism between awslog log driver and local log driver. As you mentioned, you configured awslog driver in your application and the log messages would be processed by cloudwatchlogs.go which works fine and all your logs are available in cloudwatch. However, since awslog does not provide read method for local logs reading, the dual logging would be kicked off by default. Your log entries would be cached to local machine and docker daemon would invoke local logdriver to read when you hit docker log(and when beanstalk execute docker compose log as well). As we can see in awslog, this log driver does not encode log entries with protocol buffer standard defined in moby but the local driver would read local cached logs with the protocol buffer standard. This could be the cause of error unmarshalling log entry (size=73482): proto: LogEntry: illegal tag 0 (wire type 6) and then this error could occur repeatedly and get stacked until the rate limit configured by Amazon Linux is reached then we see the other error error from daemon in stream: Error grabbing logs: log message is too large.

Could you try the following steps as a walkaround?

  1. Switch log driver from awslog to local, this would help making sure the log encoding compliant with protocol buffer standard.
  2. Turn on logstreaming in Elastic Beanstalk, this would help uploading local logs to cloudwatch(CW) through beanstalk controlled path. And we should be able to see them in CW under /aws/elasticbeanstalk/environment_name/var/log/eb-docker/containers/eb-current-app/stdouterr.log.

If this issue is gone after this walkaround, it would be a support of my hypothesis and we would be more clear on what is our next step.

@wouthoekstra
Copy link
Author

@PeterCai7 I switched the driver from awslog to local. It is running for two hours now without issues, which seems promising.

I've enabled log streaming for the instance, and i've set the required permissions to do so. The only logs that make it into cloudwatch currently are:
/aws/elasticbeanstalk/<<instance>>/environment-health.log

I am missing my access logs and error logs. That's no issue for a short period, and I guess it is not an issue for debugging this issue either? I did ssh into the machine just now and manually ran docker logs, and the logs are all there.

I will report back tomorrow and let you know if the instance remained healthy overnight.

@wouthoekstra
Copy link
Author

About twenty hours later, docker compose log is still running and the instance is still healthy.

@lbergesio
Copy link

@PeterCai7 my case above I was already using the local log driver and my vm was not running on aws

@PeterCai7
Copy link

@wouthoekstra According to the documentation, there should be a CloudWatch log group named /aws/elasticbeanstalk/{environment_name}(environment name, not the instance name) if you configured everything right. But this is the different issue if you cannot find the log group in CW. At this moment, we probably just make sure the log entry error does not occur any more after you switched to local driver.

@PeterCai7
Copy link

@lbergesio Yes, your case could be different. I see you mentioned there was no Error streaming logs: error unmarshalling log entry (size=1) issue in your case. I believe things went well for log reading/decoding. The log message is too large may related to your rotation config. Are you using Amazon Linux series of OS? Or you are using other Linux distributions since you mentioned your VM is not running on AWS. Just trying to clarify before we dive deep.

@lbergesio
Copy link

Error streaming logs

I dont have access to the vm anymore :(. But it is an ubuntu 20.04 based image running on IBM cloud.This is my daemon.json with the log config:

{                        
"log-driver": "local",   
"log-opts": {            
  "compress": "true",    
  "max-size": "20m",     
  "max-file": "8"        
},                       
"iptables": false,       
"bridge": "none"         
}                        

Also, although I think it is unrelated, I am filtering out some docker logs with this in /etc/rsyslog.d/01-blocklist.conf:

if $msg contains "run-docker-runtime" and $msg contains ".mount: Succeeded." then {
    stop                                                                        
}                                                                               

@wouthoekstra
Copy link
Author

@PeterCai7 Is there any progress on this issue? I can understand if the issue is hard to solve, but if there is no solution in sight I can discuss with my team about moving several of our servers away from awslog for now.

@cpuguy83
Copy link
Member

cpuguy83 commented Jan 10, 2024

Can someone provide a sample of the logs in question? Log files are in /var/lib/docker/containers/<id>/local-logs
I understand this can be sensitive information and may be difficult/impossible to provide.

I am trying to find a way to reproduce the issue in the meantime.

@wouthoekstra
Copy link
Author

@cpuguy83 I can send you log files. I've just downloaded some recent logs, and they do look weird. less considers it a binary file. I see you have an email address in your github profile, should I send it to you over there?

@PeterCai7
Copy link

@PeterCai7 Is there any progress on this issue? I can understand if the issue is hard to solve, but if there is no solution in sight I can discuss with my team about moving several of our servers away from awslog for now.

Yes, the beanstalk team did not develop the awslog driver in Docker repo here. Therefore, I think the first thing we need to figure out is shall we make changes to awslog for follwing encoding mechanisims as what "local" driver is doing. Otherwise, I would say putting servers with 'local' driver has better integration with Docker platform of beanstalk.

@jlian
Copy link

jlian commented Jan 30, 2024

Can someone provide a sample of the logs in question? Log files are in /var/lib/docker/containers/<id>/local-logs I understand this can be sensitive information and may be difficult/impossible to provide.

I am trying to find a way to reproduce the issue in the meantime.

@cpuguy83 in the (possibly related) IoT Edge issue, we provided some logs

Azure/iotedge#7074 (comment)

EDIT: also tagging Azure/iotedge#7177 as possibly related

@cpuguy83
Copy link
Member

Thanks @jlian

I was referring to the log files of the container.

@wouthoekstra

Sure you can send it there.

@wouthoekstra
Copy link
Author

I've been away for a couple of days, and I just wanted to make a fresh export of logs for @cpuguy83, when I noticed all my environments at AWS are reported to be healthy. Has there been an update?

@cpuguy83
Copy link
Member

cpuguy83 commented Feb 5, 2024

Sadly no, I don't think there has been any changes to this logging code.

@wouthoekstra
Copy link
Author

That is odd. Nothing changed on our side as far as I am aware off. And we had several environments with completely different tech stacks that were impacted by this issue, all of them currently reporting to be healthy. Nonetheless, I've send you docker container logs from one of the environments from when we did have issues.

@PeterCai7
Copy link

@wouthoekstra
Did you mean nothing changed from your side but awslog driver started working for your environments and reporing healthy?

@wouthoekstra
Copy link
Author

@PeterCai7 Yes that is correct. We did not update our tech stacks, nor did we change our logging mechanism. We do have automatic updates on our AWS servers, and our environments are all at "Docker running on 64bit Amazon Linux 2/3.7.1" now.

@cpuguy83
Copy link
Member

cpuguy83 commented Mar 28, 2024

The "log message too large" seems like its related to corruption in the log files.
This would have been possible before 7493342

Basically the (local) logs are encoded like so:

[msg length][msg as protbuf][msg length]

In the case of corrupted logs, I believe dockerd thinks it is reading the [msg length] but it is really in the middle of some message or some corrupted data and as such the number it reads is much larger than the max buffer size.

Unfortunately we don't have any additional framing to determine the beginning or end of a message (and what we do have is not suitable for this purpose), so the log reader is unable to recover from this situation, e.g. by skipping ahead to the next message.
When this occurs the log file needs to be deleted.
It might be possible to scope your logs request to not include the file with corrupted data (by adding a since and/or until, or tail parameter).

If this is happening on >= moby v24 with only newly created logs, this would show there is still a problem.
I have been unable to reproduce such problems under various different settings like concurrent read/write with various levels of rotations (including ridiculously frequent rotations, single-file rotations, just a slew of different things).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/logging kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage version/20.10
Projects
None yet
Development

No branches or pull requests

8 participants