New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logspout fails to reconnect after connection reset by peer. #215

Closed
jeacott1 opened this Issue Jul 28, 2016 · 21 comments

Comments

Projects
None yet
@jeacott1

jeacott1 commented Jul 28, 2016

I have 5 instances of logspout running in marathon talking to loggly.
all was well until yesterday when all 5 failed with errors like this:

#   ADAPTER     ADDRESS                 CONTAINERS      SOURCES OPTIONS
#   syslog+tcp  logs-01.loggly.com:514                          map[]
2016/07/09 03:39:42 syslog: write tcp xxx.xxx.xxx.xxx:34657->logglyip:514: write: connection reset by peer
2016/07/09 03:39:42 syslog: retrying tcp up to 11 times
2016/07/09 03:40:23 syslog: retry failed
2016/07/09 03:40:23 syslog: reconnecting up to 11 times
2016/07/10 23:33:50 syslog: write tcp xxx.xxx.xxx.xxx:34657->logglyip:514: write: connection timed out
2016/07/10 23:33:50 syslog: retrying tcp up to 11 times
2016/07/10 23:34:31 syslog: retry failed
2016/07/10 23:34:31 syslog: reconnecting up to 11 times
2016/07/11 02:04:47 syslog: write tcp xxx.xxx.xxx.xxx:34657->logglyip:514: write: connection timed out
2016/07/11 02:04:47 syslog: retrying tcp up to 11 times
2016/07/11 02:05:28 syslog: retry failed
2016/07/11 02:05:28 syslog: reconnecting up to 11 times
2016/07/11 02:32:28 syslog: write tcp xxx.xxx.xxx.xxx:34657->logglyip:514: write: connection timed out
2016/07/11 02:32:28 syslog: retrying tcp up to 11 times
2016/07/11 02:33:09 syslog: retry failed
2016/07/11 02:33:09 syslog: reconnecting up to 11 times
2016/07/12 02:57:56 syslog: write tcp xxx.xxx.xxx.xxx:34657->logglyip:514: write: connection timed out
2016/07/12 02:57:56 syslog: retrying tcp up to 11 times
2016/07/12 02:58:37 syslog: retry failed
2016/07/12 02:58:37 syslog: reconnecting up to 11 times
2016/07/13 23:51:10 syslog: write tcp xxx.xxx.xxx.xxx:34657->logglyip:514: write: broken pipe
2016/07/13 23:51:10 syslog: reconnecting up to 11 times
2016/07/18 11:11:16 syslog: write tcp xxx.xxx.xxx.xxx:34657->logglyip:514: write: broken pipe
2016/07/18 11:11:16 syslog: reconnecting up to 11 times
2016/07/22 05:51:14 syslog: write tcp xxx.xxx.xxx.xxx:34657->logglyip:514: write: broken pipe
2016/07/22 05:51:14 syslog: reconnecting up to 11 times
2016/07/27 08:00:46 syslog: write tcp xxx.xxx.xxx.xxx:34657->logglyip:514: write: connection reset by peer
2016/07/27 08:00:46 syslog: retrying tcp up to 11 times
2016/07/27 08:01:27 syslog: retry failed
2016/07/27 08:01:27 syslog: reconnecting up to 11 times
2016/07/27 17:40:52 syslog: write tcp xxx.xxx.xxx.xxx:34657->logglyip:514: write: connection timed out
2016/07/27 17:40:52 syslog: retrying tcp up to 11 times
2016/07/27 17:41:33 syslog: retry failed
2016/07/27 17:41:33 syslog: reconnecting up to 11 times
2016/07/28 00:56:24 syslog: write tcp xxx.xxx.xxx.xxx:34657->logglyip:514: write: connection timed out
2016/07/28 00:56:24 syslog: retrying tcp up to 11 times
2016/07/28 00:57:05 syslog: retry failed
2016/07/28 00:57:05 syslog: reconnecting up to 11 times

it looks like it retried with a backoff, but never succeeded in reconnecting, even though when I manually tried connecting it worked just fine.

restarting the dockers fixed the issue, so 2 things from this.

  1. something doesn't work as expected with the retry mechanism.
  2. if there was a useful healthcheck endpoint in logspout, marathon could have shot/restarted the processes.
@Halama

This comment has been minimized.

Show comment
Hide comment
@Halama

Halama Aug 5, 2016

Same here, my containers stopped logging to Papertrail:

2016/08/04 21:01:25 syslog: write tcp 172.17.0.2:33158->67.214.223.210:*****: write: connection reset by peer
2016/08/04 21:01:25 syslog: retrying tcp up to 11 times
2016/08/04 21:02:06 syslog: retry failed
2016/08/04 21:02:06 syslog: reconnecting up to 11 times

It started log again after manual restart of container.

Halama commented Aug 5, 2016

Same here, my containers stopped logging to Papertrail:

2016/08/04 21:01:25 syslog: write tcp 172.17.0.2:33158->67.214.223.210:*****: write: connection reset by peer
2016/08/04 21:01:25 syslog: retrying tcp up to 11 times
2016/08/04 21:02:06 syslog: retry failed
2016/08/04 21:02:06 syslog: reconnecting up to 11 times

It started log again after manual restart of container.

@albertsun

This comment has been minimized.

Show comment
Hide comment
@albertsun

albertsun Aug 15, 2016

I'm seeing the same thing trying to write to Papertrail. Manual restart of the service fixes things.

Aug 03 14:20:17 ip-10-214-0-120.ec2.internal docker[30862]: 2016/08/03 14:20:17 syslog: write tcp 192.168.16.2:44954->173.247.107.217:49069: write: connection reset by peer
Aug 03 14:20:17 ip-10-214-0-120.ec2.internal docker[30862]: 2016/08/03 14:20:17 syslog: retrying tcp up to 11 times
Aug 03 14:20:58 ip-10-214-0-120.ec2.internal docker[30862]: 2016/08/03 14:20:58 syslog: retry failed
Aug 03 14:20:58 ip-10-214-0-120.ec2.internal docker[30862]: 2016/08/03 14:20:58 syslog: reconnecting up to 11 times

albertsun commented Aug 15, 2016

I'm seeing the same thing trying to write to Papertrail. Manual restart of the service fixes things.

Aug 03 14:20:17 ip-10-214-0-120.ec2.internal docker[30862]: 2016/08/03 14:20:17 syslog: write tcp 192.168.16.2:44954->173.247.107.217:49069: write: connection reset by peer
Aug 03 14:20:17 ip-10-214-0-120.ec2.internal docker[30862]: 2016/08/03 14:20:17 syslog: retrying tcp up to 11 times
Aug 03 14:20:58 ip-10-214-0-120.ec2.internal docker[30862]: 2016/08/03 14:20:58 syslog: retry failed
Aug 03 14:20:58 ip-10-214-0-120.ec2.internal docker[30862]: 2016/08/03 14:20:58 syslog: reconnecting up to 11 times
@jacobdr

This comment has been minimized.

Show comment
Hide comment
@jacobdr

jacobdr Aug 26, 2016

I see the same behavior. And when I restart it looks like all of the logs have been cached, and upon restart they all get flushed out again.

I think there is a bug in the retry logic, but the quickest way out might be to just crash the container upon timeout/connection loss and let the restart semantics of docker take care of reconnection.

jacobdr commented Aug 26, 2016

I see the same behavior. And when I restart it looks like all of the logs have been cached, and upon restart they all get flushed out again.

I think there is a bug in the retry logic, but the quickest way out might be to just crash the container upon timeout/connection loss and let the restart semantics of docker take care of reconnection.

@edude03

This comment has been minimized.

Show comment
Hide comment
@edude03

edude03 Sep 6, 2016

I'm having the same issue. Anyone interested in a bounty?

edude03 commented Sep 6, 2016

I'm having the same issue. Anyone interested in a bounty?

@ryansch

This comment has been minimized.

Show comment
Hide comment
@ryansch

ryansch Sep 6, 2016

Honestly after hitting this bug I spent a little time on getting the docker syslog driver working and I've been pleased with the result.

ryansch commented Sep 6, 2016

Honestly after hitting this bug I spent a little time on getting the docker syslog driver working and I've been pleased with the result.

@yeasy

This comment has been minimized.

Show comment
Hide comment
@yeasy

yeasy Sep 14, 2016

Any solution on the problem?
Maybe we should enable reconnection?
Thanks!

yeasy commented Sep 14, 2016

Any solution on the problem?
Maybe we should enable reconnection?
Thanks!

@adrien-candiotti

This comment has been minimized.

Show comment
Hide comment
@adrien-candiotti

adrien-candiotti Sep 21, 2016

Same here, i'm using dockercloud so modifying the docker syslog driver is not a good option for me.
Any idea?

adrien-candiotti commented Sep 21, 2016

Same here, i'm using dockercloud so modifying the docker syslog driver is not a good option for me.
Any idea?

@izakp

This comment has been minimized.

Show comment
Hide comment
@izakp

izakp Oct 7, 2016

+1 would really love this fixed as it makes logspout unusable - it never gets past the message "reconnecting up to 11 times" which makes me think it's stuck in the reconnection loop and never exiting

izakp commented Oct 7, 2016

+1 would really love this fixed as it makes logspout unusable - it never gets past the message "reconnecting up to 11 times" which makes me think it's stuck in the reconnection loop and never exiting

@gaul

This comment has been minimized.

Show comment
Hide comment
@gaul

gaul Oct 7, 2016

Contributor

I suspect that the exponential retry logic cannot handle long-lived network failures. It retries 11 times with a maximum interval of 10240 milliseconds.

Contributor

gaul commented Oct 7, 2016

I suspect that the exponential retry logic cannot handle long-lived network failures. It retries 11 times with a maximum interval of 10240 milliseconds.

@edude03

This comment has been minimized.

Show comment
Hide comment
@edude03

edude03 Oct 7, 2016

I think want we want is for the container to die when it can't reconnect (ideally buffered but let's not get ahead of ourselves) that way the scheduler can restart the container and/or notify someone when it's not working

edude03 commented Oct 7, 2016

I think want we want is for the container to die when it can't reconnect (ideally buffered but let's not get ahead of ourselves) that way the scheduler can restart the container and/or notify someone when it's not working

@izakp

This comment has been minimized.

Show comment
Hide comment
@izakp

izakp Oct 7, 2016

I'd be happy if it tried 1 reconnection and on error, exited, then leaving it up to the scheduler to take further action.

izakp commented Oct 7, 2016

I'd be happy if it tried 1 reconnection and on error, exited, then leaving it up to the scheduler to take further action.

@yeasy

This comment has been minimized.

Show comment
Hide comment
@yeasy

yeasy Oct 8, 2016

One possible way is to support setting retry times by configuration.

yeasy commented Oct 8, 2016

One possible way is to support setting retry times by configuration.

gaul pushed a commit to gaul/logspout that referenced this issue Oct 14, 2016

gaul pushed a commit to gaul/logspout that referenced this issue Oct 14, 2016

gaul pushed a commit to gaul/logspout that referenced this issue Oct 14, 2016

@rhuddleston

This comment has been minimized.

Show comment
Hide comment
@rhuddleston

rhuddleston Nov 29, 2016

After investigating this more I can confirm I'm seeing the same thing where after "reconnecting up to 11 times" I'm losing messages but the interesting thing is It only stops logging some of the docker containers. Others continue logging and the ones that stop logging I can fix by restarting logspout

Also this looks like the same issue: #107

rhuddleston commented Nov 29, 2016

After investigating this more I can confirm I'm seeing the same thing where after "reconnecting up to 11 times" I'm losing messages but the interesting thing is It only stops logging some of the docker containers. Others continue logging and the ones that stop logging I can fix by restarting logspout

Also this looks like the same issue: #107

@imriz

This comment has been minimized.

Show comment
Hide comment
@imriz

imriz Dec 14, 2016

for the benefit of everyone affected by this (huge) bug, and due to the lack of activity in this issue - I've forked @robcza fork, which implements both the configurable retry, and the exit on failure,
and created a public image in docker hub: https://hub.docker.com/r/imriz/logspout/

imriz commented Dec 14, 2016

for the benefit of everyone affected by this (huge) bug, and due to the lack of activity in this issue - I've forked @robcza fork, which implements both the configurable retry, and the exit on failure,
and created a public image in docker hub: https://hub.docker.com/r/imriz/logspout/

@qiang0723

This comment has been minimized.

Show comment
Hide comment
@qiang0723

qiang0723 Dec 15, 2016

@imriz
I got some errors when using your image in the hub(https://hub.docker.com/r/imriz/logspout/) for testing.
here is my steps:
1.stop logstash, logspout get errors like this:

2016/12/15 11:03:51 syslog: write udp 172.17.0.2:57539->192.168.7.60:5000: write: connection refused
2016/12/15 11:03:53 syslog: write udp 172.17.0.2:57539->192.168.7.60:5000: write: connection refused
2016/12/15 11:03:54 syslog: write udp 172.17.0.2:57539->192.168.7.60:5000: write: connection refused
2016/12/15 11:03:54 syslog: write udp 172.17.0.2:57539->192.168.7.60:5000: write: connection refused

2.restart logstash, logspout get same errors:

2016/12/15 11:04:15 syslog: write udp 172.17.0.2:57539->192.168.7.60:5000: write: connection refused
2016/12/15 11:04:16 syslog: write udp 172.17.0.2:57539->192.168.7.60:5000: write: connection refused

command:

$sudo docker run   --rm  \
                 -v /var/run/docker.sock:/var/run/docker.sock \   
                  imriz/logspout  \  
                  syslog+udp://logstash-host:5000

Is my setup wrong, or this image doesn't support?

qiang0723 commented Dec 15, 2016

@imriz
I got some errors when using your image in the hub(https://hub.docker.com/r/imriz/logspout/) for testing.
here is my steps:
1.stop logstash, logspout get errors like this:

2016/12/15 11:03:51 syslog: write udp 172.17.0.2:57539->192.168.7.60:5000: write: connection refused
2016/12/15 11:03:53 syslog: write udp 172.17.0.2:57539->192.168.7.60:5000: write: connection refused
2016/12/15 11:03:54 syslog: write udp 172.17.0.2:57539->192.168.7.60:5000: write: connection refused
2016/12/15 11:03:54 syslog: write udp 172.17.0.2:57539->192.168.7.60:5000: write: connection refused

2.restart logstash, logspout get same errors:

2016/12/15 11:04:15 syslog: write udp 172.17.0.2:57539->192.168.7.60:5000: write: connection refused
2016/12/15 11:04:16 syslog: write udp 172.17.0.2:57539->192.168.7.60:5000: write: connection refused

command:

$sudo docker run   --rm  \
                 -v /var/run/docker.sock:/var/run/docker.sock \   
                  imriz/logspout  \  
                  syslog+udp://logstash-host:5000

Is my setup wrong, or this image doesn't support?

@imriz

This comment has been minimized.

Show comment
Hide comment
@imriz

imriz Dec 15, 2016

@qiang0723 I would try to sniff the traffic with tcpdump. I suspect you are getting an ICMP unreachable packet or similar.

Anyway, this image supports whatever supported in @robcza's fork.
I am using it like this:
-e ROUTE_URIS=syslog+tcp://logstash:5000?structured_data={{.Container.Config.Env}} -e RETRY_COUNT=16 -e INACTIVITY_TIMEOUT=1m

imriz commented Dec 15, 2016

@qiang0723 I would try to sniff the traffic with tcpdump. I suspect you are getting an ICMP unreachable packet or similar.

Anyway, this image supports whatever supported in @robcza's fork.
I am using it like this:
-e ROUTE_URIS=syslog+tcp://logstash:5000?structured_data={{.Container.Config.Env}} -e RETRY_COUNT=16 -e INACTIVITY_TIMEOUT=1m

@imriz

This comment has been minimized.

Show comment
Hide comment
@imriz

imriz Dec 19, 2016

I did some more digging, and I found the following items that at least in my environment caused logspout to stop logging. This may or may not apply in your environment. We are working with Docker 1.8 (year old - apparently that is very old for docker ;) ), so some of the points here, mostly regarding the JSON log driver might not be true in later versions.

  1. In https://github.com/gliderlabs/logspout/blob/master/router/pump.go#L351 there is a 1 second timeout for the send method. This means that if the reconnect takes more than one second, the log stream will be closed and from what I witnessed, it will stop the logs from that unlucky container.

  2. I observed a case where a container had a large log file (2GB - 4 days worth of logs), and when the tail option was set to all, and a since value was given, the docker daemon choked. I waited for the stream to start, and gave up after 15 minutes. Changing the initial tail from all to 0 solved this for me. On a stream disconnect (inactivity or other), I set the tail to 2000. Do note that it means that whenever logspout start it won't digest any backlog, even if that option is set. It also means that on any reconnection of the docker stream (docker stream, not syslog), it will only look 2000 events back.

  3. Similar to previous issue, I encountered an issue with tail set to a large value (I tried 6000), the docker daemon might fail parsing the JSON logs. See moby/moby#29511 - I was told to try with newer Docker version, but that is not possible for me ATM.

  4. I am no Go expert, but it seems to me that the retry mechanism in https://github.com/gliderlabs/logspout/blob/master/adapters/syslog/syslog.go#L104 will not try re- sending the log in the iteration if the error is not "temporary". According to my observations, disconnect is not considered temporary. I fixed it by resending the log after reconnect() - https://github.com/imriz/logspout/blob/5db6a9fbc9d77e4635be62e86b7f98e1aa9d80bc/adapters/syslog/syslog.go#L143

Anyway, I fixed all these issue at my fork, and it is present in my docker image. I am running for about 24 hours without issues (before I had 3-4 incidents per day). If you use it, YMMV and please understand the consequences of my changes.

imriz commented Dec 19, 2016

I did some more digging, and I found the following items that at least in my environment caused logspout to stop logging. This may or may not apply in your environment. We are working with Docker 1.8 (year old - apparently that is very old for docker ;) ), so some of the points here, mostly regarding the JSON log driver might not be true in later versions.

  1. In https://github.com/gliderlabs/logspout/blob/master/router/pump.go#L351 there is a 1 second timeout for the send method. This means that if the reconnect takes more than one second, the log stream will be closed and from what I witnessed, it will stop the logs from that unlucky container.

  2. I observed a case where a container had a large log file (2GB - 4 days worth of logs), and when the tail option was set to all, and a since value was given, the docker daemon choked. I waited for the stream to start, and gave up after 15 minutes. Changing the initial tail from all to 0 solved this for me. On a stream disconnect (inactivity or other), I set the tail to 2000. Do note that it means that whenever logspout start it won't digest any backlog, even if that option is set. It also means that on any reconnection of the docker stream (docker stream, not syslog), it will only look 2000 events back.

  3. Similar to previous issue, I encountered an issue with tail set to a large value (I tried 6000), the docker daemon might fail parsing the JSON logs. See moby/moby#29511 - I was told to try with newer Docker version, but that is not possible for me ATM.

  4. I am no Go expert, but it seems to me that the retry mechanism in https://github.com/gliderlabs/logspout/blob/master/adapters/syslog/syslog.go#L104 will not try re- sending the log in the iteration if the error is not "temporary". According to my observations, disconnect is not considered temporary. I fixed it by resending the log after reconnect() - https://github.com/imriz/logspout/blob/5db6a9fbc9d77e4635be62e86b7f98e1aa9d80bc/adapters/syslog/syslog.go#L143

Anyway, I fixed all these issue at my fork, and it is present in my docker image. I am running for about 24 hours without issues (before I had 3-4 incidents per day). If you use it, YMMV and please understand the consequences of my changes.

@robcza

This comment has been minimized.

Show comment
Hide comment
@robcza

robcza Dec 19, 2016

@imriz I gave up fast and I am killing the container as soon as any problem occurs. It works for my use case, however it is too aggressive approach in general.
Your investigation is much appreciated, will you go for pull requests with your results? It would help a lot of users.

robcza commented Dec 19, 2016

@imriz I gave up fast and I am killing the container as soon as any problem occurs. It works for my use case, however it is too aggressive approach in general.
Your investigation is much appreciated, will you go for pull requests with your results? It would help a lot of users.

@imriz

This comment has been minimized.

Show comment
Hide comment
@imriz

imriz Dec 19, 2016

imriz commented Dec 19, 2016

gaul added a commit to gaul/logspout that referenced this issue Mar 28, 2017

@michaelshobbs

This comment has been minimized.

Show comment
Hide comment
@michaelshobbs

michaelshobbs Mar 30, 2017

Member

I just merged #240 that allows you to configure the number of retries. I'm still pondering the indefinite retry method as proposed by #253

Member

michaelshobbs commented Mar 30, 2017

I just merged #240 that allows you to configure the number of retries. I'm still pondering the indefinite retry method as proposed by #253

@michaelshobbs

This comment has been minimized.

Show comment
Hide comment
@michaelshobbs

michaelshobbs Apr 14, 2017

Member

Potential fix + tests were merged in f18e503

Member

michaelshobbs commented Apr 14, 2017

Potential fix + tests were merged in f18e503

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment