New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GELF Driver only does DNS resolution on startup #17904

Open
chiefy opened this Issue Nov 11, 2015 · 51 comments

Comments

Projects
None yet
@chiefy

chiefy commented Nov 11, 2015

Stay with me here, because I know next to nothing about UDP. We have a cluster of apps running 14.04.3 LTS w/ docker-compose 1.5.0. Our containers are configured to use the GELF log driver to point to an internal DNS, for this case we'll call it internal.logstash.mycompany.org

Everything w/ our ELK stack has been running fine for days, but unfortunately Amazon e-mailed us and told us they had to kill our Logstash instance. I fired a new one up, changed the DNS to point to new private IP, and started tailing the logstash log to make sure logs were getting forwarded.

Nothing.

I ssh'd into one of our app hosts, and noticed the following:

ubuntu@ip-10-10-10-1:~$ netstat | grep udp
udp        0      0 10.10.10.1:35293    10.10.10.2:12201    ESTABLISHED
udp        0      0 10.10.10.1:52176    10.10.10.2:12201    ESTABLISHED
udp        0      0 10.10.10.1:40769    10.10.10.2:12201    ESTABLISHED
udp        0      0 10.10.10.1:50376    10.10.10.2:12201    ESTABLISHED

10.10.10.2 is the IP of the old logstash instance that went down. It doesn't appear to be a DNS caching issue either, when I dig or ping internal.logstash.mycompany.org it resolves to the new IP.

I just dug around a bit in the source, and it looks like net.Dial is only called once when the container is started and then the writer is cached (which makes sense). The issue is that the only time the hostname gets resolved by DNS is during that net.Dial, so if the DNS entry changes, the log writer will still be logging to the old IP. Since UDP is connection-less, there is really no way for the writer to know that the connection is broke and re-dial the host.

docker version: 1.9.0
docker-compose: 1.5.0
host: ubuntu 14.04.3 LTS
cc: @mariussturm

@LarsFronius

This comment has been minimized.

LarsFronius commented Dec 22, 2015

Yep, I second that. Any suggestion on how to implement a fix here? Every here and then close the connection or do a hostname lookup?

@mariussturm

This comment has been minimized.

Contributor

mariussturm commented Dec 22, 2015

We can't do a dns lookup on every log message, that would be far too expensive. Doing it every-Nth message would mean we have to keep a state/count what is also not nice to do. Sorry currently no good idea.

@LarsFronius

This comment has been minimized.

LarsFronius commented Dec 22, 2015

How about time based? Every 30s sounds like a good default and we could make it a tuneable settings?
Right now in any non-static infrastructure environment logging with GELF over UDP is pretty much something that is due to break at some point.

@chiefy

This comment has been minimized.

chiefy commented Dec 22, 2015

Without some way to force the log driver to re-lookup, I'd say that the GELF logging over UDP is unusable. If you have 10,20,100 docker containers running and your logstash instance goes down, unless you have static IP, this means you have to re-start all X of your containers to get logging going again.

I like @LarsFronius idea, even if it's just once every 5-10 minutes.

@mariussturm

This comment has been minimized.

Contributor

mariussturm commented Dec 22, 2015

Not sure if DNS failover is the right approach at all. What really solves the problem is to use some kind of HA failover. Virtual IP or loadbalancer with static IP. But that's all outside of the logging driver scope.

@chiefy

This comment has been minimized.

chiefy commented Dec 22, 2015

AWS ELBs don't work w/ UDP 😢

@mariussturm

This comment has been minimized.

Contributor

mariussturm commented Dec 22, 2015

@chiefy

This comment has been minimized.

chiefy commented Dec 22, 2015

Not sure if that will work for our use case, we want to keep all traffic inside our VPC, EIP uses public space if I am not mistaken?

@LarsFronius

This comment has been minimized.

LarsFronius commented Dec 22, 2015

Let's not go into details about how to achieve a non-dynamic IP in AWS. This issue can happen anywhere (e.g. DNS round/robin for example.)

imho: DNS has TTLs for a good reason and those should be respected. Not sure how gethostbyname system call is implemented in golang, nevertheless, the system should cache the DNS for as long as the DNS TTL and query again once the TTL is expired. Pretty sure this is how DNS is meant to be working.

@mariussturm

This comment has been minimized.

Contributor

mariussturm commented Dec 22, 2015

The problem is that currently neither the docker daemon nor the gelf library is doing any buffering. When we reconnect every x-minutes it's likely (haven't done any tests yet) that we lose one or two messages on a busy host during a reconnect. So what is better, losing one message every x-minutes or losing a bunch of messages when your endpoint is really down and you have to switch a virtual IP? :)
I think you can isolate EIPs via security groups in a way that there is no connection from the outside possible.

@LarsFronius

This comment has been minimized.

LarsFronius commented Feb 26, 2016

So, this is now a won't fix?
The current status here is not really appealing to work with.
I'd expect my logger to ship the Gelf UDP logs to the configured DNS address with the DNS being looked up respecting the DNS' TTLs for caching the endpoint.

Here is another "what is better" question: Is it better to restart all docker daemons when you change your logging infrastructure or having a self healing system? ;)

Can we agree we should respect DNS TTLs here and change the logger to reconnect accordingly when changes appear?

@mariussturm

This comment has been minimized.

Contributor

mariussturm commented Feb 26, 2016

No it's not a won't fix. But the only reasonable solution at this point is to implement TCP in the GELF library to detect reliably topology changes. This is not a trivial change and need some effort.
If this is not ok for you we are happy to receive pull requests here: https://github.com/Graylog2/go-gelf

Do you have any references for applications that respect DNS TTL? I was not able to find any. For my understanding the TTL is part of the internal DNS protocol and is not used on the application level. I would not go and misuse a protocol to solve a single use case.

@lennartkoopmann

This comment has been minimized.

lennartkoopmann commented Feb 26, 2016

I agree with @mariussturm. Downsides of UDP (tradeoffs for not having to deal with connections) should be documented and we'll look into providing TCP support in the future.

Thanks for the detailed report! 👍 :)

@pmahoney

This comment has been minimized.

pmahoney commented Feb 26, 2016

Do you have any references for applications that respect DNS TTL?

@mariussturm See nginx resolver directive. The documentation states:

By default, nginx caches answers using the TTL value of a response.

Unfortunately, last I checked, golang's stdlib does not expose the TTL from a DNS query. A somewhat hacky solution would be to implement a user-provided timeout, much like nginx's valid parameter described in the linked documentation. Or use a better DNS client that does give the TTL.

See also this discussion of Firefox with comments from 2002 all the way to 2016. It's not clear to me whether Linux caches DNS queries itself (or only if going through glibc?).

@toddlers

This comment has been minimized.

toddlers commented Mar 9, 2016

+1 @LarsFronius

@mariussturm

HAProxy can perform DNS resolution at runtime, in an asynchronous way and update server IP on the fly. This is very convenient in environment like Docker or Amazon Web Service where server IPs can be changed at any time.

HAProxy1.6 also resolves this !

@mariussturm

This comment has been minimized.

Contributor

mariussturm commented Mar 9, 2016

Please test this dev build with patched GELF logger in your environment and provide feedback!
https://github.com/mariussturm/docker/releases/tag/1.11.0-dev

What it does:

  • Caches the GELF server IP address
  • Resolves the given hostname periodically in the background
  • If, and only if an IP switch is detected, reconnect to the given GELF address

Considerations:

  • What happens when there are more then one DNS record for the GELF endpoint? Is it already working?
  • What happens when DNS is down?
  • Is it working with IPv6?
  • Why is it not working with /etc/hosts entries?

You need real DNS records to make it work, if this is a problem use something like dnsmasq.

@LarsFronius

This comment has been minimized.

LarsFronius commented Mar 9, 2016

@mariussturm awesome, thank your for going the whole way with this. I think this will improve reliability for a bunch of people. I'll try your release soon and will try to get to a reproducibly breaking setup with the old version that is self healing with the new release.

Regarding your considerations:

  • I believe most implementations pick a random one. Would be smart to pick a random one that works if not all work
  • using a stale entry is better than no entry imho
  • will check if I can test that
  • what OS are you on? I figured there is a bunch of different behaviors around /etc/hosts entries these days. Will check if it works for me - also it seems golang doesn't under all circumstances check /etc/hosts e.g. golang/go#6824
@mariussturm

This comment has been minimized.

Contributor

mariussturm commented Mar 14, 2016

@chiefy @pmahoney @toddlers would you mind also test the dev build?

@mariussturm

This comment has been minimized.

Contributor

mariussturm commented May 10, 2016

@thaJeztah no feedback was provided till now, guess you can close it until it come up again. Too bad that this was a waste of my spare time.

@thaJeztah

This comment has been minimized.

Member

thaJeztah commented May 10, 2016

Thanks @mariussturm and sorry for that. I didn't follow the complete discussion; were improvements made to the Gelf driver in 1.11, or no changes?

@binman-docker

This comment has been minimized.

binman-docker commented Nov 4, 2016

Hi @mariussturm and @thaJeztah

We're seeing this, pretty nasty bug/feature for us. If there's a version of the fix for 1.12 we'd be happy to test it. And we're internal so we will definitely follow up.

Thanks!

@thaJeztah

This comment has been minimized.

Member

thaJeztah commented Nov 4, 2016

@mariussturm do you have that branch/commit still around?

@mariussturm

This comment has been minimized.

Contributor

mariussturm commented Nov 5, 2016

@binman-docker I have to rebase the changes to the current master, will take some time. I am trying to build another test binary next week.

@binman-docker

This comment has been minimized.

binman-docker commented Nov 7, 2016

Great, thank you!

@mariussturm

This comment has been minimized.

Contributor

mariussturm commented Nov 17, 2016

@binman-docker was testing the old go-gelf:Dynamic-DNS-refresh branch against the current Docker master. Unfortunately Docker is now implementing it's own DNS and caching stack, the proposed solution is not working any longer. DNS updates are now cached by the daemon and not exposed to the GELF library anymore. Some related issues do exist: #23910

Sorry, can't help with this anymore.

@thaJeztah

This comment has been minimized.

Member

thaJeztah commented Nov 17, 2016

@mariussturm the daemon should still use the DNS server that was configured through --dns directly (i.e. does not use the embedded DNS that is used for service discovery on custom networks). I am not sure about a DNS cache (no change that I am aware of, but I can inform)

@mariussturm

This comment has been minimized.

Contributor

mariussturm commented Nov 17, 2016

@thaJeztah I tried the --dns option already, didn't work out. Would be good if you could figure something out.

@mavenugo

This comment has been minimized.

Contributor

mavenugo commented Nov 17, 2016

@sanimej can you pls help with this ?

@mariussturm

This comment has been minimized.

Contributor

mariussturm commented Nov 17, 2016

In the end we are calling net.LookupHost() https://github.com/Graylog2/go-gelf/blob/Dynamic-DNS-refresh/gelf/writer.go#L133 in the GELF library, is there any chance that this call is somehow cached by systems like vendor/github.com/miekg/dns or vendor/github.com/docker/libnetwork? If not I could do some deeper debugging of the GELF code.

@cpuguy83

This comment has been minimized.

Contributor

cpuguy83 commented Nov 17, 2016

@mariussturm libnetwork is only handling container DNS, any lookups the docker daemon does is handled with the normal semantics on the host.

Are you using docker4mac beta, by chance?

@mariussturm

This comment has been minimized.

Contributor

mariussturm commented Nov 17, 2016

Yes, I am on OSX. Docker 1.12.1 build 12133

@cpuguy83

This comment has been minimized.

Contributor

cpuguy83 commented Nov 17, 2016

@mariussturm Can you upgrade to the latest?

@mariussturm

This comment has been minimized.

Contributor

mariussturm commented Nov 17, 2016

Sure, will do some more testing. And also will double check on linux. Thanks for the informations so far!

@cpuguy83

This comment has been minimized.

Contributor

cpuguy83 commented Nov 17, 2016

I know for a fact there is some DNS issues on beta30 which they just released a fix for today.
Normal resolution works, but resolution using the go resolver did not.

@cpuguy83

This comment has been minimized.

Contributor

cpuguy83 commented Jan 20, 2017

I'm going to re-open because this is still a real issue.
As I see it we need some way to healthcheck the gelf service at the resolved addr, or handle dns resolution in the library.

@cpuguy83 cpuguy83 reopened this Jan 20, 2017

@shouze

This comment has been minimized.

Contributor

shouze commented Feb 15, 2017

Ok guys, we maybe experience some UDP paquet losts using GELF log drivers.

After reading the excellent article of @jpetazzo to sum up this UDP paquet losts issue can have to possible origins.

1. UDP buffer

This could happen if your sender (Docker) is faster than your receiver (Logstash/Fluentd), which is why we mentioned a queue earlier: the queue will allow the receiver to drain the UDP buffer as fast as possible to avoid overflows. Combine that with a large enough UDP buffer, and you’ll be safe.

Could you confirm guys that this kind of kernel tweak is enough?

sudo sysctl -w net.core.rmem_max=26214400

And that I can check /proc/net/snmp if no errors reported in the SndbufErrors column then the only other possible thing is the other origin (the DNS one)?

Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors IgnoredMulti
Udp: 410 0 0 539 0 0 0 0

2. DNS

I haven't tried yet to run locally a socat (or a logstash)

@jillesvangurp

This comment has been minimized.

jillesvangurp commented Apr 5, 2017

This is still broken; I currently have the shitty job of having to restart ALL of our docker hosts just because it perpetually caches the logging endpoint.

@cpuguy83

This comment has been minimized.

Contributor

cpuguy83 commented Apr 5, 2017

@jillesvangurp Would love your help/insight to fix this in gelf library. It's a problem for all (golang) consumers of Gelf.

@jillesvangurp

This comment has been minimized.

jillesvangurp commented Apr 5, 2017

@cpuguy83 see also @LarsFronius comments above (he was working with us at the time he made those comments).

We have our logstash nodes running in a cloud formation stack with an autoscaling group and a route 53 dns entry "logs-internal.inbot.io" with multiple ips for each of the logstash nodes in the ASG.

Ideally, what the driver should do is regularly fetch the list of ips and round robin over those. DNS TTL exists for a reason, as @LarsFronius pointed out and ip addresses changing is expected behavior in any kind of micro services based architecture, which is sort of a core use case for Docker.

@cpuguy83

This comment has been minimized.

Contributor

cpuguy83 commented Apr 5, 2017

The problem isn't that DNS TTL's are not being respected (not directly anyway). It's that once the connection the the gelf server is established the connection is cached (this is a normal thing to do).

With a cached connection there needs to be a way to healthcheck the service, which is something TCP tends to handle sort of well because if the server goes away you'll know the very next time you try to transmit something. UDP does not handle this case so it needs to be handled at the application layer.

Now, explicitly checking DNS TTLs and re-resolving and rebuilding the connection after the TTL is expired is something that can help the situation, but is only really helps once the TTL is expired.

@jillesvangurp

This comment has been minimized.

jillesvangurp commented Apr 5, 2017

@cpuguy83 in that case the problem would either be that it doesn't recover from the connection timing out by reconnecting (pretty bad IMHO) or that it tries to connect to the no longer valid IP by not doing the DNS resolution when it times out, which would be a DNS caching issue.

In any case we set the TTL pretty low typically.

@mariussturm

This comment has been minimized.

Contributor

mariussturm commented Apr 5, 2017

@jillesvangurp there is nothing like a connection in a udp transmission. The advantage of not blocking your docker daemon (what could happen with all tcp based logging drivers) comes with the price of not doing a reconnect. The library needs to actively monitor the state of the endpoint what is not trivial to implement. Like @cpuguy83 said, random reconnects may improve your situation but they doesn't solve the real problem.

@cpuguy83

This comment has been minimized.

Contributor

cpuguy83 commented Apr 5, 2017

Also, the lower the TTL the more frequently one would have to do lookups which is expensive. Might help if lookup handling could be handled for all containers using gelf.

@jillesvangurp

This comment has been minimized.

jillesvangurp commented Apr 5, 2017

As @mariussturm points out Gelf is indeed udp. So if there are no connections there are no reconnects nor any cached connections as @cpuguy83 suggested.

Which brings us back to the orginal analysis and indeed title of this ticket "GELF Driver only does DNS resolution on startup". That's broken behavior in any kind of cloud environment where the whole point is that things are constantly failing over or being replaced.

I'm not a go programmer but I'm sure it has decent facilities for udp networking and dealing with such things as DNS. This is a minor headache on Java as well where you have to explicitly choose to not cache forever via the security manager (rookie mistake to not fix that on any server setup). We set ours to a few seconds typically. The overhead of occasionally doing a DNS lookup is negligible; OSes are pretty good with that kind of stuff.

@cpuguy83

This comment has been minimized.

Contributor

cpuguy83 commented Apr 10, 2017

btw, support for logging plugins was just merged. Iterating on the right behavior via plugin may be a simpler solution than changing docker, at least until the correct/acceptable behavior is reached and we can pull in the change.

@acehege

This comment has been minimized.

acehege commented Apr 12, 2017

@jillesvangurp I had the exact same problem as you, and it ended with me just running a logstash container on each host, "fixing" the problem.

My environment is pretty static due to the fact I run on Docker Enterprise.

in my opinion;
Do not use the GELF driver against endpoints outside the host... I have had way to many issues with it, and Docker support fails horribly when you have issues with it.

Links

https://jpetazzo.github.io/2017/01/20/docker-logging-gelf/

Quote

Let’s see how to solve our logging problem.
The easiest solution is to restart our containers whenever we need to “reconnect” (technically, resolve and reconnect). It works, but it is very annoying.

A slightly better solution is to send logs to 127.0.0.1:12201, and then run a packet redirector to “bounce” or “mirror” these packets to the actual logger; e.g.:

socat UDP-LISTEN:12201 UDP:gelf.container.church:12201
This needs to run on each container host. It is very lightweight, and whenever gelf.container.church is updated, instead of restarting your containers, you merely restart socat.

@jillesvangurp

This comment has been minimized.

jillesvangurp commented Apr 12, 2017

@acehege I ended up using filebeat and downgrading docker logging to using the more limited json-file (less meta data in the json). Nice side effect is that it doesn't destroy all the evidence in case filebeat fails to deliver the logs due to the rotating log files that it tails. Also the filebeat agent is a bit less memory hungry than logstash.

There's also the docker beat agent, which you can run as a container. Unfortunately that is only for metrics and not for container output. Although it does make you wonder whether that is not a viable option long term if you can simply deploy a log agent container to each docker host, the host setup can be simplified a lot.

In any case, I'm looking forward to a bit more plugin friendly logging as @cpuguy83 mentioned. The current infrastructure is a bit limited for me. IMHO running without decent centralized logging aggregation with docker is unacceptable in any production environment. Our environments are quite dynamic with very few hosts being around forever and being able to drill down to whatever host and inspect what is happening in syslog, docker, and other places is crucial when there are issues to diagnose.

@ketchoop

This comment has been minimized.

ketchoop commented May 11, 2017

I'm not sure that it's right topic to discuss my troubles with this behaviour of DNS resoulution, but I've got a lot of headache, when logstash fails. I've got really dynamic environment and when logstash fails it's deregistred from Consul DNS and after that I can't do anything with other containers that have gelf logging driver till I up logstash container. I have some thoughts about how to work it out, but I have no idea how to deal with it in clear and good way.

@chiefy

This comment has been minimized.

chiefy commented May 13, 2017

@ket4yii maintainers don't want to add option to do DNS lookup every nth message. not saying it's the best solution, but it's something.

@ketchoop

This comment has been minimized.

ketchoop commented May 13, 2017

@chiefy My problem is only with starting containers, when gelf logs consumer faied:). Why container has to fail on start when DNS lookup fails? It'll be better to write this error in log and allow to start a container.

@chiefy

This comment has been minimized.

chiefy commented May 15, 2017

@ket4yii yeah, I believe that's a docker issue that's been brought up a few times unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment