Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS: timeout on resolving address. #40

Closed
tomsommer opened this issue Mar 8, 2018 · 43 comments · Fixed by #45
Closed

DNS: timeout on resolving address. #40

tomsommer opened this issue Mar 8, 2018 · 43 comments · Fixed by #45

Comments

@tomsommer
Copy link

tomsommer commented Mar 8, 2018

After updating to 3.0.7, the plugin starts spamming DNS: timeout on resolving address. {:field=>"host", :value=>"172.20.54.3" errors in the logs after some time.

If I do a manual lookup on the indexer, the IP resolves fine. So it seems like some kind of cache or memory buffer is breaking lookups permanently. It seems after 1 timeout, no new DNS queries are even made (I checked with tcpdump).

If I restart the indexer, the plugin works fine for a while until it breaks again.

Config:

    dns {
                        reverse => ["host"]
                        action => "replace"
                }

@tomsommer tomsommer reopened this Mar 8, 2018
@jsvd jsvd added the bug label Mar 8, 2018
@jsvd
Copy link
Contributor

jsvd commented Mar 8, 2018

which version of logstash is being used with version 3.0.7?

also, are you pointing the filter to an internal or local dns server? do you have something like dnsmasq installed?

are there any other network related exceptions being logged?

@tomsommer
Copy link
Author

tomsommer commented Mar 9, 2018

logstash-6.2.2-1.noarch

The filter just uses the systems resolvers running powerdns on "remote" internal servers.
There are no other exceptions at all.

This problem also clogs the pipeline, since all requests suddenly are marked as timed out and retried until failed (adding +1 second per event).

andrewvc added a commit to andrewvc/logstash-filter-dns that referenced this issue Mar 9, 2018
andrewvc added a commit to andrewvc/logstash-filter-dns that referenced this issue Mar 9, 2018
@jake2184
Copy link

I was seeing seeing this behaviour on 3.0.7 (Logstash 6.2.3).

I have upgraded to v3.0.9, and am still seeing the behaviour - I get many DNS: timeout on resolving address for a valid resolvable IP. This is causing backlogs as logstash cannot process the events as quick as they are coming in (<100 a second).

I suspect if the solution above was to introduce a cache, then that would not solve the problem - I am using this to process proxy logs and have >100 of IPs that are resolved, so I can believe it is overflowing it.

I am also running PDNS servers, so I will investigate there as well.

@nithril
Copy link

nithril commented Jun 8, 2018

Same issue with the 3.0.9. It keeps to timeout until I restart logstasg.

@nithril
Copy link

nithril commented Jun 8, 2018

Logstash is using JRuby 9.1.13.

Can it relates to this issue:
jruby/jruby@ada05fa#diff-6d2422fab9324bf20363957b293b3658 ?

It may explain why the DNS resolution is working properly for some time then block.

@jsvd
Copy link
Contributor

jsvd commented Jun 8, 2018

very good find @nithril !
Is there any chance you could apply that patch (https://github.com/jruby/jruby/commit/ada05fa601c29ff5491fb53434053aa597ccc97b.patch) and confirm the issue is gone?

[edit] I tested that the patch applies cleanly to logstash 6.2.4:

/tmp/logstash-6.2.4/vendor/jruby % wget https://github.com/jruby/jruby/commit/ada05fa601c29ff5491fb53434053aa597ccc97b.patch
/tmp/logstash-6.2.4/vendor/jruby % patch -p1 < ada05fa601c29ff5491fb53434053aa597ccc97b.patch
patching file lib/ruby/stdlib/resolv.rb
/tmp/logstash-6.2.4/vendor/jruby %

@nithril
Copy link

nithril commented Jun 8, 2018

Patch applied. I will be able to check next week.

@nithril
Copy link

nithril commented Jun 8, 2018

Unfortunately it does not solve the issue. The FD remains stable, no leak.

@jsvd
Copy link
Contributor

jsvd commented Jun 8, 2018

@nithril can you confirm the exact versions that trigger this? we know that >= 3.0.7 on logstash 6.2.2 triggers, but it'd be great if we could narrow this down.

For example, version 3.0.7 didn't add anything new, so the issue isn't coming from it. I'd be great if we, using the same logstsash version, could find a dns filter version < 3.0.7 where the timeouts don't happen.

Also, you may want to know that version 3.0.10 fixes a serious concurrency issue that greatly(!) speeds up the performance when the cache is used.

@jake2184
Copy link

jake2184 commented Jun 8, 2018

I'm happy to try some older versions, but it's hard to trigger - it seems to start occuring after an indeterminate amount of time.

@nithril
Copy link

nithril commented Jun 8, 2018

The dns filter version is the 3.0.9.

All the timeout error comes from Timeout:Error and not ResolvTimeout

The intersting part is that the CPU is stuck at 100% (one core) and the jstack seems to confirm that the process is stuck in the DNS.allocate_request_id loop

@jsvd
Copy link
Contributor

jsvd commented Jun 8, 2018

is stuck in the DNS.allocate_request_id loop

if this is the case then I suggest attempting another monkeypatch to resolv.rb to see if the problem goes away. While the commit you mentioned addressed the leak, there were about 4 other commits that touched the structure of resolv.rb and also mention a leak.

@nithril if you could just replace the local logstash resolv.rb file with https://raw.githubusercontent.com/jruby/jruby/ruby-2.5/lib/ruby/stdlib/resolv.rb we'd be able to check if the latest jruby resolv.rb fixes this problem.

Thanks for your patience so far.

@nithril
Copy link

nithril commented Jun 8, 2018

The resolution does not work not sure why. Is there a way to display jruby compilation/interpreter error?

@colinsurprenant
Copy link
Contributor

colinsurprenant commented Jun 8, 2018

I have yet to be able to reproduce that problem locally. I tried on:

  • master with latest dns filter (3.0.10)
  • 6.2.2 with 3.0.7
  • 6.2.2 with 3.0.9
  • 5.6.3 with 3.0.6

My test testup is this:

input {
  stdin {
    codec => line
  }
}
filter {
  grok {
    match => { "message" => "%{COMBINEDAPACHELOG}"}
  }
  dns {
    reverse => ["clientip"]
  }
}
output {
  stdout { codec => dots }
}
  • and launching logstash with:
$ cat apache_log | bin/logstash -f dns.conf

What I see is that there are a lot of errors like

 DNS: timeout on resolving address. {:field=>"clientip", :value=>"xxx"}

But I am assuming these are somewhat normal but more importantly at no time do I see the pipeline stop.

@nithril @jake2184 it would be great if you could run the same test with the same data and report what you see here. On my side I will try with LS 5 to see if the same timeouts actually occur.

@colinsurprenant
Copy link
Contributor

I added "5.6.3 with 3.0.6" above, which shows a similar behaviour.

@jake2184
Copy link

The pipeline doesn't stop completely, it's still running and logging failures. The issue is that due to the endless timeouts, it cannot keep up with the incoming data stream. In my case this is only around 20-30 events a second. The queue (I have persistent queue configured for logstash) then grows until I restart logstash.

It's worth noting that I don't believe this is caused by an actual DNS timeout, as hinted at by @nithril . We run a high availability internal DNS system, and are only resolving internal IP addresses. If we had anywhere near as many 'timeouts' as Logstash is recording we would be having many other problems all over our system.

@colinsurprenant
Copy link
Contributor

@jake2184 thanks for the clarifications - I was in fact looking for a pipeline stall.
So, concentrating on the timeouts only, note that I basically saw the same behaviour between LS 5.6.3+DNS 3.0.6 and LS 6.2.2+DNS 3.0.7/3.0.9.

To help correctly diagnose and/or test for potential solutions I would like to find a combination that does not show the timeout behaviour using my minimal setup + data as shared above.

Do you know of a LS+DNS combination that was known to be working correctly and could you run the above config+data to verify on your end that it behave the same? (I will test it too locally but knowing it produces the same result on your side will help)

@jake2184
Copy link

I am also getting 'DNS: timeout on resolving address' when DNS returns NXDOMAIN. Does this mean it isn't accepting NXDOMAIN as a valid response, so retries?

Apologies for not being too helpful with the debugging, I've had a look at the Ruby source for the filter & resolv library, but I don't know Ruby so can't understand if there are any problems.

@colinsurprenant
Copy link
Contributor

@jake2184 I am trying locally and NSDOMAIN is handled correcty; it does not result in a timeout.

You really don't need to know Ruby to help here; the important thing for me at this point is to be able to reproduce the problem and reproduce when the problem does no occur. For that I'll repeat my question:

Do you know of a LS+DNS combination that was known to be working correctly and could you run the above config+data to verify on your end that it behave the same? (I will test it too locally but knowing it produces the same result on your side will help)

@jake2184
Copy link

I've taken a bit of time to set up a test environment - sorry, I should have done this earlier when you're so willing to help.

I am using my own config file and data, editing the input/output to be stdin, and removing the DNS filter cache parameters. I am now able to reliably cause the problem. As indicated in the above bug report, I am almost certain it is that something is not being freed in resolv.rb. I am able to process just under 65536 events, before the the pipeline doesn't process any more and prints error messages. The exact number (of dots from the output codec, I assume accurate) is around 64100 - 64400, so roughly 65536 - 1024?

I have tested the following versions:
LS 5.6.3 + DNS 3.0.6 - No problem
LS 6.2.1 + DNS 3.0.7 - Problem
LS 6.2.3 + DNS 3.0.7/3.0.9 - Problem

Manual fixes I have attempted to LS 6.2.3 and DNS 3.0.9 that have not worked:

  • Copying the resolv.rb from LS 5.6.3. This is for jruby 1.9
  • Copying the dns.rb from DNS 3.0.6. The differences between 3.0.6 and 3.0.9 appear to only be the failed cache, so I don't think the problem is within the logstash filter, but the resolv.rb (or associated) libraries.
  • Manually applying the one-line fix mentioned in the jruby patch above
  • Using the latest jruby version (2.5) of resolv.rb

@jake2184
Copy link

Have had a quick look at the Apache log you are testing. It has <40,000 unique IP addresses, so it's possible you're not doing enough DNS lookups to trigger the error, if you have any form of caching enabled?

@colinsurprenant
Copy link
Contributor

@jake2184 thanks a lot for your followup! This is super useful.

It has <40,000 unique IP addresses, so it's possible you're not doing enough DNS lookups to trigger the error

Oh, this is interesting. ok, I will build a > 64k unique IP addresses file and use that and report back my finding.

if you have any form of caching enabled?

Not at the filter level - but I locally installed dnsmasq using default configuration which I assume does caching but I guess that should not be relevant since the dns filter still has to call the resolver which will hit dnsmasq.

@colinsurprenant
Copy link
Contributor

This is really mysterious. I built a file with ~3M unique IPs, one per line and ran this test:

$ cat .uniq_ips.txt | bin/logstash -e 'filter{dns{reverse=>["message"]}} output{stdout{codec=>dots}}'

and at the same time measured the pipeline throughput using the measure_performance.rb script in https://github.com/elastic/logstash-benchmark-tools/tree/master/pq_blog and observed the following:

  • LS master + DNS 3.0.10 - continuous stream of between 125 to 500 EPS, with occasional drops to zero depending on the number of timeouts.
  • LS 6.2.3 + DNS 3.0.9 - similar throughput behaviour
  • LS 5.6.3 + DNS 3.0.6 - overall slower but still steady behaviour (more drops to zero)

All my tests processed a few 100ks of events to make sure I am well over the 64k mark. Note that this config above does not use any caching.

This suggest that sometime else is really going on. Is it my usage of dnsmasq locally? Could it be OS related? (I am running on a MBP + OSX 10.12.6), could it be the JVM version? (I am running 1.8.0_144).

It could also be caching related?? I will try another testing round with extreme caching enabled to make sure those caches grow huge. something like:

hit_cache_size => 100000
hit_cache_ttl => 10000
failed_cache_size => 100000
failed_cache_ttl => 10000

@colinsurprenant
Copy link
Contributor

damn - I just realized that my IP dedupping was wrong and my IPs are not unique. fixing that and re-running tests.

@colinsurprenant
Copy link
Contributor

My test file now contains 68648 unique IPs.

My test config is :

cat ips.txt | bin/logstash -e 'filter{dns{reverse=>["message"] hit_cache_size => 0  failed_cache_size => 0}} output{stdout{codec=>dots}}'

So basically this is without any caching.

  • LS 5.6.3 + DNS 3.0.6 - quite slow, but steadily slow at 2166s to process the 68648 IPs which is ~31 EPS.
  • LS 6.2.3 + DNS 3.0.9 - steadily slow at 895s for 68648 IPs which is ~76 EPS.
  • LS 6.3.0 + DNS 3.0.9 - steadily slow at 665s for 68648 IPs which is ~103 EPS.
  • LS master + DNS 3.0.10 - steadily slow at 864 for 68648 IPs which is ~79 EPS.

Overall the EPS differences here is not really important in the sense that overall the same behaviour was witnessed in all these scenarios: the pipeline was steadily processing with many periods of 0 EPS because of the DNS timeouts but the pipeline never completely stalled.

So my conclusion is similar to the one above: I can't seem to reproduce the stalling as something happening after some time.
@jake2184 you mentioned earlier that what you witness is not a complete stall but a heavy slowdown at some point. It seems that the numbers you gave are similar with the numbers I see in my tests. What confuses me is that it seems that all my tests are actually behaving the way you describe your problem but for me it doesn't happen after some time, it is steadily very slow.

I can't reproduce a scenario that works fast for a while. Is that normal because many of my IPs are in fact "normally" timing out? should I try with only IPs known to be valid? I will check that.

@jake2184
Copy link

I ran my same tests again (LS 6.2.3 + DNS 3.0.9), and measured using the script above. The output was:
output.txt

As you can see, I can rapidly do many DNS lookups for the first 60K ish events, and then there is a drastic drop-off. It looks like it does still process events in intermittent bursts, perhaps as whatever resource is tied-up is freed? However it can process 0 events for 30 seconds..

Version Info:
Centos 7.2
OpenJDK Java 1.8.0_111

I don't think it's an individual system config problem, as I see the issue on our production system and the test one I am running. I get the feeling it is going to be tricky for you to help much if you cannot generate the problem! Very grateful for any and all time spent assisting.

@colinsurprenant
Copy link
Contributor

I figured last night how to test that - stay tuned 🤔

@colinsurprenant
Copy link
Contributor

colinsurprenant commented Jun 15, 2018

BOOM 💥 I can reproduce.

Here's what I did:

  • reconfigured my local dnsmasq to create a synthetic domain for IPs 10.0.0.0/8 which mean that dnsmasq will resolve any 10.x.x.x addresses as my configured sythetic domain.
  • created a simple ipgenerator input plugin which generate a new event with the successor IP address in sequence starting at an initial IP address.

This setup allows me to be generate an exact number of always resolvable IP addresses.

The result: after EXACTLY 65534 unique address resolutions, the resolver systematically times out.

Now that I have a reproduction setup, let's work on the solution. I will now officially re-open this issue.

@colinsurprenant
Copy link
Contributor

Good news.

I can reproduce this problem outside of logstash using my dnsmasq setup above and this simple script:

require "resolv"
require "ipaddr"

resolv = Resolv.new
ip = IPAddr.new("10.0.0.1")

i = 0
loop  do
  puts("#{i}, #{ip.to_s}, #{resolv.getname(ip.to_s)}")
  ip = ip.succ
  i += 1
end

It fails on JRuby versions 9.1.13.0 to 9.1.15.0 and is fixed in 9.1.16.0.

As suggested by @jsvd above the fix IS indeed this patch as seen in the jruby diff on lib/ruby/stdlib/resolv.rb:

lib/ruby/stdlib/resolv.rb
@@ -776,7 +776,7 @@ def sender(msg, data, host, port=Port)
           sock = @socks_hash[host.index(':') ? "::" : "0.0.0.0"]
           return nil if !sock
           service = [IPAddr.new(host), port]
-          id = DNS.allocate_request_id(host, port)
+          id = DNS.allocate_request_id(service[0], service[1])
           request = msg.encode
           request[0,2] = [id].pack('n')
           return @senders[[service, id]] =

⚠️ Now, this patch cannot be applied on vendor/jruby/lib/ruby/stdlib/resolv.rb because this is not the JRuby classes used in the logstash runtime, these are coming from logstash-core/lib/jars/jruby-complete-9.1.13.0.jar

The correct fix will be to upgrade to a JRuby version >= 9.1.16.0 but until we can do that I will probably be looking at creating a monkey patch on resolv.rb.

@nithril
Copy link

nithril commented Jun 15, 2018

Now, this patch cannot be applied on vendor/jruby/lib/ruby/stdlib/resolv.rb because this is not the JRuby classes used in the logstash runtime, these are coming from logstash-core/lib/jars/jruby-complete-9.1.13.0.jar

All this effort just because we patch the wrong file. Thank you @colinsurprenant 👍

@colinsurprenant
Copy link
Contributor

colinsurprenant commented Jun 15, 2018

submitted #45 and also elastic/logstash/issues/9750

@colinsurprenant
Copy link
Contributor

colinsurprenant commented Jun 16, 2018

@nithril yeah really - I wish I had clued in earlier. At least now we have a solid setup to reproduce and stress test any scenarios with this filter and that also gave me an idea on how to make the input generator plugin more flexible.

@jake2184
Copy link

Brilliant! Thanks for all your help + time.

When can we expect the pull request to be merged and released (for the plugin)?

@jsvd
Copy link
Contributor

jsvd commented Jun 18, 2018

@jake2184 the PR is being reviewed, progress is visible there

@colinsurprenant
Copy link
Contributor

logstash-filter-dns v3.0.11 is published. @jake2184 @nithril would love to know if that works for you.

@nithril
Copy link

nithril commented Jun 18, 2018 via email

@jake2184
Copy link

Can confirm this fixes my test, I can happily resolve 5K+ events a second :)

Thanks very much

@colinsurprenant
Copy link
Contributor

Great stuff! Thanks for the update @jake2184

@berglh
Copy link

berglh commented Apr 24, 2019

@colinsurprenant I'm seeing this behaviour on logstash-filter-dns v3.0.12 in Logstash 6.7.1. Here are my DNS filter configurations, there are a few more instances of the plugin running in other config files.

2-general/400000-ad-smb.conf:      dns {
2-general/400000-ad-smb.conf-        resolve => ["[@metadata][destination][address]"]
2-general/400000-ad-smb.conf-        action => "replace"
2-general/400000-ad-smb.conf-        failed_cache_size => 1000
2-general/400000-ad-smb.conf-        failed_cache_ttl => 60
2-general/400000-ad-smb.conf-        hit_cache_size => 10000
2-general/400000-ad-smb.conf-        hit_cache_ttl => 300
2-general/400000-ad-smb.conf-      }
--
2-general/400000-ad-smb.conf:            dns {
2-general/400000-ad-smb.conf-              reverse => ["[@metadata][ip]"]
2-general/400000-ad-smb.conf-              action => "replace"
2-general/400000-ad-smb.conf-              failed_cache_size => 1000
2-general/400000-ad-smb.conf-              failed_cache_ttl => 60
2-general/400000-ad-smb.conf-              hit_cache_size => 10000
2-general/400000-ad-smb.conf-              hit_cache_ttl => 300
2-general/400000-ad-smb.conf-            }

I'm wondering if the multiple instances are somehow conflicting. After a certain time, almost all the lookups timeout slowing down the pipelines significantly. I'm wondering if either the hit cache is getting so large that the timeout of 500ms (default) is being hit before actually trying to hit our resolving name servers. Anyway, I'm going to try down-grading to logstash-filter-dns v3.0.11 to see if I can replicate it there. Another option would be to avoid using the caches altogether.

@colinsurprenant
Copy link
Contributor

@berglh thanks for the heads-up. In fact we've seen other reports of this which seems like a regression in 6.7+.
I am trying to reproduce but without success so far. It does not look like it is necessarily related to the same 64k problem previously solved here.

Will report back when I find more.

@berglh
Copy link

berglh commented Apr 25, 2019

@colinsurprenant I've tested some more this morning and I tried the following:

  • Remove failed & hit caches form config
  • Downgraded plugin to 3.011
  • Returned to Logstash 6.5.1

The behaviour continued in the first two on LS 6.7.1.
Things seem stable again back at LS 6.5.1, but I wonder if the random crash that was fixed in newer version of the plugin might be what caused the occasional issue with my pipeline completely stopping (without a noticeable error that I could see).

Some other notes about my environment:

  • I'm using the official Logstash containers
  • I'm use a caching-only BIND server to service the requests of Logstash configured at the Docker container level. These resolve the failed requests every time I double check the service.

Do you want me to open an issue or comment an an open one?

@colinsurprenant
Copy link
Contributor

@berglh Thanks for the followup.

I believe I found the source of the problem. I will need to validate tomorrow but there seems to be a leak (a request id cache fills up which lead to an infinite loop that is aborted by the timeout) in the new version of the resolver code that was updated starting with 6.7. For now, as you noticed, the only workaround is to downgrade to 6.6 (6.6.2 as of now) and just downgrading the dns filter will not help.

@colinsurprenant
Copy link
Contributor

@berglh Yeah, we should open a new issue and followup there...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants