FIN without SERVFAIL on TCP lookup to failing hostnames #69

johnhtodd · 2019-08-23T20:39:15Z

Version: 1.9.2

Short description: Sporadically, TCP-based lookups for records that result in SERVFAIL will result in a FIN being sent early to the client instead of SRVFAIL and then a FIN. Subsequent replies for the same record then result in the expected SRVFAIL response code and normal TCP flow until cache timeout.

Steps to reproduce (unbound on port 5000):

[root@test log]# kdig @127.0.0.1 -p 5000 +tcp +retry=1 rhybar.cz
;; WARNING: can't receive reply from 127.0.0.1@5000(TCP)
;; WARNING: failed to query server 127.0.0.1@5000(TCP)
[root@test log]#
[root@test log]# kdig @127.0.0.1 -p 5000 +tcp +retry=1 rhybar.cz
;; ->>HEADER<<- opcode: QUERY; status: SERVFAIL; id: 22768
;; Flags: qr rd ra; QUERY: 1; ANSWER: 0; AUTHORITY: 0; ADDITIONAL: 0

;; QUESTION SECTION:
;; rhybar.cz. IN A

;; Received 27 B
;; Time 2019-08-23 20:14:34 UTC
;; From 127.0.0.1@5000(TCP) in 0.1 ms
[root@test log]#

From unbound.log:
Aug 23 20:14:30 res310 unbound: [10242:1] info: validation failure rhybar.cz. A IN
Aug 23 20:14:30 res310 unbound: [10242:1] info: validation failure rhybar.cz. NS IN

Long description:
This has been observed with many other SRVFAIL responses, though rhybar.cz does tend to fail much more predictably and has a validation failure. Another randomly-picked servfail test element that has been used with less consistently reproducable results: ns.hesjptt.net.cn

Packet captures were done on these transactions, and there is a FIN sent by unbound around 200ms after the receipt of the query on TCP sessions.

If this is a configuration issue, advice would be welcome on what we should look at.

wcawijngaards · 2019-08-26T11:33:37Z

When I try the rhybar.cz name, it does not fail for me. It gives servfail with the validation failure, but not the TCP failure. For the other name, I get it to respond too slowly, and give this log message (on high verbosity):
debug: tcp took too long, dropped
The newer unbound releases have gradual TCP management, where if the TCP is overloaded, the TCP timeout is reduced. The second name has a nameserver that produces timeouts for queries, and lookups take too long. The TCP connection is dropped. Eventually that would produce a SERVFAIL in cache that then answers quickly until it passes out of the cache, and unbound attempts to look up the data again.
Could it be that the nameserver for the cz name occasionally goes down, and then unbound wants to relookup, but it takes a long time. TCP timeouts are shorter now than they were before (and configurable, I believe), and then it drops the TCP connection.
But 200 ms after the query, that looks like it is the tcp timeout configured with tcp-idle-timeout. That is reduced to 1% and then 0.2% of the configured value, with a minimum of 200ms. Looks like the server is full on TCP and the shorter timeout is used. Meanwhile the lookup is taking long (too long).

wcawijngaards self-assigned this Aug 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIN without SERVFAIL on TCP lookup to failing hostnames #69

FIN without SERVFAIL on TCP lookup to failing hostnames #69

johnhtodd commented Aug 23, 2019 •

edited

wcawijngaards commented Aug 26, 2019

FIN without SERVFAIL on TCP lookup to failing hostnames #69

FIN without SERVFAIL on TCP lookup to failing hostnames #69

Comments

johnhtodd commented Aug 23, 2019 • edited

wcawijngaards commented Aug 26, 2019

johnhtodd commented Aug 23, 2019 •

edited