Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIN without SERVFAIL on TCP lookup to failing hostnames #69

Open
johnhtodd opened this issue Aug 23, 2019 · 1 comment
Open

FIN without SERVFAIL on TCP lookup to failing hostnames #69

johnhtodd opened this issue Aug 23, 2019 · 1 comment
Assignees

Comments

@johnhtodd
Copy link

johnhtodd commented Aug 23, 2019

Version: 1.9.2

Short description: Sporadically, TCP-based lookups for records that result in SERVFAIL will result in a FIN being sent early to the client instead of SRVFAIL and then a FIN. Subsequent replies for the same record then result in the expected SRVFAIL response code and normal TCP flow until cache timeout.

Steps to reproduce (unbound on port 5000):

[root@test log]# kdig @127.0.0.1 -p 5000 +tcp +retry=1 rhybar.cz
;; WARNING: can't receive reply from 127.0.0.1@5000(TCP)
;; WARNING: failed to query server 127.0.0.1@5000(TCP)
[root@test log]#
[root@test log]# kdig @127.0.0.1 -p 5000 +tcp +retry=1 rhybar.cz
;; ->>HEADER<<- opcode: QUERY; status: SERVFAIL; id: 22768
;; Flags: qr rd ra; QUERY: 1; ANSWER: 0; AUTHORITY: 0; ADDITIONAL: 0

;; QUESTION SECTION:
;; rhybar.cz. IN A

;; Received 27 B
;; Time 2019-08-23 20:14:34 UTC
;; From 127.0.0.1@5000(TCP) in 0.1 ms
[root@test log]#

From unbound.log:
Aug 23 20:14:30 res310 unbound: [10242:1] info: validation failure rhybar.cz. A IN
Aug 23 20:14:30 res310 unbound: [10242:1] info: validation failure rhybar.cz. NS IN

Long description:
This has been observed with many other SRVFAIL responses, though rhybar.cz does tend to fail much more predictably and has a validation failure. Another randomly-picked servfail test element that has been used with less consistently reproducable results: ns.hesjptt.net.cn

Packet captures were done on these transactions, and there is a FIN sent by unbound around 200ms after the receipt of the query on TCP sessions.

If this is a configuration issue, advice would be welcome on what we should look at.

@wcawijngaards
Copy link
Member

When I try the rhybar.cz name, it does not fail for me. It gives servfail with the validation failure, but not the TCP failure. For the other name, I get it to respond too slowly, and give this log message (on high verbosity):
debug: tcp took too long, dropped
The newer unbound releases have gradual TCP management, where if the TCP is overloaded, the TCP timeout is reduced. The second name has a nameserver that produces timeouts for queries, and lookups take too long. The TCP connection is dropped. Eventually that would produce a SERVFAIL in cache that then answers quickly until it passes out of the cache, and unbound attempts to look up the data again.
Could it be that the nameserver for the cz name occasionally goes down, and then unbound wants to relookup, but it takes a long time. TCP timeouts are shorter now than they were before (and configurable, I believe), and then it drops the TCP connection.
But 200 ms after the query, that looks like it is the tcp timeout configured with tcp-idle-timeout. That is reduced to 1% and then 0.2% of the configured value, with a minimum of 200ms. Looks like the server is full on TCP and the shorter timeout is used. Meanwhile the lookup is taking long (too long).

@wcawijngaards wcawijngaards self-assigned this Aug 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants