Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

synapse.http.federation.srv_resolver.SrvResolver.resolve_service isn't able to "timeout" properly, and thus stalls federation #9774

Open
ShadowJonathan opened this issue Apr 8, 2021 · 5 comments
Labels
S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@ShadowJonathan
Copy link
Contributor

ShadowJonathan commented Apr 8, 2021

Originally discovered on #synapse:matrix.org by @LTangaF

On Joel's server, doing the following DNS query times out;

root@5d0681f56cda:/# dig _matrix._tcp.matrix.lion.fm SRV

; <<>> DiG 9.11.5-P4-5.1+deb10u3-Debian <<>> _matrix._tcp.matrix.lion.fm SRV
;; global options: +cmd
;; connection timed out; no servers could be reached

While a valid SRV record doesn't time out;

root@5d0681f56cda:/# dig _matrix._tcp.jboi.nl SRV

; <<>> DiG 9.11.5-P4-5.1+deb10u3-Debian <<>> _matrix._tcp.jboi.nl SRV
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 560
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;_matrix._tcp.jboi.nl.          IN      SRV

;; ANSWER SECTION:
_matrix._tcp.jboi.nl.   120     IN      SRV     0 0 443 matrix.jboi.nl.

;; Query time: 40 msec
;; SERVER: 127.0.0.11#53(127.0.0.11)
;; WHEN: Thu Apr 08 20:50:06 UTC 2021
;; MSG SIZE  rcvd: 83

This is already odd, but synapse currently doesn't specify a timeout when looking up SRV records.

The offending snippet is this:

answers, _, _ = await make_deferred_yieldable(
self._dns_client.lookupService(service_name)
)

When the underlying DNS query times out, this does never complete, and it causes a federation transmission loop to "time out" the whole request, putting it on catchup.

twisted has the following interface for lookupService:

    def lookupService(name: str, timeout: Sequence[int]) -> "Deferred":
        """
        Perform an SRV record lookup.

        @param name: DNS name to resolve.
        @param timeout: Number of seconds after which to reissue the query.
            When the last timeout expires, the query is considered failed.

        @return: A L{Deferred} which fires with a three-tuple of lists of
            L{twisted.names.dns.RRHeader} instances.  The first element of the
            tuple gives answers.  The second element of the tuple gives
            authorities.  The third element of the tuple gives additional
            information.  The L{Deferred} may instead fail with one of the
            exceptions defined in L{twisted.names.error} or with
            C{NotImplementedError}.
        """

The optional parameter timeout defines that timeout, however, synapse isn't giving it any, so it never times out. Or synapse doesn't give it a strict enough timeout.

I propose adding a 15 second timeout by adding timeout=(15,) to the SrvResolver.resolve_service snippet.

Edit: The default resolver defines the timeouts of (1, 3, 11, 45), however, it adds these up with eachother, so it basically tries to resolve dns for exactly 60 seconds before giving up, and then it has a "timeout race" with the previously-established HTTP agent timeout (also of 60 seconds), which causes this DNS query to never promptly "time out" before it's overlaying "HTTP request timeout" could.

@ShadowJonathan ShadowJonathan changed the title synapse.http.federation.srv_resolver.SrvResolver.resolve_service is able to "timeout" and stall federation synapse.http.federation.srv_resolver.SrvResolver.resolve_service is able to "timeout" properly, and thus stalls federation Apr 8, 2021
@ShadowJonathan ShadowJonathan changed the title synapse.http.federation.srv_resolver.SrvResolver.resolve_service is able to "timeout" properly, and thus stalls federation synapse.http.federation.srv_resolver.SrvResolver.resolve_service isn't able to "timeout" properly, and thus stalls federation Apr 8, 2021
@LTangaF
Copy link

LTangaF commented Apr 8, 2021

Thank you, @ShadowJonathan, for all the hard work tracking this one down.

The offending issue was I had a DNS server in my ecosystem that got retired, but the server hadn't become aware of that (because I didn't know I had forgotten to tell it). Removing it from the resolv.conf caused the timeout issue to vanish.

@richvdh
Copy link
Member

richvdh commented Apr 8, 2021

If we're unable to confirm whether a SRV record does or does not exist, that server is unreachable, so it's correct that the federation loop times out the destination. See also #5882 (comment) which considers this from the point of view of a different error case.

What would be good would be if the logs made it a bit clearer why the HTTP request was failing in this case.

@LTangaF
Copy link

LTangaF commented Apr 8, 2021

One other possibility here would to make it clear when it gives up, leave a log message about what it was exactly looking up when it gave up and that by giving up, federation didn't finish. It really wasn't clear in the logs.

Having it still fail tells an admin, "Hey, dummy! Go fix your server config." 😄

@LTangaF
Copy link

LTangaF commented Apr 8, 2021

If we're unable to confirm whether a SRV record does or does not exist, that server is unreachable, so it's correct that the federation loop times out the destination. See also #5882 (comment) which considers this from the point of view of a different error case.

It doesn't necessarily mean that the homeserver is unreachable, just that it can't access the optional SRV record. I think not knowing that it DOESN'T exist either, though, is the real reason to fail the loop after timing out. Unless it isn't optional, in which case the federation tester that I was using to troubleshoot should complain about it, too.

What would be good would be if the logs made it a bit clearer why the HTTP request was failing in this case.

Jinx!

@ShadowJonathan
Copy link
Contributor Author

ShadowJonathan commented Apr 8, 2021

In that case, it would help if the timeout(s) come out to a total of 50 seconds, I'll add a PR that fixes that issue by changing the last timeout value (in that "default" tuple) from 45 to 35 seconds, and makes it properly bubble the DNS timeout exception.

@erikjohnston erikjohnston added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Jul 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants