`synapse.http.federation.srv_resolver.SrvResolver.resolve_service` isn't able to "timeout" properly, and thus stalls federation #9774

ShadowJonathan · 2021-04-08T21:11:59Z

Originally discovered on #synapse:matrix.org by @LTangaF

On Joel's server, doing the following DNS query times out;

root@5d0681f56cda:/# dig _matrix._tcp.matrix.lion.fm SRV

; <<>> DiG 9.11.5-P4-5.1+deb10u3-Debian <<>> _matrix._tcp.matrix.lion.fm SRV
;; global options: +cmd
;; connection timed out; no servers could be reached

While a valid SRV record doesn't time out;

root@5d0681f56cda:/# dig _matrix._tcp.jboi.nl SRV

; <<>> DiG 9.11.5-P4-5.1+deb10u3-Debian <<>> _matrix._tcp.jboi.nl SRV
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 560
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;_matrix._tcp.jboi.nl.          IN      SRV

;; ANSWER SECTION:
_matrix._tcp.jboi.nl.   120     IN      SRV     0 0 443 matrix.jboi.nl.

;; Query time: 40 msec
;; SERVER: 127.0.0.11#53(127.0.0.11)
;; WHEN: Thu Apr 08 20:50:06 UTC 2021
;; MSG SIZE  rcvd: 83

This is already odd, but synapse currently doesn't specify a timeout when looking up SRV records.

The offending snippet is this:

synapse/synapse/http/federation/srv_resolver.py

Lines 137 to 139 in c619253

    
           answers, _, _ = await make_deferred_yieldable( 
        
               self._dns_client.lookupService(service_name) 
        
           )

When the underlying DNS query times out, this does never complete, and it causes a federation transmission loop to "time out" the whole request, putting it on catchup.

twisted has the following interface for lookupService:

    def lookupService(name: str, timeout: Sequence[int]) -> "Deferred":
        """
        Perform an SRV record lookup.

        @param name: DNS name to resolve.
        @param timeout: Number of seconds after which to reissue the query.
            When the last timeout expires, the query is considered failed.

        @return: A L{Deferred} which fires with a three-tuple of lists of
            L{twisted.names.dns.RRHeader} instances.  The first element of the
            tuple gives answers.  The second element of the tuple gives
            authorities.  The third element of the tuple gives additional
            information.  The L{Deferred} may instead fail with one of the
            exceptions defined in L{twisted.names.error} or with
            C{NotImplementedError}.
        """

The optional parameter timeout defines that timeout, however, synapse isn't giving it any, so it never times out. Or synapse doesn't give it a strict enough timeout.

I propose adding a 15 second timeout by adding timeout=(15,) to the SrvResolver.resolve_service snippet.

Edit: The default resolver defines the timeouts of (1, 3, 11, 45), however, it adds these up with eachother, so it basically tries to resolve dns for exactly 60 seconds before giving up, and then it has a "timeout race" with the previously-established HTTP agent timeout (also of 60 seconds), which causes this DNS query to never promptly "time out" before it's overlaying "HTTP request timeout" could.

The text was updated successfully, but these errors were encountered:

LTangaF · 2021-04-08T21:30:55Z

Thank you, @ShadowJonathan, for all the hard work tracking this one down.

The offending issue was I had a DNS server in my ecosystem that got retired, but the server hadn't become aware of that (because I didn't know I had forgotten to tell it). Removing it from the resolv.conf caused the timeout issue to vanish.

richvdh · 2021-04-08T21:34:24Z

If we're unable to confirm whether a SRV record does or does not exist, that server is unreachable, so it's correct that the federation loop times out the destination. See also #5882 (comment) which considers this from the point of view of a different error case.

What would be good would be if the logs made it a bit clearer why the HTTP request was failing in this case.

LTangaF · 2021-04-08T21:37:11Z

One other possibility here would to make it clear when it gives up, leave a log message about what it was exactly looking up when it gave up and that by giving up, federation didn't finish. It really wasn't clear in the logs.

Having it still fail tells an admin, "Hey, dummy! Go fix your server config." 😄

LTangaF · 2021-04-08T21:54:08Z

If we're unable to confirm whether a SRV record does or does not exist, that server is unreachable, so it's correct that the federation loop times out the destination. See also #5882 (comment) which considers this from the point of view of a different error case.

It doesn't necessarily mean that the homeserver is unreachable, just that it can't access the optional SRV record. I think not knowing that it DOESN'T exist either, though, is the real reason to fail the loop after timing out. Unless it isn't optional, in which case the federation tester that I was using to troubleshoot should complain about it, too.

What would be good would be if the logs made it a bit clearer why the HTTP request was failing in this case.

Jinx!

ShadowJonathan · 2021-04-08T21:55:23Z

In that case, it would help if the timeout(s) come out to a total of 50 seconds, I'll add a PR that fixes that issue by changing the last timeout value (in that "default" tuple) from 45 to 35 seconds, and makes it properly bubble the DNS timeout exception.

ShadowJonathan changed the title ~~synapse.http.federation.srv_resolver.SrvResolver.resolve_service is able to "timeout" and stall federation~~ synapse.http.federation.srv_resolver.SrvResolver.resolve_service is able to "timeout" properly, and thus stalls federation Apr 8, 2021

ShadowJonathan mentioned this issue Apr 9, 2021

Add timeout for resolve_service #9776

Closed

4 tasks

erikjohnston added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Jul 26, 2021

matrixbot mentioned this issue Dec 21, 2023

synapse.http.federation.srv_resolver.SrvResolver.resolve_service isn't able to "timeout" properly, and thus stalls federation element-hq/synapse#9774

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`synapse.http.federation.srv_resolver.SrvResolver.resolve_service` isn't able to "timeout" properly, and thus stalls federation #9774

`synapse.http.federation.srv_resolver.SrvResolver.resolve_service` isn't able to "timeout" properly, and thus stalls federation #9774

ShadowJonathan commented Apr 8, 2021 •

edited

LTangaF commented Apr 8, 2021 •

edited

richvdh commented Apr 8, 2021

LTangaF commented Apr 8, 2021

LTangaF commented Apr 8, 2021

ShadowJonathan commented Apr 8, 2021 •

edited

synapse.http.federation.srv_resolver.SrvResolver.resolve_service isn't able to "timeout" properly, and thus stalls federation #9774

synapse.http.federation.srv_resolver.SrvResolver.resolve_service isn't able to "timeout" properly, and thus stalls federation #9774

Comments

ShadowJonathan commented Apr 8, 2021 • edited

LTangaF commented Apr 8, 2021 • edited

richvdh commented Apr 8, 2021

LTangaF commented Apr 8, 2021

LTangaF commented Apr 8, 2021

ShadowJonathan commented Apr 8, 2021 • edited

`synapse.http.federation.srv_resolver.SrvResolver.resolve_service` isn't able to "timeout" properly, and thus stalls federation #9774

`synapse.http.federation.srv_resolver.SrvResolver.resolve_service` isn't able to "timeout" properly, and thus stalls federation #9774

ShadowJonathan commented Apr 8, 2021 •

edited

LTangaF commented Apr 8, 2021 •

edited

ShadowJonathan commented Apr 8, 2021 •

edited