synapse.http.federation.srv_resolver.SrvResolver.resolve_service
isn't able to "timeout" properly, and thus stalls federation
#9774
Comments
synapse.http.federation.srv_resolver.SrvResolver.resolve_service
is able to "timeout" and stall federationsynapse.http.federation.srv_resolver.SrvResolver.resolve_service
is able to "timeout" properly, and thus stalls federation
synapse.http.federation.srv_resolver.SrvResolver.resolve_service
is able to "timeout" properly, and thus stalls federationsynapse.http.federation.srv_resolver.SrvResolver.resolve_service
isn't able to "timeout" properly, and thus stalls federation
Thank you, @ShadowJonathan, for all the hard work tracking this one down. The offending issue was I had a DNS server in my ecosystem that got retired, but the server hadn't become aware of that (because I didn't know I had forgotten to tell it). Removing it from the |
If we're unable to confirm whether a SRV record does or does not exist, that server is unreachable, so it's correct that the federation loop times out the destination. See also #5882 (comment) which considers this from the point of view of a different error case. What would be good would be if the logs made it a bit clearer why the HTTP request was failing in this case. |
One other possibility here would to make it clear when it gives up, leave a log message about what it was exactly looking up when it gave up and that by giving up, federation didn't finish. It really wasn't clear in the logs. Having it still fail tells an admin, "Hey, dummy! Go fix your server config." 😄 |
It doesn't necessarily mean that the homeserver is unreachable, just that it can't access the optional SRV record. I think not knowing that it DOESN'T exist either, though, is the real reason to fail the loop after timing out. Unless it isn't optional, in which case the federation tester that I was using to troubleshoot should complain about it, too.
Jinx! |
In that case, it would help if the timeout(s) come out to a total of 50 seconds, I'll add a PR that fixes that issue by changing the last timeout value (in that "default" tuple) from 45 to 35 seconds, and makes it properly bubble the DNS timeout exception. |
Originally discovered on
#synapse:matrix.org
by @LTangaFOn Joel's server, doing the following DNS query times out;
While a valid SRV record doesn't time out;
This is already odd, but synapse currently doesn't specify a timeout when looking up SRV records.
The offending snippet is this:
synapse/synapse/http/federation/srv_resolver.py
Lines 137 to 139 in c619253
When the underlying DNS query times out, this does never complete, and it causes a federation transmission loop to "time out" the whole request, putting it on catchup.
twisted
has the following interface forlookupService
:The optional parameter
timeout
defines that timeout, however, synapse isn't giving it any, so it never times out. Or synapse doesn't give it a strict enough timeout.I propose adding a 15 second timeout by adding
timeout=(15,)
to theSrvResolver.resolve_service
snippet.Edit: The default resolver defines the timeouts of
(1, 3, 11, 45)
, however, it adds these up with eachother, so it basically tries to resolve dns for exactly 60 seconds before giving up, and then it has a "timeout race" with the previously-established HTTP agent timeout (also of 60 seconds), which causes this DNS query to never promptly "time out" before it's overlaying "HTTP request timeout" could.The text was updated successfully, but these errors were encountered: