Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

DNS failures connecting to IP address (!) for replication #9865

Open
richvdh opened this issue Apr 22, 2021 · 7 comments
Open

DNS failures connecting to IP address (!) for replication #9865

richvdh opened this issue Apr 22, 2021 · 7 comments
Labels
S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@richvdh
Copy link
Member

richvdh commented Apr 22, 2021

We had a slew of this sort of error:

2021-04-22 10:12:31,676 - synapse.http.client - 453 - INFO - PUT-1883814-- - Error sending request to  POST http://10.103.0.1:8135/_synapse/replication/fed_send_edu/m.direct_to_device/cYKChokZoY: DNSLookupError Couldn't find the hostname '10.103.0.1'

There are two things of concern here:

  • we are doing DNS lookups for IP addresses. This is somewhat known, though highly unsatisfactory. It happens within HostnameEndpoint in twisted, which is used fairly widely, including the HTTP Agent. (Raised as https://twistedmatrix.com/trac/ticket/10181)
  • the fact the DNS lookup failed.
@richvdh
Copy link
Member Author

richvdh commented Apr 22, 2021

Ah, I think what probably happened was that the request was cancelled after 60 seconds. HostnameEndpoint will, misleadingly, report a DNSLookupError if its connect method is cancelled before the DNS lookup completes (this is https://twistedmatrix.com/trac/ticket/9696).

Which leads us to... why did the DNS lookup take 60 seconds? I suspect the answer to this is that we got a bunch of other DNS lookups stacked up, to the point that the _LimitedHostnameResolver ended up taking more than 60seconds to get through its queue. In other words, this is basically the same as #7113.

@richvdh
Copy link
Member Author

richvdh commented Apr 22, 2021

I think we should just rip out _LimitedHostnameResolver and hope it (a) doesn't break anything and (b) fixes this and #7113.

@richvdh
Copy link
Member Author

richvdh commented Apr 22, 2021

I think we should just rip out _LimitedHostnameResolver and hope it (a) doesn't break anything and (b) fixes this and #7113.

I'm not actually sure that it would. We'd still be subject to the concurrency limits of GAIResolver (which is backed by the reactor's default threadpool, which is limited to 10 threads.)

@richvdh
Copy link
Member Author

richvdh commented Apr 22, 2021

So I suspect the real solution is to fix https://twistedmatrix.com/trac/ticket/10181, and/or to install a nameResolver which spots literal IP addresses and fast-paths them.

@clokep
Copy link
Member

clokep commented Apr 22, 2021

So I suspect the real solution is to fix twistedmatrix.com/trac/ticket/10181, and/or to install a nameResolver which spots literal IP addresses and fast-paths them.

We should be sure to double check this doesn't skip the IP address blacklists if we do this...although that's why we have the BlacklsistingReactor, I think. 😄

richvdh added a commit that referenced this issue Jun 17, 2021
As I've written in various places in the past (#7113, #9865) I'm pretty sure this is doing nothing useful at all.
@erikjohnston
Copy link
Member

Do we think #10190 fixed this?

@erikjohnston erikjohnston added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Jul 26, 2021
@richvdh
Copy link
Member Author

richvdh commented Jul 26, 2021

I don't really think so. The bottleneck is still the GAIResolver, which is limited to 10 concurrent lookups.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

3 participants