Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Pusher stops pushing when it falls behind, with lots of DNSLookupFailed errors #7113

Open
richvdh opened this issue Mar 20, 2020 · 16 comments
Open
Labels
A-Performance Performance, both client-facing and admin-facing A-Push Issues related to push/notifications T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@richvdh
Copy link
Member

richvdh commented Mar 20, 2020

Our pusher process got behind. A couple of hours after it started to catch up, almost all pushes started failing:

image

Inspection of logs showed that they were failing with DNSLookupFailed errors:

2020-03-19 11:11:27,083 - synapse.http.client - 281 - INFO - httppush.process-57923 - Sending request POST http://10.103.0.7/_matrix/push/v1/notify
2020-03-19 11:12:28,647 - synapse.http.client - 328 - INFO - httppush.process-57923 - Error sending request to  POST http://10.103.0.7/_matrix/push/v1/notify: DNSLookupError Couldn't find the hostname '10.103.0.7'
2020-03-19 11:12:28,647 - synapse.push.httppusher - 391 - WARNING - httppush.process-57923 - Failed to push event <redacted> to  <redacted>: <class 'twisted.internet.error.DNSLookupError'> DNS lookup failed: Couldn't find the hostname '10.103.0.7'.
2020-03-19 11:12:32,225 - synapse.push.httppusher - 276 - INFO - httppush.process-57923 - Push failed: delaying for 1s
@richvdh
Copy link
Member Author

richvdh commented Mar 20, 2020

(it's worth pointing out that the matrix-org-hotfixes branch has this substitution to make it talk over http:

self.url = self.url.replace(
"https://matrix.org/_matrix/push/v1/notify",
"http://10.103.0.7/_matrix/push/v1/notify",
)
)

@richvdh
Copy link
Member Author

richvdh commented Mar 20, 2020

so here's what's going on

  • each HTTP hit has a timeout of 60seconds
  • we limit the number of concurrent DNS lookups to 100
  • we end up doing so many DNS lookups that the queue grows so long that the DNS lookup does not complete within 60 seconds
  • so the request gets cancelled
  • but the DNS lookup does not
  • pretty soon we end up spending all our time doing DNS lookups whose results will get thrown away

and yes, we're doing DNS lookups despite the fact it's an IP address

It appears that we're using Twisted's GAIResolver, which works via getaddrinfo(3) calls from a threadpool. (It remains a mystery to me when we use the GAIResolver and when we use twisted.names).

@richvdh
Copy link
Member Author

richvdh commented Mar 20, 2020

Some thoughts on what we could do about this:

  • obviously, IP addresses don't need to have a DNS lookup. We could optimise this (it seems like it belongs in HostnameEndpoint), but it doesn't help the situation for anyone that's not matrix.org, and anyway I'd rather we weren't hardcoding the IP address of our HTTP gateway in the sourcecode.
  • It would be nice if cancelling a request meant that the DNS lookup got cancelled. This would mean extending the IHostnameResolver interface in twisted.
  • It would be nice if the DNS resolution didn't delegate via a threadpool. One way to do this might be to use twisted.names.client (cf Replaced GAIResolver with twisted.names.client for the default resolver #5053). Another might be to wrap getaddrinfo_a on platforms that support it.
  • Possibly we should be limiting the number of pushes we attempt to do in parallel. One way to do this would be to keep track of the number of HttpPusher instances which are in their _unsafe_process loop, and once it reaches a certain level (100 or so?), stick them in a queue instead.
  • Can we just define a new push API that lets you request more than one push per http hit?

@clokep clokep added A-Performance Performance, both client-facing and admin-facing A-Push Issues related to push/notifications labels Apr 2, 2020
@kegsay
Copy link
Member

kegsay commented Apr 16, 2020

Our pusher process got behind.

Isn't this embarrassingly parallelisable though? Each request does not depend on any other, so can't we just spin more processes up depending on load? Even if we did do the proposed fixes, how did it get so far behind in the first place? Slow push is also a huge problem.

@richvdh
Copy link
Member Author

richvdh commented Apr 16, 2020

Isn't this embarrassingly parallelisable though? Each request does not depend on any other, so can't we just spin more processes up depending on load?

This is a bit simplistic. If you have multiple processes doing the pushing, you have to agree a way to make sure that each push gets processed by exactly one pusher. Obviously there are ways to solve that problem, but just spinning up more processes isn't likely to work currently.

Even if we did do the proposed fixes, how did it get so far behind in the first place?

Since you ask, #7075 originally (we got two weeks behind). But actually this can happen whenever we have a brief period of a high volume of pushing to do, which the pusher can't keep up with. Once we start failing, the pusher drowns itself in retries and failures.

Slow push is also a huge problem.

It is, but I'd rather have pushes happening 10 seconds late than a flurry of pushes making the entire process fall over so that pushes don't happen at all until someone manually restarts the thing.

@richvdh
Copy link
Member Author

richvdh commented Apr 24, 2020

Possibly we should be limiting the number of pushes we attempt to do in parallel. One way to do this would be to keep track of the number of HttpPusher instances which are in their _unsafe_process loop, and once it reaches a certain level (100 or so?), stick them in a queue instead.

this is definitely a thing. One of the things that can lead us into a death spiral is that, if the push gateway starts responding slowly for any reason, we can stack up thousands of concurrent connections, each of which take a long time to service. Indeed, they can take so long to service that the push gateway starts timing out the requests, which leads to us retrying them, which leads to more load, etc etc.

@o0laxkilla0o
Copy link

o0laxkilla0o commented Jul 18, 2020

I believe this is effecting my home server as well. Is there anything I can do to rectify the situation?
I don't see the DNS Failures but I do see several entrees like this. Once the user opens the app it appears to process.

2020-07-18 04:13:32,907 - synapse.push.httppusher - 193 - INFO - httppush.process-3162- Processing 1 unprocessed push actions for @blahblah:nerdnightlax.com/im.vector.app.ios.prod/RWPKtk8kNYt0W9EM6NTaoRjMxUG9qdLGD9doeG57wSc= starting at stream_ordering 789193

I believe my issue is related to this issue
element-hq/element-android#1723

@richvdh
Copy link
Member Author

richvdh commented Aug 28, 2020

(It remains a mystery to me when we use the GAIResolver and when we use twisted.names).

I'm going to write this down here because this seems to be the most likely place future me will find it.

  • the reactor.nameResolver (and reactor.resolver) interfaces use GAIResolver (wrapped in our case by _LimitedHostnameResolver and in some cases IPBlacklistingResolver). This is installed in ReactorBase.__init__.
  • however, we use twisted.names from srv_resolver (since the IHostnameResolver interface doesn't support SRV record lookups. (Incidentally, the equivalent of getaddrinfo seems to be res_nquery.)

@richvdh
Copy link
Member Author

richvdh commented Aug 28, 2020

As I've just written in 8027166, I don't really understand why we have a _LimitedHostnameResolver. (for links: that was added in #5037)

@v1-valux

This comment was marked as off-topic.

@v1-valux

This comment was marked as off-topic.

@DMRobertson DMRobertson added the T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. label Oct 14, 2021
@richvdh

This comment was marked as off-topic.

@xundeenergie

This comment was marked as off-topic.

@reivilibre

This comment was marked as off-topic.

@reivilibre reivilibre changed the title Pusher stops pushing, with lots of DNSLookupFailed errors Pusher stops pushing when it falls behind, with lots of DNSLookupFailed errors Mar 9, 2022
@StefanBalea
Copy link

In my case, the same error was received when the ip of the push server was added in ip_range_blacklist config. Removing the IP from this list fixed the problem.

@richvdh
Copy link
Member Author

richvdh commented Jun 29, 2022

In my case, the same error was received when the ip of the push server was added in ip_range_blacklist config.

Same error message, completely unrelated problem. Again: if your push gateway never worked at all then it is not due to this bug.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Performance Performance, both client-facing and admin-facing A-Push Issues related to push/notifications T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

9 participants