-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grafana 4.0.0 causes Prometheus denial of service #2238
Comments
That's unfortunately a Grafana problem. Can you file a bug report in their repository? @stuartnelson3 isn't that exactly the bug we fixed for them in a past Grafana version? |
Repoter stated clearly that although there is obvious problem with Grafana, it shouldn't be possible to DoS Prometheus down. I was hit by this bug as well and all other datasources remained usable for Grafana, but Prometheus didn't and it wasn't able to collect data either. THAT's the problem that needs to be addressed in Prometheus. |
Apologies, you are right – I missed that part in my morning dizziness. |
@fabxc The previous issue that we addressed was canceling in-flight requests if "refresh" is clicked multiple times |
The Grafana issue points to: grafana/grafana@56b7e2d#diff-2ce5b8fd7c5c41aeaf1e8b12a3163b80 The used keep-alive time there seems reasonable. How many connections is your Grafana opening? Is it growing unbounded? A general rate limiter would have to be configured quite aggressively (to the point where it impacts other clients) to mitigate the issue it seems. |
I don't have much golang experience (but i have some time so i'm looking to contribute) but can this be because http.Server doesn't specify any timeouts? this too many open files issue can happen according to this article |
Yes, that's a good reference blog post on timeouts in general. So while we should probably add an upper bound, I'm not sure it solves this particular problem. |
OK, second try: How about Transport.IdleConnTimeout in go 1.7? But it seems it has a default value about 90 seconds, so it's a weak one. I'll try to reproduce first. |
Back to first idea. It seems http read timeout does effect connection's keep-alive time. This SO answer suggests it does http://stackoverflow.com/a/29334926 I simply tried with keeping a telnet open, sending some request every now and then. I can confirm it closes the idle connections. Transport is for http clients it seems, so sorry if that caused any confusion, i'm learning on the way. Anyways, i guess it'd be better if there was a config for it with some sensible default (30 seconds)? |
This also specifies a timeout for idle client connections, which may cause "too many open files" errors. See prometheus#2238
Thanks everyone for re-reading my suggestion that Prometheus should not stop collecting and writing data when it has a misbehaving client. Increasing the FD limit is basically a whackamole non-solution, for any limit someone will come along with an even-more-misbehaving client (i.e. build a wall, someone brings a taller ladder). Some ideas I had thinking about this:
Ideas 1 & 2 together would be a pretty robust solution. |
Does #2169 factor in by any chance as a tool for enforcing a limit on number of client connections? |
Just a heads up that grafana 4.0.1 is out now. |
I'm not a fan of any magical introspection (be that file descriptiors, memory, or anything else). It makes the system hard to reason about and is a pretty good indication on went down a wrong path way earlier. So agreed on 1) and 2). #2240 sets a read deadline. I think that alone is not sufficient to solve our problems. It kills of slow clients but from my understanding, it is over once we received the first proper requests from a newly accepted connection. Thus, it doesn't apply to subsequently left open ones. The Go HTTP server as it is uses a listener with a keep-alive period of 3 minutes. Depending on the platform it will take that plus some more for idle connections to be killed for good. I think that's sane and probably the reason why net/http doesn't expose configuration beyond on/off. I'd suggest using a LimitListener which can be configured via flag. If you run Prometheus and specify the FD limit, you can easily and explicitly configure how many of those can be allocated for connections. |
#2240 is not sufficient for a complete DoS protection for sure. But it solves the simpler problem of closing idle connections. By idle connections i mean connections having no http conversation for some time. It does this by resetting the deadline on each new request on the same connection, so it will effect all the requests first and subsequent. BTW that's what i confirmed with primitive telnet setup. Go http server listener having 3 minutes keep-alive period you linked won't close such connections, as that is a TCP keep-alive period, which simply checks if the TCP connection is still usable. It won't close idle connections, it will close connections of lost clients for example. I'm not sure where grafana limited itself. Are you referring to 4.0.1? On FD issue, i guess everyone knows linux default nofile limit is pretty low for file juggling daemon processes like prometheus. My suggestion would be increasing it to a more sensible value first and then maybe suggesting large scale users to increase it even more. Provided that prometheus is not leaking any. |
Gotcha, that all makes sense. Thanks for the clarification. |
@fabxc I have updated PR to add LimitListener. I also used govendor tool to add it under vendor, i hope i get that right. A small note on LimitListener: I doesn't reject connections after reaching maximum but doesn't accept new ones. So it's similar to the backlog in posix listen(). I'm not sure about the default size in golang or if it works similarly to standard listen, but in any case that should no longer be a prometheus problem since non-accepted connections have no fd. |
👍 Thank you! |
Yes. Thanks a lot! |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
What did you do?
Upgraded to Grafana 4.0.0
What did you expect to see?
Snazzy charts as always.
What did you see instead? Under which circumstances?
Prometheus ran out of filehandles, didn't crash, but stopped ingesting new data.
Grafana 4.0.0 introduces a bug in which each chart data gather is issued in a new HTTP connection, and these connections appear to be kept alive in Grafana even though they are never used again. The issue is described here: grafana/grafana#6759
Clearly Grafana needs to fix this on their end, but it exposes a problem in that Prometheus can be DoSed simply by opening too many connections. Prometheus should be able to fend off a misbehaving frontend without stopping backend data ingestion.
Environment
Alertmanager version: N/A
Prometheus configuration file: N/A
Alertmanager configuration file: N/A
Logs:
Thousands of lines like this:
The text was updated successfully, but these errors were encountered: