Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upGrafana 4.0.0 causes Prometheus denial of service #2238
Comments
This comment has been minimized.
This comment has been minimized.
|
That's unfortunately a Grafana problem. Can you file a bug report in their repository? @stuartnelson3 isn't that exactly the bug we fixed for them in a past Grafana version? |
fabxc
closed this
Dec 1, 2016
This comment has been minimized.
This comment has been minimized.
hasso
commented
Dec 1, 2016
|
Repoter stated clearly that although there is obvious problem with Grafana, it shouldn't be possible to DoS Prometheus down. I was hit by this bug as well and all other datasources remained usable for Grafana, but Prometheus didn't and it wasn't able to collect data either. THAT's the problem that needs to be addressed in Prometheus. |
This comment has been minimized.
This comment has been minimized.
|
Apologies, you are right – I missed that part in my morning dizziness. |
fabxc
reopened this
Dec 1, 2016
This comment has been minimized.
This comment has been minimized.
|
@fabxc The previous issue that we addressed was canceling in-flight requests if "refresh" is clicked multiple times |
This comment has been minimized.
This comment has been minimized.
|
The Grafana issue points to: grafana/grafana@56b7e2d#diff-2ce5b8fd7c5c41aeaf1e8b12a3163b80 The used keep-alive time there seems reasonable. How many connections is your Grafana opening? Is it growing unbounded? A general rate limiter would have to be configured quite aggressively (to the point where it impacts other clients) to mitigate the issue it seems. |
This comment has been minimized.
This comment has been minimized.
|
I don't have much golang experience (but i have some time so i'm looking to contribute) but can this be because http.Server doesn't specify any timeouts? this too many open files issue can happen according to this article |
This comment has been minimized.
This comment has been minimized.
|
Yes, that's a good reference blog post on timeouts in general. So while we should probably add an upper bound, I'm not sure it solves this particular problem. |
This comment has been minimized.
This comment has been minimized.
|
OK, second try: How about Transport.IdleConnTimeout in go 1.7? But it seems it has a default value about 90 seconds, so it's a weak one. I'll try to reproduce first. |
This comment has been minimized.
This comment has been minimized.
|
Back to first idea. It seems http read timeout does effect connection's keep-alive time. This SO answer suggests it does http://stackoverflow.com/a/29334926 I simply tried with keeping a telnet open, sending some request every now and then. I can confirm it closes the idle connections. Transport is for http clients it seems, so sorry if that caused any confusion, i'm learning on the way. Anyways, i guess it'd be better if there was a config for it with some sensible default (30 seconds)? |
agaoglu
added a commit
to agaoglu/prometheus
that referenced
this issue
Dec 1, 2016
This comment has been minimized.
This comment has been minimized.
|
Thanks everyone for re-reading my suggestion that Prometheus should not stop collecting and writing data when it has a misbehaving client. Increasing the FD limit is basically a whackamole non-solution, for any limit someone will come along with an even-more-misbehaving client (i.e. build a wall, someone brings a taller ladder). Some ideas I had thinking about this:
Ideas 1 & 2 together would be a pretty robust solution. |
This comment has been minimized.
This comment has been minimized.
|
Does #2169 factor in by any chance as a tool for enforcing a limit on number of client connections? |
This comment has been minimized.
This comment has been minimized.
davidkarlsen
commented
Dec 3, 2016
|
Just a heads up that grafana 4.0.1 is out now. |
This comment has been minimized.
This comment has been minimized.
|
I'm not a fan of any magical introspection (be that file descriptiors, memory, or anything else). It makes the system hard to reason about and is a pretty good indication on went down a wrong path way earlier. So agreed on 1) and 2). #2240 sets a read deadline. I think that alone is not sufficient to solve our problems. It kills of slow clients but from my understanding, it is over once we received the first proper requests from a newly accepted connection. Thus, it doesn't apply to subsequently left open ones. The Go HTTP server as it is uses a listener with a keep-alive period of 3 minutes. Depending on the platform it will take that plus some more for idle connections to be killed for good. I think that's sane and probably the reason why net/http doesn't expose configuration beyond on/off. I'd suggest using a LimitListener which can be configured via flag. If you run Prometheus and specify the FD limit, you can easily and explicitly configure how many of those can be allocated for connections. |
This comment has been minimized.
This comment has been minimized.
|
#2240 is not sufficient for a complete DoS protection for sure. But it solves the simpler problem of closing idle connections. By idle connections i mean connections having no http conversation for some time. It does this by resetting the deadline on each new request on the same connection, so it will effect all the requests first and subsequent. BTW that's what i confirmed with primitive telnet setup. Go http server listener having 3 minutes keep-alive period you linked won't close such connections, as that is a TCP keep-alive period, which simply checks if the TCP connection is still usable. It won't close idle connections, it will close connections of lost clients for example. I'm not sure where grafana limited itself. Are you referring to 4.0.1? On FD issue, i guess everyone knows linux default nofile limit is pretty low for file juggling daemon processes like prometheus. My suggestion would be increasing it to a more sensible value first and then maybe suggesting large scale users to increase it even more. Provided that prometheus is not leaking any. |
This comment has been minimized.
This comment has been minimized.
|
Gotcha, that all makes sense. Thanks for the clarification. |
This comment has been minimized.
This comment has been minimized.
|
@fabxc I have updated PR to add LimitListener. I also used govendor tool to add it under vendor, i hope i get that right. A small note on LimitListener: I doesn't reject connections after reaching maximum but doesn't accept new ones. So it's similar to the backlog in posix listen(). I'm not sure about the default size in golang or if it works similarly to standard listen, but in any case that should no longer be a prometheus problem since non-accepted connections have no fd. |
fabxc
closed this
in
#2240
Dec 6, 2016
This comment has been minimized.
This comment has been minimized.
|
|
This comment has been minimized.
This comment has been minimized.
|
Yes. Thanks a lot! |
threemachines
referenced this issue
Dec 16, 2016
Closed
"Too many files" - TCP connection exhaustion #25
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
sodabrew commentedDec 1, 2016
•
edited
What did you do?
Upgraded to Grafana 4.0.0
What did you expect to see?
Snazzy charts as always.
What did you see instead? Under which circumstances?
Prometheus ran out of filehandles, didn't crash, but stopped ingesting new data.
Grafana 4.0.0 introduces a bug in which each chart data gather is issued in a new HTTP connection, and these connections appear to be kept alive in Grafana even though they are never used again. The issue is described here: grafana/grafana#6759
Clearly Grafana needs to fix this on their end, but it exposes a problem in that Prometheus can be DoSed simply by opening too many connections. Prometheus should be able to fend off a misbehaving frontend without stopping backend data ingestion.
Environment
Alertmanager version: N/A
Prometheus configuration file: N/A
Alertmanager configuration file: N/A
Logs:
Thousands of lines like this: