Grafana 4.0.0 causes Prometheus denial of service #2238

sodabrew · 2016-12-01T06:24:49Z

What did you do?

Upgraded to Grafana 4.0.0

What did you expect to see?

Snazzy charts as always.

What did you see instead? Under which circumstances?

Prometheus ran out of filehandles, didn't crash, but stopped ingesting new data.

Grafana 4.0.0 introduces a bug in which each chart data gather is issued in a new HTTP connection, and these connections appear to be kept alive in Grafana even though they are never used again. The issue is described here: grafana/grafana#6759

Clearly Grafana needs to fix this on their end, but it exposes a problem in that Prometheus can be DoSed simply by opening too many connections. Prometheus should be able to fend off a misbehaving frontend without stopping backend data ingestion.

Environment

System information:

Linux 3.13.0-76-generic x86_64

Prometheus version:

prometheus, version 1.4.1 (branch: master, revision: 2a89e8733f240d3cd57a6520b52c36ac4744ce12)
  build user:       root@e685d23d8809
  build date:       20161128-09:59:22
  go version:       go1.7.3

Alertmanager version: N/A
Prometheus configuration file: N/A
Alertmanager configuration file: N/A
Logs:

Thousands of lines like this:

ERRO[2900] http: Accept error: accept tcp [::]:9090: accept4: too many open files; retrying in 1s
ERRO[2898] http: Accept error: accept tcp [::]:9090: accept4: too many open files; retrying in 1s
ERRO[2898] Error refreshing service xxx_exporter: Get http://localhost:8500/v1/catalog/service/xxx_exporter?index=47790194&wait=30000ms: dial tcp 127.0.0.1:8500: socket: too many open files  source=consul.go:252
ERRO[2900] Error dropping persisted chunks: open /prometheus/data/1b/949dcb022d6cfb.db: too many open files  source=storage.go:1495
WARN[2900] Series quarantined.                           fingerprint=7795383ca895e3d7 metric=node_netstat_TcpExt_TCPHPAcks{environment="production", host="xxx", instance="xxx", job="node_exporter", role="xxx", zone="xxx"} reason=open /prometheus/data/77/95383ca895e3d7.db: too many open files source=storage.go:1646
ERRO[2900] Error while checkpointing: open /prometheus/data/heads.db.tmp: too many open files  source=storage.go:1252
ERRO[2900] Error dropping persisted chunks: open /prometheus/data/1b/94b35810940463.db: too many open files  source=storage.go:1495
ERRO[2900] Error while checkpointing: open /prometheus/data/heads.db.tmp: too many open files  source=storage.go:1252
WARN[2900] Series quarantined.                           fingerprint=baf8faede9e50c85 metric=node_vmstat_nr_tlb_local_flush_all{environment="production", host="xxx", instance="xxx", job="node_exporter", role="xxx", zone="xxx"} reason=open /prometheus/data/ba/f8faede9e50c85.db: too many open files source=storage.go:1646

The text was updated successfully, but these errors were encountered:

fabxc · 2016-12-01T07:22:44Z

That's unfortunately a Grafana problem. Can you file a bug report in their repository?

@stuartnelson3 isn't that exactly the bug we fixed for them in a past Grafana version?

hasso · 2016-12-01T08:14:03Z

Repoter stated clearly that although there is obvious problem with Grafana, it shouldn't be possible to DoS Prometheus down. I was hit by this bug as well and all other datasources remained usable for Grafana, but Prometheus didn't and it wasn't able to collect data either. THAT's the problem that needs to be addressed in Prometheus.

fabxc · 2016-12-01T08:33:44Z

Apologies, you are right – I missed that part in my morning dizziness.

stuartnelson3 · 2016-12-01T08:40:47Z

@fabxc The previous issue that we addressed was canceling in-flight requests if "refresh" is clicked multiple times

fabxc · 2016-12-01T11:38:38Z

The Grafana issue points to: grafana/grafana@56b7e2d#diff-2ce5b8fd7c5c41aeaf1e8b12a3163b80

The used keep-alive time there seems reasonable. How many connections is your Grafana opening? Is it growing unbounded?
In general the Prometheus storage deals with a lot of files and increasing the the FD limit is necessary at some point. I'm not quite sure what we can reasonably do from the Prometheus side. If Grafana is not used as a proxy, the clients can come from anywhere and going by source IP doesn't really help.

A general rate limiter would have to be configured quite aggressively (to the point where it impacts other clients) to mitigate the issue it seems.

agaoglu · 2016-12-01T11:42:37Z

I don't have much golang experience (but i have some time so i'm looking to contribute) but can this be because http.Server doesn't specify any timeouts?

this too many open files issue can happen according to this article
https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/

fabxc · 2016-12-01T11:46:56Z

Yes, that's a good reference blog post on timeouts in general.
But it refers to slow connections or disappearing clients, which doesn't seem to be the general issue with Grafana. The HTTP timeouts also don't affect underlying connection's keep-alive time.

So while we should probably add an upper bound, I'm not sure it solves this particular problem.

agaoglu · 2016-12-01T12:10:07Z

OK, second try: How about Transport.IdleConnTimeout in go 1.7? But it seems it has a default value about 90 seconds, so it's a weak one. I'll try to reproduce first.

agaoglu · 2016-12-01T13:09:28Z

Back to first idea. It seems http read timeout does effect connection's keep-alive time.

This SO answer suggests it does http://stackoverflow.com/a/29334926
https://golang.org/src/net/http/server.go#L749 This confirms that server sets a new deadline on each request.

I simply tried with keeping a telnet open, sending some request every now and then. I can confirm it closes the idle connections. Transport is for http clients it seems, so sorry if that caused any confusion, i'm learning on the way.

Anyways, i guess it'd be better if there was a config for it with some sensible default (30 seconds)?

This also specifies a timeout for idle client connections, which may cause "too many open files" errors. See prometheus#2238

sodabrew · 2016-12-01T19:11:03Z

Thanks everyone for re-reading my suggestion that Prometheus should not stop collecting and writing data when it has a misbehaving client. Increasing the FD limit is basically a whackamole non-solution, for any limit someone will come along with an even-more-misbehaving client (i.e. build a wall, someone brings a taller ladder). Some ideas I had thinking about this:

Idea: provide a config option for maximum number of client connections allowed. If this is exhausted, then clients are refused gracefully (HTTP 503, or even 429 "Too Many Requests")
Idea: provide a config option to kill a keep-alive client connection if it does not issue any queries.
Idea: for a given FD limit (Prometheus could ask the environment with syscall.Getrlimit, for example) figure out how many file handles are needed for the known set of data points, and allow only as many active connections as remain. I don't necessarily like this idea because it's magic and possibly platform-dependent.

Ideas 1 & 2 together would be a pretty robust solution.

sodabrew · 2016-12-01T19:13:06Z

Does #2169 factor in by any chance as a tool for enforcing a limit on number of client connections?

davidkarlsen · 2016-12-03T22:40:17Z

Just a heads up that grafana 4.0.1 is out now.

fabxc · 2016-12-05T14:24:29Z

I'm not a fan of any magical introspection (be that file descriptiors, memory, or anything else). It makes the system hard to reason about and is a pretty good indication on went down a wrong path way earlier.

So agreed on 1) and 2).

#2240 sets a read deadline. I think that alone is not sufficient to solve our problems. It kills of slow clients but from my understanding, it is over once we received the first proper requests from a newly accepted connection. Thus, it doesn't apply to subsequently left open ones.
As it's recommendable to set in general, I'd set it to generous 15 seconds – I don't think this needs to come with a flag until someone explicitly states a need for it.

The Go HTTP server as it is uses a listener with a keep-alive period of 3 minutes. Depending on the platform it will take that plus some more for idle connections to be killed for good. I think that's sane and probably the reason why net/http doesn't expose configuration beyond on/off.
In the case of Grafana the client even limited itself to 30 seconds – so tweaking the duration here doesn't seem to be helpful regardless.

I'd suggest using a LimitListener which can be configured via flag. If you run Prometheus and specify the FD limit, you can easily and explicitly configure how many of those can be allocated for connections.
That should avoid impact on storage and other components. Finding and removing the malicious client is not really avoidable at this point. If it's a reoccurring problem one could think about disabling keep-alives in the server at runtime as the connections approach the limit. But that goes towards auto-magic once again.

agaoglu · 2016-12-05T15:25:07Z

#2240 is not sufficient for a complete DoS protection for sure. But it solves the simpler problem of closing idle connections. By idle connections i mean connections having no http conversation for some time. It does this by resetting the deadline on each new request on the same connection, so it will effect all the requests first and subsequent. BTW that's what i confirmed with primitive telnet setup.

Go http server listener having 3 minutes keep-alive period you linked won't close such connections, as that is a TCP keep-alive period, which simply checks if the TCP connection is still usable. It won't close idle connections, it will close connections of lost clients for example.

I'm not sure where grafana limited itself. Are you referring to 4.0.1?

On FD issue, i guess everyone knows linux default nofile limit is pretty low for file juggling daemon processes like prometheus. My suggestion would be increasing it to a more sensible value first and then maybe suggesting large scale users to increase it even more. Provided that prometheus is not leaking any.

fabxc · 2016-12-05T16:22:59Z

Gotcha, that all makes sense. Thanks for the clarification.
Would you mind adding the LimitListener to your PR and making it configurable via flag too? That seems to be an important protection mechanism in any case.

agaoglu · 2016-12-06T10:02:09Z

@fabxc I have updated PR to add LimitListener. I also used govendor tool to add it under vendor, i hope i get that right.

A small note on LimitListener: I doesn't reject connections after reaching maximum but doesn't accept new ones. So it's similar to the backlog in posix listen(). I'm not sure about the default size in golang or if it works similarly to standard listen, but in any case that should no longer be a prometheus problem since non-accepted connections have no fd.

sodabrew · 2016-12-06T15:03:35Z

👍 Thank you!

fabxc · 2016-12-06T15:04:43Z

Yes. Thanks a lot!

lock · 2019-03-24T02:20:25Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

fabxc closed this as completed Dec 1, 2016

fabxc reopened this Dec 1, 2016

agaoglu added a commit to agaoglu/prometheus that referenced this issue Dec 1, 2016

Set read-timeout for http.Server

9986b28

This also specifies a timeout for idle client connections, which may cause "too many open files" errors. See prometheus#2238

agaoglu mentioned this issue Dec 1, 2016

Set read-timeout for http.Server #2240

Merged

fabxc closed this as completed in #2240 Dec 6, 2016

threemachines mentioned this issue Dec 16, 2016

"Too many files" - TCP connection exhaustion prometheus/graphite_exporter#25

Closed

lock bot locked and limited conversation to collaborators Mar 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grafana 4.0.0 causes Prometheus denial of service #2238

Grafana 4.0.0 causes Prometheus denial of service #2238

sodabrew commented Dec 1, 2016 •

edited

fabxc commented Dec 1, 2016

hasso commented Dec 1, 2016

fabxc commented Dec 1, 2016

stuartnelson3 commented Dec 1, 2016

fabxc commented Dec 1, 2016

agaoglu commented Dec 1, 2016

fabxc commented Dec 1, 2016

agaoglu commented Dec 1, 2016

agaoglu commented Dec 1, 2016

sodabrew commented Dec 1, 2016 •

edited

sodabrew commented Dec 1, 2016

davidkarlsen commented Dec 3, 2016

fabxc commented Dec 5, 2016

agaoglu commented Dec 5, 2016

fabxc commented Dec 5, 2016

agaoglu commented Dec 6, 2016

sodabrew commented Dec 6, 2016

fabxc commented Dec 6, 2016

lock bot commented Mar 24, 2019

Grafana 4.0.0 causes Prometheus denial of service #2238

Grafana 4.0.0 causes Prometheus denial of service #2238

Comments

sodabrew commented Dec 1, 2016 • edited

fabxc commented Dec 1, 2016

hasso commented Dec 1, 2016

fabxc commented Dec 1, 2016

stuartnelson3 commented Dec 1, 2016

fabxc commented Dec 1, 2016

agaoglu commented Dec 1, 2016

fabxc commented Dec 1, 2016

agaoglu commented Dec 1, 2016

agaoglu commented Dec 1, 2016

sodabrew commented Dec 1, 2016 • edited

sodabrew commented Dec 1, 2016

davidkarlsen commented Dec 3, 2016

fabxc commented Dec 5, 2016

agaoglu commented Dec 5, 2016

fabxc commented Dec 5, 2016

agaoglu commented Dec 6, 2016

sodabrew commented Dec 6, 2016

fabxc commented Dec 6, 2016

lock bot commented Mar 24, 2019

sodabrew commented Dec 1, 2016 •

edited

sodabrew commented Dec 1, 2016 •

edited