Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grafana 4.0.0 causes Prometheus denial of service #2238

Closed
sodabrew opened this issue Dec 1, 2016 · 19 comments
Closed

Grafana 4.0.0 causes Prometheus denial of service #2238

sodabrew opened this issue Dec 1, 2016 · 19 comments

Comments

@sodabrew
Copy link

sodabrew commented Dec 1, 2016

What did you do?

Upgraded to Grafana 4.0.0

What did you expect to see?

Snazzy charts as always.

What did you see instead? Under which circumstances?

Prometheus ran out of filehandles, didn't crash, but stopped ingesting new data.

Grafana 4.0.0 introduces a bug in which each chart data gather is issued in a new HTTP connection, and these connections appear to be kept alive in Grafana even though they are never used again. The issue is described here: grafana/grafana#6759

Clearly Grafana needs to fix this on their end, but it exposes a problem in that Prometheus can be DoSed simply by opening too many connections. Prometheus should be able to fend off a misbehaving frontend without stopping backend data ingestion.

Environment

  • System information:
Linux 3.13.0-76-generic x86_64
  • Prometheus version:
prometheus, version 1.4.1 (branch: master, revision: 2a89e8733f240d3cd57a6520b52c36ac4744ce12)
  build user:       root@e685d23d8809
  build date:       20161128-09:59:22
  go version:       go1.7.3
  • Alertmanager version: N/A

  • Prometheus configuration file: N/A

  • Alertmanager configuration file: N/A

  • Logs:

Thousands of lines like this:

ERRO[2900] http: Accept error: accept tcp [::]:9090: accept4: too many open files; retrying in 1s
ERRO[2898] http: Accept error: accept tcp [::]:9090: accept4: too many open files; retrying in 1s
ERRO[2898] Error refreshing service xxx_exporter: Get http://localhost:8500/v1/catalog/service/xxx_exporter?index=47790194&wait=30000ms: dial tcp 127.0.0.1:8500: socket: too many open files  source=consul.go:252
ERRO[2900] Error dropping persisted chunks: open /prometheus/data/1b/949dcb022d6cfb.db: too many open files  source=storage.go:1495
WARN[2900] Series quarantined.                           fingerprint=7795383ca895e3d7 metric=node_netstat_TcpExt_TCPHPAcks{environment="production", host="xxx", instance="xxx", job="node_exporter", role="xxx", zone="xxx"} reason=open /prometheus/data/77/95383ca895e3d7.db: too many open files source=storage.go:1646
ERRO[2900] Error while checkpointing: open /prometheus/data/heads.db.tmp: too many open files  source=storage.go:1252
ERRO[2900] Error dropping persisted chunks: open /prometheus/data/1b/94b35810940463.db: too many open files  source=storage.go:1495
ERRO[2900] Error while checkpointing: open /prometheus/data/heads.db.tmp: too many open files  source=storage.go:1252
WARN[2900] Series quarantined.                           fingerprint=baf8faede9e50c85 metric=node_vmstat_nr_tlb_local_flush_all{environment="production", host="xxx", instance="xxx", job="node_exporter", role="xxx", zone="xxx"} reason=open /prometheus/data/ba/f8faede9e50c85.db: too many open files source=storage.go:1646
@fabxc
Copy link
Contributor

fabxc commented Dec 1, 2016

That's unfortunately a Grafana problem. Can you file a bug report in their repository?

@stuartnelson3 isn't that exactly the bug we fixed for them in a past Grafana version?

@fabxc fabxc closed this as completed Dec 1, 2016
@hasso
Copy link

hasso commented Dec 1, 2016

Repoter stated clearly that although there is obvious problem with Grafana, it shouldn't be possible to DoS Prometheus down. I was hit by this bug as well and all other datasources remained usable for Grafana, but Prometheus didn't and it wasn't able to collect data either. THAT's the problem that needs to be addressed in Prometheus.

@fabxc
Copy link
Contributor

fabxc commented Dec 1, 2016

Apologies, you are right – I missed that part in my morning dizziness.

@fabxc fabxc reopened this Dec 1, 2016
@stuartnelson3
Copy link
Contributor

@fabxc The previous issue that we addressed was canceling in-flight requests if "refresh" is clicked multiple times

@fabxc
Copy link
Contributor

fabxc commented Dec 1, 2016

The Grafana issue points to: grafana/grafana@56b7e2d#diff-2ce5b8fd7c5c41aeaf1e8b12a3163b80

The used keep-alive time there seems reasonable. How many connections is your Grafana opening? Is it growing unbounded?
In general the Prometheus storage deals with a lot of files and increasing the the FD limit is necessary at some point. I'm not quite sure what we can reasonably do from the Prometheus side. If Grafana is not used as a proxy, the clients can come from anywhere and going by source IP doesn't really help.

A general rate limiter would have to be configured quite aggressively (to the point where it impacts other clients) to mitigate the issue it seems.

@agaoglu
Copy link
Contributor

agaoglu commented Dec 1, 2016

I don't have much golang experience (but i have some time so i'm looking to contribute) but can this be because http.Server doesn't specify any timeouts?

this too many open files issue can happen according to this article
https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/

@fabxc
Copy link
Contributor

fabxc commented Dec 1, 2016

Yes, that's a good reference blog post on timeouts in general.
But it refers to slow connections or disappearing clients, which doesn't seem to be the general issue with Grafana. The HTTP timeouts also don't affect underlying connection's keep-alive time.

So while we should probably add an upper bound, I'm not sure it solves this particular problem.

@agaoglu
Copy link
Contributor

agaoglu commented Dec 1, 2016

OK, second try: How about Transport.IdleConnTimeout in go 1.7? But it seems it has a default value about 90 seconds, so it's a weak one. I'll try to reproduce first.

@agaoglu
Copy link
Contributor

agaoglu commented Dec 1, 2016

Back to first idea. It seems http read timeout does effect connection's keep-alive time.

This SO answer suggests it does http://stackoverflow.com/a/29334926
https://golang.org/src/net/http/server.go#L749 This confirms that server sets a new deadline on each request.

I simply tried with keeping a telnet open, sending some request every now and then. I can confirm it closes the idle connections. Transport is for http clients it seems, so sorry if that caused any confusion, i'm learning on the way.

Anyways, i guess it'd be better if there was a config for it with some sensible default (30 seconds)?

agaoglu added a commit to agaoglu/prometheus that referenced this issue Dec 1, 2016
This also specifies a timeout for idle client connections, which may
cause "too many open files" errors.
See prometheus#2238
@sodabrew
Copy link
Author

sodabrew commented Dec 1, 2016

Thanks everyone for re-reading my suggestion that Prometheus should not stop collecting and writing data when it has a misbehaving client. Increasing the FD limit is basically a whackamole non-solution, for any limit someone will come along with an even-more-misbehaving client (i.e. build a wall, someone brings a taller ladder). Some ideas I had thinking about this:

  1. Idea: provide a config option for maximum number of client connections allowed. If this is exhausted, then clients are refused gracefully (HTTP 503, or even 429 "Too Many Requests")
  2. Idea: provide a config option to kill a keep-alive client connection if it does not issue any queries.
  3. Idea: for a given FD limit (Prometheus could ask the environment with syscall.Getrlimit, for example) figure out how many file handles are needed for the known set of data points, and allow only as many active connections as remain. I don't necessarily like this idea because it's magic and possibly platform-dependent.

Ideas 1 & 2 together would be a pretty robust solution.

@sodabrew
Copy link
Author

sodabrew commented Dec 1, 2016

Does #2169 factor in by any chance as a tool for enforcing a limit on number of client connections?

@davidkarlsen
Copy link

Just a heads up that grafana 4.0.1 is out now.

@fabxc
Copy link
Contributor

fabxc commented Dec 5, 2016

I'm not a fan of any magical introspection (be that file descriptiors, memory, or anything else). It makes the system hard to reason about and is a pretty good indication on went down a wrong path way earlier.

So agreed on 1) and 2).

#2240 sets a read deadline. I think that alone is not sufficient to solve our problems. It kills of slow clients but from my understanding, it is over once we received the first proper requests from a newly accepted connection. Thus, it doesn't apply to subsequently left open ones.
As it's recommendable to set in general, I'd set it to generous 15 seconds – I don't think this needs to come with a flag until someone explicitly states a need for it.

The Go HTTP server as it is uses a listener with a keep-alive period of 3 minutes. Depending on the platform it will take that plus some more for idle connections to be killed for good. I think that's sane and probably the reason why net/http doesn't expose configuration beyond on/off.
In the case of Grafana the client even limited itself to 30 seconds – so tweaking the duration here doesn't seem to be helpful regardless.

I'd suggest using a LimitListener which can be configured via flag. If you run Prometheus and specify the FD limit, you can easily and explicitly configure how many of those can be allocated for connections.
That should avoid impact on storage and other components. Finding and removing the malicious client is not really avoidable at this point. If it's a reoccurring problem one could think about disabling keep-alives in the server at runtime as the connections approach the limit. But that goes towards auto-magic once again.

@agaoglu
Copy link
Contributor

agaoglu commented Dec 5, 2016

#2240 is not sufficient for a complete DoS protection for sure. But it solves the simpler problem of closing idle connections. By idle connections i mean connections having no http conversation for some time. It does this by resetting the deadline on each new request on the same connection, so it will effect all the requests first and subsequent. BTW that's what i confirmed with primitive telnet setup.

Go http server listener having 3 minutes keep-alive period you linked won't close such connections, as that is a TCP keep-alive period, which simply checks if the TCP connection is still usable. It won't close idle connections, it will close connections of lost clients for example.

I'm not sure where grafana limited itself. Are you referring to 4.0.1?

On FD issue, i guess everyone knows linux default nofile limit is pretty low for file juggling daemon processes like prometheus. My suggestion would be increasing it to a more sensible value first and then maybe suggesting large scale users to increase it even more. Provided that prometheus is not leaking any.

@fabxc
Copy link
Contributor

fabxc commented Dec 5, 2016

Gotcha, that all makes sense. Thanks for the clarification.
Would you mind adding the LimitListener to your PR and making it configurable via flag too? That seems to be an important protection mechanism in any case.

@agaoglu
Copy link
Contributor

agaoglu commented Dec 6, 2016

@fabxc I have updated PR to add LimitListener. I also used govendor tool to add it under vendor, i hope i get that right.

A small note on LimitListener: I doesn't reject connections after reaching maximum but doesn't accept new ones. So it's similar to the backlog in posix listen(). I'm not sure about the default size in golang or if it works similarly to standard listen, but in any case that should no longer be a prometheus problem since non-accepted connections have no fd.

@sodabrew
Copy link
Author

sodabrew commented Dec 6, 2016

👍 Thank you!

@fabxc
Copy link
Contributor

fabxc commented Dec 6, 2016

Yes. Thanks a lot!

@lock
Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants