Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

too many goroutines #997

Closed
alpc40 opened this issue Jul 15, 2019 · 21 comments
Closed

too many goroutines #997

alpc40 opened this issue Jul 15, 2019 · 21 comments

Comments

@alpc40
Copy link

alpc40 commented Jul 15, 2019

Hi,
i'm use this in windows10. When I use 200 threads to nslookup and send UDP package to dns server, it finds that there are too many goroutines and occupying large memory of heaps.

So I review the code, and find that there is no limits when proccess UDP packages in server.go. I don't sure if it is all right. May anybody help me?

my code:

func serveDNS(laddr string) error {
	serveMux := dns.NewServeMux()
	serveMux.HandleFunc(".", handleDnsRequest)
	glog.Errorf("serveDNS Begin...")
	e := make(chan error)
	for _, _net := range [...]string{"udp", "tcp"} {
		srv := &dns.Server{Addr: laddr, Net: _net, Handler: serveMux}
		go func(srv *dns.Server) {
			e <- srv.ListenAndServe()
		}(srv)
	}
	return <-e
}

dns code:

github.com/miekg/dns/server.go

for srv.isStarted() {
		m, s, err := reader.ReadUDP(l, rtimeout)
		if err != nil {
			if !srv.isStarted() {
				return nil
			}
			if netErr, ok := err.(net.Error); ok && netErr.Temporary() {
				continue
			}
			return err
		}
		if len(m) < headerSize {
			if cap(m) == srv.UDPSize {
				srv.udpPool.Put(m[:srv.UDPSize])
			}
			continue
		}
		wg.Add(1)
		go srv.serveUDPPacket(&wg, m, l, s)
	}
@tmthrgd
Copy link
Collaborator

tmthrgd commented Jul 24, 2019

@miekg I think this is something we should fix. I'm thinking a simple opt-in/opt-out semaphore w/ optional timeout. We can just use a chan struct{} as our semaphore. This might help with #916 and the related CoreDNS issues.

Something like this perhaps:

type Server struct {
	// Controls the maximum number of UDP queries to process concurrently. If the value is
	// negative/non-zero, no limit is applied. Once the limit is reached, the server will block or
	// reject the query depending on UDPQueryTimeout.
	MaxConcurrentUDPQueries int
	// If this timeout is positive, the server will reject the query if it can't handle the request
	// within the timeout.
	UDPQueryTimeout time.Duration
}

This approach isn't perfect and has problems around choosing the correct value, but I think we'll keep seeing these problems otherwise.

I'm not sure whether it makes sense to try and respond with an error or just drop the query on the floor.

@miekg Thoughts?

@miekg
Copy link
Owner

miekg commented Jul 24, 2019 via email

@tmthrgd
Copy link
Collaborator

tmthrgd commented Jul 25, 2019

@miekg I’m not sure it’s possible to do that though. Because we read packets in a single loop before dispatching, we’ll only ever have one blocking while they’re buffered by the kernel.

@miekg
Copy link
Owner

miekg commented Jul 25, 2019 via email

@miekg
Copy link
Owner

miekg commented Sep 21, 2019

In coredns we can (not enabled) throw away everything older than 5 seconds, because of client timeouts, if we can get a hold of the packet's timestamp we can implement that here

@miekg
Copy link
Owner

miekg commented Oct 1, 2019

@tmthrgd we good bring back the workers? I think this was ripped out a bit hastily, although that code could be massively improved.

@miekg
Copy link
Owner

miekg commented Oct 2, 2019

This approach isn't perfect and has problems around choosing the correct value, but I think we'll keep seeing these problems otherwise.

Actually correctly reading your proposal; so what's that value going to be and can it be dynamically determined? If not - I've seen this in Google - you're just passing the problem down to the operator (SRE in Google), and they are left with the same question. To add to the difficulty we compile for various platforms and cpu archs

@tmthrgd
Copy link
Collaborator

tmthrgd commented Oct 3, 2019

The workers introduced bugs so I’d rather not just bring them back without thought.

That’s exactly what my earlier objections were about. If the value isn’t obvious or computable, you just shift the problem elsewhere. I’m not sure what the right approach is.

Just thinking aloud, perhaps we could add some sort of UDPDispatch func(...) to the server that would let people chose to run handlers sequentially, use a worker pool or whatever they like. It’s flexible in ways that might be actually useful.

@miekg
Copy link
Owner

miekg commented Oct 3, 2019

That still pushed the problem downwards - I rather do something sensible here.

I'm also load testing (localhost <-> localhost so slight grain of salt) coredns, with the backend being served with the erratic plugin (which can introduces delays and drops, sofar only tested with delays). Doing this with dnsperf which may be too smart. But I'm not seeing it. 300 goroutines, memory is sane, perf is 40K qps (backend does 130K qps directly, so some optimization might be in order).

@cretz
Copy link

cretz commented Oct 28, 2019

Just thinking aloud, perhaps we could add some sort of UDPDispatch func(...) to the server that would let people chose to run handlers sequentially, use a worker pool or whatever they like. It’s flexible in ways that might be actually useful.

That still pushed the problem downwards - I rather do something sensible here.

Personally, as I've run into some of these problems here myself, I would love an option to essentially use my callback instead of go serveUDPPacket (and its TCP equiv). I have some constraints that would allow me to use reasonable worker pools based on what's incoming (e.g. remote addr or header) and wouldn't mind certain parts being synchronous. I would have no problem if, when this func was provided, it ignored DecorateWriter/MsgAcceptFunc (or at least the onus was on me) and let me use unpackMsgHdr when I wanted.

@miekg
Copy link
Owner

miekg commented Nov 1, 2019

this might be a way forward, but the original issue in CoreDNS, or the above issue as initially posted hasn't been root caused. If you don't close out (old) go-routines, you'll end up with a lot of them.

@miekg
Copy link
Owner

miekg commented Nov 6, 2019

an insightful comment on the coredns issue: coredns/coredns#2593 (comment)

In general getting rid of your goroutines is what you want to do; usually this is fine, except when your backend is so slow that you can't. There is two ways out here:

  1. deal with it when you detect this situation and start servfailing
  2. prevent getting in this situation

For (2) even with worker-pools we got in the bad place (this might be the impl. we had at the time, but I'm not too sure about that).

Even if you think you've done (2) you still want (1). So I think focussing on (1) makes sense here.

WDYT @tmthrgd ?

@szuecs
Copy link

szuecs commented Nov 6, 2019

@miekg for 1. you would need to respond in the same goroutine you got the connection, so before https://github.com/miekg/dns/blob/master/server.go#L434 and also before https://github.com/miekg/dns/blob/master/server.go#L479

Detection and prevention needs to know the payload size parsed in the request, number of goroutines, goroutine size and the available memory for the process (cgroup v1 /sys/fs/cgroup/memory/memory.limit_in_bytes).
You can limit the number of goroutines, that can be used as maximum.

@tmthrgd
Copy link
Collaborator

tmthrgd commented Nov 8, 2019

@miekg I would be very open to trying 1) if someone has a solid idea about how to do that.

@miekg
Copy link
Owner

miekg commented Nov 15, 2019

@szuecs getting the current memory allocated to a process is not portable (sadly). We could potentially make the knob you need to tweak; i.e. I have 2 GB, please figure out how many things I can do with that, and SERVFAIL if I hit it.

@tmthrgd memory usage seems to be the overarching thing we can something sensible about. We could start with the dumb thing of 2k * #goroutine < X -> OK, >= X -> SERVFAIL. runtime.NumGoroutines would make this trivial.

Followup question: should this be a core "feature" or left to the application? I.e. even in the case for coredns, you don't want to make a cache plugin suffer from a slow backend used in the forward plugin.

@miekg
Copy link
Owner

miekg commented Nov 17, 2019

One of the more interesting bits we could do here, is slow down; i.e. intentionally start sleeping in the loop that accepts packets once we detect that we're are going to breach some limit in the next second or so. I think this pushes out the queue of waiting packets to the network interface where kernel level limits kick in, meaning eventually you'll reach a state where you (hopefully) send back an icmp unreachable mesg.

@miekg
Copy link
Owner

miekg commented Nov 22, 2019

I'm thinking along these lines:

diff --git server.go server.go
index 3cf1a024..bd83511f 100644
--- server.go
+++ server.go
@@ -9,6 +9,7 @@ import (
 	"errors"
 	"io"
 	"net"
+	"runtime"
 	"strings"
 	"sync"
 	"time"
@@ -458,8 +459,25 @@ func (srv *Server) serveUDP(l *net.UDPConn) error {
 
 	rtimeout := srv.getReadTimeout()
 	// deadline is not used here
+	last := time.Now()
+	pkts := uint64(0)
+	max := 1500
 	for srv.isStarted() {
 		m, s, err := reader.ReadUDP(l, rtimeout)
+		pkts++
+		if pkts%100 == 0 {
+			rate := 100.0 / time.Since(last).Seconds()
+			left := float64(max-runtime.NumGoroutine()) / rate
+			switch {
+			case left < 0.0:
+				time.Sleep(10 * time.Microsecond)
+			case left >= 0.0 && left < 1.0:
+				time.Sleep(5 * time.Microsecond)
+			case left >= 1.0 && left < 2.0:
+				time.Sleep(2 * time.Microsecond)
+			}
+		}
+
 		if err != nil {
 			if !srv.isStarted() {
 				return nil
@@ -477,6 +495,7 @@ func (srv *Server) serveUDP(l *net.UDPConn) error {
 		}
 		wg.Add(1)
 		go srv.serveUDPPacket(&wg, m, l, s)
+		last = time.Now()
 	}
 
 	return nil

@miekg
Copy link
Owner

miekg commented Nov 25, 2019

Slightly better, I think:

diff --git server.go server.go
index 3cf1a024..cc1b46c9 100644
--- server.go
+++ server.go
@@ -9,6 +9,7 @@ import (
 	"errors"
 	"io"
 	"net"
+	"runtime"
 	"strings"
 	"sync"
 	"time"
@@ -458,8 +459,26 @@ func (srv *Server) serveUDP(l *net.UDPConn) error {
 
 	rtimeout := srv.getReadTimeout()
 	// deadline is not used here
+	last := time.Now()
+	pkts := uint64(0)
+	max := 1500 // 1500 is a random number
 	for srv.isStarted() {
 		m, s, err := reader.ReadUDP(l, rtimeout)
+		pkts++
+		numgo := runtime.NumGoroutine()
+		if pkts%100 == 0 { // && numgo > max/2 {
+			rate := 100.0 / time.Since(last).Seconds()
+			left := float64(max-numgo) / rate
+			switch {
+			case left < 0.0:
+				// 50 is a random number.
+				time.Sleep(50 * time.Microsecond)
+			default:
+				sleep := time.Since(last).Nanoseconds() / 100
+				time.Sleep(time.Duration(sleep))
+			}
+		}
+
 		if err != nil {
 			if !srv.isStarted() {
 				return nil
@@ -477,6 +496,7 @@ func (srv *Server) serveUDP(l *net.UDPConn) error {
 		}
 		wg.Add(1)
 		go srv.serveUDPPacket(&wg, m, l, s)
+		last = time.Now()
 	}
 
 	return nil

I'll factor this out a user callable function. Not sure what signature that should have or how it should be named. Also need to test it somehow

@cretz
Copy link

cretz commented Nov 25, 2019

If possible, can you also make configurable and/or opt-in/out? (ideally give the tools to build serveUDP ourselves to call serveUDPPacket in our own ways, but I understand that's unfortunately not a focus)

@miekg
Copy link
Owner

miekg commented Nov 25, 2019 via email

miekg added a commit that referenced this issue Jan 8, 2020
This includes code from #1052 and add a tests

See for further discussion: #997

Signed-off-by: Miek Gieben <miek@miek.nl>
miekg added a commit that referenced this issue Jan 9, 2020
This includes code from #1052 and add a tests

See for further discussion: #997

Signed-off-by: Miek Gieben <miek@miek.nl>
miekg added a commit that referenced this issue Jan 9, 2020
Refuse packets when we're over a certain limit of goroutines.
This includes code from #1052 and add a tests.

See for further discussion: #997

Signed-off-by: Miek Gieben <miek@miek.nl>
@miekg
Copy link
Owner

miekg commented Feb 17, 2020

you can add a limitreader or something, or keep track of concurrency yourself. After lots of back and forth and I don't think we should provide something out of the box.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants