too many goroutines #997

alpc40 · 2019-07-15T14:40:22Z

Hi,
i'm use this in windows10. When I use 200 threads to nslookup and send UDP package to dns server, it finds that there are too many goroutines and occupying large memory of heaps.

So I review the code, and find that there is no limits when proccess UDP packages in server.go. I don't sure if it is all right. May anybody help me?

my code:

func serveDNS(laddr string) error {
	serveMux := dns.NewServeMux()
	serveMux.HandleFunc(".", handleDnsRequest)
	glog.Errorf("serveDNS Begin...")
	e := make(chan error)
	for _, _net := range [...]string{"udp", "tcp"} {
		srv := &dns.Server{Addr: laddr, Net: _net, Handler: serveMux}
		go func(srv *dns.Server) {
			e <- srv.ListenAndServe()
		}(srv)
	}
	return <-e
}

dns code:

github.com/miekg/dns/server.go

for srv.isStarted() {
		m, s, err := reader.ReadUDP(l, rtimeout)
		if err != nil {
			if !srv.isStarted() {
				return nil
			}
			if netErr, ok := err.(net.Error); ok && netErr.Temporary() {
				continue
			}
			return err
		}
		if len(m) < headerSize {
			if cap(m) == srv.UDPSize {
				srv.udpPool.Put(m[:srv.UDPSize])
			}
			continue
		}
		wg.Add(1)
		go srv.serveUDPPacket(&wg, m, l, s)
	}

The text was updated successfully, but these errors were encountered:

tmthrgd · 2019-07-24T14:28:33Z

@miekg I think this is something we should fix. I'm thinking a simple opt-in/opt-out semaphore w/ optional timeout. We can just use a chan struct{} as our semaphore. This might help with #916 and the related CoreDNS issues.

Something like this perhaps:

type Server struct {
	// Controls the maximum number of UDP queries to process concurrently. If the value is
	// negative/non-zero, no limit is applied. Once the limit is reached, the server will block or
	// reject the query depending on UDPQueryTimeout.
	MaxConcurrentUDPQueries int
	// If this timeout is positive, the server will reject the query if it can't handle the request
	// within the timeout.
	UDPQueryTimeout time.Duration
}

This approach isn't perfect and has problems around choosing the correct value, but I think we'll keep seeing these problems otherwise.

I'm not sure whether it makes sense to try and respond with an error or just drop the query on the floor.

@miekg Thoughts?

miekg · 2019-07-24T15:17:53Z

[ Quoting <notifications@github.com> in "Re: [miekg/dns] too many goroutines..." ]

I'm not sure whether it makes sense to try and respond with an error or just drop the query on the floor. @miekg Thoughts?

Indiscriminately dropping on the ground doesn't solve it. This may actually trigger more queries because of bad retry behavior of most resolvers (which we don't control). What would be better is do check the original arrival time of the query and drop it if it's still around and older than 5s. This is what the cancel plugin in CoreDNS now does.

tmthrgd · 2019-07-25T01:10:40Z

@miekg I’m not sure it’s possible to do that though. Because we read packets in a single loop before dispatching, we’ll only ever have one blocking while they’re buffered by the kernel.

miekg · 2019-07-25T06:05:53Z

Another thing that came to mind is that you can't do this after the server loop. In an ill-fated attempt to do this in coredns, it just dropped after it saw N concurrent goroutines - i.e. there is no core support neeed to drop packets. Don't hang on to you go-routines, or at least calculate how many you can handle, setup in internal buffer, work with context to drop older ones. This is not thing we need to implement here IMO

miekg · 2019-09-21T07:13:18Z

In coredns we can (not enabled) throw away everything older than 5 seconds, because of client timeouts, if we can get a hold of the packet's timestamp we can implement that here

miekg · 2019-10-01T08:28:50Z

@tmthrgd we good bring back the workers? I think this was ripped out a bit hastily, although that code could be massively improved.

miekg · 2019-10-02T08:20:32Z

This approach isn't perfect and has problems around choosing the correct value, but I think we'll keep seeing these problems otherwise.

Actually correctly reading your proposal; so what's that value going to be and can it be dynamically determined? If not - I've seen this in Google - you're just passing the problem down to the operator (SRE in Google), and they are left with the same question. To add to the difficulty we compile for various platforms and cpu archs

tmthrgd · 2019-10-03T09:26:50Z

The workers introduced bugs so I’d rather not just bring them back without thought.

That’s exactly what my earlier objections were about. If the value isn’t obvious or computable, you just shift the problem elsewhere. I’m not sure what the right approach is.

Just thinking aloud, perhaps we could add some sort of UDPDispatch func(...) to the server that would let people chose to run handlers sequentially, use a worker pool or whatever they like. It’s flexible in ways that might be actually useful.

miekg · 2019-10-03T09:41:33Z

That still pushed the problem downwards - I rather do something sensible here.

I'm also load testing (localhost <-> localhost so slight grain of salt) coredns, with the backend being served with the erratic plugin (which can introduces delays and drops, sofar only tested with delays). Doing this with dnsperf which may be too smart. But I'm not seeing it. 300 goroutines, memory is sane, perf is 40K qps (backend does 130K qps directly, so some optimization might be in order).

cretz · 2019-10-28T17:51:23Z

Just thinking aloud, perhaps we could add some sort of UDPDispatch func(...) to the server that would let people chose to run handlers sequentially, use a worker pool or whatever they like. It’s flexible in ways that might be actually useful.

That still pushed the problem downwards - I rather do something sensible here.

Personally, as I've run into some of these problems here myself, I would love an option to essentially use my callback instead of go serveUDPPacket (and its TCP equiv). I have some constraints that would allow me to use reasonable worker pools based on what's incoming (e.g. remote addr or header) and wouldn't mind certain parts being synchronous. I would have no problem if, when this func was provided, it ignored DecorateWriter/MsgAcceptFunc (or at least the onus was on me) and let me use unpackMsgHdr when I wanted.

miekg · 2019-11-01T16:32:36Z

this might be a way forward, but the original issue in CoreDNS, or the above issue as initially posted hasn't been root caused. If you don't close out (old) go-routines, you'll end up with a lot of them.

miekg · 2019-11-06T15:33:35Z

an insightful comment on the coredns issue: coredns/coredns#2593 (comment)

In general getting rid of your goroutines is what you want to do; usually this is fine, except when your backend is so slow that you can't. There is two ways out here:

deal with it when you detect this situation and start servfailing
prevent getting in this situation

For (2) even with worker-pools we got in the bad place (this might be the impl. we had at the time, but I'm not too sure about that).

Even if you think you've done (2) you still want (1). So I think focussing on (1) makes sense here.

WDYT @tmthrgd ?

szuecs · 2019-11-06T23:00:42Z

@miekg for 1. you would need to respond in the same goroutine you got the connection, so before https://github.com/miekg/dns/blob/master/server.go#L434 and also before https://github.com/miekg/dns/blob/master/server.go#L479

Detection and prevention needs to know the payload size parsed in the request, number of goroutines, goroutine size and the available memory for the process (cgroup v1 /sys/fs/cgroup/memory/memory.limit_in_bytes).
You can limit the number of goroutines, that can be used as maximum.

tmthrgd · 2019-11-08T08:03:22Z

@miekg I would be very open to trying 1) if someone has a solid idea about how to do that.

miekg · 2019-11-15T11:19:10Z

@szuecs getting the current memory allocated to a process is not portable (sadly). We could potentially make the knob you need to tweak; i.e. I have 2 GB, please figure out how many things I can do with that, and SERVFAIL if I hit it.

@tmthrgd memory usage seems to be the overarching thing we can something sensible about. We could start with the dumb thing of 2k * #goroutine < X -> OK, >= X -> SERVFAIL. runtime.NumGoroutines would make this trivial.

Followup question: should this be a core "feature" or left to the application? I.e. even in the case for coredns, you don't want to make a cache plugin suffer from a slow backend used in the forward plugin.

miekg · 2019-11-17T09:20:27Z

One of the more interesting bits we could do here, is slow down; i.e. intentionally start sleeping in the loop that accepts packets once we detect that we're are going to breach some limit in the next second or so. I think this pushes out the queue of waiting packets to the network interface where kernel level limits kick in, meaning eventually you'll reach a state where you (hopefully) send back an icmp unreachable mesg.

miekg · 2019-11-22T13:45:21Z

I'm thinking along these lines:

diff --git server.go server.go
index 3cf1a024..bd83511f 100644
--- server.go
+++ server.go
@@ -9,6 +9,7 @@ import (
 	"errors"
 	"io"
 	"net"
+	"runtime"
 	"strings"
 	"sync"
 	"time"
@@ -458,8 +459,25 @@ func (srv *Server) serveUDP(l *net.UDPConn) error {
 
 	rtimeout := srv.getReadTimeout()
 	// deadline is not used here
+	last := time.Now()
+	pkts := uint64(0)
+	max := 1500
 	for srv.isStarted() {
 		m, s, err := reader.ReadUDP(l, rtimeout)
+		pkts++
+		if pkts%100 == 0 {
+			rate := 100.0 / time.Since(last).Seconds()
+			left := float64(max-runtime.NumGoroutine()) / rate
+			switch {
+			case left < 0.0:
+				time.Sleep(10 * time.Microsecond)
+			case left >= 0.0 && left < 1.0:
+				time.Sleep(5 * time.Microsecond)
+			case left >= 1.0 && left < 2.0:
+				time.Sleep(2 * time.Microsecond)
+			}
+		}
+
 		if err != nil {
 			if !srv.isStarted() {
 				return nil
@@ -477,6 +495,7 @@ func (srv *Server) serveUDP(l *net.UDPConn) error {
 		}
 		wg.Add(1)
 		go srv.serveUDPPacket(&wg, m, l, s)
+		last = time.Now()
 	}
 
 	return nil

miekg · 2019-11-25T11:14:30Z

Slightly better, I think:

diff --git server.go server.go
index 3cf1a024..cc1b46c9 100644
--- server.go
+++ server.go
@@ -9,6 +9,7 @@ import (
 	"errors"
 	"io"
 	"net"
+	"runtime"
 	"strings"
 	"sync"
 	"time"
@@ -458,8 +459,26 @@ func (srv *Server) serveUDP(l *net.UDPConn) error {
 
 	rtimeout := srv.getReadTimeout()
 	// deadline is not used here
+	last := time.Now()
+	pkts := uint64(0)
+	max := 1500 // 1500 is a random number
 	for srv.isStarted() {
 		m, s, err := reader.ReadUDP(l, rtimeout)
+		pkts++
+		numgo := runtime.NumGoroutine()
+		if pkts%100 == 0 { // && numgo > max/2 {
+			rate := 100.0 / time.Since(last).Seconds()
+			left := float64(max-numgo) / rate
+			switch {
+			case left < 0.0:
+				// 50 is a random number.
+				time.Sleep(50 * time.Microsecond)
+			default:
+				sleep := time.Since(last).Nanoseconds() / 100
+				time.Sleep(time.Duration(sleep))
+			}
+		}
+
 		if err != nil {
 			if !srv.isStarted() {
 				return nil
@@ -477,6 +496,7 @@ func (srv *Server) serveUDP(l *net.UDPConn) error {
 		}
 		wg.Add(1)
 		go srv.serveUDPPacket(&wg, m, l, s)
+		last = time.Now()
 	}
 
 	return nil

I'll factor this out a user callable function. Not sure what signature that should have or how it should be named. Also need to test it somehow

cretz · 2019-11-25T16:19:50Z

If possible, can you also make configurable and/or opt-in/out? (ideally give the tools to build serveUDP ourselves to call serveUDPPacket in our own ways, but I understand that's unfortunately not a focus)

miekg · 2019-11-25T17:03:45Z

[ Quoting <notifications@github.com> in "Re: [miekg/dns] too many goroutines..." ]

If possible, can you also make configurable and/or opt-in/out? (ideally give the tools to build `serveUDP` ourselves to call `serveUDPPacket` in our own ways, but I understand that's unfortunately not a focus)

If we agree this is a sensible way forward, it will be a user defined function with a default impl. Just like the AcceptFunc. But I'm also afraid of capping performance unnecessary .

This includes code from #1052 and add a tests See for further discussion: #997 Signed-off-by: Miek Gieben <miek@miek.nl>

Refuse packets when we're over a certain limit of goroutines. This includes code from #1052 and add a tests. See for further discussion: #997 Signed-off-by: Miek Gieben <miek@miek.nl>

miekg · 2020-02-17T08:18:11Z

you can add a limitreader or something, or keep track of concurrency yourself. After lots of back and forth and I don't think we should provide something out of the box.

miekg added a commit that referenced this issue Jan 8, 2020

Add LimitReader to drop excess packets

fb35ad8

This includes code from #1052 and add a tests See for further discussion: #997 Signed-off-by: Miek Gieben <miek@miek.nl>

miekg mentioned this issue Jan 8, 2020

Add LimitReader to drop excess packets #1068

Closed

miekg added a commit that referenced this issue Jan 9, 2020

Add LimitReader to drop excess packets

2d40eb3

This includes code from #1052 and add a tests See for further discussion: #997 Signed-off-by: Miek Gieben <miek@miek.nl>

miekg added a commit that referenced this issue Jan 9, 2020

Add LimitReader to drop excess packets

491a448

Refuse packets when we're over a certain limit of goroutines. This includes code from #1052 and add a tests. See for further discussion: #997 Signed-off-by: Miek Gieben <miek@miek.nl>

miekg closed this as completed Feb 17, 2020

chantra mentioned this issue Jun 15, 2020

dns server causes overflowMsg #1128

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

too many goroutines #997

too many goroutines #997

alpc40 commented Jul 15, 2019

tmthrgd commented Jul 24, 2019

miekg commented Jul 24, 2019 via email

tmthrgd commented Jul 25, 2019

miekg commented Jul 25, 2019 via email

miekg commented Sep 21, 2019

miekg commented Oct 1, 2019

miekg commented Oct 2, 2019 •

edited

Loading

tmthrgd commented Oct 3, 2019

miekg commented Oct 3, 2019

cretz commented Oct 28, 2019

miekg commented Nov 1, 2019

miekg commented Nov 6, 2019

szuecs commented Nov 6, 2019

tmthrgd commented Nov 8, 2019

miekg commented Nov 15, 2019

miekg commented Nov 17, 2019

miekg commented Nov 22, 2019

miekg commented Nov 25, 2019

cretz commented Nov 25, 2019

miekg commented Nov 25, 2019 via email

miekg commented Feb 17, 2020

too many goroutines #997

too many goroutines #997

Comments

alpc40 commented Jul 15, 2019

tmthrgd commented Jul 24, 2019

miekg commented Jul 24, 2019 via email

tmthrgd commented Jul 25, 2019

miekg commented Jul 25, 2019 via email

miekg commented Sep 21, 2019

miekg commented Oct 1, 2019

miekg commented Oct 2, 2019 • edited Loading

tmthrgd commented Oct 3, 2019

miekg commented Oct 3, 2019

cretz commented Oct 28, 2019

miekg commented Nov 1, 2019

miekg commented Nov 6, 2019

szuecs commented Nov 6, 2019

tmthrgd commented Nov 8, 2019

miekg commented Nov 15, 2019

miekg commented Nov 17, 2019

miekg commented Nov 22, 2019

miekg commented Nov 25, 2019

cretz commented Nov 25, 2019

miekg commented Nov 25, 2019 via email

miekg commented Feb 17, 2020

miekg commented Oct 2, 2019 •

edited

Loading