Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: mass connection spike leads to unpredictable amount of memory usage #35407

Open
szuecs opened this issue Nov 6, 2019 · 3 comments

Comments

@szuecs
Copy link

@szuecs szuecs commented Nov 6, 2019

A spike in TCP connections can lead to spike in memory usage in TCP handlers (for example http.ServeHTTP), which leads to unpredictable memory usage. The same happens for UDP servers.

Example projects, that have this behavior:

Known cases when this can happen:

  • specific DoS attack
  • reconects from a fleet of API clients

While investigating a problem with connection spikes, that caused an oom kill of our http proxy skipper, I tried to understand the underlying issue.
The problem is caused by unbounded goroutines in the Accept() loop, see last line of the function https://golang.org/src/net/http/server.go#L2895 in line 2927 you create the goroutine.

I could reproduce a DoS kind of situation with minimal Go code running in docker containers.
In production memory spikes up to more than 2Gi. Normal memory usage in the same production setup is less than 100Mi.

Below I will show how to create spikes that are not manageable with unbounded number of goroutines.

What version of Go are you using (go version)?

$ go version
go1.13.3

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE="on"
GOARCH="amd64"
GOBIN="/home/sszuecs/go/bin"
GOCACHE="/home/sszuecs/.cache/go-build"
GOENV="/home/sszuecs/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/sszuecs/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/share/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/share/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build878802156=/tmp/go-build -gno-record-gcc-switches"

What did you do?

To show the impact I create a test setup: [attack client] -> [backend]

backend:

package main

import (
	"fmt"
	"log"
	"net/http"
	"time"
)

type proxy struct{}

func (*proxy) ServeHTTP(w http.ResponseWriter, r *http.Request) {
	time.Sleep(10 * time.Millisecond)
	fmt.Fprintf(w, r.URL.String()) // important is to use the request!
}

func main() {
	proxy := &proxy{}
	srv := &http.Server{
		Addr:    ":9002",
		Handler: proxy,
	}
	log.Fatalf("%v", srv.ListenAndServe())
}

Create a docker container:

FROM alpine
RUN mkdir -p /usr/bin
ADD main /usr/bin/
ENV PATH $PATH:/usr/bin
CMD ["/usr/bin/main"]

build:

% docker build .
Sending build context to Docker daemon  7.392MB
Step 1/5 : FROM alpine
 ---> 11cd0b38bc3c
Step 2/5 : RUN mkdir -p /usr/bin
 ---> Running in 8a0f489fd22c
Removing intermediate container 8a0f489fd22c
 ---> c0b549e856b9
Step 3/5 : ADD main /usr/bin/
 ---> 292e9a346dde
Step 4/5 : ENV PATH $PATH:/usr/bin
 ---> Running in de5e1c78ab94
Removing intermediate container de5e1c78ab94
 ---> 66832a5b3f90
Step 5/5 : CMD ["/usr/bin/main"]
 ---> Running in 83892cb8a768
Removing intermediate container 83892cb8a768
 ---> 5c11f1edbcd6
Successfully built 5c11f1edbcd6

Start the minimal go backend

docker run --rm --memory 100m -hostnetwork -p9002:9002 -it 5c11f1edbcd6 /usr/bin/main

Create attack client, that does the connection spike

package main

import (
	"fmt"
	"log"
	"net"
	"sync"
)

func main() {
	addr := "127.0.0.1:9002"
	numConns := 20000 // increase if you don't get the expected result 
	req := "GET / HTTP/1.1\r\nHost: localhost\r\n\r\n"
	raddr, err := net.ResolveTCPAddr("tcp", addr)
	if err != nil {
		log.Fatalf("Failed to resolve %s: %v", addr, err)
	}

	var wg, ready sync.WaitGroup
	wg.Add(numConns)
	ready.Add(numConns)
	for i := 0; i < numConns; i++ {
		go func() {
			defer wg.Done()
			ready.Done()
			ready.Wait() // all goroutines at the ~same time
			conn, err := net.DialTCP("tcp", nil, raddr)
			if err != nil {
				log.Printf("Failed to dial: %v", err)
				return
			}
			fmt.Fprintf(conn, req)
		}()
	}
	wg.Wait()
}
go run attackclient.go
2019/11/06 23:17:36 Failed to dial: dial tcp 127.0.0.1:9002: connect: connection refused
2019/11/06 23:17:36 Failed to dial: dial tcp 127.0.0.1:9002: connect: connection refused
2019/11/06 23:17:36 Failed to dial: dial tcp 127.0.0.1:9002: connect: connection refused

When the connection refused starts, the backends shows:

% docker run --rm --memory 100m -hostnetwork -p9002:9002 -it a87c13d25e37 /usr/bin/main                                                                           
zsh: exit 137   docker run --rm --memory 100m -hostnetwork -p9002:9002 -it a87c13d25e37

Exit code 137 is oom kill.

What did you expect to see?

no oom kill, but http 5xx or connection refused or similar errors

What did you see instead?

oom kill

Possible solution

http.Serve{} could have a MaxConcurrency option that would limit the number of goroutines that are created. An impementation could be done with a semaphore. it is possible to implement a fix without a breaking change, such that unbounded number of goroutines is the 0 value for the mentioned new option. Another idea would be to set this value automatically via finding the cgroup memory limit for the current process, because the relation should be:

memory consumption ~= sizeof(http.Request) * sizeof(goroutine) * number(connections)
@networkimprov

This comment has been minimized.

Copy link

@networkimprov networkimprov commented Nov 7, 2019

@bradfitz

This comment has been minimized.

Copy link
Member

@bradfitz bradfitz commented Nov 7, 2019

@networkimprov, I don't see the relation? That they both involve network stuff?

@networkimprov

This comment has been minimized.

Copy link

@networkimprov networkimprov commented Nov 7, 2019

@bradfitz you proposed to eliminate goroutines for idle connections:

Actually, the more I think about this, I don't even want my idle HTTP/RPC goroutines to stick around blocked in a read call. In addition to the array memory backed by the slice given to Read, the goroutine itself is ~4KB of wasted memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.