Skip to content

net: mass connection spike leads to unpredictable amount of memory usage #35407

@szuecs

Description

@szuecs

A spike in TCP connections can lead to spike in memory usage in TCP handlers (for example http.ServeHTTP), which leads to unpredictable memory usage. The same happens for UDP servers.

Example projects, that have this behavior:

Known cases when this can happen:

  • specific DoS attack
  • reconects from a fleet of API clients

While investigating a problem with connection spikes, that caused an oom kill of our http proxy skipper, I tried to understand the underlying issue.
The problem is caused by unbounded goroutines in the Accept() loop, see last line of the function https://golang.org/src/net/http/server.go#L2895 in line 2927 you create the goroutine.

I could reproduce a DoS kind of situation with minimal Go code running in docker containers.
In production memory spikes up to more than 2Gi. Normal memory usage in the same production setup is less than 100Mi.

Below I will show how to create spikes that are not manageable with unbounded number of goroutines.

What version of Go are you using (go version)?

$ go version
go1.13.3

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE="on"
GOARCH="amd64"
GOBIN="/home/sszuecs/go/bin"
GOCACHE="/home/sszuecs/.cache/go-build"
GOENV="/home/sszuecs/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/sszuecs/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/share/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/share/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build878802156=/tmp/go-build -gno-record-gcc-switches"

What did you do?

To show the impact I create a test setup: [attack client] -> [backend]

backend:

package main

import (
	"fmt"
	"log"
	"net/http"
	"time"
)

type proxy struct{}

func (*proxy) ServeHTTP(w http.ResponseWriter, r *http.Request) {
	time.Sleep(10 * time.Millisecond)
	fmt.Fprintf(w, r.URL.String()) // important is to use the request!
}

func main() {
	proxy := &proxy{}
	srv := &http.Server{
		Addr:    ":9002",
		Handler: proxy,
	}
	log.Fatalf("%v", srv.ListenAndServe())
}

Create a docker container:

FROM alpine
RUN mkdir -p /usr/bin
ADD main /usr/bin/
ENV PATH $PATH:/usr/bin
CMD ["/usr/bin/main"]

build:

% docker build .
Sending build context to Docker daemon  7.392MB
Step 1/5 : FROM alpine
 ---> 11cd0b38bc3c
Step 2/5 : RUN mkdir -p /usr/bin
 ---> Running in 8a0f489fd22c
Removing intermediate container 8a0f489fd22c
 ---> c0b549e856b9
Step 3/5 : ADD main /usr/bin/
 ---> 292e9a346dde
Step 4/5 : ENV PATH $PATH:/usr/bin
 ---> Running in de5e1c78ab94
Removing intermediate container de5e1c78ab94
 ---> 66832a5b3f90
Step 5/5 : CMD ["/usr/bin/main"]
 ---> Running in 83892cb8a768
Removing intermediate container 83892cb8a768
 ---> 5c11f1edbcd6
Successfully built 5c11f1edbcd6

Start the minimal go backend

docker run --rm --memory 100m -hostnetwork -p9002:9002 -it 5c11f1edbcd6 /usr/bin/main

Create attack client, that does the connection spike

package main

import (
	"fmt"
	"log"
	"net"
	"sync"
)

func main() {
	addr := "127.0.0.1:9002"
	numConns := 20000 // increase if you don't get the expected result 
	req := "GET / HTTP/1.1\r\nHost: localhost\r\n\r\n"
	raddr, err := net.ResolveTCPAddr("tcp", addr)
	if err != nil {
		log.Fatalf("Failed to resolve %s: %v", addr, err)
	}

	var wg, ready sync.WaitGroup
	wg.Add(numConns)
	ready.Add(numConns)
	for i := 0; i < numConns; i++ {
		go func() {
			defer wg.Done()
			ready.Done()
			ready.Wait() // all goroutines at the ~same time
			conn, err := net.DialTCP("tcp", nil, raddr)
			if err != nil {
				log.Printf("Failed to dial: %v", err)
				return
			}
			fmt.Fprintf(conn, req)
		}()
	}
	wg.Wait()
}
go run attackclient.go
2019/11/06 23:17:36 Failed to dial: dial tcp 127.0.0.1:9002: connect: connection refused
2019/11/06 23:17:36 Failed to dial: dial tcp 127.0.0.1:9002: connect: connection refused
2019/11/06 23:17:36 Failed to dial: dial tcp 127.0.0.1:9002: connect: connection refused

When the connection refused starts, the backends shows:

% docker run --rm --memory 100m -hostnetwork -p9002:9002 -it a87c13d25e37 /usr/bin/main                                                                           
zsh: exit 137   docker run --rm --memory 100m -hostnetwork -p9002:9002 -it a87c13d25e37

Exit code 137 is oom kill.

What did you expect to see?

no oom kill, but http 5xx or connection refused or similar errors

What did you see instead?

oom kill

Possible solution

http.Serve{} could have a MaxConcurrency option that would limit the number of goroutines that are created. An impementation could be done with a semaphore. it is possible to implement a fix without a breaking change, such that unbounded number of goroutines is the 0 value for the mentioned new option. Another idea would be to set this value automatically via finding the cgroup memory limit for the current process, because the relation should be:

memory consumption ~= (sizeof(http.Request) + sizeof(goroutine)) * number(connections)

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.Performance

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions