Skip to content

net/http: graceful shutdown race condition #36819

Closed
@zerkms

Description

@zerkms

Update: I have found a super trivial way to reproduce it, see the step at the very bottom

Preamble: I think it's relevant to #23829

What version of Go are you using (go version)?

$ go _version_
go version go1.13.6 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

GOHOSTARCH="amd64"
GOHOSTOS="linux"

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/ivan/.cache/go-build"
GOENV="/home/ivan/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/srv/work/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/ivan/.local/share/umake/go/go-lang"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/ivan/.local/share/umake/go/go-lang/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build903857222=/tmp/go-build -gno-record-gcc-switches"

What did you do?

package main

import (
	"context"
	"fmt"
	"net/http"
	"os"
	"os/signal"
	"syscall"
)

func main() {
	sig := make(chan os.Signal, 1)
	signal.Notify(sig, syscall.SIGTERM, os.Interrupt)

	mux := http.NewServeMux()
	mux.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {})

	srv := &http.Server{
		Addr:    ":8080",
		Handler: mux,
	}

	go func() {
		fmt.Println(srv.ListenAndServe())
	}()

	<-sig
	fmt.Println(srv.Shutdown(context.Background()))
}

With the following code it's possible to get a case when the new connection in accepted, then closed within milliseconds, which makes the shutdown not graceful

What did you expect to see?

A graceful shutdown

What did you see instead?

Screenshot from 2020-01-28 11-29-15

Screenshot from 2020-01-28 13-04-16

How to reproduce:

Just ab -t 20 -c 5 -A foo:foo http://localhost:8080/ is enough to capture the case. The former is much more rare, the latter is quite frequent.

What you can see on screenshots: the connection is accepted, then as per https://go-review.googlesource.com/c/go/+/121419/ it should be treated as "new" (but it's not due to race), and is closed.
In first case the client had chance to send a request.
In second - it was closed even before that.

In both cases it happens with the Connection reset by peer on the client side, which should never be the case with non-keepalived connections and graceful termination.

Both pcap files are available at: https://www.dropbox.com/s/2co7wkucfey90zz/bug.tgz?dl=0

My assumption on why it's the race and how it happens:

  1. the connection is accepted with rw, e := l.Accept() successfully
  2. A moment later a shutdown request is made
  3. Before the c.setState(c.rwc, StateNew) statement is reached the server completes shutdown
  4. Because the connection is not tracked - http.Server is not aware of it
  5. The process quits, network sockets are closed by kernel

UPDATE:

It can be easier reproducible by adding time.Sleep(3 * time.Second) right before c := srv.newConn(rw)

Steps:

  1. Patch the http/server.go with time.Sleep(3 * time.Second) before c := srv.newConn(rw)
  2. Start the app
  3. Make a request
  4. Send ctrl-c

Expected: the request should complete
Actual:

➜ curl -v --user foo:foo http://localhost:8080
* Rebuilt URL to: http://localhost:8080/
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8080 (#0)
* Server auth using Basic with user 'foo'
> GET / HTTP/1.1
> Host: localhost:8080
> Authorization: Basic Zm9vOmZvbw==
> User-Agent: curl/7.58.0
> Accept: */*
> 
* Recv failure: Connection reset by peer
* stopped the pause stream!
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer

And a corresponding pcap: https://www.dropbox.com/s/5vsvwoow4ydcs3f/with-timeout.tgz?dl=0

And a corresponding wireshark screenshot:

Screenshot from 2020-01-28 14-02-22

Metadata

Metadata

Assignees

Labels

FrozenDueToAgeNeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions