Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GracefulStop() doesn’t always drain the channel #4282

Closed
didrocks opened this issue Mar 19, 2021 · 6 comments
Closed

GracefulStop() doesn’t always drain the channel #4282

didrocks opened this issue Mar 19, 2021 · 6 comments

Comments

@didrocks
Copy link

didrocks commented Mar 19, 2021

(as requested on bug #4272, here is the same bug with a simpler reproducer without systemd)

What version of gRPC are you using?

    github.com/golang/protobuf v1.4.2
    google.golang.org/grpc v1.36.0
    google.golang.org/protobuf v1.25.0
    google.golang.org/grpc/cmd/protoc-gen-go-grpc v1.1.0 // indirect

What version of Go are you using (go version)?

go version go1.16 linux/amd64

What operating system (Linux, Windows, …) and version?

Linux (Ubuntu) hirsute (devel series, incoming 21.04)

What did you do?

Using systemd with an unix socket, after requesting a GracefulStop on our grpc server, we see that the service is sometimes
restarted through socket activation (like 50% of the time).

The cause seems to be that even if all clients are disconnected, the grpc server can write a ping on the socket
before disconnecting, without draining it. So, it leaves a message on the socket and triggers a restart of the daemon by systemd.

We have a reproducer available at https://github.com/ubuntu/grpcstop, updated to remove systemd from the equation as requested on bug #4272.

To run this test program execute the command:
GODEBUG=http2debug=2 go run . -logtostderr=true

The trace of a restarted behaviour:

I0319 10:14:34.909731 1788856 main.go:55] Server ready
2021/03/19 10:14:34 http2: Framer 0xc0001ee000: wrote SETTINGS len=0
2021/03/19 10:14:34 http2: Framer 0xc000234000: wrote SETTINGS len=6, settings: MAX_FRAME_SIZE=16384
2021/03/19 10:14:34 http2: Framer 0xc000234000: read SETTINGS len=0
2021/03/19 10:14:34 http2: Framer 0xc0001ee000: read SETTINGS len=6, settings: MAX_FRAME_SIZE=16384
2021/03/19 10:14:34 http2: Framer 0xc0001ee000: wrote SETTINGS flags=ACK len=0
2021/03/19 10:14:34 http2: Framer 0xc000234000: wrote SETTINGS flags=ACK len=0
2021/03/19 10:14:34 http2: Framer 0xc000234000: read SETTINGS flags=ACK len=0
2021/03/19 10:14:34 http2: Framer 0xc0001ee000: read SETTINGS flags=ACK len=0
2021/03/19 10:14:34 http2: Framer 0xc0001ee000: wrote HEADERS flags=END_HEADERS stream=1 len=67
2021/03/19 10:14:34 http2: Framer 0xc0001ee000: wrote DATA flags=END_STREAM stream=1 len=5 data="\x00\x00\x00\x00\x00"
2021/03/19 10:14:34 http2: Framer 0xc000234000: read HEADERS flags=END_HEADERS stream=1 len=67
2021/03/19 10:14:34 http2: decoded hpack field header field ":method" = "POST"
2021/03/19 10:14:34 http2: decoded hpack field header field ":scheme" = "http"
2021/03/19 10:14:34 http2: decoded hpack field header field ":path" = "/GrpcStop/Stop"
2021/03/19 10:14:34 http2: decoded hpack field header field ":authority" = "/tmp/grpcstop.socket"
2021/03/19 10:14:34 http2: decoded hpack field header field "content-type" = "application/grpc"
2021/03/19 10:14:34 http2: decoded hpack field header field "user-agent" = "grpc-go/1.36.0"
2021/03/19 10:14:34 http2: decoded hpack field header field "te" = "trailers"
2021/03/19 10:14:34 http2: Framer 0xc000234000: read DATA flags=END_STREAM stream=1 len=5 data="\x00\x00\x00\x00\x00"
2021/03/19 10:14:34 http2: Framer 0xc000234000: wrote WINDOW_UPDATE len=4 (conn) incr=5
2021/03/19 10:14:34 http2: Framer 0xc000234000: wrote PING len=8 ping="\x02\x04\x10\x10\t\x0e\a\a"
2021/03/19 10:14:34 http2: Framer 0xc000234000: wrote HEADERS flags=END_HEADERS stream=1 len=14
2021/03/19 10:14:34 http2: Framer 0xc000234000: wrote DATA stream=1 len=5 data="\x00\x00\x00\x00\x00"
2021/03/19 10:14:34 http2: Framer 0xc000234000: wrote HEADERS flags=END_STREAM|END_HEADERS stream=1 len=24
I0319 10:14:34.910670 1788856 main.go:79] GracefulStop requested
2021/03/19 10:14:34 http2: Framer 0xc0001ee000: read WINDOW_UPDATE len=4 (conn) incr=5
2021/03/19 10:14:34 http2: Framer 0xc0001ee000: read PING len=8 ping="\x02\x04\x10\x10\t\x0e\a\a"
2021/03/19 10:14:34 http2: Framer 0xc0001ee000: read HEADERS flags=END_HEADERS stream=1 len=14
2021/03/19 10:14:34 http2: decoded hpack field header field ":status" = "200"
2021/03/19 10:14:34 http2: decoded hpack field header field "content-type" = "application/grpc"
2021/03/19 10:14:34 http2: Framer 0xc0001ee000: read DATA stream=1 len=5 data="\x00\x00\x00\x00\x00"
2021/03/19 10:14:34 http2: Framer 0xc0001ee000: read HEADERS flags=END_STREAM|END_HEADERS stream=1 len=24
2021/03/19 10:14:34 http2: decoded hpack field header field "grpc-status" = "0"
2021/03/19 10:14:34 http2: decoded hpack field header field "grpc-message" = ""
2021/03/19 10:14:34 http2: Framer 0xc000234000: wrote GOAWAY len=8 LastStreamID=2147483647 ErrCode=NO_ERROR Debug=""
2021/03/19 10:14:34 http2: Framer 0xc000234000: wrote PING len=8 ping="\x01\x06\x01\b\x00\x03\x03\t"
I0319 10:14:34.910840 1788856 main.go:70] Client Stopped
2021/03/19 10:14:34 http2: Framer 0xc0001ee000: wrote PING flags=ACK len=8 ping="\x02\x04\x10\x10\t\x0e\a\a"
2021/03/19 10:14:34 http2: Framer 0xc0001ee000: read GOAWAY len=8 LastStreamID=2147483647 ErrCode=NO_ERROR Debug=""
2021/03/19 10:14:34 http2: Framer 0xc0001ee000: wrote WINDOW_UPDATE len=4 (conn) incr=5
2021/03/19 10:14:34 http2: Framer 0xc0001ee000: read PING len=8 ping="\x01\x06\x01\b\x00\x03\x03\t"
2021/03/19 10:14:34 http2: Framer 0xc0001ee000: wrote PING len=8 ping="\x02\x04\x10\x10\t\x0e\a\a"
I0319 10:14:34.910924 1788856 main.go:81] GracefulStop done
I0319 10:14:34.910932 1788856 main.go:73] Server Stopped

We can see here that once the client stopped, there is a PING write that is drained by a read PING. However, there is
a second ping write (2021/03/19 10:14:34 http2: Framer 0xc0001ee000: wrote PING len=8 ping="\x02\x04\x10\x10\t\x0e\a\a")
that is never read.

However, other times, the socket is fully drained as expected:

I0319 10:13:43.302538 1786062 main.go:70] Client Stopped
2021/03/19 10:13:43 http2: Framer 0xc0001fe000: read PING len=8 ping="\x01\x06\x01\b\x00\x03\x03\t"
2021/03/19 10:13:43 http2: Framer 0xc0000b2000: read PING len=8 ping="\x02\x04\x10\x10\t\x0e\a\a"
I0319 10:13:43.302610 1786062 main.go:81] GracefulStop done
I0319 10:13:43.302620 1786062 main.go:73] Server Stopped

What did you expect to see?

  • We expect the channel to be fully drained so that no message is left on the socket once the GRPC server shutdown.

What did you see instead?

  • The http2debug logs with our additional annotations seems to demonstrate that there is a missing read to fully drain the socket.
@menghanl
Copy link
Contributor

there is a second ping write (2021/03/19 10:14:34 http2: Framer 0xc0001ee000: wrote PING len=8 ping="\x02\x04\x10\x10\t\x0e\a\a") that is never read.

From my understanding, this doesn't seem like the root cause. The pings are not meant to be "read".
"write" is just sending a ping, "read" is just receiving a ping. They don't have to be in pairs.
(Those PING flags=ACK are kind of in pair with the PING, because they are to acknowledge a ping. But even for those, "pairing" or "draining" is not mandatory)

I think there is a knowledge gap.
I'm not familiar enough with services to know how/when this service will be restarted. Can you explain a bit?

And it's not clear to me what socket activation means.

  • Does it mean "there are bytes remaining in the UDS file"?
  • Or "a new connection is created using this file"?
  • Or "any read/write to the UDS file"?

@didrocks
Copy link
Author

Thanks for reaching out. There is indeed a knowledge gap on both sides and of course, the fact that the PING was the cause is just theorical (really hard to imagine a test case for this).

With systemd socket activation the following happens (with Accept=no, which is the default setting):

  • systemd listens and creates the socket (definition in the .socket file) on behalf on the associated service (.service)
  • when a first connection arrives on the socket (or anything writes to it), it starts the associated service (with fork) and passes the socket file(s) descriptor(s) to it as environment variables.
  • the service can then accept() and treat any messages and any new connections directly. systemd is out of the loop as long as the service is running.
  • then, when the service stops, systemd takes back ownership of the socket and monitor new connections, ready to start a new instance the service when something transit on it (which is the part which is hard to know if this is only new connections or any bytes remaining on the UDS file).

We tried to monitor what’s on the socket, it’s clear that systemd sees "Incoming traffic" (see logs) which restarts erronously the service once it has stopped.
However, no new client instance is ever created, this is why we had the theory of leftover packets on it.

The most interesting part is that if we define FlushPending=yes in the .socket definition, the service is never restarted back (with 200 stop/start tests in loop). From man systemd.socket:

       FlushPending=
           Takes a boolean argument. May only be used when Accept=no. If yes,
           the socket's buffers are cleared after the triggered service
           exited. This causes any pending data to be flushed and any pending
           incoming connections to be rejected. If no, the socket's buffers
           won't be cleared, permitting the service to handle any pending
           connections after restart, which is the usually expected behaviour.
           Defaults to no.

So, there really seems to be some leftover on buffer on the UDS once the unique client requesting GracefulStop is leaving, which is what then makes the service restarting.

As it’s hard to remove systemd out of the loop, we took back the example as of bug #4272, which is https://github.com/ubuntu/grpcstop at revision c62a9fce7ee31ccd420ecf85e9000bf9751f7daa.

Here are the logs of both service and client:

Service:

[11963.075473] adclient04 systemd[1]: grpcstop.socket: Incoming traffic
[11963.075782] adclient04 systemd[1]: grpcstop.service: Trying to enqueue job grpcstop.service/start/replace
[11963.075856] adclient04 systemd[1]: Added job grpcstop.service/start to transaction.
[11963.075916] adclient04 systemd[1]: Pulling in system.slice/start from grpcstop.service/start
[11963.076085] adclient04 systemd[1]: Pulling in sysinit.target/start from grpcstop.service/start
[11963.094969] adclient04 systemd[1]: Pulling in grpcstop.socket/start from grpcstop.service/start
[11963.095077] adclient04 systemd[1]: Added job grpcstop.socket/start to transaction.
[11963.095178] adclient04 systemd[1]: Pulling in sysinit.target/start from grpcstop.socket/start
[11963.095276] adclient04 systemd[1]: Pulling in system.slice/start from grpcstop.socket/start
[11963.095376] adclient04 systemd[1]: Pulling in -.mount/start from grpcstop.socket/start
[11963.095472] adclient04 systemd[1]: Pulling in shutdown.target/stop from grpcstop.socket/start
[11963.095566] adclient04 systemd[1]: Pulling in shutdown.target/stop from grpcstop.service/start
[11963.097762] adclient04 systemd[1]: Found redundant job grpcstop.socket/start, dropping from transaction.
[11963.101110] adclient04 systemd[1]: grpcstop.service: Installed new job grpcstop.service/start as 30957
[11963.101188] adclient04 systemd[1]: grpcstop.service: Enqueued job grpcstop.service/start as 30957
[11963.101266] adclient04 systemd[1]: grpcstop.socket: Changed listening -> running
[11963.101347] adclient04 systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2esocket interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18657 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.101427] adclient04 systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2esocket interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18658 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.101502] adclient04 systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18659 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.101569] adclient04 systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18660 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.101740] adclient04 systemd-logind[705]: Got message type=signal sender=:1.8 destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2esocket interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18657 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.101970] adclient04 systemd-logind[705]: Got message type=signal sender=:1.8 destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2esocket interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18658 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.102051] adclient04 systemd-logind[705]: Got message type=signal sender=:1.8 destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18659 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.102138] adclient04 systemd-logind[705]: Got message type=signal sender=:1.8 destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18660 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.102287] adclient04 systemd[1]: grpcstop.service: Passing 1 fds to service
[11963.102372] adclient04 systemd[1]: grpcstop.service: About to execute /tmp/grpcstop server -logtostderr -v=2
[11963.102457] adclient04 systemd[1]: grpcstop.service: Forked /tmp/grpcstop as 215364
[11963.102533] adclient04 systemd-logind[705]: Got message type=signal sender=:1.8 destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18662 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.102611] adclient04 systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18662 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.102689] adclient04 systemd-logind[705]: Got message type=signal sender=:1.8 destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18663 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.102776] adclient04 systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18663 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.102861] adclient04 systemd[1]: grpcstop.service: Changed failed -> start
[11963.103112] adclient04 systemd[1]: Starting GRPCStop daemon service...
[11963.103200] adclient04 systemd-logind[705]: Got message type=signal sender=:1.8 destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18665 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.103301] adclient04 systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18665 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.103415] adclient04 systemd-logind[705]: Got message type=signal sender=:1.8 destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18666 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.103544] adclient04 systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18666 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.103867] adclient04 systemd[215364]: grpcstop.service: Executing: /tmp/grpcstop server -logtostderr -v=2
[11963.583886] adclient04 grpcstop[215364]: 2021/03/22 14:51:56 http2: Framer 0xc0001ba000: wrote SETTINGS len=6, settings: MAX_FRAME_SIZE=16384
[11963.584288] adclient04 grpcstop[215364]: 2021/03/22 14:51:56 http2: Framer 0xc0001ba000: read SETTINGS len=0
[11963.584288] adclient04 grpcstop[215364]: 2021/03/22 14:51:56 http2: Framer 0xc0001ba000: wrote SETTINGS flags=ACK len=0
[11963.584288] adclient04 grpcstop[215364]: 2021/03/22 14:51:56 http2: Framer 0xc0001ba000: read SETTINGS flags=ACK len=0
[11963.584288] adclient04 grpcstop[215364]: 2021/03/22 14:51:56 http2: Framer 0xc0001ba000: read HEADERS flags=END_HEADERS stream=1 len=67
[11963.584288] adclient04 grpcstop[215364]: 2021/03/22 14:51:56 http2: decoded hpack field header field ":method" = "POST"
[11963.584288] adclient04 grpcstop[215364]: 2021/03/22 14:51:56 http2: decoded hpack field header field ":scheme" = "http"
[11963.585009] adclient04 grpcstop[215364]: 2021/03/22 14:51:56 http2: decoded hpack field header field ":path" = "/GrpcStop/Stop"
[11963.585009] adclient04 grpcstop[215364]: 2021/03/22 14:51:56 http2: decoded hpack field header field ":authority" = "/tmp/grpcstop.socket"
[11963.585009] adclient04 grpcstop[215364]: 2021/03/22 14:51:56 http2: decoded hpack field header field "content-type" = "application/grpc"
[11963.585009] adclient04 grpcstop[215364]: 2021/03/22 14:51:56 http2: decoded hpack field header field "user-agent" = "grpc-go/1.36.0"
[11963.585009] adclient04 grpcstop[215364]: 2021/03/22 14:51:56 http2: decoded hpack field header field "te" = "trailers"
[11963.585009] adclient04 grpcstop[215364]: 2021/03/22 14:51:56 http2: Framer 0xc0001ba000: read DATA flags=END_STREAM stream=1 len=5 data="\x00\x00\x00\x00\x00"
[11963.585009] adclient04 grpcstop[215364]: 2021/03/22 14:51:56 http2: Framer 0xc0001ba000: wrote WINDOW_UPDATE len=4 (conn) incr=5
[11963.585009] adclient04 grpcstop[215364]: 2021/03/22 14:51:56 http2: Framer 0xc0001ba000: wrote PING len=8 ping="\x02\x04\x10\x10\t\x0e\a\a"
[11963.585009] adclient04 grpcstop[215364]: 2021/03/22 14:51:56 http2: Framer 0xc0001ba000: wrote HEADERS flags=END_HEADERS stream=1 len=14
[11963.585009] adclient04 grpcstop[215364]: 2021/03/22 14:51:56 http2: Framer 0xc0001ba000: wrote DATA stream=1 len=5 data="\x00\x00\x00\x00\x00"
[11963.585009] adclient04 grpcstop[215364]: 2021/03/22 14:51:56 http2: Framer 0xc0001ba000: wrote HEADERS flags=END_STREAM|END_HEADERS stream=1 len=24
[11963.585009] adclient04 grpcstop[215364]: 2021/03/22 14:51:56 http2: Framer 0xc0001ba000: wrote GOAWAY len=8 LastStreamID=2147483647 ErrCode=NO_ERROR Debug=""
[11963.585009] adclient04 grpcstop[215364]: 2021/03/22 14:51:56 http2: Framer 0xc0001ba000: wrote PING len=8 ping="\x01\x06\x01\b\x00\x03\x03\t"
[11963.585009] adclient04 grpcstop[215364]: 2021/03/22 14:51:56 http2: Framer 0xc0001ba000: read PING flags=ACK len=8 ping="\x02\x04\x10\x10\t\x0e\a\a"
[11963.585796] adclient04 systemd[1]: Received SIGCHLD from PID 215364 (grpcstop).
[11963.585915] adclient04 systemd[1]: Child 215364 (grpcstop) died (code=exited, status=0/SUCCESS)
[11963.585972] adclient04 systemd[1]: grpcstop.service: Failed to read oom_kill field of memory.events cgroup attribute: No such file or directory
[11963.586052] adclient04 systemd[1]: grpcstop.service: Child 215364 belongs to grpcstop.service.
[11963.586093] adclient04 systemd[1]: grpcstop.service: Main process exited, code=exited, status=0/SUCCESS
[11963.586148] adclient04 systemd[1]: grpcstop.service: Succeeded.
[11963.586190] adclient04 systemd[1]: grpcstop.service: Service will not restart (restart setting)
[11963.586232] adclient04 systemd[1]: grpcstop.service: Changed start -> dead
[11963.586311] adclient04 systemd[1]: grpcstop.service: Job 30957 grpcstop.service/start finished, result=done
[11963.586362] adclient04 systemd[1]: Finished GRPCStop daemon service.
[11963.586406] adclient04 systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18668 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.586474] adclient04 systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18669 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
^---- Here service process (PID 215364) stopped
v---- Then restarts immediately (PID 215378) due to incoming traffic
[11963.586550] adclient04 systemd[1]: grpcstop.socket: Changed running -> listening
[11963.586595] adclient04 systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2esocket interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18671 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.586633] adclient04 systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2esocket interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18672 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.586673] adclient04 systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18673 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.586708] adclient04 systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18674 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.586793] adclient04 systemd[1]: grpcstop.socket: Incoming traffic
[11963.586831] adclient04 systemd[1]: grpcstop.service: Trying to enqueue job grpcstop.service/start/replace
[11963.586870] adclient04 systemd[1]: Added job grpcstop.service/start to transaction.
[11963.586912] adclient04 systemd[1]: Pulling in system.slice/start from grpcstop.service/start
[11963.587055] adclient04 systemd[1]: Pulling in sysinit.target/start from grpcstop.service/start
[11963.602388] adclient04 systemd[1]: Pulling in grpcstop.socket/start from grpcstop.service/start
[11963.602466] adclient04 systemd[1]: Added job grpcstop.socket/start to transaction.
[11963.602546] adclient04 systemd[1]: Pulling in sysinit.target/start from grpcstop.socket/start
[11963.602640] adclient04 systemd[1]: Pulling in system.slice/start from grpcstop.socket/start
[11963.602715] adclient04 systemd[1]: Pulling in -.mount/start from grpcstop.socket/start
[11963.602786] adclient04 systemd[1]: Pulling in shutdown.target/stop from grpcstop.socket/start
[11963.602867] adclient04 systemd[1]: Pulling in shutdown.target/stop from grpcstop.service/start
[11963.605373] adclient04 systemd[1]: Found redundant job grpcstop.socket/start, dropping from transaction.
[11963.609256] adclient04 systemd[1]: grpcstop.service: Installed new job grpcstop.service/start as 31028
[11963.609344] adclient04 systemd[1]: grpcstop.service: Enqueued job grpcstop.service/start as 31028
[11963.609430] adclient04 systemd[1]: grpcstop.socket: Changed listening -> running
[11963.609515] adclient04 systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2esocket interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18675 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.609608] adclient04 systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2esocket interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18676 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.609688] adclient04 systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18677 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.609767] adclient04 systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18678 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.610005] adclient04 systemd[1]: grpcstop.service: Passing 1 fds to service
[11963.610088] adclient04 systemd[1]: grpcstop.service: About to execute /tmp/grpcstop server -logtostderr -v=2
[11963.610167] adclient04 systemd[1]: grpcstop.service: Forked /tmp/grpcstop as 215378
[11963.610252] adclient04 systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18680 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.610334] adclient04 systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18681 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.610405] adclient04 systemd[1]: grpcstop.service: Changed dead -> start
[11963.610491] adclient04 systemd[1]: Starting GRPCStop daemon service...
[11963.610578] adclient04 systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18682 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.610661] adclient04 systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18683 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.610816] adclient04 systemd-logind[705]: Got message type=signal sender=:1.8 destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18668 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.610953] adclient04 systemd-logind[705]: Got message type=signal sender=:1.8 destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18669 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.611119] adclient04 systemd-logind[705]: Got message type=signal sender=:1.8 destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2esocket interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18671 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.611190] adclient04 systemd-logind[705]: Got message type=signal sender=:1.8 destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2esocket interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18672 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.611268] adclient04 systemd-logind[705]: Got message type=signal sender=:1.8 destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18673 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.611346] adclient04 systemd-logind[705]: Got message type=signal sender=:1.8 destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18674 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.611418] adclient04 systemd-logind[705]: Got message type=signal sender=:1.8 destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2esocket interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18675 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.611610] adclient04 systemd-logind[705]: Got message type=signal sender=:1.8 destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2esocket interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18676 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.611733] adclient04 systemd-logind[705]: Got message type=signal sender=:1.8 destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18677 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.611813] adclient04 systemd-logind[705]: Got message type=signal sender=:1.8 destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18678 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.611893] adclient04 systemd-logind[705]: Got message type=signal sender=:1.8 destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18680 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.611965] adclient04 systemd-logind[705]: Got message type=signal sender=:1.8 destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18681 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.612041] adclient04 systemd-logind[705]: Got message type=signal sender=:1.8 destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18682 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.612122] adclient04 systemd-logind[705]: Got message type=signal sender=:1.8 destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=18683 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
[11963.612300] adclient04 systemd[215378]: grpcstop.service: Executing: /tmp/grpcstop server -logtostderr -v=2
[11964.095700] adclient04 grpcstop[215378]: 2021/03/22 14:51:56 http2: Framer 0xc000166000: wrote SETTINGS len=6, settings: MAX_FRAME_SIZE=16384
[11964.095700] adclient04 grpcstop[215378]: 2021/03/22 14:51:56 http2: Framer 0xc000166000: read SETTINGS len=0
[11964.095774] adclient04 grpcstop[215378]: INFO: 2021/03/22 14:51:56 [transport] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
^---- it seems that there is still a closing message on the socket once the service is listening back

Client logs:

# /tmp/grpcstop stop
INFO: 2021/03/22 14:51:55 [core] parsed scheme: ""
INFO: 2021/03/22 14:51:55 [core] scheme "" not registered, fallback to default scheme
INFO: 2021/03/22 14:51:55 [core] ccResolverWrapper: sending update to cc: {[{/tmp/grpcstop.socket  <nil> 0 <nil>}] <nil> <nil>}
INFO: 2021/03/22 14:51:55 [core] ClientConn switching balancer to "pick_first"
INFO: 2021/03/22 14:51:55 [core] Channel switches to new LB policy "pick_first"
INFO: 2021/03/22 14:51:55 [core] Subchannel Connectivity change to CONNECTING
INFO: 2021/03/22 14:51:55 [core] Subchannel picks a new address "/tmp/grpcstop.socket" to connect
INFO: 2021/03/22 14:51:55 [core] pickfirstBalancer: UpdateSubConnState: 0xc0000321e0, {CONNECTING <nil>}
INFO: 2021/03/22 14:51:55 [core] Channel Connectivity change to CONNECTING
INFO: 2021/03/22 14:51:56 [core] Subchannel Connectivity change to READY
INFO: 2021/03/22 14:51:56 [core] pickfirstBalancer: UpdateSubConnState: 0xc0000321e0, {READY <nil>}
INFO: 2021/03/22 14:51:56 [core] Channel Connectivity change to READY
## Connected, calling STOP
INFO: 2021/03/22 14:51:56 [core] Subchannel Connectivity change to CONNECTING
INFO: 2021/03/22 14:51:56 [core] Subchannel picks a new address "/tmp/grpcstop.socket" to connect
INFO: 2021/03/22 14:51:56 [transport] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
## Stop returned, shutting down
INFO: 2021/03/22 14:51:56 [core] Channel Connectivity change to SHUTDOWN
INFO: 2021/03/22 14:51:56 [core] Subchannel Connectivity change to SHUTDOWN

Hope this helps, if you need any additional logs, we are happy to provide them.

@menghanl
Copy link
Contributor

menghanl commented Mar 22, 2021

What I think happened

When the client receives a GOAWAY, it starts a new TCP connection to the same address (because the GOAWAY connection cannot be used for new RPCs).
Server's GracefulStop() also closes the listener, so this new connection will usually fail.

But this connection may arrive at the socket after the old server exits, triggering Incoming traffic.
(This is a race, and we have no control of the timing. This is probably also why you don't see the restarting behavior 100% of the time).

Something to try

The key is to stop the client from starting new connections after client stop.
One easy way is to intercept the dialer, and return early. Something like

var stopped uint32
grpc.Dial(...,
    grpc.WithContextDialer(func(ctx context.Context, addr string) (conn net.Conn, err error) {
		if atomic.LoadInt32(&stopped) != 0 {
			// Client was stopped
			return nil, errors.New("stopped")
		}
		// normal net.DialUnix()
	}),
)


switch os.Args[1] {
...
case "stop":
	atomic.StoreInt32(&stopped, 1) // Set 1 to stop the dialer.
	// normal c.Stop(context.Background(), &Empty{})
...
}

Please try this and see if it works.

@didrocks
Copy link
Author

Spot on! There are indeed 2 connections happening, the second being just after Stop() is called. So, that would explain why there are sometimes some unwanted connections on the socket triggering this activation.

However, as you said, this is racy:

  • sometimes, the client can close pretty quickly before receiving this second connection, and consequently, the GOAWAY message connection is left on the socket which will retrigger the service to restart.
  • in the general case, we have multiple clients connected to the daemon. Some are doing heavy processing and will finish their executions after the Stop() client will be called and ends. The original idea was to gracefully waits on all connections to be done (hence GracefulStop()) and then shutting down the daemon. However, those clients will receive the GOAWAY secondary connection and as they won’t know the Stop() command was requested.

didrocks added a commit to ubuntu/adsys that referenced this issue Mar 25, 2021
When stopping the daemon, the GOAWAY message is sent to connected
clients, which is not necessarily consumed by them, leaving a new connection
on the socket, which retriggers a restart from systemd.
Upstream discussion is at grpc/grpc-go#4282.

Co-authored-by: Jean-Baptiste Lallement <jean-baptiste@ubuntu.com>
@menghanl
Copy link
Contributor

menghanl commented Apr 7, 2021

Sorry for the late reply.
The root cause of this issue comes down to the missing support of IDLE in gRPC-Go: #4298
The client should not reconnect immediately after the GOAWAY (should wait for the next RPC)

Before that fix is ready, another option to consider is to use an RPC to control the client's Dial.
Same as the stopped approach above, but this time, keep a streaming RPC between all the clients and the server, and when server is shutting down, it will send a message to all the clients, to tell them to set stopped to true, which will stop future connections. Then the server can safely call GracefulStop().

I understand this is less then ideal, and will make the application unnecessarily complex.
Let me know what you think.

@didrocks
Copy link
Author

Indeed, that looks like related to the other bug. Thanks for the investigation @menghanl. The workaround is quite hackish though (and suppose all clients implements this "stop don’t redial" hack, which is fine for us, but less than ideal in a situation with an opened API).

Right now, we are going with the FlushPending=yes in the systemd .socket definition, which doesn’t require any additional code. We may lose a connection if the connect is exactly when the daemon stops, but in our case, this is not that blocking and we can implement some retry logic.

Looking foward to see bug #4298 fixed, and report back if things are indeed fixed this way. Thanks again!

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants