Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock on start gRPC server #6601

Closed
avgustinmm opened this issue Jan 14, 2020 · 2 comments
Closed

Deadlock on start gRPC server #6601

avgustinmm opened this issue Jan 14, 2020 · 2 comments
Labels
Milestone

Comments

@avgustinmm
Copy link

@avgustinmm avgustinmm commented Jan 14, 2020

What version of gRPC-Java are you using?

1.26.0

What is your environment?

Linux, Alpine
openjdk version "1.8.0_171"
OpenJDK Runtime Environment (IcedTea 3.8.0) (Alpine 8.171.11-r0)
OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)
Netty - 4.1.44.Final
Vertx & Vertx-gRPC - 3.8.4

What did you expect to see?

Proper start of gRPC Server

What did you see instead?

Start sometimes hangs with deadlock

Steps to reproduce the bug

int_threaddump.txt

I suppose it's a race condition related to synchronization in gRPC (ServerImpl), await in NetServer.start and vertx/nettty event loops (probably single threaded). Probably it could happened at any time if someone start gRPC server and concurrently open new client connection to that server.
In my case I stopped and started the gGPR server but I'm not sure if it is somehow related.

Analysis

What I see in the thread dump is the following 2 threads that stays in that state, seems, forever:

"vert.x-eventloop-thread-0" #39 prio=10 os_prio=0 tid=0x000055711e379000 nid=0x2d waiting for monitor entry [0x00007fb72abc8000]
java.lang.Thread.State: BLOCKED (on object monitor)
at io.grpc.internal.ServerImpl$ServerListenerImpl.transportCreated(ServerImpl.java:379)
- waiting to lock <0x00000000c559f1a0> (a java.lang.Object)
at io.grpc.netty.NettyServer$1.initChannel(NettyServer.java:224)
- locked <0x00000000c559bfd8> (a io.grpc.netty.NettyServer)
at io.netty.channel.ChannelInitializer.initChannel(ChannelInitializer.java:129)

"vert.x-worker-thread-12" #214 prio=10 os_prio=0 tid=0x000055711f2d1800 nid=0x418 in Object.wait() [0x00007fb720656000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:252)
- locked <0x00000000c589baa0> (a io.netty.util.concurrent.PromiseTask)
at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:35)
at io.grpc.netty.NettyServer.start(NettyServer.java:269)
at io.grpc.internal.ServerImpl.start(ServerImpl.java:184)
- locked <0x00000000c559f1a0> (a java.lang.Object)
at io.grpc.internal.ServerImpl.start(ServerImpl.java:90)

From what I see in these thread dumps and the code I think that this could be the problem (deadlock):

  1. Vertx grpc starts server (ServerImpl.start) in vertx blocking thread
  2. ServerImpl synchronize on lock and then try (keeping lock) to start server (NetServer.start)
  3. NetServer.start opens a channel, binds to it, and since that moment it, I assume, may receive connections from remote clients
  4. It seems, at this time a remote client opens connection to this server (already bound)
  5. Then in channel's event loop (probably single threaded) is received initChannel which try to get ServerImpl.lock in ServerListenerImpl.transportCreated (coudln't because got by ServerImpl.start)
  6. NetServer.start then schedules runnable in channel's event loop and blocks with channelzFuture.await()
  7. Now, channelzFuture.await() waits for a runnable to be executed in channel's event loop (probably single threaded)
  8. At this point channelzFuture.await keeps ServerImpl.lock lock, while the ServerListenerImpl.transportCreated occupies/blocks (this is what I suppose) the single threaded channel's event loop thus making impossible to process further

I'm attaching file with thread dumps of the whole JVM

@avgustinmm avgustinmm added the bug label Jan 14, 2020
@dapengzhang0
Copy link
Member

@dapengzhang0 dapengzhang0 commented Jan 15, 2020

Thank you @avgustinmm for the full thread dump. It seems it's a valid race.

@dapengzhang0 dapengzhang0 added this to the 1.27 milestone Jan 15, 2020
dapengzhang0 added a commit that referenced this issue Jan 16, 2020
A race condition was reported by user in #6601:

`ServerImpl.start()` calls `NettyServer.start()` while holding `ServerImpl.lock`. `NettyServer.start()` awaits a submitted runnable in eventloop. However, this pending runnable may never be executed because the eventloop might be executing some other task, like `ServerListenerImpl.transportCreated()`, that is trying to acquire `ServerImpl.lock` causing a deadlock.

This PR resolves the particular issue reported in #6601 for server with a single port, but `NettyServer` (https://github.com/grpc/grpc-java/blob/v1.26.0/netty/src/main/java/io/grpc/netty/NettyServer.java#L251) and `ServerImpl` (https://github.com/grpc/grpc-java/blob/v1.26.0/core/src/main/java/io/grpc/internal/ServerImpl.java#L184) in general still have the same potential risk of deadlock, which need further fix.
dapengzhang0 added a commit to dapengzhang0/grpc-java that referenced this issue Jan 22, 2020
A race condition was reported by user in grpc#6601:

`ServerImpl.start()` calls `NettyServer.start()` while holding `ServerImpl.lock`. `NettyServer.start()` awaits a submitted runnable in eventloop. However, this pending runnable may never be executed because the eventloop might be executing some other task, like `ServerListenerImpl.transportCreated()`, that is trying to acquire `ServerImpl.lock` causing a deadlock.

This PR resolves the particular issue reported in grpc#6601 for server with a single port, but `NettyServer` (https://github.com/grpc/grpc-java/blob/v1.26.0/netty/src/main/java/io/grpc/netty/NettyServer.java#L251) and `ServerImpl` (https://github.com/grpc/grpc-java/blob/v1.26.0/core/src/main/java/io/grpc/internal/ServerImpl.java#L184) in general still have the same potential risk of deadlock, which need further fix.
@dapengzhang0 dapengzhang0 removed this from the 1.27 milestone Jan 22, 2020
@dapengzhang0 dapengzhang0 added this to the 1.28 milestone Jan 22, 2020
dapengzhang0 added a commit that referenced this issue Jan 22, 2020
A race condition was reported by user in #6601:

`ServerImpl.start()` calls `NettyServer.start()` while holding `ServerImpl.lock`. `NettyServer.start()` awaits a submitted runnable in eventloop. However, this pending runnable may never be executed because the eventloop might be executing some other task, like `ServerListenerImpl.transportCreated()`, that is trying to acquire `ServerImpl.lock` causing a deadlock.

This PR resolves the particular issue reported in #6601 for server with a single port, but `NettyServer` (https://github.com/grpc/grpc-java/blob/v1.26.0/netty/src/main/java/io/grpc/netty/NettyServer.java#L251) and `ServerImpl` (https://github.com/grpc/grpc-java/blob/v1.26.0/core/src/main/java/io/grpc/internal/ServerImpl.java#L184) in general still have the same potential risk of deadlock, which need further fix.
@dapengzhang0
Copy link
Member

@dapengzhang0 dapengzhang0 commented Jan 24, 2020

Fixed by #6610. Filed #6641 to track the deadlock for multiple-port server transport.

@dapengzhang0 dapengzhang0 removed this from the 1.28 milestone Jan 24, 2020
@dapengzhang0 dapengzhang0 added this to the 1.27 milestone Jan 24, 2020
@lock lock bot locked as resolved and limited conversation to collaborators Apr 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants