New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simultaneous client reads / writes to the same socket from different threads can deadlock #4854

Closed
camlow325 opened this Issue Nov 13, 2017 · 4 comments

Comments

Projects
None yet
2 participants
@camlow325
Contributor

camlow325 commented Nov 13, 2017

Environment

Provide at least:

  • JRuby version (jruby -v) and command line (flags, JRUBY_OPTS, etc)
    9.1.13.0, 9.1.14.0, maybe others?

  • Operating system and platform (e.g. uname -a)
    Windows 10, x86_64

Expected Behavior

  • Describe your expectation of how JRuby should behave, perhaps by showing how CRuby/MRI behaves.

When a client interleaves blocking calls to read data and write data to the same TCP socket from different threads, the client is able to eventually complete the read and write calls with no errors. This is what I've seen with the script below with MRI Ruby 2.4.2, running the script indefinitely. On JRuby 9k, however, the process deadlocks after a few seconds, never seeming to recover.

  • Provide an executable Ruby script or a link to an example repository.

Here's the script I've used to reproduce this problem. I just ran jruby.exe .\socket_sample.rb to launch the script. The script opens a server socket and accepts a single connection from a client. The script performs interleaved read/write calls on the client side of the socket (with no synchronization). On the server side, reads and writes are performed as well, but all from a single thread.

require 'socket'

server = TCPServer.open(50000)

Thread.start(server) do |server|
  Thread.current.name = "ServerSocket"
  client = server.accept
  server_loops = 1
  while true
    begin
      client.read_nonblock(20000)
      server_write = "Server write #{server_loops}"
      puts(server_write)
      client.puts(server_write)
      server_loops += 1
      sleep(0.1)
    rescue IO::WaitReadable
      puts("Server: waiting for data to read...")
      IO.select([client])
    end
  end
end

client = TCPSocket.open('localhost', 50000)

Thread.start(client) do |client|
  Thread.current.name = "ClientReader"
  client_read_loops = 1
  while true
    data_read = client.gets
    puts("Client received #{client_read_loops}: #{data_read}")
    client_read_loops += 1
  end
end

Thread.current.name = "ClientWriter"
client_write_loops = 1
while true
  client_write = "Client write #{client_write_loops}"
  if client_write_loops % 10000 == 0
    puts(client_write)
  end
  client.puts(client_write)
  client_write_loops += 1
end

Actual Behavior

  • Describe or show the actual behavior.

Running with MRI Ruby 2.4.2, the script appears to run fine indefinitely, with messages like the following in the output:

Client received 901: Server write 901
Client write 950000
Server write 902
Client received 902: Server write 902
Server write 903
Client received 903: Server write 903

Running with JRuby 9.1.14.0, however, the client read/write threads appear to deadlock one another after only a few seconds of running. The console output looks like the following:

Client write 50000
Server write 9
Client received 9: Server write 9
Server write 10
Server write 11
Server write 12
...
Server write 54
Server: waiting for data to read...

Running jstack against the Java process for a JRuby 9.1.14.0 run after the 'waiting for data to read...' message above, I see the following:

Found one Java-level deadlock:
=============================
"Ruby-0-Thread-2@ClientReader: socket_sample.rb:1":
  waiting for ownable synchronizer 0x00000000fa52ca88, (a java.util.concurrent.locks.ReentrantLock$NonfairSync),
  which is held by "main"
"main":
  waiting to lock monitor 0x00000000034e9bf8 (object 0x00000000fa52c920, a java.lang.Object),
  which is held by "Ruby-0-Thread-2@ClientReader: socket_sample.rb:1"

Java stack information for the threads listed above:
===================================================
"Ruby-0-Thread-2@ClientReader: socket_sample.rb:1":
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000fa52ca88> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
        at org.jruby.util.io.OpenFile.lock(OpenFile.java:2672)
        at org.jruby.util.io.OpenFile.removeBlockingThread(OpenFile.java:2606)
        - locked <0x00000000fa528df8> (a org.jruby.util.io.OpenFile)
        at org.jruby.RubyThread.select(RubyThread.java:1811)
        - locked <0x00000000fa52c920> (a java.lang.Object)
        at org.jruby.RubyThread.select(RubyThread.java:1701)
        at org.jruby.util.io.OpenFile.readInternal(OpenFile.java:1368)
        at org.jruby.util.io.OpenFile.fillbuf(OpenFile.java:1264)
        at org.jruby.util.io.OpenFile.getlineFast(OpenFile.java:1594)
        at org.jruby.RubyIO.getlineInner(RubyIO.java:725)
        at org.jruby.RubyIO.getline(RubyIO.java:695)
        at org.jruby.RubyIO.getline(RubyIO.java:683)
        at org.jruby.RubyIO.gets(RubyIO.java:2233)
        at org.jruby.RubyIO$INVOKER$i$gets.call(RubyIO$INVOKER$i$gets.gen)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:125)
        at C_3a_.socket_sample.invokeOther21:gets(C:\socket_sample.rb:30)
        at C_3a_.socket_sample.RUBY$block$\=C\-socket_sample\,rb$1(C:\socket_sample.rb:30)
        at java.lang.invoke.LambdaForm$DMH/468121027.invokeStatic_L8_L(LambdaForm$DMH)
        at java.lang.invoke.LambdaForm$MH/1537458459.invokeExact_MT(LambdaForm$MH)
        at org.jruby.runtime.CompiledIRBlockBody.callDirect(CompiledIRBlockBody.java:145)
        at org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:71)
        at org.jruby.runtime.Block.call(Block.java:124)
        at org.jruby.RubyProc.call(RubyProc.java:289)
        at org.jruby.RubyProc.call(RubyProc.java:246)
        at org.jruby.internal.runtime.RubyRunnable.run(RubyRunnable.java:104)
        at java.lang.Thread.run(Thread.java:748)
"main":
        at org.jruby.RubyThread.select(RubyThread.java:1757)
        - waiting to lock <0x00000000fa52c920> (a java.lang.Object)
        at org.jruby.util.io.OpenFile.ready(OpenFile.java:511)
        at org.jruby.util.io.OpenFile.waitWritable(OpenFile.java:452)
        at org.jruby.util.io.OpenFile.waitWritable(OpenFile.java:464)
        at org.jruby.util.io.OpenFile.binwrite(OpenFile.java:2200)
        at org.jruby.util.io.OpenFile.fwrite(OpenFile.java:2056)
        at org.jruby.RubyIO.write(RubyIO.java:1456)
        at org.jruby.RubyIO.write(RubyIO.java:1428)
        at org.jruby.RubyIO$INVOKER$i$1$0$write.call(RubyIO$INVOKER$i$1$0$write.gen)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:153)
        at org.jruby.RubyIO.write(RubyIO.java:2584)
        at org.jruby.RubyIO.putsSingle(RubyIO.java:2562)
        at org.jruby.RubyIO.puts1(RubyIO.java:2478)
        at org.jruby.RubyIO.puts(RubyIO.java:2451)
        at org.jruby.RubyIO$INVOKER$i$puts.call(RubyIO$INVOKER$i$puts.gen)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:153)
        at C_3a_.Users.socket_sample.invokeOther47:puts(C:\socket_sample.rb:43)
        at C_3a_.Users.socket_sample.RUBY$script(C:\socket_sample.rb:43)
        at java.lang.invoke.LambdaForm$DMH/1450495309.invokeStatic_L7_L(LambdaForm$DMH)
        at java.lang.invoke.LambdaForm$BMH/242355057.reinvoke(LambdaForm$BMH)
        at java.lang.invoke.LambdaForm$MH/171497379.invoker(LambdaForm$MH)
        at java.lang.invoke.LambdaForm$MH/434176574.invokeExact_MT(LambdaForm$MH)
        at java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:627)
        at org.jruby.ir.Compiler$1.load(Compiler.java:95)
        at org.jruby.Ruby.runScript(Ruby.java:828)
        at org.jruby.Ruby.runNormally(Ruby.java:747)
        at org.jruby.Ruby.runNormally(Ruby.java:765)
        at org.jruby.Ruby.runFromMain(Ruby.java:578)
        at org.jruby.Main.doRunFromMain(Main.java:417)
        at org.jruby.Main.internalRun(Main.java:305)
        at org.jruby.Main.run(Main.java:232)
        at org.jruby.Main.main(Main.java:204)

Found 1 deadlock.

The reader thread has taken a lock on the channel within the RubyThread.select method but is waiting to take a lock (which the writer holds) within the OpenFile.lock() method. The writer thread has taken the lock held by the OpenFile object but is waiting to take the lock on the channel (which the reader holds) within the RubyThread.select method.

I've seen this occur not just in cases where the writing thread encounters a waitWritable condition but also in cases where the writer thread is blocked on a call to the synchronized OpenFile.unread method, which cannot proceed because the reader thread is in the middle of executing the synchronized OpenFile.removeBlockingThread method. In this case, the reader thread is unable to obtain the reentrant lock within OpenFile, though, because the writer thread already holds it - leading to another form of deadlock. I haven't yet been able to get a more narrow code sample to reproduce this second issue as frequently, though.

If a mutex is synchronized around the client read and write calls, the deadlocks no longer occur (even for JRuby):

require 'thread'
...
socket_mutex = Mutex.new
...
  socket_mutex.synchronize do
    data_read = client.gets
  end
...
  socket_mutex.synchronize do
    client.puts(client_write)
  end

This seems like a reasonable workaround to the problem, although maybe with some performance implications. It may be that MRI Ruby is more immune to this kind of problem because of the GIL, not sure?

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Nov 14, 2017

Member

I suspect this is bad lock management by the socket subsystem. Most operations in IO have been audited to lock and unlock all appropriate locks, in the same order all the time. Socket has not had such a touch recently.

Diagnosis: Straight-up deadlock in the OpenFile read/write lock, probably due to improper locking in socket library.

Member

headius commented Nov 14, 2017

I suspect this is bad lock management by the socket subsystem. Most operations in IO have been audited to lock and unlock all appropriate locks, in the same order all the time. Socket has not had such a touch recently.

Diagnosis: Straight-up deadlock in the OpenFile read/write lock, probably due to improper locking in socket library.

@headius headius added this to the JRuby 9.1.15.0 milestone Nov 14, 2017

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Nov 14, 2017

Member

Workaround for you for now would be to lock around these IO objects. It may be enough to require 'jruby/synchronized' and then include JRuby::Synchronized into whatever classes or objects you're using.

Member

headius commented Nov 14, 2017

Workaround for you for now would be to lock around these IO objects. It may be enough to require 'jruby/synchronized' and then include JRuby::Synchronized into whatever classes or objects you're using.

headius added a commit that referenced this issue Dec 5, 2017

Avoid deadlocks in IO by only locking one thing. Fixes #4854
There are two changes here, with the first the more likely cause
of #4854:

1. While leaving the OpenFile lock locked, we proceeded to
   attempt to lock the Selector being selected. If the order of
   these locks happens differently elsewhere, it will cause a
   deadlock. Fixed by unlocking the OpenFile lock before locking
   the Selector lock.
2. OpenFile.removeBlockingThread synchronized on both OpenFile's
   lock and the OpenFile instance itself. I could find only a
   handful of other places that lock OpenFile and they do not
   appear to lock anything else, but there's no reason to lock
   OpenFile here anyway.
@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Dec 5, 2017

Member

Easy enough problem to find: we were locking two locks that might get locked in a different order elsewhere. I made sure to unlock one them first, so it should at least avoid the deadlock you found.

Member

headius commented Dec 5, 2017

Easy enough problem to find: we were locking two locks that might get locked in a different order elsewhere. I made sure to unlock one them first, so it should at least avoid the deadlock you found.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Dec 5, 2017

Member

I am not sure how best to create a test for this, given the complexity of the original script and the difficulty of testing for deadlocks in CI. If you have ideas, please feel free to submit a PR against spec/ruby or test/jruby.

Member

headius commented Dec 5, 2017

I am not sure how best to create a test for this, given the complexity of the original script and the difficulty of testing for deadlocks in CI. If you have ideas, please feel free to submit a PR against spec/ruby or test/jruby.

@headius headius closed this Dec 5, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment