Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leak in select for a UNIXSocket (at least) #3588

Closed
headius opened this issue Jan 8, 2016 · 5 comments
Closed

Leak in select for a UNIXSocket (at least) #3588

headius opened this issue Jan 8, 2016 · 5 comments

Comments

@headius
Copy link
Member

headius commented Jan 8, 2016

The following code leaks like a sieve...up to 1.5GB of total memory in under a minute for me:

s = UNIXServer.new("foofoo")
Thread.new {
  loop {
    c1 = s.accept
    IO.select([c1])
    c1.read(5)
    c1.write("bubye")
    c1.close
  }
}

loop {
  c = UNIXSocket.new("foofoo")
  c.write("hello")
  IO.select([c])
  c.read(5)
  c.close
  print "."
}

It did not leak without the IO.select calls, so I suspect that something in jnr-enxio's implementation of selection there's native memory leaking.

@headius
Copy link
Member Author

headius commented Jan 8, 2016

This does not appear to affect 1.7, so it may actually be something we're doing wrong in JRuby.

@headius
Copy link
Member Author

headius commented Jan 20, 2016

This appears to be in jnr-enxio, as it affects selects against stdio too. Note I have only tested on OS X, which uses kqueue; Linux uses epoll and has a separate backend in jnr-enxio.

@headius
Copy link
Member Author

headius commented Jan 20, 2016

Ok, there's a few problems here, but I'm not sure which is the trigger.

  • ENXIO selectors are not pooled, mostly because we can't mix native selectors and JDK selectors when selecting for multiple heterogeneous channels. As a result, SelectorExector has complicated logic to open new, attach via pipe, and eventually shut down ENXIO selectors on the way into and out of IO.select.
  • The selector itself (KQueueSelector) allocates buffers for the kqueue call, but does not directly deallocate those same buffers. So at the very least it is relying on finalization to clear away dead memory pointers, which is unreliable. At worst, it is walking away from allocated memory.

The first point could be remedied by improving SelectorExecutor to just use the ENXIO selector when all the channels are ENXIO channels, rather than always using this complicated pipe-based connection between ENXIO and a native selector. In that case we could also pool the ENXIO selector. It may be possible to pool the ENXIO selector even with the pipe connection, but we'd still need to clean up that pipe.

The second point I believe I can fix by making ENXIO actively clean up those pointers when you close the selector. This seems to be the lowest impact right now, so I'll look into that direction.

@headius
Copy link
Member Author

headius commented Jan 20, 2016

Huzzah! I found a simple fix.

The way we're using SelectorPool is wrong. When requesting a Selector from SelectorPool.get, it assumes you're going to be pooling that selector with a later call to SelectorPool.put. Basically, it takes ownership of the selector's lifecycle. And to do that, it keeps a reference to the selector as an "open" or checked-out selector that will need to be cleaned up at some point.

Unfortunately in the ENXIO case, we only use SelectorPool.get as a shortcut for channel.provider().openSelector(), and never put it back. This means that SelectorPool holds a hard reference to the selector in its openSelectors list, and as a result, the selector never finalizes. Because it never finalizes, it never deallocates the memory associated with the aforementioned buffers.

Modifying our SelectorExecutor logic to directly open ENXIO selectors rather than going through SelectorPool appears to fix the issue. On a fast select loop, we're still relying on finalization to clean up buffers, but when GC/finalization eventually runs we do go back down to a base memory size. A simple select(timeout = 0.001) loop that previously grew without bounds now hovers between 190 and 300MB, depending on the timing of GC.

Fix coming. We still need to rework SelectExecutor (again) and accommodate the simple homogeneous cases, so we can pool ENXIO selectors. We use ENXIO a lot more in 9k.

@headius
Copy link
Member Author

headius commented Jan 20, 2016

Note that 1.7 is not affected because its IO.select implementation does not attempt to pool any selectors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant