Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File descriptor leak of type 'PIPE' when 'net/http' used to do a GET #2326

Closed
jsgoecke opened this issue Dec 15, 2014 · 15 comments
Closed

File descriptor leak of type 'PIPE' when 'net/http' used to do a GET #2326

jsgoecke opened this issue Dec 15, 2014 · 15 comments
Assignees
Milestone

Comments

@jsgoecke
Copy link

@jsgoecke jsgoecke commented Dec 15, 2014

It appears that under OSX when I open a socket and it then closes (either through a :read_timeout, or sccuessful read) it is leaking a PIPE file descriptor. When done on Linux it does both 'PIPE' and 'eventpoll'.

I am doing something like this in JRuby-1.7.17:

require 'net/http'
require 'uri'

uri = URI.parse("http://google.com/")
response = Net::HTTP.get_response(uri)

And then I get this before running the script:

lsof | grep ruby | grep PIPE
server-7:~ jsgoecke$ 

And then after running the script:

lsof | grep java | grep PIPE
java      31764 jsgoecke   17     PIPE 0x89af0ee9adc4b601      16384          ->0x89af0ee99b775441
java      31764 jsgoecke   18     PIPE 0x89af0ee99b775441      16384          ->0x89af0ee9adc4b601

I did find this issue:

#221

And tried running the Java GC manually with:

require 'jruby'

JRuby.gc

But the FDs seem to hang around.

@headius
Copy link
Member

@headius headius commented Dec 16, 2014

So the likely candidate for the pipe+eventpoll descriptor is the selection logic done for most IO operations. "Select" at the JVM level uses whatever's appropriate for the platform, e.g. kqueue or epoll, and then it also injects a pipe into the selected descriptors as a wake-up mechanism. That's these two descriptors.

If they are lingering it means some IO operation is opening a new selector and not closing it. That may be a leak...but we do also cache these selectors to avoid the overhead of closing and reopening them for every read or write.

I suspect you are seeing the cached selectors. If that's not the case, then you should see the number of file descriptors explode as you perform more IO operations (e.g. your script in a loop). The number of cached selector FD pairs should be roughly the same as the number of threads you have performing IO operations.

Confirm?

@jsgoecke
Copy link
Author

@jsgoecke jsgoecke commented Dec 16, 2014

I believe you are correct, that these are the cached descriptors. In the case of a single threaded application, I see two sets cached.

But, if I use multiple threads, I see each thread appears to get its own cache and then those linger. So if I have a script that uses three threads that each do a GET, I now have 6 descriptors (3 sets, 1 per thread) lingering, even though my threads have died.

So, for example this:

require 'net/http'
require 'uri'
require 'thread'
require 'jruby'

def get_data
  uri = URI.parse("http://google.com/")
  response = Net::HTTP.get_response(uri)
end

threads = []

3.times {
  threads << Thread.new {
    get_data
  }
}

threads.each { |t|
  t.join
}

JRuby.gc
p "Done check your lsof!"
sleep 1000

Results in this:

lsof | grep PIPE | grep java
java      42045 jsgoecke   17     PIPE 0x89af0ee99b773081      16384          ->0x89af0ee99b775391
java      42045 jsgoecke   18     PIPE 0x89af0ee99b775391      16384          ->0x89af0ee99b773081
java      42045 jsgoecke   20     PIPE 0x89af0ee99b772791      16384          ->0x89af0ee99b772e71
java      42045 jsgoecke   21     PIPE 0x89af0ee99b772e71      16384          ->0x89af0ee99b772791
java      42045 jsgoecke   23     PIPE 0x89af0ee99b7755a1      16384          ->0x89af0ee9adc4a4d1
java      42045 jsgoecke   24     PIPE 0x89af0ee9adc4a4d1      16384          ->0x89af0ee99b7755a1

And if I do 4 threads, then I get 8 and so on. This would appear to be a leak since they are no longer associated to a live thread and a forced GC, or even GCs over time (as I have observed on production systems) do not seem to clean them up.

@johntdyer
Copy link

@johntdyer johntdyer commented Dec 16, 2014

Is this related?

@headius
Copy link
Member

@headius headius commented Dec 16, 2014

The cached selectors are not per-thread...they are just a simple cache. Threads terminating would not eliminate unused selectors, nor should we have a selector per thread (which could lead to thousands of live-but-idle selectors). You're just seeing the effect of those N threads needing M concurrent selectors.

Now, perhaps they should be cleaned up, but it's not a leak in the sense of unbounded growth; the number of cached selectors will only grow to the level of concurrent select operations happening.

Cleaning them up would require modifying our selector cache to have some sort of weaker reference, probably a soft reference, so that periodically old selectors would be GCed and closed.

@headius
Copy link
Member

@headius headius commented Dec 16, 2014

@johntdyer Yes, it's the cache of which I speak. We have never heard about any problems caused by hard-caching the selectors, so we never took the next step to make them clean up after a certain amount of time.

@jsgoecke
Copy link
Author

@jsgoecke jsgoecke commented Dec 16, 2014

Our use case is a bit different for JRuby/JVM. In that we keep a ScriptContainer warm and allow additional scripts to be eval'd within it. If they use threads that do I/O, then of course those linger between evals and create a file descriptor leak.

Although this would also occur for any developer using short lived I/O threads in JRuby. Thread pools are always better of course, but there are certain contexts where that may not apply.

@headius
Copy link
Member

@headius headius commented Dec 16, 2014

The pooling of selectors is not done per-thread...it is just a list in memory from which all threads checkin/checkout selectors. If you send 100 threads into that container but none of them run at the same time, then you should only see a couple (or one) selectors cached. If that's not a case, then it would indeed be a bug.

The number of cached, open selectors at any given time should reflect the number of concurrent IO operations that have needed selectors at the same time.

I'm exploring what would be required to make these selectors lazily tidy up, but I am still not sure I've heard evidence that this is really a "leak".

@jsgoecke
Copy link
Author

@jsgoecke jsgoecke commented Dec 16, 2014

Indeed, if you launch the threads sequentially it does only produce one set of cached descriptors:

require 'net/http'
require 'uri'
require 'thread'
require 'jruby'

def get_data
  uri = URI.parse("http://google.com/")
  response = Net::HTTP.get_response(uri)
end

3.times {
  thread = Thread.new {
    get_data
  }
  thread.join
}

JRuby.gc
p "Done check your lsof!"
sleep 1000

lsof output:

lsof | grep java | grep PIPE
java      44933 jsgoecke   15     PIPE 0x89af0ee99b7745d1      16384          ->0x89af0ee9abeada81
java      44933 jsgoecke   16     PIPE 0x89af0ee9abeada81      16384          ->0x89af0ee99b7745d1

But it is reasonable that if a developer is spinning up a thread for each I/O, that they may happen concurrently over time and therefore cache additional descriptors over time. We are seeing what we believe is a 'leak' over time because of this. Of course, we may begin to push developer to use a proper queue/pool, but can't guarantee that.

@headius
Copy link
Member

@headius headius commented Dec 16, 2014

Yes, I agree we could be doing more to clean up selectors that are no longer in use, but implementing that is rather complicated.

The pool is structured like this:

  • A Map maps selector providers to lists of cached, open selectors.
  • The lists of selectors grow and are never shrunk until cleanup() is called.

In order to make it lazily clean up selectors:

  • The lists of pooled selectors would need to be weakly referencing those selectors.
  • Instead of referencing the selector directly, it would need to reference an intermediate wrapper that closes the selector on finalization.
  • The weak entries in the list would need to be periodically tidied up, probably on every insert of a new selector into the cache.

This is not a trivial datastructure. I'm playing with some options now.

@headius
Copy link
Member

@headius headius commented Dec 17, 2014

I think perhaps our best option is actually to provide a way to limit the size of that pool.

The main issue with cleaning it up is that we want to avoid having some background thread harvesting old selectors, which means leaning on JVM finalization. Finalization does not run predictable; obviously it will run in response to explicit GCs, but otherwise you might see selectors hang on a very long time. In addition to using finalization, we would need to softly reference the selectors so that they could dereference and feed into finalization. Add to that the fact that our cache is two-level -- per NIO provider -- and it starts to cross a line for me.

This will probably turn off @enebo, but the simpler option here would be to have a flag limiting how many selectors are pooled. I know, it's not a great option, since it will require individual tuning if you actually want to use it.

Another alternative for you would be to clear the SelectorPool (ruby.getSelectorPool() I believe) periodically. That will go through and close all pooled selectors for all providers.

Thoughts?

@jsgoecke
Copy link
Author

@jsgoecke jsgoecke commented Dec 17, 2014

The ruby.getSelectorPool() could be a good solution. The question I have on that one is that within a script is there a way to get the 'self' (aka - org.Ruby) object? Or is that possible only within Java that is managing the Ruby object?

@jsgoecke
Copy link
Author

@jsgoecke jsgoecke commented Dec 17, 2014

Ok, so doing this from Java on the Ruby object seems to do the cleanup, which appears to be a workable solution. I do this from within Java:

package com.tropo.jruby;

import java.io.File;
import java.io.IOException;

import org.apache.commons.io.FileUtils;
import org.jruby.Ruby;
import org.jruby.util.io.SelectorPool;

public class Playground {
    public static void main(String[] args) throws IOException, InterruptedException {       
        Ruby ruby = Ruby.newInstance();
        SelectorPool selectorPool = ruby.getSelectorPool(); 
        File script = new File("/Users/jsgoecke/Desktop/loop.rb");
        ruby.evalScriptlet(FileUtils.readFileToString(script));

        System.out.println("Done with script");
        Thread.sleep(10000);

        selectorPool.cleanup();
        System.out.println("Clean now");
        Thread.sleep(30000);
    }
}

This is the Ruby script it evals:

require 'net/http'
require 'uri'
require 'thread'

def get_data
  uri = URI.parse("http://google.com/")
  response = Net::HTTP.get_response(uri)
end

threads = []

5.times {
  threads << Thread.new {
    get_data
  }
}

threads.each { |t|
  t.join
}

p "Done check your lsof!"
sleep 10

Before I do:

selectorPool.cleanup();

I see these selector pools:

lsof | grep java | grep PIPE
java      46241 jsgoecke    0     PIPE 0x89af0ee9b83ea731      16384          ->0x89af0ee9adc497c1
java      46241 jsgoecke    1     PIPE 0x89af0ee99b772631      16384          ->0x89af0ee99b7745d1
java      46241 jsgoecke    2     PIPE 0x89af0ee99b773601      16384          ->0x89af0ee9b83eb4f1
java      46241 jsgoecke   20     PIPE 0x89af0ee9abeb0d61      16384          ->0x89af0ee99b773971
java      46241 jsgoecke   21     PIPE 0x89af0ee99b773971      16384          ->0x89af0ee9abeb0d61
java      46241 jsgoecke   23     PIPE 0x89af0ee99b773811      16384          ->0x89af0ee9adc49be1
java      46241 jsgoecke   24     PIPE 0x89af0ee9adc49be1      16384          ->0x89af0ee99b773811
java      46241 jsgoecke   26     PIPE 0x89af0ee99b775231      16384          ->0x89af0ee99b774d61
java      46241 jsgoecke   27     PIPE 0x89af0ee99b774d61      16384          ->0x89af0ee99b775231
java      46241 jsgoecke   29     PIPE 0x89af0ee99b773081      16384          ->0x89af0ee9b83ea891
java      46241 jsgoecke   30     PIPE 0x89af0ee9b83ea891      16384          ->0x89af0ee99b773081
java      46241 jsgoecke   32     PIPE 0x89af0ee99b773ad1      16384          ->0x89af0ee99b774f71
java      46241 jsgoecke   33     PIPE 0x89af0ee99b774f71      16384          ->0x89af0ee99b773ad1

Then after, it shows they are gone:

lsof | grep java | grep PIPE
java      46241 jsgoecke    0     PIPE 0x89af0ee9b83ea731      16384          ->0x89af0ee9adc497c1
java      46241 jsgoecke    1     PIPE 0x89af0ee99b772631      16384          ->0x89af0ee99b7745d1
java      46241 jsgoecke    2     PIPE 0x89af0ee99b773601      16384          ->0x89af0ee9b83eb4f1

I would still be curious if you may reference the Ruby object from within the script, but I am guessing this would break the security model.

@headius
Copy link
Member

@headius headius commented Dec 17, 2014

Within a script, the current JRuby runtime (org.jruby.Ruby) is available as JRuby.runtime. Normal Java integration niceties apply.

If you're satisfied with that solution, we'll close this as Won't Fix for now.

@headius headius added this to the Won't Fix milestone Dec 17, 2014
@headius headius self-assigned this Dec 17, 2014
@jsgoecke
Copy link
Author

@jsgoecke jsgoecke commented Dec 17, 2014

This works a charm:

require 'net/http'
require 'uri'
require 'thread'
require 'jruby'


def get_data
  uri = URI.parse("http://google.com/")
  response = Net::HTTP.get_response(uri)
end

threads = []

4.times {
  threads << Thread.new {
    get_data
  }
}

threads.each { |t|
  t.join
}

JRuby.runtime.getSelectorPool.cleanup
p "Done check your lsof!"
sleep 1000

Thank you for all of the attention and help on this one, much appreciated!

@jsgoecke jsgoecke closed this Dec 17, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants