New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChannelDescriptor is not removed from the map after ScriptContainer is terminated #4446

Closed
seanguo opened this Issue Jan 19, 2017 · 16 comments

Comments

Projects
None yet
3 participants
@seanguo

seanguo commented Jan 19, 2017

Environment

JRuby 1.7.21
Linux
Using as embedded JSR223 engine

I found a similar issue #2333 but that is already fixed in 1.7.21. And the code path is not the same. The ChannelDescriptor is not removed in our case.

Sample Code Snippet

ScriptContext context = new SimpleScriptContext(); ScriptEngine engine = ScriptEngineManager.getEngineByName(); context.setAttribute("org.jruby.embed.termination", true, ScriptContext.ENGINE_SCOPE); engine.eval(scriptFile, context);
We have a pool of these engines and will try to reuse them for each script run.

Unexpected Behavior

After the execution I logged the ChannelDescriptor.filenoDescriptorMap, it still has the following values inside:
{16=org.jruby.util.io.ChannelDescriptor@7771425e, 15=org.jruby.util.io.ChannelDescriptor@34022cb2}

I added some stacktrace to track where these two ChannelDescriptor are added, got the following:
15 is added java.lang.Exception at org.jruby.util.io.ChannelDescriptor.registerDescriptor(ChannelDescriptor.java:865) at org.jruby.util.io.ChannelDescriptor.<init>(ChannelDescriptor.java:180) at org.jruby.util.io.ChannelDescriptor.<init>(ChannelDescriptor.java:197) at org.jruby.util.io.ChannelDescriptor.<init>(ChannelDescriptor.java:294) at org.jruby.RubyIO.<init>(RubyIO.java:138) at org.jruby.embed.jsr223.Utils.getRubyIO(Utils.java:179) at org.jruby.embed.jsr223.Utils.setWriter(Utils.java:150) at org.jruby.embed.jsr223.Utils.preEval(Utils.java:113) at org.jruby.embed.jsr223.JRubyEngine.eval(JRubyEngine.java:88) ... at java.lang.Thread.run(Thread.java:745) 16 is added java.lang.Exception at org.jruby.util.io.ChannelDescriptor.registerDescriptor(ChannelDescriptor.java:865) at org.jruby.util.io.ChannelDescriptor.<init>(ChannelDescriptor.java:180) at org.jruby.util.io.ChannelDescriptor.<init>(ChannelDescriptor.java:197) at org.jruby.util.io.ChannelDescriptor.<init>(ChannelDescriptor.java:294) at org.jruby.RubyIO.<init>(RubyIO.java:138) at org.jruby.embed.jsr223.Utils.getRubyIO(Utils.java:179) at org.jruby.embed.jsr223.Utils.setErrorWriter(Utils.java:171) at org.jruby.embed.jsr223.Utils.preEval(Utils.java:114) at org.jruby.embed.jsr223.JRubyEngine.eval(JRubyEngine.java:88)

Please suggest.
Thanks.

@seanguo seanguo changed the title from ChannelDescriptor is not removed from the map after ScriptContainer is removed to ChannelDescriptor is not removed from the map after ScriptContainer is terminated Jan 19, 2017

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Jan 24, 2017

Member

Are you able to test this on a more recent JRuby 1.7? I don't know that it was fixed, but we do a lot of work on every point release.

On 9k this should be a non-issue since ChannelDescriptor and its registry are no longer used.

Member

headius commented Jan 24, 2017

Are you able to test this on a more recent JRuby 1.7? I don't know that it was fixed, but we do a lot of work on every point release.

On 9k this should be a non-issue since ChannelDescriptor and its registry are no longer used.

@headius headius added this to the JRuby 1.7.27 milestone Jan 24, 2017

@seanguo

This comment has been minimized.

Show comment
Hide comment
@seanguo

seanguo Feb 8, 2017

Thanks for your reply. I tried the latest which had the same behavior.

seanguo commented Feb 8, 2017

Thanks for your reply. I tried the latest which had the same behavior.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Feb 9, 2017

Member

@seanguo Ok, here's a couple suggestions.

  1. You can clear the descriptors yourself by accessing those collections from either Ruby or Java. That can be a temporary workaround for you. You may or may not want to try to close the descriptors as you remove them.
  2. You may be able to find why the descriptors are not getting removed by adding some logging or printStackTrace to the various paths that unregister them. I believe we unregister all of them on teardown, but teardown leaves a runtime in an unusable state.
  3. Try forcing GC and confirming that the descriptors remain.
  4. Try pulling a heap dump after you've confirmed the descriptors are leaking and inspect all references to ChannelDescriptor instances. This will show if we're keeping them alive somewhere.

If you can come up with a simple way for us to reproduce this, we'll be able to help much more easily.

Member

headius commented Feb 9, 2017

@seanguo Ok, here's a couple suggestions.

  1. You can clear the descriptors yourself by accessing those collections from either Ruby or Java. That can be a temporary workaround for you. You may or may not want to try to close the descriptors as you remove them.
  2. You may be able to find why the descriptors are not getting removed by adding some logging or printStackTrace to the various paths that unregister them. I believe we unregister all of them on teardown, but teardown leaves a runtime in an unusable state.
  3. Try forcing GC and confirming that the descriptors remain.
  4. Try pulling a heap dump after you've confirmed the descriptors are leaking and inspect all references to ChannelDescriptor instances. This will show if we're keeping them alive somewhere.

If you can come up with a simple way for us to reproduce this, we'll be able to help much more easily.

@headius headius added the core label Feb 9, 2017

@seanguo

This comment has been minimized.

Show comment
Hide comment
@seanguo

seanguo Feb 10, 2017

@headius

  1. It's hard for me find the right time to clear that map since it's global and shared across all containers. We are running different container concurrently. I think we need to fix this in the JRuby container level.
  2. As you can see from my previous comment. I've added stacktrace at the registerDescriptor method and added map dump at the unregisterDescriptor. I confirmed that the ChannelDescriptor are still there after the evaluation even after full GC. Let me know if this is not the correct way to debug.

We found this issue by finding that one of our production environment is busy with full GC activity. And after a heap dump analysis we found the heap is full of the ChannelDescriptor in the filenoDescriptorMap.

I'll try to give you a sample code to reproduce later.

Thanks.

seanguo commented Feb 10, 2017

@headius

  1. It's hard for me find the right time to clear that map since it's global and shared across all containers. We are running different container concurrently. I think we need to fix this in the JRuby container level.
  2. As you can see from my previous comment. I've added stacktrace at the registerDescriptor method and added map dump at the unregisterDescriptor. I confirmed that the ChannelDescriptor are still there after the evaluation even after full GC. Let me know if this is not the correct way to debug.

We found this issue by finding that one of our production environment is busy with full GC activity. And after a heap dump analysis we found the heap is full of the ChannelDescriptor in the filenoDescriptorMap.

I'll try to give you a sample code to reproduce later.

Thanks.

@seanguo

This comment has been minimized.

Show comment
Hide comment
@seanguo

seanguo Feb 10, 2017

You can use the following code to reproduce:

package com.test.testMaven;

import javax.script.ScriptContext;
import javax.script.ScriptEngine;
import javax.script.ScriptEngineManager;
import javax.script.ScriptException;
import javax.script.SimpleScriptContext;

import org.jruby.util.io.ChannelDescriptor;

public class App {
  public static void main(String[] args) throws ScriptException, InterruptedException {
    ScriptEngineManager scriptEngineManager = new ScriptEngineManager();
    ScriptEngine engine = scriptEngineManager.getEngineByName("jruby");
    ScriptContext scriptContext = new SimpleScriptContext();
    scriptContext.setAttribute("org.jruby.embed.termination", true, ScriptContext.ENGINE_SCOPE);

    engine.eval("print 'test'", scriptContext);

    System.gc();
    Thread.sleep(5000);
    
    System.out.println(ChannelDescriptor.getFilenoDescriptorMapReadOnly());
  }
}

Thanks.

seanguo commented Feb 10, 2017

You can use the following code to reproduce:

package com.test.testMaven;

import javax.script.ScriptContext;
import javax.script.ScriptEngine;
import javax.script.ScriptEngineManager;
import javax.script.ScriptException;
import javax.script.SimpleScriptContext;

import org.jruby.util.io.ChannelDescriptor;

public class App {
  public static void main(String[] args) throws ScriptException, InterruptedException {
    ScriptEngineManager scriptEngineManager = new ScriptEngineManager();
    ScriptEngine engine = scriptEngineManager.getEngineByName("jruby");
    ScriptContext scriptContext = new SimpleScriptContext();
    scriptContext.setAttribute("org.jruby.embed.termination", true, ScriptContext.ENGINE_SCOPE);

    engine.eval("print 'test'", scriptContext);

    System.gc();
    Thread.sleep(5000);
    
    System.out.println(ChannelDescriptor.getFilenoDescriptorMapReadOnly());
  }
}

Thanks.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Feb 13, 2017

Member

I meant to add Thread.dumpStack or catch + printStackTrace to the paths that call unregisterDescriptor, to see if they are quietly failing to remove the descriptor.

In any case, your example looks simple enough. We will try to look into it. Any additional investigation you can do would be helpful, since we are trying to end-of-life JRuby 1.7 at this point.

Member

headius commented Feb 13, 2017

I meant to add Thread.dumpStack or catch + printStackTrace to the paths that call unregisterDescriptor, to see if they are quietly failing to remove the descriptor.

In any case, your example looks simple enough. We will try to look into it. Any additional investigation you can do would be helpful, since we are trying to end-of-life JRuby 1.7 at this point.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Feb 13, 2017

Member

Ugh...I think I've found it.

It appears that engine.eval registers new ChannelDescriptors every time. I modified your example to call eval ten times, and there's ten times as many descriptors left in the map.

So this may be a simple fix. Either the scripting container needs to unregister those descriptors every time, or it needs to register them once when initialized. Thoughts @enebo?

Member

headius commented Feb 13, 2017

Ugh...I think I've found it.

It appears that engine.eval registers new ChannelDescriptors every time. I modified your example to call eval ten times, and there's ten times as many descriptors left in the map.

So this may be a simple fix. Either the scripting container needs to unregister those descriptors every time, or it needs to register them once when initialized. Thoughts @enebo?

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Feb 13, 2017

Member

Here's a stack trace from one of the offending adds:

Registering descriptor 24
java.lang.Exception: Stack trace
	at java.lang.Thread.dumpStack(Thread.java:1333)
	at org.jruby.util.io.ChannelDescriptor.registerDescriptor(ChannelDescriptor.java:876)
	at org.jruby.util.io.ChannelDescriptor.<init>(ChannelDescriptor.java:182)
	at org.jruby.util.io.ChannelDescriptor.<init>(ChannelDescriptor.java:199)
	at org.jruby.util.io.ChannelDescriptor.<init>(ChannelDescriptor.java:296)
	at org.jruby.RubyIO.<init>(RubyIO.java:139)
	at org.jruby.embed.jsr223.Utils.getRubyIO(Utils.java:179)
	at org.jruby.embed.jsr223.Utils.setWriter(Utils.java:150)
	at org.jruby.embed.jsr223.Utils.preEval(Utils.java:113)
	at org.jruby.embed.jsr223.JRubyEngine.eval(JRubyEngine.java:88)
	at Blah.main(Blah.java:25)
Member

headius commented Feb 13, 2017

Here's a stack trace from one of the offending adds:

Registering descriptor 24
java.lang.Exception: Stack trace
	at java.lang.Thread.dumpStack(Thread.java:1333)
	at org.jruby.util.io.ChannelDescriptor.registerDescriptor(ChannelDescriptor.java:876)
	at org.jruby.util.io.ChannelDescriptor.<init>(ChannelDescriptor.java:182)
	at org.jruby.util.io.ChannelDescriptor.<init>(ChannelDescriptor.java:199)
	at org.jruby.util.io.ChannelDescriptor.<init>(ChannelDescriptor.java:296)
	at org.jruby.RubyIO.<init>(RubyIO.java:139)
	at org.jruby.embed.jsr223.Utils.getRubyIO(Utils.java:179)
	at org.jruby.embed.jsr223.Utils.setWriter(Utils.java:150)
	at org.jruby.embed.jsr223.Utils.preEval(Utils.java:113)
	at org.jruby.embed.jsr223.JRubyEngine.eval(JRubyEngine.java:88)
	at Blah.main(Blah.java:25)
@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Feb 13, 2017

Member

@headius strikes me in this case we invisibly ask for an IO but then do not record it and then close/cleanup. It sticking around seems like it would only make sense if this io persisted between evals. Is there a scenario where we would expect this to escape and be a valid channeldescriptor?

Member

enebo commented Feb 13, 2017

@headius strikes me in this case we invisibly ask for an IO but then do not record it and then close/cleanup. It sticking around seems like it would only make sense if this io persisted between evals. Is there a scenario where we would expect this to escape and be a valid channeldescriptor?

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Feb 13, 2017

Member

@enebo It looks like this may be a specific peculiarity of JSR-223. The code inside our implementation needs to grab readers/writers out of the passed in ScriptContext each time eval is called. In order for those to be accessible to Ruby, they get registered. And then they never get unregistered because eval isn't really meaningful as a lifecycle.

So like you say one possibility would be not registering them. I'm not sure we have a way to do that in the API but it could be added. All you'd lose is the ability to go from the (fake) fileno to the IO, as in IO.from_fd. That seems acceptable.

The other option is just unregistering them on the way out. It would add no more overhead and filenos could still be used to look up IO, but any code running asynchronously between calls might behave oddly. This also would not fix the issue that repeat calls with the SAME readers/writers would produce DIFFERENT filenos, which seems almost as bad as not registering filenos at all.

Member

headius commented Feb 13, 2017

@enebo It looks like this may be a specific peculiarity of JSR-223. The code inside our implementation needs to grab readers/writers out of the passed in ScriptContext each time eval is called. In order for those to be accessible to Ruby, they get registered. And then they never get unregistered because eval isn't really meaningful as a lifecycle.

So like you say one possibility would be not registering them. I'm not sure we have a way to do that in the API but it could be added. All you'd lose is the ability to go from the (fake) fileno to the IO, as in IO.from_fd. That seems acceptable.

The other option is just unregistering them on the way out. It would add no more overhead and filenos could still be used to look up IO, but any code running asynchronously between calls might behave oddly. This also would not fix the issue that repeat calls with the SAME readers/writers would produce DIFFERENT filenos, which seems almost as bad as not registering filenos at all.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Feb 13, 2017

Member

Ok, as far as I can see, 9k does not have this issue. We do have a mapping of internal pseudo-filenos but at least for this case the number of registered descriptors seems to remain constant.

It's possible this is just digging out the "real" fileno, which would not get registered. This would be the case if the default ScriptingContext's streams are just stdio, which I believe they are. Given completely artificial streams, it may still leak.

Member

headius commented Feb 13, 2017

Ok, as far as I can see, 9k does not have this issue. We do have a mapping of internal pseudo-filenos but at least for this case the number of registered descriptors seems to remain constant.

It's possible this is just digging out the "real" fileno, which would not get registered. This would be the case if the default ScriptingContext's streams are just stdio, which I believe they are. Given completely artificial streams, it may still leak.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Feb 13, 2017

Member

It looks like unregistering these may be a bit problematic. We also need to take a long hard look at the lifecycle for JSR-223.

The additional problem I ran into is that while it's setting up those streams on a per-eval basis, it also sets global variables like $stdout. Obviously this isn't really threadsafe, since multiple concurrent evals might want different streams. We need a better mapping from the context streams to the IO we use internally, and ideally we would keep them the same across calls.

But the problem with unregistering is that we'd need to maintain a mapping of context+streams to the IO objects we want to unregister. And that means more state in the containers and potential issues if two eval calls want to use the same streams. The second thread might unregister them while the first is still using them.

I think for now the best option would be to modify the code in the JSR-223 code to not register fileno for incoming streams, since they're not really "real" streams anyway and there's all sorts of problems trying to tidy them up.

Member

headius commented Feb 13, 2017

It looks like unregistering these may be a bit problematic. We also need to take a long hard look at the lifecycle for JSR-223.

The additional problem I ran into is that while it's setting up those streams on a per-eval basis, it also sets global variables like $stdout. Obviously this isn't really threadsafe, since multiple concurrent evals might want different streams. We need a better mapping from the context streams to the IO we use internally, and ideally we would keep them the same across calls.

But the problem with unregistering is that we'd need to maintain a mapping of context+streams to the IO objects we want to unregister. And that means more state in the containers and potential issues if two eval calls want to use the same streams. The second thread might unregister them while the first is still using them.

I think for now the best option would be to modify the code in the JSR-223 code to not register fileno for incoming streams, since they're not really "real" streams anyway and there's all sorts of problems trying to tidy them up.

headius added a commit to headius/jruby that referenced this issue Feb 13, 2017

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Feb 13, 2017

Member

I just pushed #4491 that avoids registering these two fileno for each eval. Actually, it just immediately unregisters them, since we do not have a way to initialize a RubyIO without registering its fileno. This seems like the least invasive fix.

@seanguo Can you test out the headius:fix-fileno-leak branch with your app and see how it looks?

Member

headius commented Feb 13, 2017

I just pushed #4491 that avoids registering these two fileno for each eval. Actually, it just immediately unregisters them, since we do not have a way to initialize a RubyIO without registering its fileno. This seems like the least invasive fix.

@seanguo Can you test out the headius:fix-fileno-leak branch with your app and see how it looks?

@seanguo

This comment has been minimized.

Show comment
Hide comment
@seanguo

seanguo Feb 14, 2017

@headius Thank you for the quick fix. I've tested your branch build in the test app and it works. We will do load testing to further validation. By the way do we have a estimated date to release the 1.7.27 which includes the fix?
Thanks.

seanguo commented Feb 14, 2017

@headius Thank you for the quick fix. I've tested your branch build in the test app and it works. We will do load testing to further validation. By the way do we have a estimated date to release the 1.7.27 which includes the fix?
Thanks.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Feb 14, 2017

Member

At the moment, no. We had hoped for 1.7.26 to be the last release in the 1.7 line, but we have not addressed all migration issues nor released the "1.7 killer" 9.2 yet. I'll confer with @enebo about this today.

Member

headius commented Feb 14, 2017

At the moment, no. We had hoped for 1.7.26 to be the last release in the 1.7 line, but we have not addressed all migration issues nor released the "1.7 killer" 9.2 yet. I'll confer with @enebo about this today.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Mar 15, 2017

Member

We are looking at doing a 1.7.27 within the next week. I'll talk to @enebo about this fix.

Member

headius commented Mar 15, 2017

We are looking at doing a 1.7.27 within the next week. I'll talk to @enebo about this fix.

headius added a commit that referenced this issue Apr 19, 2017

Merge pull request #4491 from headius/fix-fileno-leak
Immediately unregister out and err for JSR223 evals. Fixes #4446.

@headius headius closed this Apr 19, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment