Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LinkedHashMap memory leak #3260

Closed
eduveks opened this issue Mar 6, 2021 · 25 comments
Closed

LinkedHashMap memory leak #3260

eduveks opened this issue Mar 6, 2021 · 25 comments
Assignees
Labels

Comments

@eduveks
Copy link

eduveks commented Mar 6, 2021

Hello,

I'm facing memory leak of many LinkedHashMap with GraalVM.

This behaviour is not new because in the last years I am using GraalVM always seeing memory issues, but now is more critical because we are using in situations with much more stress overall.

Anyway GraalVM is awesome, thank you a lot by all your work. 👏

Hope this report can help to improve.

Screenshot 2021-03-06 at 16 27 49

Screenshot 2021-03-06 at 16 28 09

Describe GraalVM and your environment :

  • GraalVM version or commit id if built from source: 21.0.0
  • CE or EE: CE
  • JDK version: 11
  • OS and OS Version: macOS Big Sur
  • Architecture: amd64
  • The output of java -Xinternalversion:
OpenJDK 64-Bit Server VM (11.0.10+8-jvmci-21.0-b06) for bsd-amd64 JRE (11.0.10+8-jvmci-21.0-b06), built on Jan 15 2021 00:28:57 by "graal1" with gcc 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)
@eduveks eduveks added the truffle label Mar 6, 2021
@eduveks
Copy link
Author

eduveks commented Mar 6, 2021

After more 20 minutes looks like this:

Screenshot 2021-03-06 at 16 48 04

Screenshot 2021-03-06 at 16 48 19

@munishchouhan
Copy link
Contributor

@eduveks thanks for reporting the issue
we will check it and get back to you

@munishchouhan munishchouhan self-assigned this Mar 10, 2021
@chumer
Copy link
Member

chumer commented Mar 29, 2021

@eduveks thanks for the report!

Are you reloading classes e.g. using a custom host class loader or some application server? It looks to me as if our host member/method lookup cache seems to be missed a lot.

If this is true then the current behavior might be not ideal but is expected. We currently do reference bound host methods strongly. A workaround is to recreate the Engine (or Context if you don't use an explicit Engine) to refresh that cache. Note that this also refreshes all the compiled code. If this turns out to be the case, then I could have another look on improving this, if you provide a small example of your usage pattern.

If you are not reloading classes in your host application then it would be immensely helpful if you could provide a sample application that reproduces this leak reliably or, less preferrably, a heap dump (hprof) file of the leak.

Thanks!

@eduveks
Copy link
Author

eduveks commented Mar 30, 2021

Hello @chumer,

I did another round of stress test, with screenshots below and here the hprof file made a few minutes ago:

Yes, it is an application server, the Netuno platform, with a Jetty server inside and runs over GraalVM.

You can check out Netuno install is straightforward, only need to execute one command in the terminal.

The installation will download and install GraalVM too inside the netuno folder.

Then here an example of how to create services.

NodeJS inside GraalVM execute the javascript server-side (services), then the script code like _out.json, _db.query and much more, they will call Java methods.

Each request executes multiples scripts (JavaScript) over the same Class that initializes the HostAccess, ContextBuilder, and then the Context, but in the same request may create more than one Context. But always all Contexts initialized will be closed in the end of the request with context.leave() and context.close(true).

Multi-threads do not use the same HostAccess and ContextBuilder, each thread/request has your own org.graalvm.polyglot.* instances.

If you need more details just say.

Thank you!

Screenshot 2021-03-30 at 17 04 50

@chumer
Copy link
Member

chumer commented Apr 6, 2021

I downloaded the heap dump and computed the retained size and the retained size looks like this:

Screenshot 2021-04-06 at 15 19 41

So it looks like if all of the references can actually be freed.
So I don't think there is anything leaking in that heap dump.
Why you are seeing the GC pressure that you are seeing, I cannot really explain.
Did you try forcing a full GC with System.gc form time to time?

Each request executes multiples scripts (JavaScript) over the same Class that initializes the HostAccess, ContextBuilder, and then the Context, but in the same request may create more than one Context. But always all Contexts initialized will be closed in the end of the request with context.leave() and context.close(true).

Two remarks here:

  • Do not use context.close(true) if you don't need to. Always prefer to use context.close() unless you really want to cancel the execution.
  • I am seeing a lot of memory pressure due to many contexts created that do a lot of host interop. A lot of this work can be cached by using code sharing between multiple contexts[1]. Maybe you should try that? Should give you a significant improvement, and probably also get rid of this issue. (although we should get to the root cause of this)

[1] https://www.graalvm.org/reference-manual/embed-languages/#code-caching-across-multiple-contexts

@eduveks
Copy link
Author

eduveks commented Apr 7, 2021

I think that I already tested the implementation with code caching across multiple contexts in the past.

But yesterday, I did some tests changing the implementation to supports the same Engine across all threads and with one Engine per thread.

The first impression in both cases was memory ate fast and very slow performance, scripts executed before in 300 milliseconds jumped to 20 seconds, CPU with high usage too.

I will try to do more tests within the next few days to try to give more details.

Yes, periodically, we force the garbage collector with:

System.gc();
System.runFinalization();

Sample of the first results with the Engine:

Screenshot 2021-04-06 at 22 47 17

@chumer
Copy link
Member

chumer commented Apr 7, 2021

Any chance you can give me something runnable that reproduces this problem?

@eduveks
Copy link
Author

eduveks commented Apr 7, 2021

Any chance you can give me something runnable that reproduces this problem?

Yes, I think. I don't have time today, but I will give you more news in the next few days.

@thurka
Copy link
Member

thurka commented Apr 8, 2021

I think I know what is going on here. I was investigating, why there are so many instances of java.lang.reflect.Method in the heap dump. I tracked almost all the references to various instances of java.lang.ClassValue$ClassValueMap and those instances does not have any references so it looks like they can be collected by GC, but instances of java.lang.ClassValue$ClassValueMap are references from java.lang.Class.classValueMap field so they are held by the particular class, but this information is not dumped to hprof file. :-(
The question is why java.lang.ClassValue$ClassValueMap holds so many java.lang.reflect.Method. I don't know. If you look at the source code of ClassValueMap, there is some caching, so it might be one reason. I also saw this: https://stackoverflow.com/questions/62593886/java-memory-leak-why-are-unreachable-objects-not-collected This looks very similar to our case and in addition it uses scripting API to call javascript.
I have netuno server running with some demo application and just randomly clicking pages, reloading pages etc., I am able to see steady rise of instances of Method.

@thurka
Copy link
Member

thurka commented Apr 8, 2021

This is example of references from Method to ClassValueMap.
refs

@thurka
Copy link
Member

thurka commented Apr 8, 2021

Small heap dump for investigation: GH-3260_heapdump.zip .

@thurka
Copy link
Member

thurka commented Apr 8, 2021

Steps to reproduce:

  1. get off Cisco VPN
  2. set JAVA_HOME to JDK 11 (installation below needs nashorn, so it will not work on JDK 15+)
  3. install netuno in /tmp
thurka$ cd /tmp
thurka$ mkdir -p netuno && cd netuno && curl https://www.netuno.org/netuno.jar > netuno.jar && java -jar netuno.jar install
  1. add node to PATH
thurka$ export PATH=/tmp/netuno/graalvm/bin/:$PATH
  1. start netuno server with the default application via ./netuno server
  2. fill in email address
  3. it will open web browser in Portuguese locale(???), you have to manually register the netuno server
  4. it will take some time, since NPM Install is running for the first time
  5. play with application
  6. start VisualVM and open notuno process
  7. switch to Sampler tab and select Memory sampling
  8. sort data by Live Object
  9. look for number of java.lang.reflect.Method instances
  10. the number is steadily climbing while I play with netuno app in browser
  11. you can use Show delta values button to see deltas in live object count
  12. you can invoke GC directly from Memory sampler view to free unreferenced instances

@chumer
Copy link
Member

chumer commented Apr 8, 2021

I did not yet run your code yet, but I think I found a problematic pattern in our code:

public class Test {
	
	/*
	 * Change to false to avoid the bug.
	 */
	private static boolean CYCLIC_REFERENCE = true;

	private static class HeavyValue {
		
		final Context context;
		final byte[] bytes = new byte[1024 * 16];
		
		HeavyValue(Context context) {
			// the context assignment
			if (CYCLIC_REFERENCE) {
				this.context = context;
			} else {
				this.context = null;
			}
		}
		
	}
	
	
	private static class Context {
		
		final ClassValue<HeavyValue> values = new ClassValue<Test.HeavyValue>() {
			@Override
			protected HeavyValue computeValue(Class<?> type) {
				return new HeavyValue(Context.this);
			}
		};
		
		
	}
	
	
	static class TestClass1 {
		
		}
	static class TestClass2 {
			
		}
	static class TestClass3 {
		
	}
	static class TestClass4 {
		
	}
	static class TestClass5 {
		
	}
	static class TestClass6 {
		
	}

	private static final Class<?>[] TEST_CLASSES = new Class[] {
			TestClass1.class,
			TestClass2.class,
			TestClass3.class,
			TestClass4.class,
			TestClass5.class,
			TestClass6.class
	};
	
	
	private static int test() {
		Context context =  new Context();
		
		int hash = 31;
		for (Class<?> testClass : TEST_CLASSES) {
			hash += context.values.get(testClass).bytes.length;	
		}
		return hash;
	}
	
	
	public static void main(String[] args) {
		int hash = 31;
		for (int i = 0; i < 100000; i++) {
			hash += test();
		}
		System.out.println(hash);
		
	}
	
}

Switching that cyclic reference flag to false fixes the problem (tested on latest JDK 8 on Mac). But this should really not happen, as such cyclic references should be resolvable by the GC. I will double check and maybe file a JDK bug. We probably need to deploy a workaround for this.

I need to do some more digging to find out what is going on.

@chumer
Copy link
Member

chumer commented Apr 8, 2021

Turns out this is a known issue:
https://bugs.openjdk.java.net/browse/JDK-8169425

@thurka
Copy link
Member

thurka commented Apr 8, 2021

It looks like every page reload in the web app adds several instances of com.oracle.truffle.polyglot.HostClassCache.

@chumer
Copy link
Member

chumer commented Apr 8, 2021

I could reproduce it with a simple example now locally.
It looks like there is a simple workaround. You need to reuse the same HostAccess instance for all contexts.
I am working on a fix that avoids a the Cyclic dependency.

@chumer
Copy link
Member

chumer commented Apr 8, 2021

Fix is in the pipeline. Will backport this to 21.1 and probably also 20.3 (need to double check)

@eduveks
Copy link
Author

eduveks commented Apr 9, 2021

Thank you, @thurka, for your inputs, and I'm happy you could found the issue with Netuno.

@chumer the workaround reusing the HostAccess across all contexts does not affect overall performance, but in this case, the memory leak still happens. See the print screen below:

Screenshot 2021-04-09 at 12 26 00
Screenshot 2021-04-09 at 12 26 10

To complement the @thurka tip to run locally, if install and start Netuno just with ./netuno server based on the default configs will run the demo app, then it is possible to execute Apache BenchMark (web stress) over one of these services:

ab -n 100 -c 5 http://127.0.0.1:9000/services/samples/javascript/print-template
ab -n 100 -c 5 http://127.0.0.1:9000/services/samples/javascript/date-format
ab -n 100 -c 5 http://127.0.0.1:9000/services/samples/javascript/query-result

The first one, print-template, seems to provoke a higher peak.

This test with Apache BenchMark and the demo application are very different from all tests I had reported before, but I think it is over the same internal issue.

My results with Apache BenchMark executed over 5 times:

Screenshot 2021-04-09 at 13 50 20

Screenshot 2021-04-09 at 13 50 33

@chumer
Copy link
Member

chumer commented Apr 9, 2021

@eduveks This is really weird, in my local tests using the same HostAccess instance for all contexts resolved the problem. As the cache is bound and reused for HostAccess instances. So it should be impossible to have the same method loaded twice. Are you sure you are not just building an equal HostAccess instance? To make sure, can you read the HostAccess instance from a static final? Maybe you can drop a snippet how you are doing that atm, so I can plug it into my test?

@eduveks
Copy link
Author

eduveks commented Apr 9, 2021

@chumer I did another test to ensure and is the same behavior with HostAccess as static.

This is the current full code of the integration with GraalVM:

package org.netuno.psamata.script;

import org.graalvm.polyglot.*;
import org.graalvm.polyglot.Context.Builder;

import java.util.*;

public class GraalRunner {
    private static boolean graal = false;
    
    //private static Engine engine = Engine.create();
    
    private static HostAccess hostAccess = null;

    private Builder contextBuilder = null;
    
    private Context context = null;
    
    private List<Context> oldContexts = new ArrayList<>();

    static {
        try {
            Context.create().close();
            
            hostAccess = HostAccess.newBuilder()
                .allowPublicAccess(true)
                .targetTypeMapping(
                        Value.class, Object.class,
                        (v) -> v.hasArrayElements(),
                        (v) -> transformArray(v)
                ).targetTypeMapping(
                        Value.class, List.class,
                        (v) -> v.hasArrayElements(),
                        (v) -> transformArray(v)
                ).targetTypeMapping(
                        Value.class, Collection.class,
                        (v) -> v.hasArrayElements(),
                        (v) -> transformArray(v)
                ).targetTypeMapping(
                        Value.class, Iterable.class,
                        (v) -> v.hasArrayElements(),
                        (v) -> transformArray(v)
                )
                .build();
            
            graal = true;
        } catch (Throwable t) {
            graal = false;
        }
    }

    public GraalRunner(String... permittedLanguages) {
        init(null, permittedLanguages);
    }
    
    public GraalRunner(Map<String, String> options, String... permittedLanguages) {
        init(options, permittedLanguages);
    }
    
    private void init(Map<String, String> options, String... permittedLanguages) {
        //engine = Engine.create();
        
        contextBuilder = Context.newBuilder(permittedLanguages)
                .allowExperimentalOptions(true)
                .allowIO(true)
                .allowAllAccess(true)
                .allowCreateThread(true)
                .allowHostAccess(hostAccess);
        
        //https://www.graalvm.org/truffle/javadoc/org/graalvm/polyglot/ResourceLimits.html
        //contextBuilder.resourceLimits(new ResourceLimits())
        
        if (options != null) {
            contextBuilder.options(options);
        }
        
        //contextBuilder.engine(engine);

        context = contextBuilder.build();
    }
    
    public void resetContext() {
        if (context != null) {
            oldContexts.add(context);
        }
        context = contextBuilder.build();
    }
    
    public void close() {
        for (Context oldContext : oldContexts) {
            try {
                oldContext.leave();
            } catch (Exception e) { }
            oldContext.close();
        }
        oldContexts.clear();
        oldContexts = null;
        try {
            context.leave();
        } catch (Exception e) { }
        context.close();
        context = null;
        //engine.close();
    }

    public static boolean isGraal() {
        return graal;
    }

    public GraalRunner set(String language, String var, Object value) {
        context.getBindings(language).putMember(var, value);
        return this;
    }

    public Set<String> keys(String language) {
        return context.getBindings(language).getMemberKeys();
    }

    public Object get(String language, String var) {
        return toObject(context.getBindings(language).getMember(var));
    }

    public boolean getBoolean(String language, String var) {
        return context.getBindings(language).getMember(var).asBoolean();
    }

    public byte getByte(String language, String var) {
        return context.getBindings(language).getMember(var).asByte();
    }

    public double getDouble(String language, String var) {
        return context.getBindings(language).getMember(var).asDouble();
    }

    public float getFloat(String language, String var) {
        return context.getBindings(language).getMember(var).asFloat();
    }

    public int getInt(String language, String var) {
        return context.getBindings(language).getMember(var).asInt();
    }

    public long getLong(String language, String var) {
        return context.getBindings(language).getMember(var).asLong();
    }

    public int getShort(String language, String var) {
        return context.getBindings(language).getMember(var).asShort();
    }

    public String getString(String language, String var) {
        return context.getBindings(language).getMember(var).asString();
    }

    public Value eval(String language, String code) {
        try {
            context.enter();
        } catch (Exception e) { }
        try {
            return context.eval(language, code);
        } finally {
            try {
                context.leave();
            } catch (Exception e) { }
        }
    }

    public static Map transformMembers(Value v) {
        Map map = new HashMap();
        for (String key : v.getMemberKeys()) {
            Value member = v.getMember(key);
            if (member.hasArrayElements() && !member.isHostObject()) {
                map.put(key, transformArray(member));
            } else if (member.hasMembers() && !member.isHostObject()) {
                map.put(key, transformMembers(member));
            } else {
                map.put(key, toObject(member));
            }
        }
        return map;
    }

    public static List transformArray(Value v) {
        List list = new ArrayList();
        for (int i = 0; i < v.getArraySize(); ++i) {
            Value element = v.getArrayElement(i);
            if (element.hasArrayElements() && !element.isHostObject()) {
                list.add(transformArray(element));
            } else if (element.hasMembers() && !element.isHostObject()) {
                list.add(transformMembers(element));
            } else {
                list.add(toObject(element));
            }
        }
        return list;
    }

    public static Object toObject(Value v) {
        if (v.isNull()) {
            return null;
        } else if (v.isHostObject()) {
            return v.asHostObject();
        } else if (v.isProxyObject()) {
            return v.asProxyObject();
        } else if (v.isBoolean()) {
            return v.asBoolean();
        } else if (v.isNumber()) {
            if (v.fitsInByte()) {
                return v.asByte();
            }
            if (v.fitsInShort()) {
                return v.asShort();
            }
            if (v.fitsInInt()) {
                return v.asInt();
            }
            if (v.fitsInLong()) {
                return v.asLong();
            }
            if (v.fitsInFloat()) {
                return v.asFloat();
            }
            if (v.fitsInDouble()) {
                return v.asDouble();
            }
        } else if (v.isString()) {
            return v.asString();
        } else {
            Value value = v.getMetaObject();
            if (value.fitsInByte()) {
                return value.asByte();
            }
            if (value.fitsInShort()) {
                return value.asShort();
            }
            if (value.fitsInInt()) {
                return value.asInt();
            }
            if (value.fitsInLong()) {
                return value.asLong();
            }
            if (value.fitsInFloat()) {
                return value.asFloat();
            }
            if (value.fitsInDouble()) {
                return value.asDouble();
            }
            try {
                return value.asBoolean();
            } catch (Exception e) {
                try {
                    return value.asString();
                } catch (Exception ex) {
                    return null;
                }
            }
        }
        return null;
    }
}

@chumer
Copy link
Member

chumer commented Apr 12, 2021

Any reason why you are doing:

try {
            context.enter();
        } catch (Exception e) { }
        try {
            return context.eval(language, code);
        } finally {
            try {
                context.leave();
            } catch (Exception e) { }
        }

With enter/leave? Any context operation should do an implicit enter/leave. In this case doing the enter and leave explicitely is actually slower.

Missing a leave on a particular thread can also potentially cause another leak. Is there any multi-threading happening for a GraalRunner instance?

@eduveks
Copy link
Author

eduveks commented Apr 12, 2021

@chumer, my apologies, I detect a maven malfunction that did not change my code version in all of my tests. 🤦‍♂️

Finally, it is correct. Your workaround sharing the HostAccess works very well!

Sorry for my mistake.

This is the result of many stress, my hardest test, rolling smooth: 🥳

Screenshot 2021-04-12 at 17 50 20

Thank you very much! 🍻

@eduveks
Copy link
Author

eduveks commented Apr 12, 2021

To clarify, I commented all context.enter() and context.leave(). I do not remember why but I think it was an attempt to avoid the leak.

Yes, just one instance of the GraalRunner per thread always.

And again, thank you for your support.

@eduveks
Copy link
Author

eduveks commented Apr 12, 2021

With this, I remembered making another test sharing the same Engine with all threads, as @chumer said.

And the performance jumped to another level, much less memory and CPU consumption overall, even with a lot of stress is like nothing is going on.

Then the key is to share Engine and HostAccess over all threads. 🚀

Finally, the killer result:

Screenshot 2021-04-12 at 19 08 40

@eduveks eduveks closed this as completed Apr 13, 2021
@chumer
Copy link
Member

chumer commented Apr 14, 2021

Happy to hear it all works out now as intended! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants