Reworked the way allocated memory is tracked #1239

joerg1985 · 2020-07-28T18:20:32Z

Replaced the WeakHashMap with a doubly linked list of WeakReferences to
improve performance by reducing the time spent in synchronized blocks.

Benchmark	Branch	Java version	Mode	Cnt	Score	Error	Units
MemoryBenchmark.allocateWithMemory	master	1.8	thrpt	25	1420406.020	± 46239.151	ops/s
MemoryBenchmark.allocateWithMemory	linked-reference	1.8	thrpt	25	2020881.615	± 33498.284	ops/s
MemoryBenchmark.allocateWithMemory	master	11	thrpt	25	1526824.044	± 106555.120	ops/s
MemoryBenchmark.allocateWithMemory	linked-reference	11	thrpt	25	2615836.523	± 171918.750	ops/s
MemoryBenchmark.allocateWithMemory	master	14	thrpt	25	1877610.032	± 108837.116	ops/s
MemoryBenchmark.allocateWithMemory	linked-reference	14	thrpt	25	2289198.917	± 175390.679	ops/s

Replaced the WeakHashMap with a doubly linked list of WeakReferences to improve performance by reducing the time spent in synchronized blocks.

dbwiddis

This performance improvement looks great! I've left a few comments/suggestions inline for your consideration.

dbwiddis · 2020-07-28T19:01:06Z

src/com/sun/jna/Memory.java

+        }
+    }
+
+    private static LinkedReference HEAD; // the head of the doubly linked list used for instance tracking


I'd generally prefer to see the variable definition before it's used in the code.

This could be done easily, LinkedReference would be not defined at that point. I think this is a chicken-and-egg situation, how to solve best?

It's a matter of preference/style. I'm not sure there's a right answer.

In this case, LinkedReference is a class, and there's not generally any need to define a class first (I suppose an import statement would normally do that) or in any specific order, compared to a field that's accessed which does have an impact for later fields. When reading code, if I see a class definition, I know "that class is defined somewhere, I can eventually find it either in this file or another one" whereas when I see a variable referenced, I generally scroll up to see where it's defined.

Not a big deal, just expressing my preference! :)

src/com/sun/jna/Memory.java

matthiasblaesing · 2020-07-29T10:31:16Z

The suggested changes should be measured first.

The suggested finer grained locking adds additional complexity to the code and nested locks could lead to dead-locks, if not all callers follow the same lock order.

I would also not care to much about the performance of disposeAll. That method is a last effort and if that is in a lock-free hot-path there are bigger problems in the application, than performance. The nested synchronized block should not matter, as the objects monitor is already taken by the calling thread.

dbwiddis · 2020-07-29T16:07:02Z

The suggested finer grained locking adds additional complexity to the code and nested locks could lead to dead-locks, if not all callers follow the same lock order.

The nesting only goes one way, so I don't see how a deadlock is possible. It will be interesting to see if this helps or hurts (or has no impact) on the performance.

I would also not care to much about the performance of disposeAll. That method is a last effort and if that is in a lock-free hot-path there are bigger problems in the application, than performance.

I wasn't suggesting to remove the locks, just the re-linking only to destroy the links. But since no new objects are involved even that process would be a tiny impact, so I agree the current implementation is sufficient.

dbwiddis · 2020-07-31T05:53:52Z

Well, I did some benchmarking of my suggestion tonight and got results that, in retrospect, are probably exactly what one should expect. Of relevance, this is an 8 core (16 hyperthreaded) machine.

With only 1 thread, obviously, the additional lock doesn't help, and hurts slightly.
With a small number of threads (2-3) the added benefit is visible in a benchmark. However, that doesn't translate to better system performance, as that's not the limiting case we're trying to fix.
Around 4-7 threads there's very little difference.
At 8 threads (fully utilized CPUs, where the allocation tracking becomes more limiting), the benefit disappears and maintaining multiple locks once again hurts performance.
Results for threads above 8 were very close to the results for 8 threads.

dbwiddis · 2020-08-01T03:13:32Z

Played around with a slightly different variation. Instead of a single linked list, created an array of linked lists, using an atomic integer to increment for each new item added, to cycle through all the lists. Each list has its own lock. This also showed improved performance, maxing out about 3-4 lists in the array. This somewhat matches the (unmeasured, but observed on a profiler) improvement from the experiments I did using the hashmap, creating multiple hashmaps to reduce lock contention.

These are all using 8 threads, the X axis is number of linkedlists/locks.

matthiasblaesing · 2020-08-01T07:11:27Z

Thank you for the measurements.

For the first measurement (one vs. two locks): We should stay with the currently proposed version from @joerg1985.

The second measurement (array of locked lists): I assume this is what you mean:

when a new memory object is created, a static AtomicInteger is incrementAndGet()ed
the fetched values is taken module the number of lists to use
the calculated value is used as a list index and the list is fetched and stored with the Memory object
the tracking of allocated memory is done in the local list, only disposeAll needs to lock all lists

This comes at a price though: The Memory object literaly doubles in size, from a single long to a long and a reference.

dbwiddis · 2020-08-01T07:44:50Z

For the first measurement (one vs. two locks): We should stay with the currently proposed version from @joerg1985.

I agree. I'm ok to merge it as is, although after playing with these classes for a bit I'm still unhappy with the style with regards to ordering of things. But that's just me. :)

The second measurement (array of locked lists): I assume this is what you mean ...

All correct except I stored the list number in the WeakReference object as an int, alongside the prev and next pointers, all of which are ultimately associated with the Memory object.

I'm not suggesting that we merge the array version, though. This is mostly observational regarding the impact of locking that we really only need to enable diposeAll(). My preferred solution is to provide a configurable option that completely disables the memory tracking functionality.

dbwiddis · 2020-08-01T18:30:43Z

For fun, I tried to "improve" the performance by getting rid of the AtomicInteger increment, using some internal value in the object to choose which map to use. Neither the peer nor hashcode helps. In fact they make it worse. Results for 8 threads, arraysize 4:

Benchmark                                   Mode  Cnt        Score        Error  Units
BenchmarkLoop.linkedListLockArray          thrpt   25  1655971.711 ±  92967.766  ops/s
BenchmarkLoop.linkedListLockArrayHashcode  thrpt   25  1411952.577 ± 114018.178  ops/s
BenchmarkLoop.linkedListOneLock            thrpt   25  1514096.333 ±  35480.170  ops/s

So I think there is something to the round-robin nature of the "add to head" locks that's the limiting performance factor here, up to a certain point (4?). Anyway, don't think we should hold up this PR, it's just an interesting experiment that may inform future attempts to tweak this.

matthiasblaesing · 2020-08-02T12:42:16Z

@dbwiddis I think I know why using peer or hashCode does not help. The memory is allocated aligned and in my test the alignment was at least 32bit, running this through the modulo operation will always yield the same value. So the multi list case degrades to the single list case with additional overhead.

matthiasblaesing · 2020-08-02T13:48:49Z

Ok - if noone objects, I would merge in a day or so. Indeed we can bikeshed about the right place for the definition of the inner class, but in the end, all options (extract the class to an external file, move it to the end, move it before the usage) have their benefits and drawbacks. @joerg1985 do you have changes in flight for this or are you ok to merge as-is?

dbwiddis · 2020-08-02T16:04:29Z

Please hold off on merging for a bit, I think I might have identified a memory leak.

Less importantly, a note on the earlier modulo conversation: I did bitshift the peer value by 8 because of alignment expectations, and the hashcode calculation involves enough internal math (involving primes) that it distributes well across the values, e.g., hashcode % 4 yields the sequence 0, 3, 2, 0, 2, 1, 3, 1, 0, 1, 3, 2, 2, 0.

joerg1985 · 2020-08-02T16:30:36Z

@matthiasblaesing I am done with this, except that a memory leak is confirmed by @dbwiddis

dbwiddis · 2020-08-02T17:02:55Z

a memory leak is confirmed

No, I haven't confirmed it. I just have suspicion that I'm trying to confirm with data. I'm seeing iterations slow down a lot as the length of the benchmark increases, and I'm not sure the performance testing has properly accounted for the full impact of GC when max heap is a limiting factor.

dbwiddis · 2020-08-02T17:57:54Z

Well, I'm not sure there's a memory leak, or just an error in how I'm running the profiler. Both the existing method and new method eventually cause a java.lang.OutOfMemoryError if allowed to run long enough. That "long enough" point arrives faster for the linked list variant due to the additional tracking the linked list requires (prev and next pointers).

However, I am concerned that the benchmarking I have previously done was not in a memory-limited (GC) scenario and thus was heavily biased toward insertion (head) performance. By setting smaller heap limits (@Fork(value = 1, jvmArgs = { "-Xms500M", "-Xmx500M" })) I see iterations slow considerably when the limit is reached, and in that scenario, the added delay due to GC ends up causing worse performance on average.

@joerg1985 can you consider redoing your own benchmarks with longer iterations and/or more limiting heap size?

joerg1985 · 2020-08-02T18:34:02Z

@joerg1985 can you consider redoing your own benchmarks with longer iterations and/or more limiting heap size?

I will rerun the benchmarks on Fedora 32 / OpenJDK 1.8.0_262 with the following settings and post the results here.
@Benchmark
@Fork(value = 1, jvmArgs = { "-Xms500M", "-Xmx500M" })
@Timeout(time = 32, timeUnit = TimeUnit.MINUTES)
@Measurement(time = 30, timeUnit = TimeUnit.MINUTES)

Dispose or unlink entries as early as possible to avoid OOM exceptions under high load with low memory.

dbwiddis · 2020-08-04T00:10:08Z

I've been playing around a bit more with the benchmarks. I think running close to max heap ends up creating too much garbage, but that's not realistic in a real scenario when many more objects than the Memory will be created and disposed of. I tried to create a "middle ground" scenario that would still measure the impact of "a little more GC" while staying away from heap limits. I set the max to 1G but manually forced a GC at 500M, basically this:

    @Benchmark
    public void weakHashMap(Blackhole blackhole) {
        Memory foo = new Memory(1);
        blackhole.consume(foo);
        if (counter.incrementAndGet() % 5000 == 0
                && (Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()) > 500_000_000) {
            System.gc();
        }
    }

It's not perfect but I think it's "fair". The new LinkedList still performs well. (JDK14, macOS)

Benchmark                         Mode  Cnt      Score     Error  Units
BenchmarkLoop.linkedListOneLock  thrpt   25  45210.650 ± 302.708  ops/s
BenchmarkLoop.weakHashMap        thrpt   25  36944.042 ± 216.455  ops/s

I see another push, though, and will test that as well.

dbwiddis · 2020-08-04T04:33:42Z

What is this magic you are working?

Benchmark                          Mode  Cnt       Score        Error  Units
BenchmarkLoop.linkedListOneLock   thrpt   25   45388.427 ±    285.035  ops/s
BenchmarkLoop.linkedListRefQueue  thrpt   25  459204.674 ± 266253.633  ops/s
BenchmarkLoop.weakHashMap         thrpt   25   37046.341 ±    159.251  ops/s

dbwiddis · 2020-08-04T06:07:41Z

Here's the non-memory-limited case (shorter runs). Still better than current implementation.

Benchmark                          Mode  Cnt        Score       Error  Units
BenchmarkLoop.linkedListOneLock   thrpt   25  1497384.617 ± 32869.763  ops/s
BenchmarkLoop.linkedListRefQueue  thrpt   25  1319160.163 ± 75901.557  ops/s
BenchmarkLoop.weakHashMap         thrpt   25  1267802.499 ± 78902.609  ops/s

joerg1985 · 2020-08-04T11:10:56Z

@dbwiddis I was digging around why the WeakHashMap was performing better with limited memory. I can't even tell exactly why this is faster...

Well i think there is one thing to check, if alot of instances are released and no new are created there might references in the queue waiting to be processed. I need to check this and hopefully this will not destroy all the benefits.

joerg1985 · 2020-08-06T20:05:12Z

Okay, i think the described effect of a unprocessed ReferenceQueue is stealing memory doesn't exist.
Found nothing online (except for PhantomReferences) and could not produce this effect on JDK 8 or 11.

This is how i unsuccessfully tried to get into this state:

public static void main(String[] args) throws InterruptedException {
List<WeakReference> refs = new ArrayList<>();
ReferenceQueue queue = new ReferenceQueue<>();

    for (long i = 0; i < 10000000; i++) {
        // keep a strong reference to the WeakReference
        refs.add(new WeakReference<>(new String("add some weight"), queue));

        if (i % 1000 == 0) {
            // release the strong reference, like the unlink call does in finalizer
            refs.clear();
        }
    }

    long found = 0;
    List<String> waste = new ArrayList<>();

    while(true) {
        if (queue.poll() != null) {
            if (++found % 1000 == 0) {
                System.out.println("FOUND " + found);
            }
        } else if (waste.size() % 100000 == 0) {
            waste.clear();
        } else {
            waste.add(found + " force some GC " + found);
        }
    }
}

dbwiddis · 2020-08-06T21:44:14Z

Thanks for this analysis! I'm not quite sure what it all means. Can you summarize what you think is the best path for this PR and why?

joerg1985 · 2020-08-09T09:30:42Z

I think this PR is working fine and no side effects are obvious.

The numbers of the benchmarks are looking good and should now be confirmed by some real world numbers.

It would be great if the author of #1211 (@Karasiq) could build and benchmark the master and the branch of this PR to see if it helps there.

I don't have any concurrent running code using JNA to get real world numbers.

matthiasblaesing

I left two inline comments. My numbers on OpenJDK 11:

Benchmark	Mode	Threads	Samples	Score	Score Error (99,9%)	Unit
5.6.0	thrpt	4	25	1418278,296963	79754,059403	ops/s
5.7.0-SNAPSHOT	thrpt	4	25	1639614,860428	301953,00327	ops/s

The benchmark is simple:

public class MemoryBenchmark {
    @Benchmark()
    public void allocateWithMemory(Blackhole blackhole) {
        blackhole.consume(new Memory(1));
    }
}

Run with:

java -jar target/benchmarks.jar -t max -gc true -rff result.csv

The big error margin for the updated code surprises me though.

src/com/sun/jna/Memory.java

matthiasblaesing · 2020-08-09T19:03:54Z

src/com/sun/jna/Memory.java

+         */
+        static LinkedReference track(Memory instance) {
+            // use a different lock here to allow the finialzier to unlink elements too
+            synchronized (QUEUE) {


Is this synchronization needed? While the class lacks documentation regarding threading, the OpenJDK implementation protects the reference queue. There are fixed bugs regarding multithreaded enqueue, so I assume, that ReferenceQueue should be save. From my POV it does not matter, which thread unlinks, as the unlink will be protected again by the HEAD lock.

I agree this lock is not needed to avoid concurrency effects with the current OpenJDK implementations. Removing it had a negative effect on performance in my digging around why the WeakHashMap was performing better with low memory.

The references returned by ReferenceQueue.poll() will allways return null on get(). Just unlink, without trying to get the referent.

matthiasblaesing · 2020-08-11T21:29:49Z

I remeasured on a desktop class machine and it showed about a 25% improvement. Even in the worst-case an improvement was visible. I also did a totally unscientific experiment with a smaller heap (512MB) and noticed, that the OOME happend later with the new code, that with the weak hashmap.

This change was squashed and merged as:
59b052a

Thank you.

Reworked the way allocated memory is tracked

ef64426

Replaced the WeakHashMap with a doubly linked list of WeakReferences to improve performance by reducing the time spent in synchronized blocks.

dbwiddis reviewed Jul 28, 2020

View reviewed changes

Use a ReferenceQueue to avoid GC overheating

f619844

Dispose or unlink entries as early as possible to avoid OOM exceptions under high load with low memory.

matthiasblaesing reviewed Aug 9, 2020

View reviewed changes

Expect the references returned by ReferenceQueue.poll() to be stale

d792c74

The references returned by ReferenceQueue.poll() will allways return null on get(). Just unlink, without trying to get the referent.

matthiasblaesing closed this Aug 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reworked the way allocated memory is tracked #1239

Reworked the way allocated memory is tracked #1239

joerg1985 commented Jul 28, 2020

dbwiddis left a comment

dbwiddis Jul 28, 2020

joerg1985 Jul 28, 2020

dbwiddis Jul 28, 2020

matthiasblaesing commented Jul 29, 2020

dbwiddis commented Jul 29, 2020 •

edited

dbwiddis commented Jul 31, 2020

dbwiddis commented Aug 1, 2020

matthiasblaesing commented Aug 1, 2020

dbwiddis commented Aug 1, 2020

dbwiddis commented Aug 1, 2020

matthiasblaesing commented Aug 2, 2020

matthiasblaesing commented Aug 2, 2020

dbwiddis commented Aug 2, 2020

joerg1985 commented Aug 2, 2020

dbwiddis commented Aug 2, 2020

dbwiddis commented Aug 2, 2020

joerg1985 commented Aug 2, 2020

dbwiddis commented Aug 4, 2020

dbwiddis commented Aug 4, 2020

dbwiddis commented Aug 4, 2020

joerg1985 commented Aug 4, 2020

joerg1985 commented Aug 6, 2020

dbwiddis commented Aug 6, 2020

joerg1985 commented Aug 9, 2020

matthiasblaesing left a comment

matthiasblaesing Aug 9, 2020

joerg1985 Aug 10, 2020

matthiasblaesing commented Aug 11, 2020

Reworked the way allocated memory is tracked #1239

Reworked the way allocated memory is tracked #1239

Conversation

joerg1985 commented Jul 28, 2020

dbwiddis left a comment

Choose a reason for hiding this comment

dbwiddis Jul 28, 2020

Choose a reason for hiding this comment

joerg1985 Jul 28, 2020

Choose a reason for hiding this comment

dbwiddis Jul 28, 2020

Choose a reason for hiding this comment

matthiasblaesing commented Jul 29, 2020

dbwiddis commented Jul 29, 2020 • edited

dbwiddis commented Jul 31, 2020

dbwiddis commented Aug 1, 2020

matthiasblaesing commented Aug 1, 2020

dbwiddis commented Aug 1, 2020

dbwiddis commented Aug 1, 2020

matthiasblaesing commented Aug 2, 2020

matthiasblaesing commented Aug 2, 2020

dbwiddis commented Aug 2, 2020

joerg1985 commented Aug 2, 2020

dbwiddis commented Aug 2, 2020

dbwiddis commented Aug 2, 2020

joerg1985 commented Aug 2, 2020

dbwiddis commented Aug 4, 2020

dbwiddis commented Aug 4, 2020

dbwiddis commented Aug 4, 2020

joerg1985 commented Aug 4, 2020

joerg1985 commented Aug 6, 2020

dbwiddis commented Aug 6, 2020

joerg1985 commented Aug 9, 2020

matthiasblaesing left a comment

Choose a reason for hiding this comment

matthiasblaesing Aug 9, 2020

Choose a reason for hiding this comment

joerg1985 Aug 10, 2020

Choose a reason for hiding this comment

matthiasblaesing commented Aug 11, 2020

dbwiddis commented Jul 29, 2020 •

edited