Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISPN-6805 CacheMgmtInterceptor: use a simplified LongAdder #4442

Conversation

danberindei
Copy link
Member

@danberindei danberindei force-pushed the ISPN-6805_CacheMgmtInterceptor branch 4 times, most recently from 2a8aa7f to 9d7e3ef Compare July 6, 2016 11:59
@rvansa
Copy link
Member

rvansa commented Jul 7, 2016

@danberindei Motivation? Benchmark results?

Note that as the Stripes are allocated in a loop, these probably lie on the same cache line. You can separate them using a hierarchy of classes (A with padding, B with variables, C with padding), or through @sun.misc.Contended but that requires users to set -XX:-RestrictContended (which makes @Contended unusable in the end).

@wburns
Copy link
Member

wburns commented Jul 7, 2016

I am also a bit interested into what was behind this. I know the allocation for the stripes would be less than all of those long adders, but is this solving some additional contention you saw as well?

@wburns wburns added the Question label Jul 7, 2016
@danberindei
Copy link
Member Author

Motivation? Benchmark results?

The motivation is simple, JFR was showing a lot of CPU spent in LongAdder :) I didn't actually think I could do better, but apparently LongAdder is too slow to scale up the number of cells.

Some performance numbers for replicated mode reads, running on 8-way Opterons:

10 threads 110 threads 210 threads 310 threads
LongAdder 12162261 reqs/s 12441921 reqs/s 12399660 reqs/s 12286400 reqs/s
Stripes 14168233 reqs/s 14781110 reqs/s 13964451 reqs/s 13664435 reqs/s

Note that as the Stripes are allocated in a loop, these probably lie on the same cache line.
You can separate them using a hierarchy of classes (A with padding, B with variables, C with padding), or through @sun.misc.Contended but that requires users to set -XX:-RestrictContended (which makes @contended unusable in the end).

I'm pretty sure you only need to use multiple classes when you want to separate 2 fields in the same instance. Regardless of how the JVM reorders fields, it's going to use the same layout for all the instances of a class, so it doesn't matter where exactly the padding is.

Anyway, I don't think padding is needed because cache lines are 64 bytes almost everywhere, and a Stripe is 80 bytes + the object header. I can add a few extra longs, but I only really care about hits and hitTimes, so the others already serve as padding :)

I'm a little more worried about the distribution of the thread id... if it turns out that on Windows or some other OS thread ids are all multiples of 8, that's going to cause some trouble ;)

@rvansa
Copy link
Member

rvansa commented Jul 7, 2016

@danberindei Ok, so LongAdder does not generate enough stripes. You could rather isolate your code to another class than put it directly into CacheMgmtInterceptor.

Wrt padding - that means that one Stripe will be on line 0 and 1, and second on line 1 and 2. And you have a conflict on line 1. Could you try the benchmark with @Contended (activated with -XX:-RestrictContended)?

If you're worried about thread ids, you should use java.lang.Thread#threadLocalRandomProbe in the same way as LongAdder does (using Unsafe). That should be truly random.

@danberindei danberindei force-pushed the ISPN-6805_CacheMgmtInterceptor branch 3 times, most recently from b8d168f to 0affd6b Compare July 8, 2016 14:44
@danberindei danberindei force-pushed the ISPN-6805_CacheMgmtInterceptor branch from 0affd6b to 3f2561b Compare July 19, 2016 23:01
@slaskawi
Copy link
Contributor

LGTM. @wburns @rvansa are you ok with merging it?

@slaskawi
Copy link
Contributor

Retriggering CI

@rvansa
Copy link
Member

rvansa commented Jul 26, 2016

@slaskawi No, I am still not satisfied with the padding and I would rather use threadLocalRandomProbe than thread Id. Dan updated one of the commits but did not respond to my last comment (and did not address it in the updated commit).

@danberindei danberindei force-pushed the ISPN-6805_CacheMgmtInterceptor branch 3 times, most recently from 54cdaa3 to d7aaf6f Compare August 5, 2016 11:57
@tristantarrant
Copy link
Member

@danberindei are @rvansa 's comments addressed ?

@danberindei danberindei force-pushed the ISPN-6805_CacheMgmtInterceptor branch 2 times, most recently from 8226291 to 8d7abc0 Compare August 10, 2016 14:08
@danberindei
Copy link
Member Author

@danberindei Ok, so LongAdder does not generate enough stripes. You could rather isolate your code to another class than put it directly into CacheMgmtInterceptor.

I've updated the PR to add a generic StripedCounters class, but still, most of the code is CacheMgmtInterceptor-specific.

Wrt padding - that means that one Stripe will be on line 0 and 1, and second on line 1 and 2. And you have a conflict on line 1. Could you try the benchmark with @contended (activated with -XX:-RestrictContended)?

Yes, in theory you can have a conflict, but you usually don't, because a thread will only update 2 of the counters. Still, I read somewhere that Intel prefetches 2 cache lines at a time, so I've added some more fields for padding. I'm having some trouble with JMH at the moment, but I'll get back to you with some more benchmarks. (On my laptop the 2 versions seem much closer than I remember them, and a definitely a lot closer than on the Opterons...)

If you're worried about thread ids, you should use java.lang.Thread#threadLocalRandomProbe in the same way as LongAdder does (using Unsafe). That should be truly random.

That's not really an option, as we want to wean ourselves off of Unsafe for JDK9. Otherwise I could have used Unsafe for accessing the counter fields as well...

Apparently thread id allocation is platform-independent ATM, but I've added some bit twiddling for future-proofing.

@rvansa
Copy link
Member

rvansa commented Aug 10, 2016

Yes, in theory you can have a conflict, but you usually don't, because a thread will only update 2 of the counters. Still, I read somewhere that Intel prefetches 2 cache lines at a time, so I've added some more fields for padding. I'm having some trouble with JMH at the moment, but I'll get back to you with some more benchmarks. (On my laptop the 2 versions seem much closer than I remember them, and a definitely a lot closer than on the Opterons...)

You still have conflicts in theory there, as the field ordering is not guaranteed in the class. If the padding gets in the middle and some hot fields end up on the beginning and end, there will be conflicts (note that while you've made the class occupy at least 16x8 bytes, it is won't be aligned on the cache line). I don't see why you're so much against the padding > fields > padding class hierarchy. But I won't block this PR; I wash my hands on this.

That's not really an option, as we want to wean ourselves off of Unsafe for JDK9. Otherwise I could have used Unsafe for accessing the counter fields as well...

OK, haven't considered JDK 9, that's a beast to come. Fine then.

@danberindei danberindei force-pushed the ISPN-6805_CacheMgmtInterceptor branch 5 times, most recently from 8bc84cc to 53d1165 Compare August 29, 2016 10:14
@danberindei danberindei force-pushed the ISPN-6805_CacheMgmtInterceptor branch 6 times, most recently from da52f05 to b207bad Compare September 8, 2016 17:04
@danberindei danberindei force-pushed the ISPN-6805_CacheMgmtInterceptor branch 2 times, most recently from 657c7ee to 7d38c71 Compare September 19, 2016 06:52
@@ -18,7 +18,7 @@ fi
echo Processing $FILE

if [ -z "$FAILED_TESTS" ] ; then
if $CAT $FILE | head -100 | grep -q TestSuiteProgress ; then
if $CAT $FILE | head -10000 | grep -q TestSuiteProgress ; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it was not intentional :)

@danberindei danberindei force-pushed the ISPN-6805_CacheMgmtInterceptor branch 2 times, most recently from a29bccc to 2bc733d Compare September 26, 2016 07:55
@rvansa
Copy link
Member

rvansa commented Sep 26, 2016

I would use more meaningful name of the third commit, but otherwise I am satisfied with the padding.

The JVM is free to reorder fields inside a class, but not to mix
fields from subclasses and superclasses.
@danberindei
Copy link
Member Author

It took a while @rvansa, but I finally ran the local read benchmark with the extra padding you suggested and it seems to make a small, but visible difference:

80 threads, my laptop 80 threads, cluster10
LongAdder 20243520 reqs/s 17190979 reqs/s
Stripes 22011241 reqs/s 17338521 reqs/s
Extra-padding 22212625 reqs/s 17444039 reqs/s

TBH I'm a bit surprised that I'm seeing a 10% difference on my machine now and almost no difference on cluster10, I'm pretty sure it was the other way around when I first made the change :)

I also got the benchmark results from @rmacor, and they still show ~10% improvement in replicated reads throughput:

iteration 1 iteration 5
master 17536470 reqs/s 17853700 reqs/s
ISPN-6805 (without 3rd commit) 19913319 reqs/s 19860132 reqs/s

@tristantarrant
Copy link
Member

@rvansa unless you have more comments, I am going to merge this

@rvansa
Copy link
Member

rvansa commented Sep 27, 2016

@tristantarrant 👍

@tristantarrant
Copy link
Member

Pushed to master, thank.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants