Sharded counter optimisation #109

JackThomson2 · 2022-08-14T11:55:54Z

This is picking up from the great work from @jimvdl in this PR: #101

I've added the crate here: https://github.com/JackThomson2/fast-counter which is using a basic sharded concurrent counter which is identified using a thread_local. It seems at worse case it's slightly slower but with more cores it starts to pull ahead.

These are the results using the conc-bench repo on my 16 core machine:

This change is

codecov · 2022-08-20T18:22:29Z

Codecov Report

Merging #109 (c92dbf2) into master (85ac469) will decrease coverage by 6.39%.
The diff coverage is 90.00%.

Additional details and impacted files

Impacted Files	Coverage Δ
src/map.rs	`75.72% <88.88%> (-4.86%)`	⬇️
src/node.rs	`73.84% <100.00%> (-4.69%)`	⬇️
src/serde_impls.rs	`83.33% <0.00%> (-7.98%)`	⬇️
src/rayon_impls.rs	`91.66% <0.00%> (-7.00%)`	⬇️
src/iter/traverser.rs	`82.89% <0.00%> (-6.95%)`	⬇️
src/iter/mod.rs	`100.00% <0.00%> (ø)`
src/lib.rs	`16.12% <0.00%> (ø)`
src/raw/mod.rs	`80.00% <0.00%> (+0.27%)`	⬆️

jonhoo

Nice, thanks for pushing this forward!

I took a look at the code for fast-counter, and had a couple of thoughts:

I think THREAD_ID can be a Cell<usize> instead of an UnsafeCell. Should let you get rid of the unsafe block you have in there.
On nightly you can use ThreadId::as_u64 rather than keeping THREAD_ID yourself.
It's not entirely clear to me why you have stable and nightly separated? The use of #[thread_local] doesn't feel like it's very important, and in terms of performance the difference seems fairly minor.

src/map.rs

JackThomson2 · 2022-08-21T18:39:28Z

Nice, thanks for pushing this forward!

I took a look at the code for fast-counter, and had a couple of thoughts:

I think THREAD_ID can be a Cell<usize> instead of an UnsafeCell. Should let you get rid of the unsafe block you have in there.

On nightly you can use ThreadId::as_u64 rather than keeping THREAD_ID yourself.

It's not entirely clear to me why you have stable and nightly separated? The use of #[thread_local] doesn't feel like it's very important, and in terms of performance the difference seems fairly minor.

Thanks for the feedback, all great points, I'll try address these soon.

The reason for the night / stable just happened as an effect of me starting with the nightly first, and originally the performance difference was greater. I left it was an isize as the previous PR used this approach, happy to update it to a usize 👍

JackThomson2 · 2022-09-07T13:17:03Z

Hey @jonhoo been looking into this and a few thoughts and findings:

I tested using the ThreadID::as_u64 this proved to be significantly slower, a quick look in godbolt it seems we have to deal with Arcs etc. The result was 2-3x slower consistently
I tried using the Cell over Unsafe Cell. This again made it slower, I'm not sure if the compiler can't optimise as well or what. But from my benchmark it showed the counter being slower than a regular AtomicUsize on every core count.
I will look at removing the nightly option it does look like the results are very close now so probably not needed.
I looked at holding a usize, the issue I found was that it is possible for a different cell to be decremented i.e. going into negatives. Would you want us to return a usize and take a usize but use the isize internally.

Thanks again for reviewing this

jonhoo · 2022-10-08T23:38:17Z

Sorry for the super slow reply!

I tested using the ThreadID::as_u64 this proved to be significantly slower, a quick look in godbolt it seems we have to deal with Arcs etc. The result was 2-3x slower consistently

Looking at the std source, it seems like the Arc is in the Thread type, whereas the ThreadId type is just a plain integer. I'm guessing this ends up slow because you're either calling thread::current() on every access (which is slow) or keeping the Thread around, which means going through the Arc each time to get to the numeric ID. I think the way to fix this is to store the ThreadId in a thread-local. That way you should have super cheap access to it (ThreadId::as_u64 is just a field access) without having to keep an ID counter yourself.

I tried using the Cell over Unsafe Cell. This again made it slower, I'm not sure if the compiler can't optimise as well or what. But from my benchmark it showed the counter being slower than a regular AtomicUsize on every core count.

That's super weird! The code for Cell is really just an UnsafeCell with #[repr(transparent)], and Cell::set just does mem::replace(unsafe { self.0.get() }, v), which should cost exactly as much as just setting the value, and certainly not as much as running atomic operations. Could you give the raw numbers you observed to see if we can figure out what's going on?

I will look at removing the nightly option it does look like the results are very close now so probably not needed.

Nice!

I looked at holding a usize, the issue I found was that it is possible for a different cell to be decremented i.e. going into negatives. Would you want us to return a usize and take a usize but use the isize internally.

Ah, yes, that's a great point, and a totally valid reason for using an isize for the sharded counter. I think I'd still prefer for the API to return a usize because the sum should never be negative, which I think we can simply assert with a debug_assert! (so it doesn't affect release build performance). It's fine to make add take an isize though I think. It's a little weird, but not the end of the world.

Thanks again for reviewing this

Thanks for digging so deep!

jonhoo · 2022-10-08T23:44:47Z

Actually, on second thought, in a concurrent setting with lazily updated counters, it's entirely possible that the sum of the counters ends up being negative for some shorter period of time. So I think we should actually leave the return value as an isize 👍

jonhoo · 2022-12-04T00:34:39Z

src/map.rs

+        let _saw_bin_length = match resize_hint {
+            Some(hint) => hint,
+            None => return
+        };


nit: I'd probably make this an if let Some

jonhoo · 2022-12-04T00:37:19Z

Looks like there are some CI failures now?

jimvdl · 2022-12-04T21:02:09Z

I've been following along somewhat closely and happy to see that you guys made good progress. 🎉 If either of you ever need assistance (with anything really) you can always @ me :) (although I'm fairly certain you both got this in the bag). Well done Jack & Jon!

JackThomson2 · 2022-12-05T11:52:02Z

Looks like there are some CI failures now?

Saw these, I'll take a look at resolving them when I get a chance! Thanks for running the workflow

JackThomson2 added 2 commits August 14, 2022 12:49

Integrate the fast counter crate

99e552b

Add option to use faster nightly counter

9f664a5

jimvdl mentioned this pull request Aug 15, 2022

Shared counters optimization #101

Closed

jonhoo requested changes Aug 20, 2022

View reviewed changes

src/map.rs Outdated Show resolved Hide resolved

JackThomson2 mentioned this pull request Nov 11, 2022

Addressing comments in Flurry JackThomson2/fast-counter#1

Merged

Update to use newest fast-counter

c92dbf2

jonhoo approved these changes Dec 4, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharded counter optimisation #109

Sharded counter optimisation #109

JackThomson2 commented Aug 14, 2022 •

edited

Loading

codecov bot commented Aug 20, 2022 •

edited

Loading

jonhoo left a comment

JackThomson2 commented Aug 21, 2022

JackThomson2 commented Sep 7, 2022

jonhoo commented Oct 8, 2022

jonhoo commented Oct 8, 2022

jonhoo Dec 4, 2022

jonhoo commented Dec 4, 2022

jimvdl commented Dec 4, 2022

JackThomson2 commented Dec 5, 2022

Sharded counter optimisation #109

Are you sure you want to change the base?

Sharded counter optimisation #109

Conversation

JackThomson2 commented Aug 14, 2022 • edited Loading

codecov bot commented Aug 20, 2022 • edited Loading

Codecov Report

jonhoo left a comment

Choose a reason for hiding this comment

JackThomson2 commented Aug 21, 2022

JackThomson2 commented Sep 7, 2022

jonhoo commented Oct 8, 2022

jonhoo commented Oct 8, 2022

jonhoo Dec 4, 2022

Choose a reason for hiding this comment

jonhoo commented Dec 4, 2022

jimvdl commented Dec 4, 2022

JackThomson2 commented Dec 5, 2022

JackThomson2 commented Aug 14, 2022 •

edited

Loading

codecov bot commented Aug 20, 2022 •

edited

Loading