Added support to configure lower bound on per-thread cache size #1515

coder-saab001 · 2024-05-17T23:33:26Z

The support is configure per thread cache is added along with the required unit tests.

Related issues: #1511

alk · 2024-05-18T17:28:34Z

This is much better. Few notables

Can we be consistent and clean on names? You use per_thread_cache_size and max_per_thread_cache_bytes in different contexts. to refer to same thing. This is inviting confusion and mistakes. And perhaps we can make it even clearer ? (yes, naming is hard, but at least lets avoid confusing people)
lets avoid repeats of min(kMinThreadCacheSize, ...) thingy. Just make it use variable. And also lets make it explicitly atomic (volatile is inviting even more confusion and actual races; and atomic will make your memory loads explicit and clear)
I don't get the logic in ThreadCache::set_max_per_thread_cache_size and it's interaction with "set overall" thingy. Perhaps add more comments ?
the test is making unsynchronized access to various variables that are being waited for. Things like "ready", "startfill_counter", etc. The are updated unsynchronized and waited for synchronized. You're risking time of check/time of use race. Just make everything guarded by locks. And also please avoid unnecessary unique_lock and explicit unlock()-s. Also, you'll make it simpler if you just make it all guarded by one lock (separate condition variables are fine; and btw please consider naming them better; what we're waiting for using each of them).
in the test's CurrentThreadCacheSize there is EXPECT_TRUE. Which is okay, But then consider initializing result with 0, given it might fail. (I'd just make things crash if it fails)
in the test max_cache_size logic, why not have it after the loop ?
in the test currently when it fails, it crashes due to failing to join threads. Either just leak them (with explicit comment), or use tcmalloc::Cleanup. So if ASSERT fails and does early test body exit, you'll have your threads joined and std::thread destructors happy.

P.S. I am still somewhat skeptical on overall direction and approach here. Please consider looking at this from holistic perspective (whether this is the "fix" you're looking for). With that said, as I promised, if this is good enough quality, I'll still merge it.

alk · 2024-05-18T17:54:44Z

ah, also, please update copyright year in the test .cc you've added.

coder-saab001 · 2024-05-25T19:39:53Z

Hey @alk, thanks for taking a deep look. Yes, this is the fix I am looking for. It will give the user more power to control the cache.
Acknowledged and answered your comments as follows:

Changed all the internal functions/variables to use per_thread_cache_size. But whatever we are exposing to the outside world must have to be max_per_thread_cache, it will make things more obvious. The same is being done already for overall_thread_cache.
A new variable named min_possible_per_thread_cache_size_ is added.
Done
In C++, conditional variable only works with unique locks, so it is needed. Explicit unlock()-s are needed for using the same unique_lock for both cv_startfill and cv_filled conditional variables. Or I can use two different unique_lock each with its scope. But this one looks more clean. Just let me know. And of course, changes the variable name and all.
Done
I moved the logic after the loop. It was just to make sure we should pick the maximum possible thread cache.
Done

alk · 2024-05-25T19:51:41Z

I'll have a closer look sometime later, but immediately I see you're still make same mistake with waiting and waking up. "naked" counters bump and notify is racing with waiting side. Which takes lock (which does nothing since notification side ignores locking), checks condition, it may find counter not yet ready to continue, goes to sleep. And then in between the check and going to sleep there is race. This race is precisely why condition variable APIs are paired with mutex. Please fix.

And again, things will be simpler and nicer to read if you just do one mutex to guard everything.

coder-saab001 · 2024-05-25T20:15:33Z

Thanks for the quickest reply.

Can you please have a look again @alk

alk · 2024-05-25T20:24:21Z

This is better. Now you don't need those counters to be atomic. Also no need to unlock and then immediately lock the mutex when waiting in main function. I'd use block scope to avoid explicit unlock.

Also please handle the join/cleanup thingy. Just artificially make the final assert fail and you'll see what I am talking about.

coder-saab001 · 2024-05-25T20:40:07Z

Done @alk
join/cleapup thingy is already been taken care using tcmalloc::Cleanup.

alk · 2024-05-27T13:08:20Z

Took a closer look. So now that test is clearer, there is seemingly no value at all in those counters and condition variables. Filler threads start and finish without waiting for anyone. And main thread can simply wait for them to finish by joining fillers.

For the main change the real problem appears to be that per_thread_cache_size_ is barely used. I did some git archaeology and tcmalloc had some "static cache size" mode and there it was used. What we have now is incorrect comment that it is being read without synchronization (and thus volatile is unnecessary and thus your change to atomic is unnecessary too). The resultant confusion of meanings is the real blocker for your change. And naming unclarity is just a symptom. Something needs to be done about it.

Also for the future. We don't initialize things with <type> <name>(value) syntax. It looks too much like function declaration. And also I insist that all reads and writes of atomic variables should be explicit .load(<memory order>) and store calls.

alk · 2024-05-27T13:12:19Z

Why not simply add tcmalloc.minimal_per_thread_cache_size setting and most trivial implementation of it ?

And then I think RecomputePerThreadCacheSize should be changed to not use per_thread_cache_size_. And perhaps do the ratio logic based on thread's max sizes. This will amputate the per_thread_cache_size_ thingy entirely. But this can be done separately (and I'll be thankful if you handle it, but if not I'll get to it eventually).

coder-saab001 · 2024-05-27T18:30:57Z

Thanks @alk for all the deep analysis.
Sure, let's take this separately. Will pick it up after this one.

Refactored the whole CR from setting per_thread_cache_bytes to min_per_thread_cache_byte if that looks good.

Please have a look.

alk · 2024-05-27T18:41:40Z

Better. Still few things:

your documentation update refers to non-existent tcmalloc.max_per_thread_cache_bytes setting. The intention was to refer to max_total_thread_cache_bytes instead?
"deferred" cleanup/joining of threads now makes test's assertion race with threads updating max_cache_size. You probably want to join threads before asserting ? In addition to that, I recommend you "test" your test by making sure it fails without your change.
do we need page heap lock setting min_per_thread_cache_size? (I think, no)
since we're accessing min_per_thread_cache_size un-synchronized, I recommend you make it atomic (but read with std::memory_order_relaxed, we don't need any perf hits of atomic access)

alk · 2024-05-27T18:42:27Z

ah, also emplace_back instead of push_back(thread(... ?

alk · 2024-05-27T18:45:34Z

also for set_min_per_thread_cache_size why not inline directly in class definition? You have inline for getter but not setter.

coder-saab001 · 2024-05-27T19:22:05Z

Thanks @alk

Thanks for that, I forgot to remove that from the previous versions of the change.
It makes sense, I don't know how I missed it, and yeah, I tested it, and it's failing without this change.
Yeah, it seems like. Removed.
Done
Much better ; )
Done

alk · 2024-05-27T20:27:40Z

Better. Couple more things. Nearly all cosmetic.

include has to be after config.h. In general order is config.h, then "matching header" (if we're in .cc), then C headers, then C++ headers, then project headers (generally sorted within group). We don't 100% always follow this rule, but it is the rule (see also google C++ style guide; but note it's modern editions don't special-case config.h, which always needs to be first)
There is still holding page heap lock around reading of tcmalloc.min_per_thread_cache_bytes
inside test, I think having mutex declared static right inside the locking block would be more idiomatic and easier to read. I.e. mtx is only ever used there. Also having max_cache_size updated not by explicit if but via std::max would be more idiomatic. I.e. for readers of the code, no need to decipher: "aha, this is taking maximum".
in thread_cache.cc there are few places that do "redundant" loads of min_per_thread_cache_size, causing possible races. I highly recommend you load it once "per-use". E.g. setting SetMaxSize to be min size value and decrement unclaimed space by that same amount (instead of freshly loaded min size value).
ah and also in the test where you SetNumericProperty, it is worth checking that the property name was understood (i.e. set property returned true).

alk · 2024-05-27T20:34:47Z

ah, also since we've changed the approach to only touch min size, I also recommend you to update the commit subject line

coder-saab001 · 2024-05-27T21:41:21Z

Thanks again @alk . Hoping this one looks fine

alk · 2024-05-27T22:37:43Z

Nearly perfect. But I just noticed that you still add (seemingly unused) getter for per_thread_cache_size. Also since you're changing anyways. There is small typo in documentation you're adding. "takes affect" I think what is meant is "takes effect". Also I am not sure this statement is accurate (but let me know if I misunderstand): " Also, make sure to set this property before tcmalloc.max_total_thread_cache_bytes."

coder-saab001 · 2024-05-27T22:49:53Z

hey @alk ,
The statement is accurate in the sense that configuring tcmalloc.min_per_thread_cache_bytes will have no much affect unless we set tcmalloc.max_total_thread_cache_bytes. Because it's the second one where we are calling RecomputePerThreadCacheSize(). First one will just use to relax the lower bound while computing per thread cache, in case.

coder-saab001 · 2024-05-27T22:53:35Z

Let me know if we want to change the behaviour. It's just this one looks more fine to me.

alk · 2024-05-29T16:56:26Z

I anticipate that the main impact of min cache size setting in practice is going to be via IncreaseCacheLimit logic. I.e. we expect min cache size to be set quite early when usually only main thread exists and quite possibly it hasn't yet increased it's cache size cap too. Then as threads are created and they compete for cache size limit, is when your new setting is going to do it's job.

Let me know if you disagree.

coder-saab001 · 2024-05-29T18:46:12Z

Yeah, correct @alk
Basically, everything will be the same as it was with tcmalloc.max_total_thread_cache_bytes logic. It's just a lower limit where per thread cache can stoop can be configured beforehand.

coder-saab001 · 2024-05-29T18:51:53Z

Agree that this is not so obvious setting as the previous one for per_thread_cache which had been refractored

alk · 2024-06-01T00:55:32Z

Hi. So, should I wait for you to update the comment ?

coder-saab001 · 2024-06-01T18:50:58Z

Hey, @alk , which comment do you want me to update specifically?

alk · 2024-06-03T12:25:53Z

I am referring to comment "Also, make sure to set this property before tcmalloc.max_total_thread_cache_bytes." in the docs. As noted above, I think it is factually incorrect and misleading. But do let me know if you disagree.

coder-saab001 · 2024-06-03T13:37:24Z

Ohh yeah, that is incorrect. Removed the comment @alk : )

coder-saab001 · 2024-06-03T14:26:54Z

Thanks a lot @alk, learned a lot as part of this.
Do you want to take a final look and then maybe we can go forward with merging it.

alk · 2024-06-03T21:39:20Z

So unfortunately test doesn't always fail if I comment out the min thread size setting. Best reproduction is to pin test program to a single core (i.e. taskset -c 0 ./min_per_thread_cache_size_test). And it makes sense given that nothing is being waited for in the filler threads, and they don't do much work, so they may easily finish before next one is started.

Also the way max thread cache size is regulated, it isn't really guaranteed. One thread "stealing" max size bytes from another doesn't really reduce other thread's cache. So it isn't quite clear how to best test this.

Perhaps we could expose thread's total max size aggregation, instead of actual free list sizes to the test and insect like this. And then filler thread will simply stop and the end waiting "go ahead and die" signal. Then main thread can do the stats inspection, and there is no need to have locking around each thread inspecting the value and max-ing it.

Another minor thing (since this requires more iteration) is that assigning atomic of min_per_thread_cache_size_ should be with explicit .store call as I noted above. I'd allow that without extra iteration, but since we need to iterate more, it shouldn't be much trouble.

alk · 2024-06-03T21:41:02Z

Alternatively filler threads can simply round-robin doing some sort of synchronization facility. I'd maybe do that. But I am not sure how comfortable you are with arranging somewhat less common synchronization schemes.

coder-saab001 · 2024-06-03T23:56:58Z

Agree @alk
Tried implementing round round-robin thing. Happily, test is now always failing without the setting even with taskset -c 0 ./min_per_thread_cache_size_test. Please check.

alk · 2024-06-04T00:15:38Z

Not bad. Seems to be correct. Few things:

I think base/cleanup.h include is not used anymore
why you lock and unlock the mutex on each of ten "passes" ? No need.
You choose the kFooBar constants for number of allocations and number of threads (I don't mind) but then you hardcoded "naked" literal of 10 for number of passes. Mixing such styles is a slight violation of principle of least surprise. Can you please fix ?
special lock around getting total thread cache usage and max-ing it isn't needed anymore. You already have enough serialization. Please fix.
ah and btw there is no need to explicitly initialize global variables to 0. Really minor, but somehow catches my attention every time.
ah and another minor thing. It would be slightly more idiomatic to do "new style" for loop iteration over vector of threads to join. No need to have an index, just do for (auto& t : threads) { t.join(); } or something like that.

coder-saab001 · 2024-06-04T00:27:00Z

Thanks @alk
Addressed your comments

alk · 2024-06-04T00:41:21Z

Thanks. Final thing I missed last time. Can you please make getter for min per thread size also use explicit load? And please make both store and load use explicit relaxed memory order .

coder-saab001 · 2024-06-04T01:08:21Z

Done @alk

alk · 2024-06-04T16:12:55Z

Applied. Thanks. I had to make tiny change to your commit (added signed-off line). You added spurious new line at the end of thread_cache.h, so I removed that addition.

coder-saab001 mentioned this pull request May 17, 2024

Tcmalloc not obeying max thread cache limit #1511

Closed

coder-saab001 force-pushed the configure-per-thread-cache branch 2 times, most recently from aa6bde3 to dfb008a Compare May 17, 2024 23:41

coder-saab001 force-pushed the configure-per-thread-cache branch from dfb008a to 42c9fed Compare May 25, 2024 19:38

coder-saab001 force-pushed the configure-per-thread-cache branch from 42c9fed to 80cec29 Compare May 25, 2024 20:14

coder-saab001 force-pushed the configure-per-thread-cache branch from 80cec29 to 62f5dd2 Compare May 25, 2024 20:39

coder-saab001 force-pushed the configure-per-thread-cache branch 2 times, most recently from db9c22d to fca78a9 Compare May 27, 2024 18:30

coder-saab001 force-pushed the configure-per-thread-cache branch from fca78a9 to 86eaaad Compare May 27, 2024 19:20

coder-saab001 force-pushed the configure-per-thread-cache branch from 86eaaad to f29f21d Compare May 27, 2024 21:41

coder-saab001 changed the title ~~Added support to configure per-thread cache~~ Added support to configure lower bound on per-thread cache size May 27, 2024

coder-saab001 force-pushed the configure-per-thread-cache branch from f29f21d to 594f5b7 Compare May 27, 2024 21:47

coder-saab001 force-pushed the configure-per-thread-cache branch from 594f5b7 to d79227c Compare May 27, 2024 22:49

coder-saab001 force-pushed the configure-per-thread-cache branch from d79227c to d0f34a1 Compare June 3, 2024 13:36

coder-saab001 force-pushed the configure-per-thread-cache branch from d0f34a1 to 0d1dfd3 Compare June 3, 2024 23:53

coder-saab001 force-pushed the configure-per-thread-cache branch from 0d1dfd3 to e810b7c Compare June 4, 2024 00:26

coder-saab001 force-pushed the configure-per-thread-cache branch from e810b7c to f3d8542 Compare June 4, 2024 00:28

Added support to configure lower bound on per-thread cache size

acc722e

coder-saab001 force-pushed the configure-per-thread-cache branch from f3d8542 to acc722e Compare June 4, 2024 01:08

alk closed this Jun 4, 2024

Added support to configure lower bound on per-thread cache size #1515

Added support to configure lower bound on per-thread cache size #1515

Conversation

coder-saab001 commented May 17, 2024

alk commented May 18, 2024

alk commented May 18, 2024

coder-saab001 commented May 25, 2024

alk commented May 25, 2024

coder-saab001 commented May 25, 2024

alk commented May 25, 2024

coder-saab001 commented May 25, 2024

alk commented May 27, 2024

alk commented May 27, 2024

coder-saab001 commented May 27, 2024

alk commented May 27, 2024

alk commented May 27, 2024

alk commented May 27, 2024

coder-saab001 commented May 27, 2024

alk commented May 27, 2024

alk commented May 27, 2024

coder-saab001 commented May 27, 2024

alk commented May 27, 2024

coder-saab001 commented May 27, 2024

coder-saab001 commented May 27, 2024

alk commented May 29, 2024

coder-saab001 commented May 29, 2024 • edited Loading

coder-saab001 commented May 29, 2024

alk commented Jun 1, 2024

coder-saab001 commented Jun 1, 2024

alk commented Jun 3, 2024

coder-saab001 commented Jun 3, 2024 • edited Loading

coder-saab001 commented Jun 3, 2024 • edited Loading

alk commented Jun 3, 2024 • edited Loading

alk commented Jun 3, 2024

coder-saab001 commented Jun 3, 2024

alk commented Jun 4, 2024

coder-saab001 commented Jun 4, 2024

alk commented Jun 4, 2024

coder-saab001 commented Jun 4, 2024

alk commented Jun 4, 2024

coder-saab001 commented May 29, 2024 •

edited

Loading

coder-saab001 commented Jun 3, 2024 •

edited

Loading

coder-saab001 commented Jun 3, 2024 •

edited

Loading

alk commented Jun 3, 2024 •

edited

Loading