Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support to configure lower bound on per-thread cache size #1515

Conversation

coder-saab001
Copy link
Contributor

The support is configure per thread cache is added along with the required unit tests.

Related issues: #1511

@coder-saab001 coder-saab001 force-pushed the configure-per-thread-cache branch 2 times, most recently from aa6bde3 to dfb008a Compare May 17, 2024 23:41
@alk
Copy link
Contributor

alk commented May 18, 2024

This is much better. Few notables

  • Can we be consistent and clean on names? You use per_thread_cache_size and max_per_thread_cache_bytes in different contexts. to refer to same thing. This is inviting confusion and mistakes. And perhaps we can make it even clearer ? (yes, naming is hard, but at least lets avoid confusing people)
  • lets avoid repeats of min(kMinThreadCacheSize, ...) thingy. Just make it use variable. And also lets make it explicitly atomic (volatile is inviting even more confusion and actual races; and atomic will make your memory loads explicit and clear)
  • I don't get the logic in ThreadCache::set_max_per_thread_cache_size and it's interaction with "set overall" thingy. Perhaps add more comments ?
  • the test is making unsynchronized access to various variables that are being waited for. Things like "ready", "startfill_counter", etc. The are updated unsynchronized and waited for synchronized. You're risking time of check/time of use race. Just make everything guarded by locks. And also please avoid unnecessary unique_lock and explicit unlock()-s. Also, you'll make it simpler if you just make it all guarded by one lock (separate condition variables are fine; and btw please consider naming them better; what we're waiting for using each of them).
  • in the test's CurrentThreadCacheSize there is EXPECT_TRUE. Which is okay, But then consider initializing result with 0, given it might fail. (I'd just make things crash if it fails)
  • in the test max_cache_size logic, why not have it after the loop ?
  • in the test currently when it fails, it crashes due to failing to join threads. Either just leak them (with explicit comment), or use tcmalloc::Cleanup. So if ASSERT fails and does early test body exit, you'll have your threads joined and std::thread destructors happy.

P.S. I am still somewhat skeptical on overall direction and approach here. Please consider looking at this from holistic perspective (whether this is the "fix" you're looking for). With that said, as I promised, if this is good enough quality, I'll still merge it.

@alk
Copy link
Contributor

alk commented May 18, 2024

ah, also, please update copyright year in the test .cc you've added.

@coder-saab001
Copy link
Contributor Author

Hey @alk, thanks for taking a deep look. Yes, this is the fix I am looking for. It will give the user more power to control the cache.
Acknowledged and answered your comments as follows:

  • Changed all the internal functions/variables to use per_thread_cache_size. But whatever we are exposing to the outside world must have to be max_per_thread_cache, it will make things more obvious. The same is being done already for overall_thread_cache.
  • A new variable named min_possible_per_thread_cache_size_ is added.
  • Done
  • In C++, conditional variable only works with unique locks, so it is needed. Explicit unlock()-s are needed for using the same unique_lock for both cv_startfill and cv_filled conditional variables. Or I can use two different unique_lock each with its scope. But this one looks more clean. Just let me know. And of course, changes the variable name and all.
  • Done
    I moved the logic after the loop. It was just to make sure we should pick the maximum possible thread cache.
  • Done

@alk
Copy link
Contributor

alk commented May 25, 2024

I'll have a closer look sometime later, but immediately I see you're still make same mistake with waiting and waking up. "naked" counters bump and notify is racing with waiting side. Which takes lock (which does nothing since notification side ignores locking), checks condition, it may find counter not yet ready to continue, goes to sleep. And then in between the check and going to sleep there is race. This race is precisely why condition variable APIs are paired with mutex. Please fix.

And again, things will be simpler and nicer to read if you just do one mutex to guard everything.

@coder-saab001
Copy link
Contributor Author

Thanks for the quickest reply.

Can you please have a look again @alk

@alk
Copy link
Contributor

alk commented May 25, 2024

This is better. Now you don't need those counters to be atomic. Also no need to unlock and then immediately lock the mutex when waiting in main function. I'd use block scope to avoid explicit unlock.

Also please handle the join/cleanup thingy. Just artificially make the final assert fail and you'll see what I am talking about.

@coder-saab001
Copy link
Contributor Author

Done @alk
join/cleapup thingy is already been taken care using tcmalloc::Cleanup.

@alk
Copy link
Contributor

alk commented May 27, 2024

Took a closer look. So now that test is clearer, there is seemingly no value at all in those counters and condition variables. Filler threads start and finish without waiting for anyone. And main thread can simply wait for them to finish by joining fillers.

For the main change the real problem appears to be that per_thread_cache_size_ is barely used. I did some git archaeology and tcmalloc had some "static cache size" mode and there it was used. What we have now is incorrect comment that it is being read without synchronization (and thus volatile is unnecessary and thus your change to atomic is unnecessary too). The resultant confusion of meanings is the real blocker for your change. And naming unclarity is just a symptom. Something needs to be done about it.

Also for the future. We don't initialize things with <type> <name>(value) syntax. It looks too much like function declaration. And also I insist that all reads and writes of atomic variables should be explicit .load(<memory order>) and store calls.

@alk
Copy link
Contributor

alk commented May 27, 2024

Why not simply add tcmalloc.minimal_per_thread_cache_size setting and most trivial implementation of it ?

And then I think RecomputePerThreadCacheSize should be changed to not use per_thread_cache_size_. And perhaps do the ratio logic based on thread's max sizes. This will amputate the per_thread_cache_size_ thingy entirely. But this can be done separately (and I'll be thankful if you handle it, but if not I'll get to it eventually).

@coder-saab001 coder-saab001 force-pushed the configure-per-thread-cache branch 2 times, most recently from db9c22d to fca78a9 Compare May 27, 2024 18:30
@coder-saab001
Copy link
Contributor Author

Thanks @alk for all the deep analysis.
Sure, let's take this separately. Will pick it up after this one.

Refactored the whole CR from setting per_thread_cache_bytes to min_per_thread_cache_byte if that looks good.

Please have a look.

@alk
Copy link
Contributor

alk commented May 27, 2024

Better. Still few things:

  • your documentation update refers to non-existent tcmalloc.max_per_thread_cache_bytes setting. The intention was to refer to max_total_thread_cache_bytes instead?
  • "deferred" cleanup/joining of threads now makes test's assertion race with threads updating max_cache_size. You probably want to join threads before asserting ? In addition to that, I recommend you "test" your test by making sure it fails without your change.
  • do we need page heap lock setting min_per_thread_cache_size? (I think, no)
  • since we're accessing min_per_thread_cache_size un-synchronized, I recommend you make it atomic (but read with std::memory_order_relaxed, we don't need any perf hits of atomic access)

@alk
Copy link
Contributor

alk commented May 27, 2024

ah, also emplace_back instead of push_back(thread(... ?

@alk
Copy link
Contributor

alk commented May 27, 2024

also for set_min_per_thread_cache_size why not inline directly in class definition? You have inline for getter but not setter.

@coder-saab001
Copy link
Contributor Author

Thanks @alk

  • Thanks for that, I forgot to remove that from the previous versions of the change.
  • It makes sense, I don't know how I missed it, and yeah, I tested it, and it's failing without this change.
  • Yeah, it seems like. Removed.
  • Done
  • Much better ; )
  • Done

@alk
Copy link
Contributor

alk commented May 27, 2024

Better. Couple more things. Nearly all cosmetic.

  • include has to be after config.h. In general order is config.h, then "matching header" (if we're in .cc), then C headers, then C++ headers, then project headers (generally sorted within group). We don't 100% always follow this rule, but it is the rule (see also google C++ style guide; but note it's modern editions don't special-case config.h, which always needs to be first)
  • There is still holding page heap lock around reading of tcmalloc.min_per_thread_cache_bytes
  • inside test, I think having mutex declared static right inside the locking block would be more idiomatic and easier to read. I.e. mtx is only ever used there. Also having max_cache_size updated not by explicit if but via std::max would be more idiomatic. I.e. for readers of the code, no need to decipher: "aha, this is taking maximum".
  • in thread_cache.cc there are few places that do "redundant" loads of min_per_thread_cache_size, causing possible races. I highly recommend you load it once "per-use". E.g. setting SetMaxSize to be min size value and decrement unclaimed space by that same amount (instead of freshly loaded min size value).
  • ah and also in the test where you SetNumericProperty, it is worth checking that the property name was understood (i.e. set property returned true).

@alk
Copy link
Contributor

alk commented May 27, 2024

ah, also since we've changed the approach to only touch min size, I also recommend you to update the commit subject line

@coder-saab001
Copy link
Contributor Author

Thanks again @alk . Hoping this one looks fine

@coder-saab001 coder-saab001 changed the title Added support to configure per-thread cache Added support to configure lower bound on per-thread cache size May 27, 2024
@alk
Copy link
Contributor

alk commented May 27, 2024

Nearly perfect. But I just noticed that you still add (seemingly unused) getter for per_thread_cache_size. Also since you're changing anyways. There is small typo in documentation you're adding. "takes affect" I think what is meant is "takes effect". Also I am not sure this statement is accurate (but let me know if I misunderstand): " Also, make sure to set this property before tcmalloc.max_total_thread_cache_bytes."

@coder-saab001
Copy link
Contributor Author

hey @alk ,
The statement is accurate in the sense that configuring tcmalloc.min_per_thread_cache_bytes will have no much affect unless we set tcmalloc.max_total_thread_cache_bytes. Because it's the second one where we are calling RecomputePerThreadCacheSize(). First one will just use to relax the lower bound while computing per thread cache, in case.

@coder-saab001
Copy link
Contributor Author

Let me know if we want to change the behaviour. It's just this one looks more fine to me.

@alk
Copy link
Contributor

alk commented May 29, 2024

I anticipate that the main impact of min cache size setting in practice is going to be via IncreaseCacheLimit logic. I.e. we expect min cache size to be set quite early when usually only main thread exists and quite possibly it hasn't yet increased it's cache size cap too. Then as threads are created and they compete for cache size limit, is when your new setting is going to do it's job.

Let me know if you disagree.

@coder-saab001
Copy link
Contributor Author

coder-saab001 commented May 29, 2024

Yeah, correct @alk
Basically, everything will be the same as it was with tcmalloc.max_total_thread_cache_bytes logic. It's just a lower limit where per thread cache can stoop can be configured beforehand.

@coder-saab001
Copy link
Contributor Author

Agree that this is not so obvious setting as the previous one for per_thread_cache which had been refractored

@alk
Copy link
Contributor

alk commented Jun 1, 2024

Hi. So, should I wait for you to update the comment ?

@coder-saab001
Copy link
Contributor Author

Hey, @alk , which comment do you want me to update specifically?

@alk
Copy link
Contributor

alk commented Jun 3, 2024

I am referring to comment "Also, make sure to set this property before tcmalloc.max_total_thread_cache_bytes." in the docs. As noted above, I think it is factually incorrect and misleading. But do let me know if you disagree.

@coder-saab001
Copy link
Contributor Author

coder-saab001 commented Jun 3, 2024

Ohh yeah, that is incorrect. Removed the comment @alk : )

@coder-saab001
Copy link
Contributor Author

coder-saab001 commented Jun 3, 2024

Thanks a lot @alk, learned a lot as part of this.
Do you want to take a final look and then maybe we can go forward with merging it.

@alk
Copy link
Contributor

alk commented Jun 3, 2024

So unfortunately test doesn't always fail if I comment out the min thread size setting. Best reproduction is to pin test program to a single core (i.e. taskset -c 0 ./min_per_thread_cache_size_test). And it makes sense given that nothing is being waited for in the filler threads, and they don't do much work, so they may easily finish before next one is started.

Also the way max thread cache size is regulated, it isn't really guaranteed. One thread "stealing" max size bytes from another doesn't really reduce other thread's cache. So it isn't quite clear how to best test this.

Perhaps we could expose thread's total max size aggregation, instead of actual free list sizes to the test and insect like this. And then filler thread will simply stop and the end waiting "go ahead and die" signal. Then main thread can do the stats inspection, and there is no need to have locking around each thread inspecting the value and max-ing it.

Another minor thing (since this requires more iteration) is that assigning atomic of min_per_thread_cache_size_ should be with explicit .store call as I noted above. I'd allow that without extra iteration, but since we need to iterate more, it shouldn't be much trouble.

@alk
Copy link
Contributor

alk commented Jun 3, 2024

Alternatively filler threads can simply round-robin doing some sort of synchronization facility. I'd maybe do that. But I am not sure how comfortable you are with arranging somewhat less common synchronization schemes.

@coder-saab001
Copy link
Contributor Author

Agree @alk
Tried implementing round round-robin thing. Happily, test is now always failing without the setting even with taskset -c 0 ./min_per_thread_cache_size_test. Please check.

@alk
Copy link
Contributor

alk commented Jun 4, 2024

Not bad. Seems to be correct. Few things:

  • I think base/cleanup.h include is not used anymore
  • why you lock and unlock the mutex on each of ten "passes" ? No need.
  • You choose the kFooBar constants for number of allocations and number of threads (I don't mind) but then you hardcoded "naked" literal of 10 for number of passes. Mixing such styles is a slight violation of principle of least surprise. Can you please fix ?
  • special lock around getting total thread cache usage and max-ing it isn't needed anymore. You already have enough serialization. Please fix.
  • ah and btw there is no need to explicitly initialize global variables to 0. Really minor, but somehow catches my attention every time.
  • ah and another minor thing. It would be slightly more idiomatic to do "new style" for loop iteration over vector of threads to join. No need to have an index, just do for (auto& t : threads) { t.join(); } or something like that.

@coder-saab001
Copy link
Contributor Author

Thanks @alk
Addressed your comments

@alk
Copy link
Contributor

alk commented Jun 4, 2024

Thanks. Final thing I missed last time. Can you please make getter for min per thread size also use explicit load? And please make both store and load use explicit relaxed memory order .

@coder-saab001
Copy link
Contributor Author

Done @alk

@alk
Copy link
Contributor

alk commented Jun 4, 2024

Applied. Thanks. I had to make tiny change to your commit (added signed-off line). You added spurious new line at the end of thread_cache.h, so I removed that addition.

@alk alk closed this Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants