-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support to configure lower bound on per-thread cache size #1515
Added support to configure lower bound on per-thread cache size #1515
Conversation
aa6bde3
to
dfb008a
Compare
This is much better. Few notables
P.S. I am still somewhat skeptical on overall direction and approach here. Please consider looking at this from holistic perspective (whether this is the "fix" you're looking for). With that said, as I promised, if this is good enough quality, I'll still merge it. |
ah, also, please update copyright year in the test .cc you've added. |
dfb008a
to
42c9fed
Compare
Hey @alk, thanks for taking a deep look. Yes, this is the fix I am looking for. It will give the user more power to control the cache.
|
I'll have a closer look sometime later, but immediately I see you're still make same mistake with waiting and waking up. "naked" counters bump and notify is racing with waiting side. Which takes lock (which does nothing since notification side ignores locking), checks condition, it may find counter not yet ready to continue, goes to sleep. And then in between the check and going to sleep there is race. This race is precisely why condition variable APIs are paired with mutex. Please fix. And again, things will be simpler and nicer to read if you just do one mutex to guard everything. |
42c9fed
to
80cec29
Compare
Thanks for the quickest reply. Can you please have a look again @alk |
This is better. Now you don't need those counters to be atomic. Also no need to unlock and then immediately lock the mutex when waiting in main function. I'd use block scope to avoid explicit unlock. Also please handle the join/cleanup thingy. Just artificially make the final assert fail and you'll see what I am talking about. |
80cec29
to
62f5dd2
Compare
Done @alk |
Took a closer look. So now that test is clearer, there is seemingly no value at all in those counters and condition variables. Filler threads start and finish without waiting for anyone. And main thread can simply wait for them to finish by joining fillers. For the main change the real problem appears to be that per_thread_cache_size_ is barely used. I did some git archaeology and tcmalloc had some "static cache size" mode and there it was used. What we have now is incorrect comment that it is being read without synchronization (and thus volatile is unnecessary and thus your change to atomic is unnecessary too). The resultant confusion of meanings is the real blocker for your change. And naming unclarity is just a symptom. Something needs to be done about it. Also for the future. We don't initialize things with |
Why not simply add tcmalloc.minimal_per_thread_cache_size setting and most trivial implementation of it ? And then I think RecomputePerThreadCacheSize should be changed to not use per_thread_cache_size_. And perhaps do the ratio logic based on thread's max sizes. This will amputate the per_thread_cache_size_ thingy entirely. But this can be done separately (and I'll be thankful if you handle it, but if not I'll get to it eventually). |
db9c22d
to
fca78a9
Compare
Thanks @alk for all the deep analysis. Refactored the whole CR from setting Please have a look. |
Better. Still few things:
|
ah, also emplace_back instead of push_back(thread(... ? |
also for set_min_per_thread_cache_size why not inline directly in class definition? You have inline for getter but not setter. |
fca78a9
to
86eaaad
Compare
Thanks @alk
|
Better. Couple more things. Nearly all cosmetic.
|
ah, also since we've changed the approach to only touch min size, I also recommend you to update the commit subject line |
86eaaad
to
f29f21d
Compare
Thanks again @alk . Hoping this one looks fine |
f29f21d
to
594f5b7
Compare
Nearly perfect. But I just noticed that you still add (seemingly unused) getter for per_thread_cache_size. Also since you're changing anyways. There is small typo in documentation you're adding. "takes affect" I think what is meant is "takes effect". Also I am not sure this statement is accurate (but let me know if I misunderstand): " Also, make sure to set this property before tcmalloc.max_total_thread_cache_bytes." |
594f5b7
to
d79227c
Compare
hey @alk , |
Let me know if we want to change the behaviour. It's just this one looks more fine to me. |
I anticipate that the main impact of min cache size setting in practice is going to be via IncreaseCacheLimit logic. I.e. we expect min cache size to be set quite early when usually only main thread exists and quite possibly it hasn't yet increased it's cache size cap too. Then as threads are created and they compete for cache size limit, is when your new setting is going to do it's job. Let me know if you disagree. |
Yeah, correct @alk |
Agree that this is not so obvious setting as the previous one for |
Hi. So, should I wait for you to update the comment ? |
Hey, @alk , which comment do you want me to update specifically? |
I am referring to comment "Also, make sure to set this property before tcmalloc.max_total_thread_cache_bytes." in the docs. As noted above, I think it is factually incorrect and misleading. But do let me know if you disagree. |
d79227c
to
d0f34a1
Compare
Ohh yeah, that is incorrect. Removed the comment @alk : ) |
Thanks a lot @alk, learned a lot as part of this. |
So unfortunately test doesn't always fail if I comment out the min thread size setting. Best reproduction is to pin test program to a single core (i.e. taskset -c 0 ./min_per_thread_cache_size_test). And it makes sense given that nothing is being waited for in the filler threads, and they don't do much work, so they may easily finish before next one is started. Also the way max thread cache size is regulated, it isn't really guaranteed. One thread "stealing" max size bytes from another doesn't really reduce other thread's cache. So it isn't quite clear how to best test this. Perhaps we could expose thread's total max size aggregation, instead of actual free list sizes to the test and insect like this. And then filler thread will simply stop and the end waiting "go ahead and die" signal. Then main thread can do the stats inspection, and there is no need to have locking around each thread inspecting the value and max-ing it. Another minor thing (since this requires more iteration) is that assigning atomic of min_per_thread_cache_size_ should be with explicit .store call as I noted above. I'd allow that without extra iteration, but since we need to iterate more, it shouldn't be much trouble. |
Alternatively filler threads can simply round-robin doing some sort of synchronization facility. I'd maybe do that. But I am not sure how comfortable you are with arranging somewhat less common synchronization schemes. |
d0f34a1
to
0d1dfd3
Compare
Agree @alk |
Not bad. Seems to be correct. Few things:
|
0d1dfd3
to
e810b7c
Compare
Thanks @alk |
e810b7c
to
f3d8542
Compare
Thanks. Final thing I missed last time. Can you please make getter for min per thread size also use explicit load? And please make both store and load use explicit relaxed memory order . |
f3d8542
to
acc722e
Compare
Done @alk |
Applied. Thanks. I had to make tiny change to your commit (added signed-off line). You added spurious new line at the end of thread_cache.h, so I removed that addition. |
The support is configure per thread cache is added along with the required unit tests.
Related issues: #1511