-
Notifications
You must be signed in to change notification settings - Fork 578
Implement threads pin offset for PerClusterPools
#338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -120,7 +120,8 @@ class PerClusterPools { | |
| // result in threads not running on their own core, we only allow for | ||
| // *upper bounds* on the number of clusters and threads. The actual number of | ||
| // clusters and threads are still limited by the detected topology. | ||
| PerClusterPools(size_t max_clusters, size_t max_threads) | ||
| PerClusterPools(size_t max_clusters, size_t max_threads, | ||
| size_t pin_offset = 0) | ||
| : have_threading_support_(hwy::HaveThreadingSupport()), | ||
| cores_per_cluster_(DetectCoresPerCluster()), | ||
| outer_pool_(CapIfNonzero(cores_per_cluster_.size(), max_clusters)) { | ||
|
|
@@ -135,8 +136,11 @@ class PerClusterPools { | |
| inner_pools_.push_back(std::make_unique<hwy::ThreadPool>(num_threads)); | ||
| if (num_threads > 1) { | ||
| inner_pools_.back()->Run(0, num_threads, | ||
| [](uint64_t /*task*/, size_t thread) { | ||
| hwy::PinThreadToLogicalProcessor(thread); | ||
| [pin_offset, num_threads](uint64_t /*task*/, | ||
| size_t thread) { | ||
| auto lp = | ||
| (thread + pin_offset) % num_threads; | ||
| hwy::PinThreadToLogicalProcessor(lp); | ||
| }); | ||
| } | ||
| return; | ||
|
|
@@ -153,7 +157,7 @@ class PerClusterPools { | |
| // (the one calling inner.Run()) to the enabled cores in the cluster. | ||
| outer_pool_.Run( | ||
| 0, outer_pool_.NumWorkers(), | ||
| [this](uint64_t outer, size_t outer_thread) { | ||
| [this, pin_offset](uint64_t outer, size_t outer_thread) { | ||
| HWY_ASSERT(outer == outer_thread); // each outer has one task | ||
| hwy::ThreadPool& inner = *inner_pools_[outer]; | ||
|
|
||
|
|
@@ -163,9 +167,10 @@ class PerClusterPools { | |
| HWY_ASSERT(inner.NumWorkers() <= cores.size()); | ||
|
|
||
| inner.Run(0, inner.NumWorkers(), | ||
| [&cores](uint64_t task, size_t thread) { | ||
| [pin_offset, &cores](uint64_t task, size_t thread) { | ||
| HWY_ASSERT(task == thread); // each inner has one task | ||
| hwy::PinThreadToLogicalProcessor(cores[task]); | ||
| auto lp = cores[(task + pin_offset) % cores.size()]; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm concerned about this logic. It's probably fine right now because we temporarily throw all CPUs into one basket, and then add + mod makes sense (as is done above). However we'd soon restore the intended detection, where each AMD CCX's 8 cores go into one inner pool. Then it doesn't make sense for multiple processes to share one CCX's cores, because that would thrash their shared L3 cache. It would be better for multiple processes to split the 'outer' (clusters, i.e. CCX or sockets) among themselves. That seems difficult to express with just one "pin offset". Perhaps it would be better to pass in how many total processes we expect, and which of those we are? Then we could do the right thing here, as well as in the topology-detection-failed case above where we just have a set of CPUs. Taking a step back, do we really want multiple processes? If this is basically to implement batching, wouldn't it be more efficient to switch to the batch interface? Then you'd get most of the benefits (especially as we further improve the threading model), without the complexity of multiple processes. Happy to discuss via quick video call if it's helpful?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I considered the batch interface, but I think it's not suitable for my scenario yet.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If it were just the start_pos, we could move towards passing one per query. I see two possible ways forward:
I'd tend to option 1 because pinning is anyway a no-op on Mac. Would you like to measure the performance benefit of pinning on your Linux system to see how helpful it is for multiple processes? Last time I checked, for our single process use case it's something like 5-15%.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Passing one
The 1st option is almost my previous solution, I did not call the old
Maybe tomorrow I can try running a simple benchmark on my Linux machine. :)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have tested it on my environment, and the speeds of pin or non-pin are almost the same, the difference is less than 1%. System: Ubuntu 20.04
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, good to know. FYI our previous heuristic was to enable pinning above 10 threads. How about I add a tri-state flag that allows disabling it? We'd welcome a pull request for adding per-query pos, preferably passed as a
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think this is feasible so that users can choose according to their needs.
I have the same idea as you, but I haven't looked into the code about interleave queries yet, it feels like this is important to implement per-query pos. :) |
||
| hwy::PinThreadToLogicalProcessor(lp); | ||
| }); | ||
| }); | ||
| } | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.