Implement threads pin offset for `PerClusterPools` #338

ufownl · 2024-08-06T13:38:09Z

In my scenario, I use a multi-process application to handle user requests, and each process has limited threads to limit the computing resources each request uses. After adapting to the new PerClusterPools APIs, I found that threads in different processes were pinned to the same few processors, greatly hurting performance. So I added this parameter to the constructor of PerClusterPools to provide a way to avoid this issue.

util/threading.h

jan-wassenberg · 2024-08-06T17:36:09Z

util/threading.h

+                    [pin_offset, &cores](uint64_t task, size_t thread) {
                      HWY_ASSERT(task == thread);  // each inner has one task
-                      hwy::PinThreadToLogicalProcessor(cores[task]);
+                      auto lp = cores[(task + pin_offset) % cores.size()];


I'm concerned about this logic. It's probably fine right now because we temporarily throw all CPUs into one basket, and then add + mod makes sense (as is done above). However we'd soon restore the intended detection, where each AMD CCX's 8 cores go into one inner pool. Then it doesn't make sense for multiple processes to share one CCX's cores, because that would thrash their shared L3 cache.

It would be better for multiple processes to split the 'outer' (clusters, i.e. CCX or sockets) among themselves. That seems difficult to express with just one "pin offset". Perhaps it would be better to pass in how many total processes we expect, and which of those we are? Then we could do the right thing here, as well as in the topology-detection-failed case above where we just have a set of CPUs.

Taking a step back, do we really want multiple processes? If this is basically to implement batching, wouldn't it be more efficient to switch to the batch interface? Then you'd get most of the benefits (especially as we further improve the threading model), without the complexity of multiple processes. Happy to discuss via quick video call if it's helpful?

I considered the batch interface, but I think it's not suitable for my scenario yet.

In my scenario, I need to maintain multiple conversations with different users, they may have different contexts and different start_pos, the batch interface can pass in different KV caches for different contexts, but can only pass one start_pos.

Currently, our number of concurrent requests is not large, and when user requests arrive is unpredictable, so it's difficult to wait for enough requests to be batched. Once prefill and generation begin, subsequent requests will be pending until the previous processing is complete, the initial response to the subsequent requests will be significantly delayed. So I prefer a suitable number of worker processes to handle requests using the stream interface to guarantee the initial response to a request is fast if the load is not exceeded. It is a bad experience to let users wait without any response, so in my scenario, response speed is more important than generation speed.

If it were just the start_pos, we could move towards passing one per query.
Makes sense that batching is harder with low QPS. So looks like multiple processes are required.

I see two possible ways forward:

we add a flag to skip the pinning, and each process uses the max_outer and num_threads flags.

we add awareness of multiple process to our pinning logic, passing in new num_processes and my_process flags, then we divide max_outer by num_processes and pin to those.

I'd tend to option 1 because pinning is anyway a no-op on Mac. Would you like to measure the performance benefit of pinning on your Linux system to see how helpful it is for multiple processes? Last time I checked, for our single process use case it's something like 5-15%.

If it were just the start_pos, we could move towards passing one per query.

Passing one start_pos per query should improve the usability of the batch interface in many practical scenarios. I thought about this a few days ago, but haven't had enough time to look into the relevant code recently.

I see two possible ways forward:

The 1st option is almost my previous solution, I did not call the old gcpp::PinWorkersToCores function in my application. Prefilling and generation speeds are acceptable for me.

Would you like to measure the performance benefit of pinning on your Linux system to see how helpful it is for multiple processes?

Maybe tomorrow I can try running a simple benchmark on my Linux machine. :)

I have tested it on my environment, and the speeds of pin or non-pin are almost the same, the difference is less than 1%.

System: Ubuntu 20.04
CPU: Intel(R) Core(TM) i5-10400F CPU @ 2.90GHz

Thanks, good to know. FYI our previous heuristic was to enable pinning above 10 threads. How about I add a tri-state flag that allows disabling it?

We'd welcome a pull request for adding per-query pos, preferably passed as a hwy::Span<size_t>.

How about I add a tri-state flag that allows disabling it?

I think this is feasible so that users can choose according to their needs.

preferably passed as a hwy::Span<size_t>

I have the same idea as you, but I haven't looked into the code about interleave queries yet, it feels like this is important to implement per-query pos. :)

PiperOrigin-RevId: 661293615

PiperOrigin-RevId: 661389171

ufownl · 2024-08-10T03:11:54Z

PR #343 fixes the issue this PR was trying to fix so that this PR can be closed.

Implement threads pin offset for PerClusterPools

93c7d7d

jan-wassenberg reviewed Aug 6, 2024

View reviewed changes

copybara-service bot pushed a commit that referenced this pull request Aug 9, 2024

Add pin flag to disable pinning. Refs #338

104980c

PiperOrigin-RevId: 661293615

copybara-service bot mentioned this pull request Aug 9, 2024

Add pin flag to disable pinning. Refs #338 #343

Merged

copybara-service bot pushed a commit that referenced this pull request Aug 9, 2024

Add pin flag to disable pinning. Refs #338

6fca7a3

PiperOrigin-RevId: 661293615

copybara-service bot pushed a commit that referenced this pull request Aug 9, 2024

Add pin flag to disable pinning. Refs #338

8d5fa01

PiperOrigin-RevId: 661293615

copybara-service bot pushed a commit that referenced this pull request Aug 9, 2024

Add pin flag to disable pinning. Refs #338

282f73e

PiperOrigin-RevId: 661389171

ufownl closed this Aug 10, 2024

ufownl mentioned this pull request Aug 10, 2024

Implement start_pos per query for batch interface #344

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement threads pin offset for `PerClusterPools` #338

Implement threads pin offset for `PerClusterPools` #338

Uh oh!

ufownl commented Aug 6, 2024

Uh oh!

Uh oh!

jan-wassenberg Aug 6, 2024

Uh oh!

ufownl Aug 7, 2024

Uh oh!

jan-wassenberg Aug 7, 2024

Uh oh!

ufownl Aug 7, 2024

Uh oh!

ufownl Aug 8, 2024

Uh oh!

jan-wassenberg Aug 8, 2024

Uh oh!

ufownl Aug 8, 2024

Uh oh!

ufownl commented Aug 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement threads pin offset for PerClusterPools #338

Implement threads pin offset for PerClusterPools #338

Uh oh!

Conversation

ufownl commented Aug 6, 2024

Uh oh!

Uh oh!

jan-wassenberg Aug 6, 2024

Choose a reason for hiding this comment

Uh oh!

ufownl Aug 7, 2024

Choose a reason for hiding this comment

Uh oh!

jan-wassenberg Aug 7, 2024

Choose a reason for hiding this comment

Uh oh!

ufownl Aug 7, 2024

Choose a reason for hiding this comment

Uh oh!

ufownl Aug 8, 2024

Choose a reason for hiding this comment

Uh oh!

jan-wassenberg Aug 8, 2024

Choose a reason for hiding this comment

Uh oh!

ufownl Aug 8, 2024

Choose a reason for hiding this comment

Uh oh!

ufownl commented Aug 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement threads pin offset for `PerClusterPools` #338

Implement threads pin offset for `PerClusterPools` #338