Skip to content

Conversation

@ufownl
Copy link
Contributor

@ufownl ufownl commented Aug 6, 2024

In my scenario, I use a multi-process application to handle user requests, and each process has limited threads to limit the computing resources each request uses. After adapting to the new PerClusterPools APIs, I found that threads in different processes were pinned to the same few processors, greatly hurting performance. So I added this parameter to the constructor of PerClusterPools to provide a way to avoid this issue.

[pin_offset, &cores](uint64_t task, size_t thread) {
HWY_ASSERT(task == thread); // each inner has one task
hwy::PinThreadToLogicalProcessor(cores[task]);
auto lp = cores[(task + pin_offset) % cores.size()];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about this logic. It's probably fine right now because we temporarily throw all CPUs into one basket, and then add + mod makes sense (as is done above). However we'd soon restore the intended detection, where each AMD CCX's 8 cores go into one inner pool. Then it doesn't make sense for multiple processes to share one CCX's cores, because that would thrash their shared L3 cache.

It would be better for multiple processes to split the 'outer' (clusters, i.e. CCX or sockets) among themselves. That seems difficult to express with just one "pin offset". Perhaps it would be better to pass in how many total processes we expect, and which of those we are? Then we could do the right thing here, as well as in the topology-detection-failed case above where we just have a set of CPUs.

Taking a step back, do we really want multiple processes? If this is basically to implement batching, wouldn't it be more efficient to switch to the batch interface? Then you'd get most of the benefits (especially as we further improve the threading model), without the complexity of multiple processes. Happy to discuss via quick video call if it's helpful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered the batch interface, but I think it's not suitable for my scenario yet.

  • In my scenario, I need to maintain multiple conversations with different users, they may have different contexts and different start_pos, the batch interface can pass in different KV caches for different contexts, but can only pass one start_pos.
  • Currently, our number of concurrent requests is not large, and when user requests arrive is unpredictable, so it's difficult to wait for enough requests to be batched. Once prefill and generation begin, subsequent requests will be pending until the previous processing is complete, the initial response to the subsequent requests will be significantly delayed. So I prefer a suitable number of worker processes to handle requests using the stream interface to guarantee the initial response to a request is fast if the load is not exceeded. It is a bad experience to let users wait without any response, so in my scenario, response speed is more important than generation speed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it were just the start_pos, we could move towards passing one per query.
Makes sense that batching is harder with low QPS. So looks like multiple processes are required.

I see two possible ways forward:

  1. we add a flag to skip the pinning, and each process uses the max_outer and num_threads flags.
  2. we add awareness of multiple process to our pinning logic, passing in new num_processes and my_process flags, then we divide max_outer by num_processes and pin to those.

I'd tend to option 1 because pinning is anyway a no-op on Mac. Would you like to measure the performance benefit of pinning on your Linux system to see how helpful it is for multiple processes? Last time I checked, for our single process use case it's something like 5-15%.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it were just the start_pos, we could move towards passing one per query.

Passing one start_pos per query should improve the usability of the batch interface in many practical scenarios. I thought about this a few days ago, but haven't had enough time to look into the relevant code recently.

I see two possible ways forward:

The 1st option is almost my previous solution, I did not call the old gcpp::PinWorkersToCores function in my application. Prefilling and generation speeds are acceptable for me.

Would you like to measure the performance benefit of pinning on your Linux system to see how helpful it is for multiple processes?

Maybe tomorrow I can try running a simple benchmark on my Linux machine. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested it on my environment, and the speeds of pin or non-pin are almost the same, the difference is less than 1%.

System: Ubuntu 20.04
CPU: Intel(R) Core(TM) i5-10400F CPU @ 2.90GHz

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good to know. FYI our previous heuristic was to enable pinning above 10 threads. How about I add a tri-state flag that allows disabling it?

We'd welcome a pull request for adding per-query pos, preferably passed as a hwy::Span<size_t>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about I add a tri-state flag that allows disabling it?

I think this is feasible so that users can choose according to their needs.

preferably passed as a hwy::Span<size_t>

I have the same idea as you, but I haven't looked into the code about interleave queries yet, it feels like this is important to implement per-query pos. :)

copybara-service bot pushed a commit that referenced this pull request Aug 9, 2024
PiperOrigin-RevId: 661293615
copybara-service bot pushed a commit that referenced this pull request Aug 9, 2024
PiperOrigin-RevId: 661293615
copybara-service bot pushed a commit that referenced this pull request Aug 9, 2024
PiperOrigin-RevId: 661293615
copybara-service bot pushed a commit that referenced this pull request Aug 9, 2024
PiperOrigin-RevId: 661389171
@ufownl
Copy link
Contributor Author

ufownl commented Aug 10, 2024

PR #343 fixes the issue this PR was trying to fix so that this PR can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants