-
Notifications
You must be signed in to change notification settings - Fork 576
Implement threads pin offset for PerClusterPools
#338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| [pin_offset, &cores](uint64_t task, size_t thread) { | ||
| HWY_ASSERT(task == thread); // each inner has one task | ||
| hwy::PinThreadToLogicalProcessor(cores[task]); | ||
| auto lp = cores[(task + pin_offset) % cores.size()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned about this logic. It's probably fine right now because we temporarily throw all CPUs into one basket, and then add + mod makes sense (as is done above). However we'd soon restore the intended detection, where each AMD CCX's 8 cores go into one inner pool. Then it doesn't make sense for multiple processes to share one CCX's cores, because that would thrash their shared L3 cache.
It would be better for multiple processes to split the 'outer' (clusters, i.e. CCX or sockets) among themselves. That seems difficult to express with just one "pin offset". Perhaps it would be better to pass in how many total processes we expect, and which of those we are? Then we could do the right thing here, as well as in the topology-detection-failed case above where we just have a set of CPUs.
Taking a step back, do we really want multiple processes? If this is basically to implement batching, wouldn't it be more efficient to switch to the batch interface? Then you'd get most of the benefits (especially as we further improve the threading model), without the complexity of multiple processes. Happy to discuss via quick video call if it's helpful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered the batch interface, but I think it's not suitable for my scenario yet.
- In my scenario, I need to maintain multiple conversations with different users, they may have different contexts and different
start_pos, the batch interface can pass in different KV caches for different contexts, but can only pass onestart_pos. - Currently, our number of concurrent requests is not large, and when user requests arrive is unpredictable, so it's difficult to wait for enough requests to be batched. Once prefill and generation begin, subsequent requests will be pending until the previous processing is complete, the initial response to the subsequent requests will be significantly delayed. So I prefer a suitable number of worker processes to handle requests using the stream interface to guarantee the initial response to a request is fast if the load is not exceeded. It is a bad experience to let users wait without any response, so in my scenario, response speed is more important than generation speed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it were just the start_pos, we could move towards passing one per query.
Makes sense that batching is harder with low QPS. So looks like multiple processes are required.
I see two possible ways forward:
- we add a flag to skip the pinning, and each process uses the max_outer and num_threads flags.
- we add awareness of multiple process to our pinning logic, passing in new num_processes and my_process flags, then we divide max_outer by num_processes and pin to those.
I'd tend to option 1 because pinning is anyway a no-op on Mac. Would you like to measure the performance benefit of pinning on your Linux system to see how helpful it is for multiple processes? Last time I checked, for our single process use case it's something like 5-15%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it were just the start_pos, we could move towards passing one per query.
Passing one start_pos per query should improve the usability of the batch interface in many practical scenarios. I thought about this a few days ago, but haven't had enough time to look into the relevant code recently.
I see two possible ways forward:
The 1st option is almost my previous solution, I did not call the old gcpp::PinWorkersToCores function in my application. Prefilling and generation speeds are acceptable for me.
Would you like to measure the performance benefit of pinning on your Linux system to see how helpful it is for multiple processes?
Maybe tomorrow I can try running a simple benchmark on my Linux machine. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested it on my environment, and the speeds of pin or non-pin are almost the same, the difference is less than 1%.
System: Ubuntu 20.04
CPU: Intel(R) Core(TM) i5-10400F CPU @ 2.90GHz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good to know. FYI our previous heuristic was to enable pinning above 10 threads. How about I add a tri-state flag that allows disabling it?
We'd welcome a pull request for adding per-query pos, preferably passed as a hwy::Span<size_t>.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about I add a tri-state flag that allows disabling it?
I think this is feasible so that users can choose according to their needs.
preferably passed as a hwy::Span<size_t>
I have the same idea as you, but I haven't looked into the code about interleave queries yet, it feels like this is important to implement per-query pos. :)
PiperOrigin-RevId: 661293615
PiperOrigin-RevId: 661293615
PiperOrigin-RevId: 661293615
PiperOrigin-RevId: 661389171
|
PR #343 fixes the issue this PR was trying to fix so that this PR can be closed. |
In my scenario, I use a multi-process application to handle user requests, and each process has limited threads to limit the computing resources each request uses. After adapting to the new
PerClusterPoolsAPIs, I found that threads in different processes were pinned to the same few processors, greatly hurting performance. So I added this parameter to the constructor ofPerClusterPoolsto provide a way to avoid this issue.