Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU SHM based inference_all_reduce improve #5320

Merged
merged 36 commits into from Apr 4, 2024

Conversation

delock
Copy link
Contributor

@delock delock commented Mar 27, 2024

This PR improves SHM based inference_all_reduce on CPU:

  1. Optimize for larger message size which affects performance of first token generation with long context. For example, for llama2 70b with 1024 input sequence length, all_reduce message size is 32MB with single batch.
    • Increased SHM buffer size from 1MB/worker to 32MB/worker
    • Each worker allocate SHM buffer on its own NUMA node, instead of rank 0 allcoate SHM buffers for all other workers
    • For message size > 1MB, a more distributed algoritm is used to make memory bandwidth and computation evenly distributed among workers
  2. Decouple SHM based collective code with oneCCL based code, making it ready to integrate with other backend i.e. gloo backend
  3. Loosen the condition SHM based allreduce is used, i.e. message size does not have to divisible by 32 bytes.

The new distributed algorithm, combine with larger per worker SHM buffer, brings ~3x allreduce performance improvement for 32MB message size on a 2 socket machine.

@loadams loadams added this pull request to the merge queue Apr 4, 2024
Merged via the queue into microsoft:master with commit 731fd68 Apr 4, 2024
12 checks passed
rraminen pushed a commit to ROCm/DeepSpeed that referenced this pull request May 9, 2024
This PR improves SHM based inference_all_reduce on CPU:
1. Optimize for larger message size which affects performance of first
token generation with long context. For example, for llama2 70b with
1024 input sequence length, all_reduce message size is 32MB with single
batch.
    * Increased SHM buffer size from 1MB/worker to 32MB/worker
* Each worker allocate SHM buffer on its own NUMA node, instead of rank
0 allcoate SHM buffers for all other workers
* For message size > 1MB, a more distributed algoritm is used to make
memory bandwidth and computation evenly distributed among workers
2. Decouple SHM based collective code with oneCCL based code, making it
ready to integrate with other backend i.e. gloo backend
3. Loosen the condition SHM based allreduce is used, i.e. message size
does not have to divisible by 32 bytes.

The new distributed algorithm, combine with larger per worker SHM
buffer, brings ~3x allreduce performance improvement for 32MB message
size on a 2 socket machine.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants