Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSpeed Comm. Backend v1 #1985

Merged
merged 24 commits into from
Jun 10, 2022
Merged

DeepSpeed Comm. Backend v1 #1985

merged 24 commits into from
Jun 10, 2022

Conversation

awan-10
Copy link
Contributor

@awan-10 awan-10 commented May 27, 2022

This PR introduces the DeepSpeed Comm. Backend (v1).

Current advanced communication schemes rely on mixing python-level communication packages (e.g. torch.distributed, mpi4py for 1-bit Adam). In order to simplify comms prototypes, we're looking to add support for custom communication backends within DeepSpeed built directly on top of their respective libraries (e.g. NCCL, MPI, etc).

This PR completes the first phase towards this goal by introducing:

  • The new comms interface deepspeed.comms
  • A complete wrapper around torch.distributed called TorchBackend for backwards-compatibility
  • A rough skeleton for custom backends that we can use for phase 2

Co-authored-by: Quentin Anthony qganthony@yahoo.com
Co-authored-by: Ammar Ahmad Awan ammar.awan@microsoft.com
Co-authored-by: Jeff Rasley jerasley@microsoft.com

Co-authored-by: Quentin Anthony <qganthony@yahoo.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
deepspeed/__init__.py Outdated Show resolved Hide resolved
deepspeed/comm/backend.py Outdated Show resolved Hide resolved
deepspeed/moe/sharded_moe.py Outdated Show resolved Hide resolved
deepspeed/runtime/engine.py Outdated Show resolved Hide resolved
op_builder/comm.py Outdated Show resolved Hide resolved
tests/comm/test.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@awan-10 awan-10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this PR with Quentin. He will take care of minor comments. @jeffra, please review this one.

@jeffra
Copy link
Contributor

jeffra commented May 31, 2022

@SeanNaren can you take a look at the lightning test failure? We're seeing this on other PRs and in master right now as well. It appears to be a protobuf issue, have you seen this on your side before?

@Quentin-Anthony
Copy link
Contributor

@jeffra and @tjruwase -- Any further comments?

deepspeed/comm/comm.py Outdated Show resolved Hide resolved
deepspeed/comm/comm.py Outdated Show resolved Hide resolved
deepspeed/comm/comm.py Outdated Show resolved Hide resolved
deepspeed/comm/comm.py Outdated Show resolved Hide resolved
deepspeed/comm/comm.py Outdated Show resolved Hide resolved
@microsoft microsoft deleted a comment from rocm-mici Jun 9, 2022
deepspeed/comm/comm.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants