ZeRo-Inference across multiple GPU's #2335

joaopcm1996 · 2022-09-20T13:17:22Z

joaopcm1996
Sep 20, 2022

Hey everyone, awesome launch with ZeRo-Inference. Reading the [blog post](https://www.deepspeed.ai/2022/09/09/zero-inference.html#optimizations, specifically on the parallelization of layer fetching across multiple GPU's part, I was wondering: in the case that a single layer would still not fit into a GPU, can the second phase of layer fetching be skipped, i.e the assembling of all pieces of the given layer onto each GPU? The whole reason tensor parallel inference exists is that, when dealing with these bigger models, some of their layers individually don't fit into 1 GPU correct?

So in practice, the question I'm making is if ZeRo-Inference integrates with DeepSpeed's model parallel capabilities?

Answered by tjruwase

Sep 20, 2022

Thanks, this is a great question. In short, yes, ZeRO-Inference integrates with model parallel capabilities, specifically tensor-slicing that splits layers across multiple GPUs.

More broadly, ZeRO-Inference does not attempt to detect whether or not layer-by-layer computation fits the available GPU memory, and so out-of-memory (OOM) errors could occur. Note that this scenario could also occur because of batch size and token caching. And so, there are multiple solutions to such OOMs, including reducing batch size and enabling tensor slicing.

View full answer

tjruwase · 2022-09-20T16:05:44Z

tjruwase
Sep 20, 2022
Maintainer

Thanks, this is a great question. In short, yes, ZeRO-Inference integrates with model parallel capabilities, specifically tensor-slicing that splits layers across multiple GPUs.

More broadly, ZeRO-Inference does not attempt to detect whether or not layer-by-layer computation fits the available GPU memory, and so out-of-memory (OOM) errors could occur. Note that this scenario could also occur because of batch size and token caching. And so, there are multiple solutions to such OOMs, including reducing batch size and enabling tensor slicing.

6 replies

joaopcm1996 Sep 20, 2022
Author

As a follow-up, since the highlight of this is to be able to use with a single GPU, will a script using ZeRo inference need to be launched via the deepspeed launcher as well? Or since it won't require inter-GPU comm, it can be run normally?

joaopcm1996 Sep 20, 2022
Author

Lastly, if multiple GPU's are used to parallelize layer fetching and each GPU ends up having the full layer (no tensor slicing applied), how are the extra GPU's used for inference? Is data parallel inference run across GPU's? Haven't seen a data parallel feature in DeepSpeed-inference, so I was curious about this.

tjruwase Sep 20, 2022
Maintainer

As a follow-up, since the highlight of this is to be able to use with a single GPU, will a script using ZeRo inference need to be launched via the deepspeed launcher as well? Or since it won't require inter-GPU comm, it can be run normally?

You should be able to launch ZeRO-Inference using normal python launcher. In reality, DeepSpeed launcher is nothing more than a fancy wrapper around python -m torch.distributed.launch.

tjruwase Sep 20, 2022
Maintainer

Lastly, if multiple GPU's are used to parallelize layer fetching and each GPU ends up having the full layer (no tensor slicing applied), how are the extra GPU's used for inference? Is data parallel inference run across GPU's? Haven't seen a data parallel feature in DeepSpeed-inference, so I was curious about this.

In theory, one could submit independent queries to different GPUs. It could be a way of achieving data parallelism for batch-oriented inference scenarios.

joaopcm1996 Sep 21, 2022
Author

Lastly, if multiple GPU's are used to parallelize layer fetching and each GPU ends up having the full layer (no tensor slicing applied), how are the extra GPU's used for inference? Is data parallel inference run across GPU's? Haven't seen a data parallel feature in DeepSpeed-inference, so I was curious about this.

In theory, one could submit independent queries to different GPUs. It could be a way of achieving data parallelism for batch-oriented inference scenarios.

Got it, so this has to be done DIY

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZeRo-Inference across multiple GPU's #2335

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

ZeRo-Inference across multiple GPU's #2335

joaopcm1996 Sep 20, 2022

Replies: 1 comment · 6 replies

tjruwase Sep 20, 2022 Maintainer

joaopcm1996 Sep 20, 2022 Author

joaopcm1996 Sep 20, 2022 Author

tjruwase Sep 20, 2022 Maintainer

tjruwase Sep 20, 2022 Maintainer

joaopcm1996 Sep 21, 2022 Author

joaopcm1996
Sep 20, 2022

Replies: 1 comment 6 replies

tjruwase
Sep 20, 2022
Maintainer

joaopcm1996 Sep 20, 2022
Author

joaopcm1996 Sep 20, 2022
Author

tjruwase Sep 20, 2022
Maintainer

tjruwase Sep 20, 2022
Maintainer

joaopcm1996 Sep 21, 2022
Author