ZeRo-Inference across multiple GPU's #2335
-
Hey everyone, awesome launch with ZeRo-Inference. Reading the [blog post](https://www.deepspeed.ai/2022/09/09/zero-inference.html#optimizations, specifically on the parallelization of layer fetching across multiple GPU's part, I was wondering: in the case that a single layer would still not fit into a GPU, can the second phase of layer fetching be skipped, i.e the assembling of all pieces of the given layer onto each GPU? The whole reason tensor parallel inference exists is that, when dealing with these bigger models, some of their layers individually don't fit into 1 GPU correct? So in practice, the question I'm making is if ZeRo-Inference integrates with DeepSpeed's model parallel capabilities? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 6 replies
-
Thanks, this is a great question. In short, yes, ZeRO-Inference integrates with model parallel capabilities, specifically tensor-slicing that splits layers across multiple GPUs. More broadly, ZeRO-Inference does not attempt to detect whether or not layer-by-layer computation fits the available GPU memory, and so out-of-memory (OOM) errors could occur. Note that this scenario could also occur because of batch size and token caching. And so, there are multiple solutions to such OOMs, including reducing batch size and enabling tensor slicing. |
Beta Was this translation helpful? Give feedback.
Thanks, this is a great question. In short, yes, ZeRO-Inference integrates with model parallel capabilities, specifically tensor-slicing that splits layers across multiple GPUs.
More broadly, ZeRO-Inference does not attempt to detect whether or not layer-by-layer computation fits the available GPU memory, and so out-of-memory (OOM) errors could occur. Note that this scenario could also occur because of batch size and token caching. And so, there are multiple solutions to such OOMs, including reducing batch size and enabling tensor slicing.