Skip to content

ZeRo-Inference across multiple GPU's #2335

Answered by tjruwase
joaopcm1996 asked this question in Q&A
Discussion options

You must be logged in to vote

Thanks, this is a great question. In short, yes, ZeRO-Inference integrates with model parallel capabilities, specifically tensor-slicing that splits layers across multiple GPUs.

More broadly, ZeRO-Inference does not attempt to detect whether or not layer-by-layer computation fits the available GPU memory, and so out-of-memory (OOM) errors could occur. Note that this scenario could also occur because of batch size and token caching. And so, there are multiple solutions to such OOMs, including reducing batch size and enabling tensor slicing.

Replies: 1 comment 6 replies

Comment options

You must be logged in to vote
6 replies
@joaopcm1996
Comment options

@joaopcm1996
Comment options

@tjruwase
Comment options

@tjruwase
Comment options

@joaopcm1996
Comment options

Answer selected by joaopcm1996
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants