You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I read the deepspeed docs and have the following confusion:
(1) What's the difference between these methods when in inferencing LLMs?
a. deepspeed.initialize and then write code to generate text
b. deepspeed.init_inference then write code to generate
c. use mii to inference
(2) Which of them are friendly for memory? For example, I want to inference 70b models, which of them support model parallelism that separates model parameters across gpus?
(3) For inference, what's the best practice now for inferencing 70b llama?
a. zero3 + cpu offload (1*a100)
b. zero3 (2*a100)
...
Thank you!
The text was updated successfully, but these errors were encountered:
Hi, I read the deepspeed docs and have the following confusion:
(1) What's the difference between these methods when in inferencing LLMs?
a. deepspeed.initialize and then write code to generate text
b. deepspeed.init_inference then write code to generate
c. use mii to inference
(2) Which of them are friendly for memory? For example, I want to inference 70b models, which of them support model parallelism that separates model parameters across gpus?
(3) For inference, what's the best practice now for inferencing 70b llama?
a. zero3 + cpu offload (1*a100)
b. zero3 (2*a100)
...
Thank you!
The text was updated successfully, but these errors were encountered: