You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Being able to split models into multiple GPUs, as with vllm/aphrodite engine for LLMs.
Motivation
It would be extremely helpful to be able to split larger models into multiple GPUs.
Also, without TP, one GPU loses lots of vram and the other does not, making it impossible to use tensor parallelism on another program at the same time. (Without losing as much VRAM on the non-utilized GPU)
Your contribution
communicating the feature
The text was updated successfully, but these errors were encountered:
You typically do data-parallel style inference on sentence-transformers. TP is used when one GPU can't handle the desired batch size or the model at all. Unless there are some compelling benchmarks for bert-base, there is no need for tensor parallelism.
Feature request
Being able to split models into multiple GPUs, as with vllm/aphrodite engine for LLMs.
Motivation
It would be extremely helpful to be able to split larger models into multiple GPUs.
Also, without TP, one GPU loses lots of vram and the other does not, making it impossible to use tensor parallelism on another program at the same time. (Without losing as much VRAM on the non-utilized GPU)
Your contribution
communicating the feature
The text was updated successfully, but these errors were encountered: