You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried HF OPT-13b on a 4 GPU machine with tensor-parallel: 4. One observation is all GPUs used the same amount of memory (~25G). It is consistent with other users report. And I also found the memory is as same as the memory used when tensor-parallel: 2. So my question is whether the model is split after it is loaded into CPU memory as said in this thread? My understanding is the memory should be a fourth if the model is split when tensor-parallel: 4 and a second when tensor-paralle: 2.
By the way, I also didn't really find latency reduction when increasing tensor parallel number (the latency only has 2 or 3 ms difference).
The text was updated successfully, but these errors were encountered:
Hi @larry-fuy thanks for using MII! I'll assume you're checking memory using with nvidia-smi - if that's the case, you are likely seeing that the total memory usage per GPU includes cached memory that can be freed. This is due to how we are loading and splitting a model across GPUs. I've created a PR that will empty the torch cache after splitting the model and the correct amount of memory usage is now reported. Please try it out: #121
As for the latency, could you share some more details to help me understand your setup?
I tried HF OPT-13b on a 4 GPU machine with
tensor-parallel: 4
. One observation is all GPUs used the same amount of memory (~25G). It is consistent with other users report. And I also found the memory is as same as the memory used whentensor-parallel: 2
. So my question is whether the model is split after it is loaded into CPU memory as said in this thread? My understanding is the memory should be a fourth if the model is split whentensor-parallel: 4
and a second whentensor-paralle: 2
.By the way, I also didn't really find latency reduction when increasing tensor parallel number (the latency only has 2 or 3 ms difference).
The text was updated successfully, but these errors were encountered: