Is model split in OPT TP mode? #120

larry-fuy · 2022-12-13T19:34:38Z

I tried HF OPT-13b on a 4 GPU machine with tensor-parallel: 4. One observation is all GPUs used the same amount of memory (~25G). It is consistent with other users report. And I also found the memory is as same as the memory used when tensor-parallel: 2. So my question is whether the model is split after it is loaded into CPU memory as said in this thread? My understanding is the memory should be a fourth if the model is split when tensor-parallel: 4 and a second when tensor-paralle: 2.

By the way, I also didn't really find latency reduction when increasing tensor parallel number (the latency only has 2 or 3 ms difference).

The text was updated successfully, but these errors were encountered:

mrwyattii · 2022-12-19T23:54:04Z

Hi @larry-fuy thanks for using MII! I'll assume you're checking memory using with nvidia-smi - if that's the case, you are likely seeing that the total memory usage per GPU includes cached memory that can be freed. This is due to how we are loading and splitting a model across GPUs. I've created a PR that will empty the torch cache after splitting the model and the correct amount of memory usage is now reported. Please try it out: #121

As for the latency, could you share some more details to help me understand your setup?

what GPUs are you running on and how many?
how are you measuring latency?
what are the exact measurements you have?

wangshankun · 2023-05-06T06:20:25Z

@mrwyattii how to get latency using mii api？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is model split in OPT TP mode? #120

Is model split in OPT TP mode? #120

larry-fuy commented Dec 13, 2022

mrwyattii commented Dec 19, 2022

wangshankun commented May 6, 2023

Is model split in OPT TP mode? #120

Is model split in OPT TP mode? #120

Comments

larry-fuy commented Dec 13, 2022

mrwyattii commented Dec 19, 2022

wangshankun commented May 6, 2023