You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently I am trying to run OPT models on MII but came across some memory issues. The OPT model I used is facebook/opt-13b. mii-config and deployment parameters are like this:
The checkpoint is already downloaded into the model_path. Since the checkpoint size of opt-13b is around 26 Gb, I suppose it should work on a machine with 4 x v100 and 224G memory. But it turns out the loading part (even before the server started), MII reported an error of the server crashed and exit quietly. I then checked the memory usage and surprisingly found MII used up all 224G memory. So my question is why MII consumes several times of memory than the checkpoint? Is there any configuration to change this behavior?
The text was updated successfully, but these errors were encountered:
@larry-fuy out of curiosity, have you tried addingload_with_sys_mem: True to the config? It may help as it loads the model onto system memory and then lets deepspeed.init_inference take care of moving the model to GPU memory.
@aponte411 this is part of the solution. @larry-fuy you also have "dtype": "fp32", but the facebook/opt-13b checkpoint are stored in fp16. So you are doubling the size of the model by not choosing to run in half precision.
@mrwyattii Yes. facebook/opt-13b is stored in fp16 so fixed this. @aponte411 I tried load_with_sys_mem but the issue is still there. But I fixed it by upgrading the transformers version to latest one. I guess it is the issue of transformers rather than MII.
Recently I am trying to run OPT models on MII but came across some memory issues. The OPT model I used is
facebook/opt-13b
.mii-config
and deployment parameters are like this:The checkpoint is already downloaded into the
model_path
. Since the checkpoint size ofopt-13b
is around 26 Gb, I suppose it should work on a machine with 4 x v100 and 224G memory. But it turns out the loading part (even before the server started), MII reported an error ofthe server crashed
and exit quietly. I then checked the memory usage and surprisingly found MII used up all 224G memory. So my question is why MII consumes several times of memory than the checkpoint? Is there any configuration to change this behavior?The text was updated successfully, but these errors were encountered: