Memory issue when loading OPT #117

larry-fuy · 2022-12-09T07:50:58Z

Recently I am trying to run OPT models on MII but came across some memory issues. The OPT model I used is facebook/opt-13b. mii-config and deployment parameters are like this:

mii_configs = {
    "dtype": "fp32",
    "tensor_parallel": 4,
}

name = "facebook/opt-13b"

mii.deploy(task='text-generation',
           model=name,
           deployment_name=name + "_deployment",
           model_path='/root/ckpt/opt_13b/mii',
           mii_config=mii_configs)

The checkpoint is already downloaded into the model_path. Since the checkpoint size of opt-13b is around 26 Gb, I suppose it should work on a machine with 4 x v100 and 224G memory. But it turns out the loading part (even before the server started), MII reported an error of the server crashed and exit quietly. I then checked the memory usage and surprisingly found MII used up all 224G memory. So my question is why MII consumes several times of memory than the checkpoint? Is there any configuration to change this behavior?

The text was updated successfully, but these errors were encountered:

aponte411 · 2022-12-09T18:59:23Z

@larry-fuy out of curiosity, have you tried addingload_with_sys_mem: True to the config? It may help as it loads the model onto system memory and then lets deepspeed.init_inference take care of moving the model to GPU memory.

mrwyattii · 2022-12-09T19:26:15Z

@aponte411 this is part of the solution. @larry-fuy you also have "dtype": "fp32", but the facebook/opt-13b checkpoint are stored in fp16. So you are doubling the size of the model by not choosing to run in half precision.

larry-fuy · 2022-12-12T17:05:50Z

@mrwyattii Yes. facebook/opt-13b is stored in fp16 so fixed this. @aponte411 I tried load_with_sys_mem but the issue is still there. But I fixed it by upgrading the transformers version to latest one. I guess it is the issue of transformers rather than MII.

larry-fuy closed this as completed Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory issue when loading OPT #117

Memory issue when loading OPT #117

larry-fuy commented Dec 9, 2022

aponte411 commented Dec 9, 2022

mrwyattii commented Dec 9, 2022

larry-fuy commented Dec 12, 2022

Memory issue when loading OPT #117

Memory issue when loading OPT #117

Comments

larry-fuy commented Dec 9, 2022

aponte411 commented Dec 9, 2022

mrwyattii commented Dec 9, 2022

larry-fuy commented Dec 12, 2022