fail to load checkpoints after zero3 initialize #3574

Stick-To · 2023-05-19T03:01:14Z

  File "/home/lxr/workspace/t5_train/esm5.py", line 66, in <module>
    engine.load_module_state_dict(torch.load("weight/mt5-small/pytorch_model.bin"))
  File "/home/lxr/workspace/t5_train/deepspeed/runtime/engine.py", line 2421, in load_module_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for MT5ForConditionalGeneration:
        size mismatch for shared.weight: copying a param with shape torch.Size([250112, 512]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([250112, 512]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for encoder.block.0.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for encoder.block.0.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for encoder.block.0.layer.0.SelfAttention.v.weight: copying a param with shape torch.Size([384, 512]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for encoder.block.0.layer.0.SelfAttention.o.weight: copying a param with shape torch.Size([512, 384]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight: copying a param with shape torch.Size([32, 6]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for encoder.block.0.layer.0.layer_norm.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([0]).

The text was updated successfully, but these errors were encountered:

dittops · 2023-06-01T06:40:57Z

@Stick-To I'm also facing this issue. Could you please share how you resolved it?

tjruwase · 2023-06-01T17:45:08Z

@dittops, please re-open and share repro steps including a stack trace. Thanks!

Alchemy5 · 2023-06-11T18:50:48Z

im also facing this issue, any help would be great!

iamsile · 2023-06-13T15:15:51Z

I'm also facing this issue too, anyone who can help ?

Alchemy5 · 2023-06-13T16:20:15Z

I actually found a solution to my problem in the Saving and Loading section of this article: https://huggingface.co/docs/accelerate/usage_guides/deepspeed#saving-and-loading

yix-chen · 2023-07-17T16:56:56Z

@Stick-To I'm also facing this issue. Could you please share how you resolved it?

Stick-To · 2023-08-26T01:43:31Z

I have not solve it

Zx55 · 2023-11-29T05:02:33Z

One possible reason could be the conflict of multiple initialization between hf deepspeed integration and explicit call "deepspeed.zero.Init()". I solve this following here.

ZeyuLing · 2024-05-13T14:18:33Z

I met the same problem and solved it by reinitialization. U can deepcopy an original model before u wrap it with deepspeed.initialize. Load your checkpoint to the original model, and run deepspeed.initialize to the loaded one.

samadejacobs added the training label May 19, 2023

samadejacobs assigned GuanhuaWang May 19, 2023

Stick-To closed this as completed May 27, 2023

lucasjinreal mentioned this issue Mar 7, 2024

Ablation study on using just single path encoder? luogen1996/LLaVA-HR#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fail to load checkpoints after zero3 initialize #3574

fail to load checkpoints after zero3 initialize #3574

Stick-To commented May 19, 2023 •

edited

Loading

dittops commented Jun 1, 2023

tjruwase commented Jun 1, 2023

Alchemy5 commented Jun 11, 2023

iamsile commented Jun 13, 2023

Alchemy5 commented Jun 13, 2023

yix-chen commented Jul 17, 2023

Stick-To commented Aug 26, 2023

Zx55 commented Nov 29, 2023

ZeyuLing commented May 13, 2024

fail to load checkpoints after zero3 initialize #3574

fail to load checkpoints after zero3 initialize #3574

Comments

Stick-To commented May 19, 2023 • edited Loading

dittops commented Jun 1, 2023

tjruwase commented Jun 1, 2023

Alchemy5 commented Jun 11, 2023

iamsile commented Jun 13, 2023

Alchemy5 commented Jun 13, 2023

yix-chen commented Jul 17, 2023

Stick-To commented Aug 26, 2023

Zx55 commented Nov 29, 2023

ZeyuLing commented May 13, 2024

Stick-To commented May 19, 2023 •

edited

Loading