Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Can't Load ZeRO3 model with Engine.load_checkpoint() #1394

Closed
skpig opened this issue Sep 24, 2021 · 13 comments
Closed

Question: Can't Load ZeRO3 model with Engine.load_checkpoint() #1394

skpig opened this issue Sep 24, 2021 · 13 comments
Assignees

Comments

@skpig
Copy link
Contributor

skpig commented Sep 24, 2021

I used Engine.save_checkpoint to save my ZeRO3 model_engine . But when I load it with Engine.load_checkpoint(), I encountered runtime error as below:

Traceback (most recent call last):                                                                                                                                                   
  File "train.py", line 329, in <module>                                                                                                                                             
Traceback (most recent call last):                                                                                                                                                   
  File "train.py", line 329, in <module>                                                                                                                                             
    path, state = model_engine.load_checkpoint("Model/tmp", tag="ckpt")                                                                                                              
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 1919, in load_checkpoint                                                        
    path, state = model_engine.load_checkpoint("Model/tmp", tag="ckpt")
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 1919, in load_checkpoint
    load_module_only=load_module_only)
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 1969, in _load_checkpoint
    load_module_only=load_module_only)
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 1969, in _load_checkpoint
    strict=load_module_strict)
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 1819, in load_module_state_dict
    strict=load_module_strict)
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 1819, in load_module_state_dict
    self.module.load_state_dict(state_dict, strict=strict)
    self.module.load_state_dict(state_dict, strict=strict)
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1224, in load_state_dict
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1224, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for BartForConditionalGeneration:
        size mismatch for model.encoder.embed_positions.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1026, 768]).
        ......
        size mismatch for model.decoder.layernorm_embedding.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).

I'm using deepspeed ZeRO3 to train my bart (implemented by Huggingface's transformers) with 4 GPUs. (deepspeed --num_gpus=4 train.py --deepspeed --deepspeed_config config/ds_config.json)
Here is my code.(to simplify the question, I skip all the training code and only test the load & save function)

    parser = argparse.ArgumentParser()
    parser.add_argument("--local_rank", default=-1, type=int,
                        help="local_rank for distributed training on gpus")
    parser = deepspeed.add_config_arguments(parser)
    args = parser.parse_args()
    os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3'
    args.local_rank = int(os.environ['LOCAL_RANK'])

    model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")

    param_optimizer = list(model.named_parameters())
    no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
    optimizer_grouped_parameters = [{
        'params':
            [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
        'weight_decay':
            0.01
    }, {
        'params':
            [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
        'weight_decay':
            0.0
    }]

    model_engine, optimizer, _, scheduler = deepspeed.initialize(args=args, model=model,
                                                                 model_parameters=optimizer_grouped_parameters)
    model_engine.save_checkpoint("Model/tmp", tag="ckpt")
    path, state = model_engine.load_checkpoint("Model/tmp", tag="ckpt")

And here is my ds_config.json

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 3e-05,
            "weight_decay": 0.01
        }
    },

    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 3e-05,
            "warmup_num_steps": 400,
            "total_num_steps": 9000
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": 5e8,
        "stage3_prefetch_bucket_size": 5e8,
        "stage3_param_persistence_threshold": 1e6,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_fp16_weights_on_model_save": true
    },
    "gradient_clipping": 0.1,
    "train_micro_batch_size_per_gpu": 2,
    "train_batch_size": 32,
    "wall_clock_breakdown": false
}

I'm new to deepspeed and not familiar with every details about ZeRO3, please help me solve my problem. Thanks a lot!!!

@tjruwase
Copy link
Contributor

tjruwase commented Oct 1, 2021

@skpig, sorry for the late response. Is this is still an issue? Thanks,

@tjruwase tjruwase self-assigned this Oct 1, 2021
@skpig
Copy link
Contributor Author

skpig commented Oct 2, 2021

Yes, the issue still exists with the most recent commit (30965ea)

@tjruwase
Copy link
Contributor

tjruwase commented Oct 2, 2021

@skpig, thanks for confirming. As you probably notice, a number of related PRs are in process to merge by early next week. We can then evaluate whether the issue still remains.

@tjruwase
Copy link
Contributor

tjruwase commented Oct 4, 2021

@skpig, we have just merged a bunch for checkpoint PRs. Can you please check again? Thanks.

@stas00
Copy link
Collaborator

stas00 commented Oct 4, 2021

@skpig, please try to follow the instruction here to setup up HF transformers to do the right thing during training if you're not using HF Trainer:
https://huggingface.co/transformers/master/main_classes/deepspeed.html#non-trainer-deepspeed-integration
Let me know if you run into any problems.

@skpig
Copy link
Contributor Author

skpig commented Oct 7, 2021

@tjruwase The issue still exists with current commit (d8e9ef6). And thanks for your reminder @stas00 .

my test code
ds_config = "config/ds_config.json"
dschf = HfDeepSpeedConfig(ds_config)
model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")
param_optimizer = list(model.named_parameters())
optimizer_grouped_parameters = [{
    'params':
        [p for n, p in param_optimizer],
    'weight_decay':
        0.01
}]
model_engine, optimizer, _, scheduler = deepspeed.initialize(args=args, model=model, config_params=ds_config,
                                                             model_parameters=optimizer_grouped_parameters)
model_engine.save_checkpoint("Model/tmp", tag="ckpt")
path, state = model_engine.load_checkpoint("Model/tmp", tag="ckpt")
Traceback
Traceback (most recent call last):
  File "train.py", line 341, in <module>
    path, state = model_engine.load_checkpoint("Model/tmp", tag="ckpt")
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/runtime/engine.py", line 1987, in load_checkpoint
    load_module_only=load_module_only)
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/runtime/engine.py", line 2029, in _load_checkpoint
    strict=load_module_strict)
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/runtime/engine.py", line 1887, in load_module_state_dict
    self.module.load_state_dict(state_dict, strict=strict)
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1052, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for BartForConditionalGeneration:
	size mismatch for model.encoder.embed_positions.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1026, 768]).
	size mismatch for model.encoder.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.0.self_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.0.self_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.0.self_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.0.self_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.0.self_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.0.self_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.0.self_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.0.fc1.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for model.encoder.layers.0.fc2.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.0.final_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.0.final_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.1.self_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.1.self_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.1.self_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.1.self_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.1.self_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.1.self_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.1.self_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.1.self_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.1.self_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.1.self_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.1.fc1.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for model.encoder.layers.1.fc2.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.1.final_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.1.final_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.2.self_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.2.self_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.2.self_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.2.self_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.2.self_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.2.self_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.2.self_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.2.self_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.2.self_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.2.self_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.2.fc1.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for model.encoder.layers.2.fc2.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.2.final_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.2.final_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.3.self_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.3.self_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.3.self_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.3.self_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.3.self_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.3.self_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.3.self_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.3.self_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.3.self_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.3.self_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.3.fc1.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for model.encoder.layers.3.fc2.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.3.final_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.3.final_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.4.self_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.4.self_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.4.self_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.4.self_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.4.self_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.4.self_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.4.self_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.4.self_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.4.self_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.4.self_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.4.fc1.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for model.encoder.layers.4.fc2.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.4.final_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.4.final_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.5.self_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.5.self_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.5.self_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.5.self_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.5.self_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.5.self_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.5.self_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.5.self_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.5.self_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.5.self_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.5.fc1.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for model.encoder.layers.5.fc2.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.5.final_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.5.final_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layernorm_embedding.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layernorm_embedding.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.embed_positions.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1026, 768]).
	size mismatch for model.decoder.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.0.self_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.0.self_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.0.self_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.self_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.0.self_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.self_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.self_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.encoder_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.0.encoder_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.encoder_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.0.encoder_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.encoder_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.0.encoder_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.encoder_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.0.encoder_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.encoder_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.encoder_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.fc1.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for model.decoder.layers.0.fc2.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.final_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.final_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.self_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.1.self_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.self_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.1.self_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.self_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.1.self_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.self_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.1.self_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.self_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.self_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.encoder_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.1.encoder_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.encoder_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.1.encoder_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.encoder_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.1.encoder_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.encoder_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.1.encoder_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.encoder_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.encoder_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.fc1.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for model.decoder.layers.1.fc2.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.final_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.final_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.self_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.2.self_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.self_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.2.self_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.self_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.2.self_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.self_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.2.self_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.self_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.self_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.encoder_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.2.encoder_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.encoder_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.2.encoder_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.encoder_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.2.encoder_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.encoder_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.2.encoder_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.encoder_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.encoder_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.fc1.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for model.decoder.layers.2.fc2.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.final_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.final_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.self_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.3.self_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.self_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.3.self_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.self_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.3.self_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.self_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.3.self_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.self_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.self_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.encoder_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.3.encoder_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.encoder_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.3.encoder_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.encoder_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.3.encoder_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.encoder_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.3.encoder_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.encoder_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.encoder_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.fc1.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for model.decoder.layers.3.fc2.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.final_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.final_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.self_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.4.self_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.self_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.4.self_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.self_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.4.self_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.self_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.4.self_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.self_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.self_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.encoder_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.4.encoder_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.encoder_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.4.encoder_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.encoder_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.4.encoder_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.encoder_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.4.encoder_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.encoder_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.encoder_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.fc1.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for model.decoder.layers.4.fc2.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.final_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.final_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.self_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.5.self_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.self_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.5.self_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.self_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.5.self_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.self_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.5.self_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.self_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.self_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.encoder_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.5.encoder_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.encoder_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.5.encoder_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.encoder_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.5.encoder_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.encoder_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.5.encoder_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.encoder_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.encoder_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.fc1.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for model.decoder.layers.5.fc2.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.final_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.final_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layernorm_embedding.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layernorm_embedding.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
ds_config.json
{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 3e-05,
            "weight_decay": 0.01
        }
    },

    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 3e-05,
            "warmup_num_steps": 400,
            "total_num_steps": 9000
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": 5e8,
        "stage3_prefetch_bucket_size": 5e8,
        "stage3_param_persistence_threshold": 1e6,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_fp16_weights_on_model_save": true
    },
    "gradient_clipping": 0.1,
    "train_micro_batch_size_per_gpu": 2,
    "train_batch_size": 32,
    "wall_clock_breakdown": false
}

@stas00
Copy link
Collaborator

stas00 commented Oct 7, 2021

copying a param with shape torch.Size([1])

means that ZeRO3 hasn't gathered the param from the gpus, before using it.

Thank you for trying the the HfDeepSpeedConfig setup.

Does the problem go away if you don't use param groups i.e. just pass the args to the optimizer normally?

@skpig
Copy link
Contributor Author

skpig commented Oct 7, 2021

No, it doesn't. The traceback is the same as before with model_parameters=model.parameters()

@stas00
Copy link
Collaborator

stas00 commented Oct 7, 2021

OK, I was able to reproduce the issue with the help of your script after adding some missing bits and I will have a look now. Will keep you posted.

The full code to reproduce the issue is (the config is from #1394 (comment)

# test.py
from transformers import BartForConditionalGeneration
from transformers.deepspeed import HfDeepSpeedConfig
import deepspeed

ds_config = "ds_config.json"
mname = "sshleifer/bart-tiny-random"

dschf = HfDeepSpeedConfig(ds_config)
model = BartForConditionalGeneration.from_pretrained(mname)
param_optimizer = list(model.named_parameters())
optimizer_grouped_parameters = [{
    'params':
        [p for n, p in param_optimizer],
    'weight_decay':
        0.01
}]

model_engine, optimizer, _, scheduler = deepspeed.initialize(args=None, model=model, config_params=ds_config,
                                                             model_parameters=optimizer_grouped_parameters)
model_engine.save_checkpoint("Model/tmp", tag="ckpt")
path, state = model_engine.load_checkpoint("Model/tmp", tag="ckpt")

and run with just:

deepspeed test.py

I have only 2 gpus. and I'm using a tiny model to speed up the debug.

@stas00
Copy link
Collaborator

stas00 commented Oct 7, 2021

OK, I figured it out.

It appears that Deepspeed engine wasn't designed to do save/load on the same engine. It was designed to save, save, save in process A and then do load in a new process when it restarts.

So it tries to load into a model that's already partitioned/used and it fails because the saved model is correct with normal shapes, but during load it tries to load it into fake ds_params which are of torch.Size[1]

Here is a working workaround:

from transformers import BartForConditionalGeneration
import deepspeed

ds_config = "ds_config.json"
mname = "sshleifer/bart-tiny-random"

def my_init():
    model = BartForConditionalGeneration.from_pretrained(mname)
    param_optimizer = list(model.named_parameters())
    optimizer_grouped_parameters = [{
        'params':
            [p for n, p in param_optimizer],
        'weight_decay':
            0.01
    }]
    model_engine, optimizer, _, scheduler = deepspeed.initialize(args=None, model=model, config_params=ds_config,
                                                                 model_parameters=optimizer_grouped_parameters)
    return model_engine


model_engine = my_init()
model_engine.save_checkpoint("Model/tmp")

model_engine = my_init()
path, state = model_engine.load_checkpoint("Model/tmp")

The culprit, is that model_engine.module is partitioned under ZeRO3, and load_checkpoint wants a pristine model.

In theory this should have worked as a workaround:

with deepspeed.zero.GatheredParameters(list(model_engine.parameters(recurse=True))):
    print(model_engine.module.model.shared.weight.shape)
    print(model_engine.module.model.decoder.layers[1].fc1.bias.shape)
    path, state = model_engine.load_checkpoint("out")
# prints torch.Size([50265, 24])
# prints torch.Size([16])

but it doesn't. You can see that the model's weights are gathered correctly, but model_engine still sees the old partitioned model.

I suspect that it is not the partitioning that the issue but something else, first, because my 2nd workaround didn't work and second, because under zero.Init my_init already returns the partitioned model. So it's some object inside deepspeed that needs to be reset, for load_checkpoint to work once the model has been "used".

I will let @tjruwase comment on why this is so as I haven't designed this engine.

(sidenote: since HF integrated Deeepspeed, the Deepspeed team has been repeatedly surprised their awesome tool has been attempted to be used in dozens of ways they haven't originally envisioned. So this attempt at expanding adoption is a blessing and a curse at the same time.)

@tjruwase
Copy link
Contributor

tjruwase commented Oct 8, 2021

@stas00, thanks for resolving this. Your analysis is correct, the engine is currently not re-startable mid-flight as we did not envision this. We will put this on our TODO.

@samyam, @jeffra FYI

@skpig
Copy link
Contributor Author

skpig commented Oct 17, 2021

Sorry for the late response. I guess it is not an important feature. But maybe the document needs to highlight the issue and indicates the right way to use engine.load_checkpoint, since it is a little unclear for some beginners like me. BTW, so the right way to load checkpoint is to reinitialize another engine and use it for loading? @tjruwase @stas00

@stas00
Copy link
Collaborator

stas00 commented Oct 17, 2021

I totally agree, @skpig, it indeed should be documented

If you'd like you could make a PR adding a note explaining this limitation somewhere in the docstring here:
https://github.com/microsoft/DeepSpeed/blob/1fc74cb9c81668b5ff0046446f8004d4cf8dc2d5/deepspeed/runtime/engine.py#L1997

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants