Question: Can't Load ZeRO3 model with Engine.load_checkpoint() #1394

skpig · 2021-09-24T04:57:16Z

I used Engine.save_checkpoint to save my ZeRO3 model_engine . But when I load it with Engine.load_checkpoint(), I encountered runtime error as below:

Traceback (most recent call last):                                                                                                                                                   
  File "train.py", line 329, in <module>                                                                                                                                             
Traceback (most recent call last):                                                                                                                                                   
  File "train.py", line 329, in <module>                                                                                                                                             
    path, state = model_engine.load_checkpoint("Model/tmp", tag="ckpt")                                                                                                              
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 1919, in load_checkpoint                                                        
    path, state = model_engine.load_checkpoint("Model/tmp", tag="ckpt")
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 1919, in load_checkpoint
    load_module_only=load_module_only)
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 1969, in _load_checkpoint
    load_module_only=load_module_only)
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 1969, in _load_checkpoint
    strict=load_module_strict)
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 1819, in load_module_state_dict
    strict=load_module_strict)
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 1819, in load_module_state_dict
    self.module.load_state_dict(state_dict, strict=strict)
    self.module.load_state_dict(state_dict, strict=strict)
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1224, in load_state_dict
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1224, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for BartForConditionalGeneration:
        size mismatch for model.encoder.embed_positions.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1026, 768]).
        ......
        size mismatch for model.decoder.layernorm_embedding.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).

I'm using deepspeed ZeRO3 to train my bart (implemented by Huggingface's transformers) with 4 GPUs. (deepspeed --num_gpus=4 train.py --deepspeed --deepspeed_config config/ds_config.json)
Here is my code.(to simplify the question, I skip all the training code and only test the load & save function)

    parser = argparse.ArgumentParser()
    parser.add_argument("--local_rank", default=-1, type=int,
                        help="local_rank for distributed training on gpus")
    parser = deepspeed.add_config_arguments(parser)
    args = parser.parse_args()
    os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3'
    args.local_rank = int(os.environ['LOCAL_RANK'])

    model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")

    param_optimizer = list(model.named_parameters())
    no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
    optimizer_grouped_parameters = [{
        'params':
            [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
        'weight_decay':
            0.01
    }, {
        'params':
            [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
        'weight_decay':
            0.0
    }]

    model_engine, optimizer, _, scheduler = deepspeed.initialize(args=args, model=model,
                                                                 model_parameters=optimizer_grouped_parameters)
    model_engine.save_checkpoint("Model/tmp", tag="ckpt")
    path, state = model_engine.load_checkpoint("Model/tmp", tag="ckpt")

And here is my ds_config.json

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 3e-05,
            "weight_decay": 0.01
        }
    },

    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 3e-05,
            "warmup_num_steps": 400,
            "total_num_steps": 9000
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": 5e8,
        "stage3_prefetch_bucket_size": 5e8,
        "stage3_param_persistence_threshold": 1e6,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_fp16_weights_on_model_save": true
    },
    "gradient_clipping": 0.1,
    "train_micro_batch_size_per_gpu": 2,
    "train_batch_size": 32,
    "wall_clock_breakdown": false
}

I'm new to deepspeed and not familiar with every details about ZeRO3, please help me solve my problem. Thanks a lot!!!

The text was updated successfully, but these errors were encountered:

tjruwase · 2021-10-01T14:22:39Z

@skpig, sorry for the late response. Is this is still an issue? Thanks,

skpig · 2021-10-02T01:42:54Z

Yes, the issue still exists with the most recent commit (30965ea)

tjruwase · 2021-10-02T02:56:09Z

@skpig, thanks for confirming. As you probably notice, a number of related PRs are in process to merge by early next week. We can then evaluate whether the issue still remains.

tjruwase · 2021-10-04T20:21:10Z

@skpig, we have just merged a bunch for checkpoint PRs. Can you please check again? Thanks.

stas00 · 2021-10-04T20:37:15Z

@skpig, please try to follow the instruction here to setup up HF transformers to do the right thing during training if you're not using HF Trainer:
https://huggingface.co/transformers/master/main_classes/deepspeed.html#non-trainer-deepspeed-integration
Let me know if you run into any problems.

skpig · 2021-10-07T01:32:02Z

@tjruwase The issue still exists with current commit (d8e9ef6). And thanks for your reminder @stas00 .

my test code

ds_config = "config/ds_config.json"
dschf = HfDeepSpeedConfig(ds_config)
model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")
param_optimizer = list(model.named_parameters())
optimizer_grouped_parameters = [{
    'params':
        [p for n, p in param_optimizer],
    'weight_decay':
        0.01
}]
model_engine, optimizer, _, scheduler = deepspeed.initialize(args=args, model=model, config_params=ds_config,
                                                             model_parameters=optimizer_grouped_parameters)
model_engine.save_checkpoint("Model/tmp", tag="ckpt")
path, state = model_engine.load_checkpoint("Model/tmp", tag="ckpt")

Traceback

Traceback (most recent call last):
  File "train.py", line 341, in <module>
    path, state = model_engine.load_checkpoint("Model/tmp", tag="ckpt")
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/runtime/engine.py", line 1987, in load_checkpoint
    load_module_only=load_module_only)
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/runtime/engine.py", line 2029, in _load_checkpoint
    strict=load_module_strict)
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/runtime/engine.py", line 1887, in load_module_state_dict
    self.module.load_state_dict(state_dict, strict=strict)
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1052, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for BartForConditionalGeneration:
	size mismatch for model.encoder.embed_positions.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1026, 768]).
	size mismatch for model.encoder.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.0.self_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.0.self_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.0.self_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.0.self_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.0.self_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.0.self_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.0.self_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.0.fc1.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for model.encoder.layers.0.fc2.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.0.final_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.0.final_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.1.self_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.1.self_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.1.self_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.1.self_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.1.self_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.1.self_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.1.self_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.1.self_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.1.self_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.1.self_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.1.fc1.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for model.encoder.layers.1.fc2.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.1.final_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.1.final_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.2.self_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.2.self_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.2.self_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.2.self_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.2.self_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.2.self_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.2.self_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.2.self_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.2.self_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.2.self_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.2.fc1.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for model.encoder.layers.2.fc2.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.2.final_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.2.final_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.3.self_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.3.self_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.3.self_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.3.self_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.3.self_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.3.self_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.3.self_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.3.self_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.3.self_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.3.self_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.3.fc1.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for model.encoder.layers.3.fc2.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.3.final_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.3.final_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.4.self_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.4.self_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.4.self_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.4.self_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.4.self_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.4.self_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.4.self_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.4.self_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.4.self_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.4.self_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.4.fc1.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for model.encoder.layers.4.fc2.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.4.final_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.4.final_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.5.self_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.5.self_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.5.self_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.5.self_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.5.self_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.5.self_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.5.self_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.encoder.layers.5.self_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.5.self_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.5.self_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.5.fc1.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for model.encoder.layers.5.fc2.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.5.final_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layers.5.final_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layernorm_embedding.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.encoder.layernorm_embedding.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.embed_positions.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1026, 768]).
	size mismatch for model.decoder.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.0.self_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.0.self_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.0.self_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.self_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.0.self_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.self_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.self_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.encoder_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.0.encoder_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.encoder_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.0.encoder_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.encoder_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.0.encoder_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.encoder_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.0.encoder_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.encoder_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.encoder_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.fc1.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for model.decoder.layers.0.fc2.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.final_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.0.final_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.self_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.1.self_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.self_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.1.self_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.self_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.1.self_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.self_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.1.self_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.self_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.self_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.encoder_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.1.encoder_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.encoder_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.1.encoder_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.encoder_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.1.encoder_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.encoder_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.1.encoder_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.encoder_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.encoder_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.fc1.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for model.decoder.layers.1.fc2.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.final_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.1.final_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.self_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.2.self_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.self_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.2.self_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.self_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.2.self_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.self_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.2.self_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.self_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.self_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.encoder_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.2.encoder_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.encoder_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.2.encoder_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.encoder_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.2.encoder_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.encoder_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.2.encoder_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.encoder_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.encoder_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.fc1.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for model.decoder.layers.2.fc2.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.final_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.2.final_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.self_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.3.self_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.self_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.3.self_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.self_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.3.self_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.self_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.3.self_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.self_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.self_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.encoder_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.3.encoder_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.encoder_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.3.encoder_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.encoder_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.3.encoder_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.encoder_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.3.encoder_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.encoder_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.encoder_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.fc1.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for model.decoder.layers.3.fc2.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.final_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.3.final_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.self_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.4.self_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.self_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.4.self_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.self_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.4.self_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.self_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.4.self_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.self_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.self_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.encoder_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.4.encoder_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.encoder_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.4.encoder_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.encoder_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.4.encoder_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.encoder_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.4.encoder_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.encoder_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.encoder_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.fc1.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for model.decoder.layers.4.fc2.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.final_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.4.final_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.self_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.5.self_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.self_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.5.self_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.self_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.5.self_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.self_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.5.self_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.self_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.self_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.encoder_attn.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.5.encoder_attn.k_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.encoder_attn.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.5.encoder_attn.v_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.encoder_attn.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.5.encoder_attn.q_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.encoder_attn.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for model.decoder.layers.5.encoder_attn.out_proj.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.encoder_attn_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.encoder_attn_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.fc1.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for model.decoder.layers.5.fc2.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.final_layer_norm.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layers.5.final_layer_norm.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layernorm_embedding.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for model.decoder.layernorm_embedding.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768]).

ds_config.json

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 3e-05,
            "weight_decay": 0.01
        }
    },

    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 3e-05,
            "warmup_num_steps": 400,
            "total_num_steps": 9000
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": 5e8,
        "stage3_prefetch_bucket_size": 5e8,
        "stage3_param_persistence_threshold": 1e6,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_fp16_weights_on_model_save": true
    },
    "gradient_clipping": 0.1,
    "train_micro_batch_size_per_gpu": 2,
    "train_batch_size": 32,
    "wall_clock_breakdown": false
}

stas00 · 2021-10-07T01:42:13Z

copying a param with shape torch.Size([1])

means that ZeRO3 hasn't gathered the param from the gpus, before using it.

Thank you for trying the the HfDeepSpeedConfig setup.

Does the problem go away if you don't use param groups i.e. just pass the args to the optimizer normally?

skpig · 2021-10-07T01:50:07Z

No, it doesn't. The traceback is the same as before with model_parameters=model.parameters()

stas00 · 2021-10-07T01:54:54Z

OK, I was able to reproduce the issue with the help of your script after adding some missing bits and I will have a look now. Will keep you posted.

The full code to reproduce the issue is (the config is from #1394 (comment)

# test.py
from transformers import BartForConditionalGeneration
from transformers.deepspeed import HfDeepSpeedConfig
import deepspeed

ds_config = "ds_config.json"
mname = "sshleifer/bart-tiny-random"

dschf = HfDeepSpeedConfig(ds_config)
model = BartForConditionalGeneration.from_pretrained(mname)
param_optimizer = list(model.named_parameters())
optimizer_grouped_parameters = [{
    'params':
        [p for n, p in param_optimizer],
    'weight_decay':
        0.01
}]

model_engine, optimizer, _, scheduler = deepspeed.initialize(args=None, model=model, config_params=ds_config,
                                                             model_parameters=optimizer_grouped_parameters)
model_engine.save_checkpoint("Model/tmp", tag="ckpt")
path, state = model_engine.load_checkpoint("Model/tmp", tag="ckpt")

and run with just:

deepspeed test.py

I have only 2 gpus. and I'm using a tiny model to speed up the debug.

stas00 · 2021-10-07T03:04:01Z

OK, I figured it out.

It appears that Deepspeed engine wasn't designed to do save/load on the same engine. It was designed to save, save, save in process A and then do load in a new process when it restarts.

So it tries to load into a model that's already partitioned/used and it fails because the saved model is correct with normal shapes, but during load it tries to load it into fake ds_params which are of torch.Size[1]

Here is a working workaround:

from transformers import BartForConditionalGeneration
import deepspeed

ds_config = "ds_config.json"
mname = "sshleifer/bart-tiny-random"

def my_init():
    model = BartForConditionalGeneration.from_pretrained(mname)
    param_optimizer = list(model.named_parameters())
    optimizer_grouped_parameters = [{
        'params':
            [p for n, p in param_optimizer],
        'weight_decay':
            0.01
    }]
    model_engine, optimizer, _, scheduler = deepspeed.initialize(args=None, model=model, config_params=ds_config,
                                                                 model_parameters=optimizer_grouped_parameters)
    return model_engine


model_engine = my_init()
model_engine.save_checkpoint("Model/tmp")

model_engine = my_init()
path, state = model_engine.load_checkpoint("Model/tmp")

The culprit, is that model_engine.module is partitioned under ZeRO3, and load_checkpoint wants a pristine model.

In theory this should have worked as a workaround:

with deepspeed.zero.GatheredParameters(list(model_engine.parameters(recurse=True))):
    print(model_engine.module.model.shared.weight.shape)
    print(model_engine.module.model.decoder.layers[1].fc1.bias.shape)
    path, state = model_engine.load_checkpoint("out")
# prints torch.Size([50265, 24])
# prints torch.Size([16])

but it doesn't. You can see that the model's weights are gathered correctly, but model_engine still sees the old partitioned model.

I suspect that it is not the partitioning that the issue but something else, first, because my 2nd workaround didn't work and second, because under zero.Init my_init already returns the partitioned model. So it's some object inside deepspeed that needs to be reset, for load_checkpoint to work once the model has been "used".

I will let @tjruwase comment on why this is so as I haven't designed this engine.

(sidenote: since HF integrated Deeepspeed, the Deepspeed team has been repeatedly surprised their awesome tool has been attempted to be used in dozens of ways they haven't originally envisioned. So this attempt at expanding adoption is a blessing and a curse at the same time.)

tjruwase · 2021-10-08T13:16:54Z

@stas00, thanks for resolving this. Your analysis is correct, the engine is currently not re-startable mid-flight as we did not envision this. We will put this on our TODO.

@samyam, @jeffra FYI

skpig · 2021-10-17T01:21:31Z

Sorry for the late response. I guess it is not an important feature. But maybe the document needs to highlight the issue and indicates the right way to use engine.load_checkpoint, since it is a little unclear for some beginners like me. BTW, so the right way to load checkpoint is to reinitialize another engine and use it for loading? @tjruwase @stas00

stas00 · 2021-10-17T05:20:02Z

I totally agree, @skpig, it indeed should be documented

If you'd like you could make a PR adding a note explaining this limitation somewhere in the docstring here:
https://github.com/microsoft/DeepSpeed/blob/1fc74cb9c81668b5ff0046446f8004d4cf8dc2d5/deepspeed/runtime/engine.py#L1997

tjruwase self-assigned this Oct 1, 2021

skpig mentioned this issue Oct 17, 2021

document about engine.load_checkpoint() #1457

Merged

skpig closed this as completed Oct 19, 2021

This was referenced Dec 6, 2021

[deepspeed] fix --load_best_model_at_end huggingface/transformers#14652

Merged

[BUG] load_checkpoint fails after deepspeed engine started training #1612

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Can't Load ZeRO3 model with Engine.load_checkpoint() #1394

Question: Can't Load ZeRO3 model with Engine.load_checkpoint() #1394

skpig commented Sep 24, 2021

tjruwase commented Oct 1, 2021

skpig commented Oct 2, 2021

tjruwase commented Oct 2, 2021

tjruwase commented Oct 4, 2021

stas00 commented Oct 4, 2021 •

edited

Loading

skpig commented Oct 7, 2021

stas00 commented Oct 7, 2021 •

edited

Loading

skpig commented Oct 7, 2021

stas00 commented Oct 7, 2021 •

edited

Loading

stas00 commented Oct 7, 2021 •

edited

Loading

tjruwase commented Oct 8, 2021

skpig commented Oct 17, 2021

stas00 commented Oct 17, 2021

Question: Can't Load ZeRO3 model with Engine.load_checkpoint() #1394

Question: Can't Load ZeRO3 model with Engine.load_checkpoint() #1394

Comments

skpig commented Sep 24, 2021

tjruwase commented Oct 1, 2021

skpig commented Oct 2, 2021

tjruwase commented Oct 2, 2021

tjruwase commented Oct 4, 2021

stas00 commented Oct 4, 2021 • edited Loading

skpig commented Oct 7, 2021

stas00 commented Oct 7, 2021 • edited Loading

skpig commented Oct 7, 2021

stas00 commented Oct 7, 2021 • edited Loading

stas00 commented Oct 7, 2021 • edited Loading

tjruwase commented Oct 8, 2021

skpig commented Oct 17, 2021

stas00 commented Oct 17, 2021

stas00 commented Oct 4, 2021 •

edited

Loading

stas00 commented Oct 7, 2021 •

edited

Loading

stas00 commented Oct 7, 2021 •

edited

Loading

stas00 commented Oct 7, 2021 •

edited

Loading