Saved weights differ from the original model #30543

bezir · 2024-04-29T14:12:38Z

System Info

transformers 4.40.1
peft 0.10.0

Who can help?

@sanchit-gandhi @Rocketknight1 @younesbelkada

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I have fine-tuned a GPT2 Model using SFTTrainer. I merge base model and trained adapters with the code below. I also extended the vocabulary.

peft_model_path = "checkpoint"
tokenizer = AutoTokenizer.from_pretrained(peft_model_path, device_map="auto")
base_model_path = "openai-community/gpt2"
base_model = GPT2LMHeadModel.from_pretrained(base_model_path, device_map="auto")

base_model.resize_token_embeddings(len(tokenizer))
    
model = PeftModel.from_pretrained(base_model, peft_model_path, device_map="auto")
merged_model = model.merge_and_unload()

I test this model and the results are okay. Then I want to save this merged_model using the code below.

tokenizer.save_pretrained(save_path)
merged_model.save_pretrained(save_path)

Lastly, I open the saved model with the code below.

tokenizer = AutoTokenizer.from_pretrained(save_path)
model = AutoModelForCausalLM.from_pretrained(save_path)

The model that I load from the save_path does not work well. It repeats the same token or gives random tokens from the base vocabulary.

Model before Save

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(66156, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=66156, bias=False)
)

Loaded Model:

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(66156, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=66156, bias=False)
)

Now let's look at the weights.

Model Before Save

OrderedDict([('transformer.wte.weight',
              tensor([[-0.2500,  0.2324,  0.0162,  ...,  0.0013, -0.4432,  0.2431],
                      [ 0.3332, -0.1894, -0.2949,  ...,  0.2883, -0.0411,  0.3148],
                      [-0.2403, -0.1975,  0.4091,  ..., -0.3482,  0.5244,  0.1759],
                      ...,
                      [ 0.3374, -0.1371, -0.2627,  ..., -0.6586, -0.5067, -0.0226],
                      [-0.1031,  0.1453, -0.9022,  ..., -0.3682,  0.4504,  0.3242],
                      [-0.5442, -0.6574, -0.0881,  ..., -0.2370, -0.3048,  0.7317]],
                     device='cuda:0')),
             ('transformer.wpe.weight',
              tensor([[-1.2274e-01, -1.5239e-01,  1.9312e-01,  ..., -2.8421e-03,
                       -5.4589e-02,  3.1110e-02],
                      [-1.1096e-02, -1.4371e-01, -5.5172e-02,  ...,  1.8058e-01,
                        4.9604e-02,  4.3034e-02],
                      [ 6.0074e-02, -2.2665e-01,  2.2515e-01,  ..., -1.2376e-02,
                        1.3002e-01, -1.2887e-02],
                      ...,
                      [ 3.5125e-01, -1.1077e+00,  1.3558e-01,  ..., -2.0376e-01,
                       -3.5680e-01,  3.1987e-01],
                      [ 4.0268e-01, -4.6439e-01, -8.0140e-04,  ...,  1.4744e-01,
                        1.2033e-01, -8.1738e-02],
                      [ 2.6610e-04,  3.0272e-03, -1.7086e-03,  ..., -4.6506e-03,
                       -2.3541e-03, -5.7855e-03]], device='cuda:0')),
             ('transformer.h.0.ln_1.weight',
              tensor([0.2232, 0.1820, 0.1534, 0.1917, 0.2036, 0.1948, 0.1467, 0.1865, 0.2143,
                      0.1956, 0.2118, 0.2153, 0.1882, 0.2074, 0.1871, 0.2040, 0.2044, 0.1900,
                      0.1952, 0.0475, 0.1909, 0.2115, 0.1971, 0.2202, 0.1998, 0.2108, 0.2303,
                      ...
                      0.1662, 0.1982, 0.1582, 0.1935, 0.2182, 0.2067, 0.1855, 0.1778, 0.1900,
                      0.2124, 0.1215, 0.2092, 0.1929, 0.2434, 0.1936, 0.1948, 0.0622, 0.1852,
                      0.1868, 0.2035, 0.2310, 0.1794, 0.1655, 0.1756, 0.2074, 0.2194, 0.2152,
                      0.0502, 0.2294, 0.1950, 0.2149, 0.2024, 0.1727, 0.0657, 0.1919, 0.1847,
                      0.1900, 0.1825, 0.1898], device='cuda:1')), ...]

Loaded Model


OrderedDict([('transformer.wte.weight',
              tensor([[-0.0952, -0.0785,  0.0155,  ..., -0.1458, -0.0334,  0.0052],
                      [ 0.0234, -0.0811,  0.0049,  ...,  0.1172, -0.0847,  0.0343],
                      [-0.1060,  0.0711,  0.1621,  ..., -0.0243, -0.1103, -0.0732],
                      ...,
                      [-0.0358,  0.0035,  0.0351,  ..., -0.0633, -0.0200, -0.0084],
                      [-0.1555,  0.0488,  0.0125,  ..., -0.0582,  0.0440, -0.1661],
                      [-0.0095,  0.1273, -0.0158,  ...,  0.0115, -0.1641, -0.0303]],
                     device='cuda:0')),
             ('transformer.wpe.weight',
              tensor([[-1.2274e-01, -1.5239e-01,  1.9312e-01,  ..., -2.8421e-03,
                       -5.4589e-02,  3.1110e-02],
                      [-1.1096e-02, -1.4371e-01, -5.5172e-02,  ...,  1.8058e-01,
                        4.9604e-02,  4.3034e-02],
                      [ 6.0074e-02, -2.2665e-01,  2.2515e-01,  ..., -1.2376e-02,
                        1.3002e-01, -1.2887e-02],
                      ...,
                      [ 3.5125e-01, -1.1077e+00,  1.3558e-01,  ..., -2.0376e-01,
                       -3.5680e-01,  3.1987e-01],
                      [ 4.0268e-01, -4.6439e-01, -8.0140e-04,  ...,  1.4744e-01,
                        1.2033e-01, -8.1738e-02],
                      [ 2.6610e-04,  3.0272e-03, -1.7086e-03,  ..., -4.6506e-03,
                       -2.3541e-03, -5.7855e-03]], device='cuda:0')),
             ('transformer.h.0.ln_1.weight',
              tensor([0.2232, 0.1820, 0.1534, 0.1917, 0.2036, 0.1948, 0.1467, 0.1865, 0.2143,
                      0.1956, 0.2118, 0.2153, 0.1882, 0.2074, 0.1871, 0.2040, 0.2044, 0.1900,
                      0.1952, 0.0475, 0.1909, 0.2115, 0.1971, 0.2202, 0.1998, 0.2108, 0.2303,
                      ...
                      0.1662, 0.1982, 0.1582, 0.1935, 0.2182, 0.2067, 0.1855, 0.1778, 0.1900,
                      0.2124, 0.1215, 0.2092, 0.1929, 0.2434, 0.1936, 0.1948, 0.0622, 0.1852,
                      0.1868, 0.2035, 0.2310, 0.1794, 0.1655, 0.1756, 0.2074, 0.2194, 0.2152,
                      0.0502, 0.2294, 0.1950, 0.2149, 0.2024, 0.1727, 0.0657, 0.1919, 0.1847,
                      0.1900, 0.1825, 0.1898], device='cuda:0')), ...]

I only call two functions save_pretrained then load_pretrained why are the weights different? I tried to change weights after
loading the model, it started to work fine again. Then, I tried to save that model, then the same problem, saved model is different than loaded model.

Expected behavior

The model weights are supposed to be the same.

The text was updated successfully, but these errors were encountered:

bezir · 2024-04-29T14:17:48Z

One another way to look at it is this:

torch.save(merged_model.state_dict(), "save_path/pytorch_model.bin")
model = torch.load("save_path/pytorch_model.bin")

model = GPT2LMHeadModel.from_pretrained("save_path")

I get two different weight tensors for this two particular loading methods.

amyeroberts · 2024-04-29T17:23:50Z

cc @younesbelkada @pacman100 re PEFT

furkantrky · 2024-04-29T22:40:57Z

I have the same issue here, the model which is merged and unload before saving has different lm_head("wte") weights than the loaded model which is saved from merged model. I am using the save_pretrained() method for saving and as far as I understand from my researchs gpt2's lm_head has tied weights.
So is there be any problem with save_pretrained() method such as like not saving tied weights. If so is there any parameter or something to pass on and save them as well.

bezir · 2024-05-02T09:44:13Z

tie_word_embeddings=False solved it.

amyeroberts · 2024-05-02T10:05:52Z

@bezir Thanks for sharing a solution! I'm going to re-open, as it shouldn't be necessary to pass in this argument, but glad there's a work-around which works

amyeroberts added the PEFT label Apr 29, 2024

bezir closed this as completed May 2, 2024

amyeroberts reopened this May 2, 2024

huggingface deleted a comment from github-actions bot May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saved weights differ from the original model #30543

Saved weights differ from the original model #30543

bezir commented Apr 29, 2024 •

edited by sanchit-gandhi

bezir commented Apr 29, 2024

amyeroberts commented Apr 29, 2024

furkantrky commented Apr 29, 2024

bezir commented May 2, 2024

amyeroberts commented May 2, 2024

Saved weights differ from the original model #30543

Saved weights differ from the original model #30543

Comments

bezir commented Apr 29, 2024 • edited by sanchit-gandhi

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

bezir commented Apr 29, 2024

amyeroberts commented Apr 29, 2024

furkantrky commented Apr 29, 2024

bezir commented May 2, 2024

amyeroberts commented May 2, 2024

bezir commented Apr 29, 2024 •

edited by sanchit-gandhi