Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saved weights differ from the original model #30543

Open
2 of 4 tasks
bezir opened this issue Apr 29, 2024 · 5 comments
Open
2 of 4 tasks

Saved weights differ from the original model #30543

bezir opened this issue Apr 29, 2024 · 5 comments
Labels

Comments

@bezir
Copy link

bezir commented Apr 29, 2024

System Info

transformers 4.40.1
peft 0.10.0

Who can help?

@sanchit-gandhi @Rocketknight1 @younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I have fine-tuned a GPT2 Model using SFTTrainer. I merge base model and trained adapters with the code below. I also extended the vocabulary.

peft_model_path = "checkpoint"
tokenizer = AutoTokenizer.from_pretrained(peft_model_path, device_map="auto")
base_model_path = "openai-community/gpt2"
base_model = GPT2LMHeadModel.from_pretrained(base_model_path, device_map="auto")

base_model.resize_token_embeddings(len(tokenizer))
    
model = PeftModel.from_pretrained(base_model, peft_model_path, device_map="auto")
merged_model = model.merge_and_unload()

I test this model and the results are okay. Then I want to save this merged_model using the code below.

tokenizer.save_pretrained(save_path)
merged_model.save_pretrained(save_path) 

Lastly, I open the saved model with the code below.

tokenizer = AutoTokenizer.from_pretrained(save_path)
model = AutoModelForCausalLM.from_pretrained(save_path)

The model that I load from the save_path does not work well. It repeats the same token or gives random tokens from the base vocabulary.

Model before Save

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(66156, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=66156, bias=False)
)

Loaded Model:

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(66156, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=66156, bias=False)
)

Now let's look at the weights.

Model Before Save

OrderedDict([('transformer.wte.weight',
              tensor([[-0.2500,  0.2324,  0.0162,  ...,  0.0013, -0.4432,  0.2431],
                      [ 0.3332, -0.1894, -0.2949,  ...,  0.2883, -0.0411,  0.3148],
                      [-0.2403, -0.1975,  0.4091,  ..., -0.3482,  0.5244,  0.1759],
                      ...,
                      [ 0.3374, -0.1371, -0.2627,  ..., -0.6586, -0.5067, -0.0226],
                      [-0.1031,  0.1453, -0.9022,  ..., -0.3682,  0.4504,  0.3242],
                      [-0.5442, -0.6574, -0.0881,  ..., -0.2370, -0.3048,  0.7317]],
                     device='cuda:0')),
             ('transformer.wpe.weight',
              tensor([[-1.2274e-01, -1.5239e-01,  1.9312e-01,  ..., -2.8421e-03,
                       -5.4589e-02,  3.1110e-02],
                      [-1.1096e-02, -1.4371e-01, -5.5172e-02,  ...,  1.8058e-01,
                        4.9604e-02,  4.3034e-02],
                      [ 6.0074e-02, -2.2665e-01,  2.2515e-01,  ..., -1.2376e-02,
                        1.3002e-01, -1.2887e-02],
                      ...,
                      [ 3.5125e-01, -1.1077e+00,  1.3558e-01,  ..., -2.0376e-01,
                       -3.5680e-01,  3.1987e-01],
                      [ 4.0268e-01, -4.6439e-01, -8.0140e-04,  ...,  1.4744e-01,
                        1.2033e-01, -8.1738e-02],
                      [ 2.6610e-04,  3.0272e-03, -1.7086e-03,  ..., -4.6506e-03,
                       -2.3541e-03, -5.7855e-03]], device='cuda:0')),
             ('transformer.h.0.ln_1.weight',
              tensor([0.2232, 0.1820, 0.1534, 0.1917, 0.2036, 0.1948, 0.1467, 0.1865, 0.2143,
                      0.1956, 0.2118, 0.2153, 0.1882, 0.2074, 0.1871, 0.2040, 0.2044, 0.1900,
                      0.1952, 0.0475, 0.1909, 0.2115, 0.1971, 0.2202, 0.1998, 0.2108, 0.2303,
                      ...
                      0.1662, 0.1982, 0.1582, 0.1935, 0.2182, 0.2067, 0.1855, 0.1778, 0.1900,
                      0.2124, 0.1215, 0.2092, 0.1929, 0.2434, 0.1936, 0.1948, 0.0622, 0.1852,
                      0.1868, 0.2035, 0.2310, 0.1794, 0.1655, 0.1756, 0.2074, 0.2194, 0.2152,
                      0.0502, 0.2294, 0.1950, 0.2149, 0.2024, 0.1727, 0.0657, 0.1919, 0.1847,
                      0.1900, 0.1825, 0.1898], device='cuda:1')), ...]

Loaded Model


OrderedDict([('transformer.wte.weight',
              tensor([[-0.0952, -0.0785,  0.0155,  ..., -0.1458, -0.0334,  0.0052],
                      [ 0.0234, -0.0811,  0.0049,  ...,  0.1172, -0.0847,  0.0343],
                      [-0.1060,  0.0711,  0.1621,  ..., -0.0243, -0.1103, -0.0732],
                      ...,
                      [-0.0358,  0.0035,  0.0351,  ..., -0.0633, -0.0200, -0.0084],
                      [-0.1555,  0.0488,  0.0125,  ..., -0.0582,  0.0440, -0.1661],
                      [-0.0095,  0.1273, -0.0158,  ...,  0.0115, -0.1641, -0.0303]],
                     device='cuda:0')),
             ('transformer.wpe.weight',
              tensor([[-1.2274e-01, -1.5239e-01,  1.9312e-01,  ..., -2.8421e-03,
                       -5.4589e-02,  3.1110e-02],
                      [-1.1096e-02, -1.4371e-01, -5.5172e-02,  ...,  1.8058e-01,
                        4.9604e-02,  4.3034e-02],
                      [ 6.0074e-02, -2.2665e-01,  2.2515e-01,  ..., -1.2376e-02,
                        1.3002e-01, -1.2887e-02],
                      ...,
                      [ 3.5125e-01, -1.1077e+00,  1.3558e-01,  ..., -2.0376e-01,
                       -3.5680e-01,  3.1987e-01],
                      [ 4.0268e-01, -4.6439e-01, -8.0140e-04,  ...,  1.4744e-01,
                        1.2033e-01, -8.1738e-02],
                      [ 2.6610e-04,  3.0272e-03, -1.7086e-03,  ..., -4.6506e-03,
                       -2.3541e-03, -5.7855e-03]], device='cuda:0')),
             ('transformer.h.0.ln_1.weight',
              tensor([0.2232, 0.1820, 0.1534, 0.1917, 0.2036, 0.1948, 0.1467, 0.1865, 0.2143,
                      0.1956, 0.2118, 0.2153, 0.1882, 0.2074, 0.1871, 0.2040, 0.2044, 0.1900,
                      0.1952, 0.0475, 0.1909, 0.2115, 0.1971, 0.2202, 0.1998, 0.2108, 0.2303,
                      ...
                      0.1662, 0.1982, 0.1582, 0.1935, 0.2182, 0.2067, 0.1855, 0.1778, 0.1900,
                      0.2124, 0.1215, 0.2092, 0.1929, 0.2434, 0.1936, 0.1948, 0.0622, 0.1852,
                      0.1868, 0.2035, 0.2310, 0.1794, 0.1655, 0.1756, 0.2074, 0.2194, 0.2152,
                      0.0502, 0.2294, 0.1950, 0.2149, 0.2024, 0.1727, 0.0657, 0.1919, 0.1847,
                      0.1900, 0.1825, 0.1898], device='cuda:0')), ...]

I only call two functions save_pretrained then load_pretrained why are the weights different? I tried to change weights after
loading the model, it started to work fine again. Then, I tried to save that model, then the same problem, saved model is different than loaded model.

Expected behavior

The model weights are supposed to be the same.

@bezir
Copy link
Author

bezir commented Apr 29, 2024

One another way to look at it is this:

torch.save(merged_model.state_dict(), "save_path/pytorch_model.bin")
model = torch.load("save_path/pytorch_model.bin")

model = GPT2LMHeadModel.from_pretrained("save_path")

I get two different weight tensors for this two particular loading methods.

@amyeroberts
Copy link
Collaborator

cc @younesbelkada @pacman100 re PEFT

@furkantrky
Copy link

I have the same issue here, the model which is merged and unload before saving has different lm_head("wte") weights than the loaded model which is saved from merged model. I am using the save_pretrained() method for saving and as far as I understand from my researchs gpt2's lm_head has tied weights.
So is there be any problem with save_pretrained() method such as like not saving tied weights. If so is there any parameter or something to pass on and save them as well.

@bezir bezir closed this as completed May 2, 2024
@bezir
Copy link
Author

bezir commented May 2, 2024

tie_word_embeddings=False solved it.

@amyeroberts
Copy link
Collaborator

@bezir Thanks for sharing a solution! I'm going to re-open, as it shouldn't be necessary to pass in this argument, but glad there's a work-around which works

@amyeroberts amyeroberts reopened this May 2, 2024
@huggingface huggingface deleted a comment from github-actions bot May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants