Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

size mismatch for weights and bias #3

Open
1e0ndavid opened this issue Apr 29, 2021 · 5 comments
Open

size mismatch for weights and bias #3

1e0ndavid opened this issue Apr 29, 2021 · 5 comments
Labels
help wanted Extra attention is needed

Comments

@1e0ndavid
Copy link

1e0ndavid commented Apr 29, 2021

Hi there. After trained the model, I run "python serve.py" to test whether the model is capable to use, before this I have changed run_uuid to be that of my model and checkpoint. Any idea about why it raises error "RuntimeError: Error(s) in loading state_dict for TransformerXLModel:"? Thanks.


(autocomplete) daijianbo@ubuntu18:~/python_autocomplete-master-old/python_autocomplete$ python serve.py

LABML WARNING
Not a valid git repository: /home/daijianbo/python_autocomplete-master-old

Prepare model...
Prepare n_tokens...
Prepare tokenizer...[DONE] 1.27ms
Prepare n_tokens...[DONE] 2.10ms
Prepare transformer...[DONE] 1.33ms
Prepare ffn...[DONE] 0.30ms
Prepare device...
Prepare device_info...[DONE] 23.29ms
Prepare device...[DONE] 23.51ms
Prepare model...[DONE] 107.18ms
Selected experiment = source_code run = b32da5eea23711eb982bccbbfe110075 checkpoint = 1744896
Loading checkpoint...[FAIL] 840.09ms
Traceback (most recent call last):
File "serve.py", line 18, in
predictor = get_predictor()
File "/home/daijianbo/python_autocomplete-master-old/python_autocomplete/evaluate/factory.py", line 39, in get_predictor conf = load_experiment()
File "/home/daijianbo/python_autocomplete-master-old/python_autocomplete/evaluate/factory.py", line 33, in load_experiment
experiment.start()
File "/home/daijianbo/miniconda3/envs/autocomplete/lib/python3.8/site-packages/labml/experiment.py", line 256, in start
return _experiment_singleton().start(run_uuid=_load_run_uuid, checkpoint=_load_checkpoint)
File "/home/daijianbo/miniconda3/envs/autocomplete/lib/python3.8/site-packages/labml/internal/experiment/init.py", line 407, in start
global_step = self.__start_from_checkpoint(run_uuid, checkpoint)
File "/home/daijianbo/miniconda3/envs/autocomplete/lib/python3.8/site-packages/labml/internal/experiment/init.py", line 312, in __start_from_check point
self._load_checkpoint(checkpoint_path)
File "/home/daijianbo/miniconda3/envs/autocomplete/lib/python3.8/site-packages/labml/internal/experiment/init.py", line 280, in _load_checkpoint
self.checkpoint_saver.load(checkpoint_path)
File "/home/daijianbo/miniconda3/envs/autocomplete/lib/python3.8/site-packages/labml/internal/experiment/init.py", line 118, in load
saver.load(checkpoint_path, info[name])
File "/home/daijianbo/miniconda3/envs/autocomplete/lib/python3.8/site-packages/labml/internal/experiment/pytorch.py", line 66, in load self.model.load_state_dict(state)
File "/home/daijianbo/miniconda3/envs/autocomplete/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(RuntimeError: Error(s) in loading state_dict for TransformerXLModel:
size mismatch for src_embed.weight: copying a param with shape torch.Size([1096, 512]) from checkpoint, the shape in current model is torch.Size([1097, 512]).
size mismatch for generator.weight: copying a param with shape torch.Size([1096, 512]) from checkpoint, the shape in current model is torch.Size([1097, 512]).
size mismatch for generator.bias: copying a param with shape torch.Size([1096]) from checkpoint, the shape in current model is torch.Size([1097]).

@vpj
Copy link
Member

vpj commented Apr 30, 2021

Looks like the number of tokens is different from the number of token when it was training. Did you change the dataset or run BPE again?

@1e0ndavid
Copy link
Author

Looks like the number of tokens is different from the number of token when it was training. Did you change the dataset or run BPE again?

No, I dont think I did, the weird thing is my friend also met this problem and the difference between two dimensions is bigger than mine, he got [1084, 512] and [1092, 512] respectively. One way we solve this problem is to train more times and select another checkpoint, sometimes it works. I'm not sure what goes wrong here, whether it could be in the part of "segment-level recurrence"? I have no idea since I haven't review the code carefully.

@vpj
Copy link
Member

vpj commented Apr 30, 2021

This looks sounds like a bug. The dimensions of the embedding weights are number of tokens and number of embedding features (d_model)

@vpj
Copy link
Member

vpj commented Apr 30, 2021

I will give it a try and see if I can reproduce. Are you running the latest master? Did you make changes? Also is the dataset the same?

@1e0ndavid
Copy link
Author

I will give it a try and see if I can reproduce. Are you running the latest master? Did you make changes? Also is the dataset the same?

Ok, try and see what will happen, lol. Yep, I downloaded this model several days ago so I suppose I have run the latest master. I haven't make any change other than commented the part of downloading data part, I used my own data and copied it from other folder directly. I also suspected at first whether is because I edited any key code but I don't think so after comparing with the original version. I tried other times from the very beginning, say from downloading the model to making it work. Similar question still exist, btw, my friend also met this, so maybe there is something wrong in the model?

And yeah, I always keep the dataset the same.

@vpj vpj added the help wanted Extra attention is needed label May 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants