Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to load wt103 checkpoint, size mismatch #66

Closed
violet-zct opened this issue Sep 4, 2022 · 6 comments
Closed

Unable to load wt103 checkpoint, size mismatch #66

violet-zct opened this issue Sep 4, 2022 · 6 comments

Comments

@violet-zct
Copy link

Hi Albert,

I tried to load your recently uploaded wikitext-103 checkpoint, but encountered the following error:

RuntimeError: Error(s) in loading state_dict for SequenceLightningModule:
size mismatch for encoder.0.emb_layers.3.weight: copying a param with shape torch.Size([67738, 16]) from checkpoint, the shape in current model is torch.Size([67737, 16]).
size mismatch for loss.out_layers_biases.3: copying a param with shape torch.Size([67738]) from checkpoint, the shape in current model is torch.Size([67737]).
size mismatch for loss_val.out_layers_biases.3: copying a param with shape torch.Size([67738]) from checkpoint, the shape in current model is torch.Size([67737]).

Do you know why is it? I used the wt103 data downloaded from the transformer-xl repo: https://github.com/kimiyoung/transformer-xl/blob/master/getdata.sh.

Thanks!

@albertfgu
Copy link
Contributor

Can you be more specific about the commands you ran? I followed the instructions in the README with

python -m generate experiment=lm/s4-wt103 checkpoint_path=checkpoints/s4-wt103.pt n_samples=1 l_sample=16384 l_prefix=8192 decode=text

and this worked fine.

@violet-zct
Copy link
Author

Thanks for your reply! I want to evaluate the checkpoint's ppl but not use it for generation, so I use the train script but skip training and only do evaluation, and changed the ckpt_path to be the path of the checkpoint:
python -m train experiment=lm/s4-wt103 trainer.gpus=1

trainer.validate(model, ckpt_path='/home/projects/chunting/state-spaces/s4_wt103.ckpt')

@albertfgu
Copy link
Contributor

I'm not running into the same error. My evaluation script loads the model with

model = SequenceLightningModule.load_from_checkpoint(ckpt_path, config=config)
trainer = create_trainer(config)
trainer.test(model)

If this doesn't work, maybe there's something wrong with the dataset?

The final test ppl was 20.95 (updated in the arxiv paper). Looking at the logs, the final val loss is

Epoch 444, global step 701319: val/loss reached 2.97975

which is 19.68 ppl

@violet-zct
Copy link
Author

Oh, that might be the reason.
I am using the wikitext-103 corpus from Transformer-XL with the following script: https://github.com/kimiyoung/transformer-xl/blob/master/getdata.sh

@albertfgu
Copy link
Contributor

That's the right version. But the dataset loader uses a cache after processing the vocab, and I thought it's possible that the logic changed and you're using an outdated cache (your error message could be because of an off-by-1 error in the vocab size between 67737 and 67738). It could be worth trying to remove the cache folders inside data/wt103 and try again.

But if all you wanted is the ppl numbers, those have been reported.

@violet-zct
Copy link
Author

Thanks Albert! I'd like to eval and test the speed. I'll close this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants