Unable to load wt103 checkpoint, size mismatch #66

violet-zct · 2022-09-04T07:38:36Z

Hi Albert,

I tried to load your recently uploaded wikitext-103 checkpoint, but encountered the following error:

RuntimeError: Error(s) in loading state_dict for SequenceLightningModule:
size mismatch for encoder.0.emb_layers.3.weight: copying a param with shape torch.Size([67738, 16]) from checkpoint, the shape in current model is torch.Size([67737, 16]).
size mismatch for loss.out_layers_biases.3: copying a param with shape torch.Size([67738]) from checkpoint, the shape in current model is torch.Size([67737]).
size mismatch for loss_val.out_layers_biases.3: copying a param with shape torch.Size([67738]) from checkpoint, the shape in current model is torch.Size([67737]).

Do you know why is it? I used the wt103 data downloaded from the transformer-xl repo: https://github.com/kimiyoung/transformer-xl/blob/master/getdata.sh.

Thanks!

albertfgu · 2022-09-07T22:19:51Z

Can you be more specific about the commands you ran? I followed the instructions in the README with

python -m generate experiment=lm/s4-wt103 checkpoint_path=checkpoints/s4-wt103.pt n_samples=1 l_sample=16384 l_prefix=8192 decode=text

and this worked fine.

violet-zct · 2022-09-09T22:53:27Z

Thanks for your reply! I want to evaluate the checkpoint's ppl but not use it for generation, so I use the train script but skip training and only do evaluation, and changed the ckpt_path to be the path of the checkpoint:
python -m train experiment=lm/s4-wt103 trainer.gpus=1

trainer.validate(model, ckpt_path='/home/projects/chunting/state-spaces/s4_wt103.ckpt')

albertfgu · 2022-09-18T20:18:23Z

I'm not running into the same error. My evaluation script loads the model with

model = SequenceLightningModule.load_from_checkpoint(ckpt_path, config=config)
trainer = create_trainer(config)
trainer.test(model)

If this doesn't work, maybe there's something wrong with the dataset?

The final test ppl was 20.95 (updated in the arxiv paper). Looking at the logs, the final val loss is

Epoch 444, global step 701319: val/loss reached 2.97975

which is 19.68 ppl

violet-zct · 2022-09-20T04:44:37Z

Oh, that might be the reason.
I am using the wikitext-103 corpus from Transformer-XL with the following script: https://github.com/kimiyoung/transformer-xl/blob/master/getdata.sh

albertfgu · 2022-09-20T16:53:28Z

That's the right version. But the dataset loader uses a cache after processing the vocab, and I thought it's possible that the logic changed and you're using an outdated cache (your error message could be because of an off-by-1 error in the vocab size between 67737 and 67738). It could be worth trying to remove the cache folders inside data/wt103 and try again.

But if all you wanted is the ppl numbers, those have been reported.

violet-zct · 2022-09-25T08:07:25Z

Thanks Albert! I'd like to eval and test the speed. I'll close this issue now.

violet-zct closed this as completed Sep 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to load wt103 checkpoint, size mismatch #66

Unable to load wt103 checkpoint, size mismatch #66

violet-zct commented Sep 4, 2022

albertfgu commented Sep 7, 2022

violet-zct commented Sep 9, 2022

albertfgu commented Sep 18, 2022

violet-zct commented Sep 20, 2022

albertfgu commented Sep 20, 2022

violet-zct commented Sep 25, 2022

Unable to load wt103 checkpoint, size mismatch #66

Unable to load wt103 checkpoint, size mismatch #66

Comments

violet-zct commented Sep 4, 2022

albertfgu commented Sep 7, 2022

violet-zct commented Sep 9, 2022

albertfgu commented Sep 18, 2022

violet-zct commented Sep 20, 2022

albertfgu commented Sep 20, 2022

violet-zct commented Sep 25, 2022