Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overflow error #1

Open
tomhosking opened this issue Oct 19, 2021 · 2 comments
Open

Overflow error #1

tomhosking opened this issue Oct 19, 2021 · 2 comments

Comments

@tomhosking
Copy link

Hi,

During training, I get the following error:

Traceback (most recent call last):
  File "train.py", line 182, in <module>
    generation_save_path=args.generation_save_path)
  File "/disk/nfs/ostrom/s1717552/btmpg/utils/run.py", line 133, in __call__
    self.run()
  File "/disk/nfs/ostrom/s1717552/btmpg/utils/run.py", line 100, in run
    max_length=self.max_length)
  File "/disk/nfs/ostrom/s1717552/btmpg/model/VAE.py", line 206, in round
    out_embed = self.embed(self.GS(sentence[:, -1:, :]))
  File "/disk/nfs/ostrom/s1717552/btmpg/btmpgenv/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/disk/nfs/ostrom/s1717552/btmpg/model/gumbleSoftmax.py", line 17, in forward
    sigma = min(self.tau_max, (self.tau_max ** (self.n / self.N)))
OverflowError: (34, 'Numerical result out of range')

This happens after a few days of training, around epoch 39 for MSCOCO and epoch 77 for Quora.

The command used was:

python train.py --cuda \
                --train_source ./data/qqp_train.src \
                --train_target ./data/qqp_train.tgt \
                --test_source  ./data/qqp_dev.src \
                --test_target  ./data/qqp_dev.tgt \
                --vocab_path ./checkpoints/qqp.vocab \
                --batch_size 8 \
                --epoch 100 \
                --num_rounds 2 \
                --max_length 50 \
                --clip_length 50 \
                --model_save_path ./checkpoints/qqp.model \
                --generation_save_path ./outputs/qqp/
@L-Zhe
Copy link
Owner

L-Zhe commented Oct 20, 2021

I am trying to reappear this error and will reply to you soon.

@hahally
Copy link

hahally commented Aug 11, 2022

Hi, 这个溢出是因为gumble_softmax的并没有按照论文里说的那样设置。在文件run.py 里面 ‘self.GS = gumble_softmax(3500, 100)‘,即N=3500、Tau_max=100,仔细查看代码会发现,每一步,n+=1,随着训练步数增加,n越来越大,self.tau_max ** (self.n / self.N) 将会出现溢出错误。

Hi, this overflow is because gumble_softmax did not set as mentioned in the paper. In the file run.py, ‘self.GS = gumble_softmax(3500, 100)‘, that is, n = 3500, tau_max = 100, check the code carefully, you will find that every step, n+= 1, with the number of training steps increase, n is getting bigger and bigger, self.tau_max ** (self.n / seld.n) will have an overflow error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants