Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev branch: toy training stops after 2 epochs #174

Closed
me2beats opened this issue Oct 28, 2018 · 2 comments
Closed

Dev branch: toy training stops after 2 epochs #174

me2beats opened this issue Oct 28, 2018 · 2 comments

Comments

@me2beats
Copy link

me2beats commented Oct 28, 2018

I test toy example.
I use Google Colab

If I use master branch CPU then everything is good

But if I want to use GPU and run:

TRAIN_PATH='data/toy_reverse/train/data.txt'
DEV_PATH='data/toy_reverse/dev/data.txt'
# Start training
!python examples/sample.py --train_path $TRAIN_PATH --dev_path $DEV_PATH

then I get:

/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py:52: UserWarning: size_average and reduce args will be deprecated, please use reduction='elementwise_mean' instead.
  warnings.warn(warning.format(ret))
2018-10-28 15:59:34,815 root         INFO     Namespace(dev_path='data/toy_reverse/dev/data.txt', expt_dir='./experiment', load_checkpoint=None, log_level='info', resume=False, train_path='data/toy_reverse/train/data.txt')
/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py:52: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
/usr/local/lib/python2.7/dist-packages/torch/nn/modules/rnn.py:38: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.2 and num_layers=1
  "num_layers={}".format(dropout, num_layers))
2018-10-28 15:59:37,915 seq2seq.trainer.supervised_trainer INFO     Optimizer: Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    eps: 1e-08
    lr: 0.001
    weight_decay: 0
), Scheduler: None
Traceback (most recent call last):
  File "examples/sample.py", line 129, in <module>
    resume=opt.resume)
  File "/usr/local/lib/python2.7/dist-packages/seq2seq/trainer/supervised_trainer.py", line 186, in train
    teacher_forcing_ratio=teacher_forcing_ratio)
  File "/usr/local/lib/python2.7/dist-packages/seq2seq/trainer/supervised_trainer.py", line 103, in _train_epoches
    loss = self._train_batch(input_variables, input_lengths.tolist(), target_variables, model, teacher_forcing_ratio)
  File "/usr/local/lib/python2.7/dist-packages/seq2seq/trainer/supervised_trainer.py", line 55, in _train_batch
    teacher_forcing_ratio=teacher_forcing_ratio)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/seq2seq/models/seq2seq.py", line 48, in forward
    encoder_outputs, encoder_hidden = self.encoder(input_variable, input_lengths)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/seq2seq/models/EncoderRNN.py", line 68, in forward
    embedded = self.embedding(input_var)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/sparse.py", line 110, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py", line 1110, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected object of type torch.cuda.LongTensor but found type torch.LongTensor for argument #3 'index'


Then I tested dev branch. this error is no more, but training stops after 2 epochs. (??)
And as result I have wrong output sequences. Is this ok?

I.e. I run

!scripts/toy.sh

TRAIN_SOURCE='data/toy_reverse/train/src.txt'
TRAIN_TARGET='data/toy_reverse/train/tgt.txt'
DEV_SOURCE='data/toy_reverse/dev/src.txt'
DEV_TARGET='data/toy_reverse/dev/tgt.txt'

# Start training
!python examples/sample.py $TRAIN_SOURCE $TRAIN_TARGET $DEV_SOURCE $DEV_TARGET

And I get error:

2018-10-28 19:41:05,614:root:INFO: train_source: data/toy_reverse/train/src.txt
2018-10-28 19:41:05,615:root:INFO: train_target: data/toy_reverse/train/tgt.txt
2018-10-28 19:41:05,615:root:INFO: dev_source: data/toy_reverse/dev/src.txt
2018-10-28 19:41:05,615:root:INFO: dev_target: data/toy_reverse/dev/tgt.txt
2018-10-28 19:41:05,615:root:INFO: experiment_directory: ./experiment
2018-10-28 19:41:09,111:seq2seq.trainer.supervised_trainer:INFO: Optimizer: Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    eps: 1e-08
    initial_lr: 0.001
    lr: 0.001
    weight_decay: 0
), Scheduler: <torch.optim.lr_scheduler.StepLR object at 0x7f83459c6ef0>
Train Perplexity: 22.6210: 100% 20/20 [00:01<00:00, 15.11it/s]
2018-10-28 19:41:10,528:seq2seq.trainer.supervised_trainer:INFO: Finished epoch 1: Train Perplexity: 11.3105, Dev Perplexity: 14.0211, Accuracy: 0.2420
Train Perplexity: 8.4440: 100% 20/20 [00:01<00:00, 13.06it/s]
2018-10-28 19:41:12,152:seq2seq.trainer.supervised_trainer:INFO: Finished epoch 2: Train Perplexity: 9.3128, Dev Perplexity: 7.3888, Accuracy: 0.3452
2018-10-28 19:41:12,153:root:INFO: Training time: 3.00s
Type in a source sequence: 1 2 3 4 5
['5', '5', '1', '1', '<eos>']
Type in a source sequence: 5 4 3 2 1
['1', '1', '1', '1', '<eos>']
Type in a source sequence: 
@Diego999
Copy link

Diego999 commented Nov 2, 2018

For the dev branch, you should change the parameters ;-) I got the same problems a couple of months ago. The batch & epoch in master were 32 & 6 and here they are 512 and 2. If you change them, it will work

@me2beats
Copy link
Author

me2beats commented Nov 2, 2018

Working! Thanks
sample.py
batch_size=512 ---> batch_size=32
n_epochs=2 ---> n_epochs=6

@me2beats me2beats closed this as completed Nov 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants