hyperparamters to reproduce paper result #32

xianxl · 2019-07-11T16:22:05Z

Could you share the full command (and default hyperparameters) you used? For example, I found --word_mask_keep_rand is "0.8,0.1,0.1" in MASS but "0,0,1" in MASS-fairseq.

AndyELiu · 2019-07-11T21:17:23Z

I also had a hard time to reproduce the unsupervised MT result. I run exactly as suggested in ReadME on en-fr. At epoch 4, my "valid_fr-en_mt_bleu" is only a little above 1, but you had "valid_fr-en_mt_bleu -> 10.55". I run it a few times.

StillKeepTry · 2019-07-12T17:02:24Z

@xianxl Our implementation based on fairseq contains some different methods from our paper. There will have some slight different experiment settings in fairseq.

xianxl · 2019-07-12T17:17:49Z

@StillKeepTry thanks for your reply. So what do you recommend to set word_mask_keep_rand in MASS-fairseq implementation? The default is "0,0,1" (which means no mask?) and this arg is not set in the training command you shared.

caoyuan · 2019-08-20T18:18:09Z

I followed the exactly the same setting as what the git page shows, running on a machine with 8 V100 gpus as the paper describes:

The git page claims that after 4 epochs, even without back translation the unsupervised BLEU should be close to the following numbers:

epoch -> 4
valid_fr-en_mt_bleu -> 10.55
valid_en-fr_mt_bleu -> 7.81
test_fr-en_mt_bleu -> 11.72
test_en-fr_mt_bleu -> 8.80

However this is not what I got, my numbers are much worse at epoch 4:

Could you please let us know if any param is wrong, or there are any hidden recipe that we're not aware of to reproduce the results?

On the other hand, I also loaded your pre-trained en-fr model, and the results are much better. So alternatively, could you share the settings you used to train the pre-trained model?

caoyuan · 2019-08-21T00:16:49Z

After some investigation, it seems that the suggested epoch size (200000) is really small and not the one used to produce the paper results. Could you confirm on this hypothesis?

Bachstelze · 2019-11-02T17:14:30Z

Can we conclude that the results are not reproducible?

tan-xu · 2019-11-02T18:01:28Z

Sorry for late response. We will provide all the scripts, hyper parameters, training logs to help you reproduce the results later today. 在 2019年11月3日，上午1:14，Bachstelze <notifications@github.com<mailto:notifications@github.com>> 写道： Can we conclude that the results are not reproducible? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmicrosoft%2FMASS%2Fissues%2F32%3Femail_source%3Dnotifications%26email_token%3DALQV7L5L34BIG5JMDBQUBV3QRWYPRA5CNFSM4IBMVTK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC5AU2Q%23issuecomment-549063274&data=02%7C01%7CXu.Tan%40microsoft.com%7Ccc48921bb70946d1eb7a08d75fb8212f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637083116757422671&sdata=lUCfdvSHzyicFpoZdumOLKIEA5jfdc3BQt9zW1VlJYU%3D&reserved=0>, or unsubscribe<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FALQV7L34VC2MPBEKBUKGEH3QRWYPRANCNFSM4IBMVTKQ&data=02%7C01%7CXu.Tan%40microsoft.com%7Ccc48921bb70946d1eb7a08d75fb8212f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637083116757422671&sdata=HxzuSmJz0aEESkLwK7DQDpZQzcF2C04odSU9DxH6y7w%3D&reserved=0>.

StillKeepTry · 2019-11-03T15:25:26Z

First, we used 50M monolingual data for each language during the training, this may be one reason (pre-training usually needs more data). Second, epoch_size=200000 in this code means training 200,000 sentences for one epoch. In other words, nearly 62 epochs (size=200,000) is equal to train on 50M data for once.

To explain this, I have uploaded some logs to this link from my previous experiments when epoch_size = 200000. It can obtain 8.24/5.45 (at 10 epochs), 11.95/8.21 (at 50 epochs), 13.38/9.34 (at 100 epochs), 14.26/10.06 (at 146 epochs). This is just the result of 146 epochs. While we take over 500 epochs in our experiments for pre-training.

And in the latest code, you can try this hyperparameter for pre-training which result in better performance:

DATA_PATH=/data/processed/de-en

export NGPU=8; CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=$NGPU train.py \
        --exp_name unsupMT_ende                              \
        --dump_path ./models/en-de/                          \
        --exp_id test                                        \
        --data_path $DATA_PATH                               \
        --lgs 'en-de'                                        \
        --mass_steps 'en,de'                                 \
        --encoder_only false                                 \
        --emb_dim 1024                                       \
        --n_layers 6                                         \
        --n_heads 8                                          \
        --dropout 0.1                                        \
        --attention_dropout 0.1                              \
        --gelu_activation true                               \
        --tokens_per_batch 3000                              \
        --batch_size 32                                      \
        --bptt 256                                           \
        --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 \
        --epoch_size 200000                                  \
        --max_epoch 50                                       \
        --eval_bleu true                                     \
        --word_mass 0.5                                      \
        --min_len 4                                          \
        --save_periodic 10                                   \
        --lambda_span "8"                                    \
        --word_mask_keep_rand '0.8,0.05,0.15'

k888 · 2020-01-28T19:00:09Z

@StillKeepTry could you also please provide the log for bt steps for the en_de model you pretrained?

Also log for pretraining and bt for en_fr would be highly appreciated as well!

LibertFan · 2020-05-07T15:44:38Z

@k888 have you find the hyperparameters for en_de BT steps?

Frankszc · 2020-12-03T03:39:06Z

First, we used 50M monolingual data for each language during the training, this may be one reason (pre-training usually needs more data). Second, epoch_size=200000 in this code means training 200,000 sentences for one epoch. In other words, nearly 62 epochs (size=200,000) is equal to train on 50M data for once.

To explain this, I have uploaded some logs to this link from my previous experiments when epoch_size = 200000. It can obtain 8.24/5.45 (at 10 epochs), 11.95/8.21 (at 50 epochs), 13.38/9.34 (at 100 epochs), 14.26/10.06 (at 146 epochs). This is just the result of 146 epochs. While we take over 500 epochs in our experiments for pre-training.

And in the latest code, you can try this hyperparameter for pre-training which result in better performance:
DATA_PATH=/data/processed/de-en

export NGPU=8; CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=$NGPU train.py \
        --exp_name unsupMT_ende                              \
        --dump_path ./models/en-de/                          \
        --exp_id test                                        \
        --data_path $DATA_PATH                               \
        --lgs 'en-de'                                        \
        --mass_steps 'en,de'                                 \
        --encoder_only false                                 \
        --emb_dim 1024                                       \
        --n_layers 6                                         \
        --n_heads 8                                          \
        --dropout 0.1                                        \
        --attention_dropout 0.1                              \
        --gelu_activation true                               \
        --tokens_per_batch 3000                              \
        --batch_size 32                                      \
        --bptt 256                                           \
        --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 \
        --epoch_size 200000                                  \
        --max_epoch 50                                       \
        --eval_bleu true                                     \
        --word_mass 0.5                                      \
        --min_len 4                                          \
        --save_periodic 10                                   \
        --lambda_span "8"                                    \
        --word_mask_keep_rand '0.8,0.05,0.15'

Hello, thanks for your great work. And you said that you used 50M monolingual data(50,000,000 sentences) for each language during the training, the epoch_size is 200,000, so why the number of epoch training on 50M data for once is 62? why not 250? @StillKeepTry

cbaziotis mentioned this issue Jul 20, 2020

Confusion about the amount of monolingual data used in the experiments #160

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hyperparamters to reproduce paper result #32

hyperparamters to reproduce paper result #32

xianxl commented Jul 11, 2019

AndyELiu commented Jul 11, 2019

StillKeepTry commented Jul 12, 2019

xianxl commented Jul 12, 2019

caoyuan commented Aug 20, 2019 •

edited

caoyuan commented Aug 21, 2019

Bachstelze commented Nov 2, 2019

tan-xu commented Nov 2, 2019 via email

StillKeepTry commented Nov 3, 2019

k888 commented Jan 28, 2020

LibertFan commented May 7, 2020

Frankszc commented Dec 3, 2020 •

edited

hyperparamters to reproduce paper result #32

hyperparamters to reproduce paper result #32

Comments

xianxl commented Jul 11, 2019

AndyELiu commented Jul 11, 2019

StillKeepTry commented Jul 12, 2019

xianxl commented Jul 12, 2019

caoyuan commented Aug 20, 2019 • edited

caoyuan commented Aug 21, 2019

Bachstelze commented Nov 2, 2019

tan-xu commented Nov 2, 2019 via email

StillKeepTry commented Nov 3, 2019

k888 commented Jan 28, 2020

LibertFan commented May 7, 2020

Frankszc commented Dec 3, 2020 • edited

caoyuan commented Aug 20, 2019 •

edited

Frankszc commented Dec 3, 2020 •

edited