Improvements in validation/test ppl of the transformer-xl with MoE on wt103 #75

latifisalar · 2021-09-21T16:10:42Z

Hi there,

I have been running some experiments with mixture of experts and transformer-xl, I noticed that while the training ppl gets significant improvement with inclusion of MoE, however the same results do not get reflected in the testing/validation accuracies.

Here is an example with transformer-xl trained on wt103 dataset (the numbers correspond to the best model saved after full training):

Baseline transformer-xl with 16-layers:

Training ppl: 20.48
Validation ppl: 22.835
Test ppl: 23.710

MoE transformer-xl with 16-layers and 16-experts:

Training ppl: 14.53
Validation ppl: 21.859
Test ppl: 22.682

While we are getting a minor improvement with MoE, however it seems to me that by training with MoE, the model is facing overfitting issues. I have tried adjusting the dropout rate of the expert layers as suggested by switch-Transformer, and while it helped by having similar training and evaluation performance, it has not helped to achieve better test ppl. Have you faced similar issues?

laekov · 2021-09-23T12:47:03Z

@xptree @Sengxian any ideas?

xptree · 2021-10-02T07:24:04Z

@latifisalar It is not surprising. In your WT103 experiment, the MoE-Transformer-XL has 16x parameters comparing to the vanilla one, and it would be much easier for the MoE model to get overfitted (given WT103 is not a very hard dataset to fit).

On the other hand, it means the model capacity of MoE is much larger than the vanilla Transformer-XL, even with similar FLOPs.

latifisalar · 2021-10-02T20:58:35Z

@xptree That is a valid point. I was expecting the overfitting issue with a small dataset as wt103. Would you mind if I ask which dataset was used in training GPT model in section 5.4, are you using the latest wikipedia dump? And also, I am guessing the loss curves in Figure 7 correspond to the pre-training phase. Did you check the test accuracy at the end of pre-training phase? Or, validate the model with any downstream tasks? Just wanted to see if the overfitting issue is not a problem in larger datasets and it actually results in improving test accuracy at the end.

xptree · 2021-10-11T11:50:00Z

@latifisalar We report the train loss and ppl in our manuscript, and I agree that a validation loss/ppl curve is necessary here (perhaps we can update it in our next version). The dataset we used for pre-training is wiki.

latifisalar closed this as completed Oct 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements in validation/test ppl of the transformer-xl with MoE on wt103 #75

Improvements in validation/test ppl of the transformer-xl with MoE on wt103 #75

latifisalar commented Sep 21, 2021

laekov commented Sep 23, 2021

xptree commented Oct 2, 2021 •

edited

Loading

latifisalar commented Oct 2, 2021

xptree commented Oct 11, 2021

Improvements in validation/test ppl of the transformer-xl with MoE on wt103 #75

Improvements in validation/test ppl of the transformer-xl with MoE on wt103 #75

Comments

latifisalar commented Sep 21, 2021

laekov commented Sep 23, 2021

xptree commented Oct 2, 2021 • edited Loading

latifisalar commented Oct 2, 2021

xptree commented Oct 11, 2021

xptree commented Oct 2, 2021 •

edited

Loading