Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements in validation/test ppl of the transformer-xl with MoE on wt103 #75

Closed
latifisalar opened this issue Sep 21, 2021 · 4 comments

Comments

@latifisalar
Copy link

Hi there,

I have been running some experiments with mixture of experts and transformer-xl, I noticed that while the training ppl gets significant improvement with inclusion of MoE, however the same results do not get reflected in the testing/validation accuracies.

Here is an example with transformer-xl trained on wt103 dataset (the numbers correspond to the best model saved after full training):

Baseline transformer-xl with 16-layers:

  • Training ppl: 20.48
  • Validation ppl: 22.835
  • Test ppl: 23.710

MoE transformer-xl with 16-layers and 16-experts:

  • Training ppl: 14.53
  • Validation ppl: 21.859
  • Test ppl: 22.682

While we are getting a minor improvement with MoE, however it seems to me that by training with MoE, the model is facing overfitting issues. I have tried adjusting the dropout rate of the expert layers as suggested by switch-Transformer, and while it helped by having similar training and evaluation performance, it has not helped to achieve better test ppl. Have you faced similar issues?

@laekov
Copy link
Owner

laekov commented Sep 23, 2021

@xptree @Sengxian any ideas?

@xptree
Copy link
Collaborator

xptree commented Oct 2, 2021

@latifisalar It is not surprising. In your WT103 experiment, the MoE-Transformer-XL has 16x parameters comparing to the vanilla one, and it would be much easier for the MoE model to get overfitted (given WT103 is not a very hard dataset to fit).

On the other hand, it means the model capacity of MoE is much larger than the vanilla Transformer-XL, even with similar FLOPs.

@latifisalar
Copy link
Author

@xptree That is a valid point. I was expecting the overfitting issue with a small dataset as wt103. Would you mind if I ask which dataset was used in training GPT model in section 5.4, are you using the latest wikipedia dump? And also, I am guessing the loss curves in Figure 7 correspond to the pre-training phase. Did you check the test accuracy at the end of pre-training phase? Or, validate the model with any downstream tasks? Just wanted to see if the overfitting issue is not a problem in larger datasets and it actually results in improving test accuracy at the end.

@xptree
Copy link
Collaborator

xptree commented Oct 11, 2021

@latifisalar We report the train loss and ppl in our manuscript, and I agree that a validation loss/ppl curve is necessary here (perhaps we can update it in our next version). The dataset we used for pre-training is wiki.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants