-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements in validation/test ppl of the transformer-xl with MoE on wt103 #75
Comments
@latifisalar It is not surprising. In your WT103 experiment, the MoE-Transformer-XL has 16x parameters comparing to the vanilla one, and it would be much easier for the MoE model to get overfitted (given WT103 is not a very hard dataset to fit). On the other hand, it means the model capacity of MoE is much larger than the vanilla Transformer-XL, even with similar FLOPs. |
@xptree That is a valid point. I was expecting the overfitting issue with a small dataset as wt103. Would you mind if I ask which dataset was used in training GPT model in section 5.4, are you using the latest wikipedia dump? And also, I am guessing the loss curves in Figure 7 correspond to the pre-training phase. Did you check the test accuracy at the end of pre-training phase? Or, validate the model with any downstream tasks? Just wanted to see if the overfitting issue is not a problem in larger datasets and it actually results in improving test accuracy at the end. |
@latifisalar We report the train loss and ppl in our manuscript, and I agree that a validation loss/ppl curve is necessary here (perhaps we can update it in our next version). The dataset we used for pre-training is wiki. |
Hi there,
I have been running some experiments with mixture of experts and transformer-xl, I noticed that while the training ppl gets significant improvement with inclusion of MoE, however the same results do not get reflected in the testing/validation accuracies.
Here is an example with transformer-xl trained on wt103 dataset (the numbers correspond to the best model saved after full training):
Baseline transformer-xl with 16-layers:
MoE transformer-xl with 16-layers and 16-experts:
While we are getting a minor improvement with MoE, however it seems to me that by training with MoE, the model is facing overfitting issues. I have tried adjusting the dropout rate of the expert layers as suggested by switch-Transformer, and while it helped by having similar training and evaluation performance, it has not helped to achieve better test ppl. Have you faced similar issues?
The text was updated successfully, but these errors were encountered: