Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can i use the transformers pretraining script of T5 as mT5 ? #16571

Closed
StephennFernandes opened this issue Apr 3, 2022 · 7 comments
Closed

Comments

@StephennFernandes
Copy link

@patrickvonplaten @lewtun, @NielsRogge I am planing to pretrain multilingual T5 small and/or medium from scratch using the huggingface T5 pre-training script , i came across this post #5079 and the hugginface implementation for T5, my question is can i use the same pretraining script from T5 , by replace the T5Config with mT5Config ? WOULD THIS WORK ?

Also how should the dataset be arranged for multilingual languages pretraining ? should all the langages be arranged in a sequential order where a sequence of one lang followed by another eg: [French, German, Italian] or should all the languages be randomly shuffled ?

for the record i am planning to pretrain mT5 on indian languages on the oscar corpus and some additionally sourced text corpus.

@salrowili
Copy link

salrowili commented Apr 4, 2022

check this https://github.com/Shivanandroy/simpleT5
it uses transformers and it based on transformers and PyTorch lightening and it supports mt5 training.
with Pytorch lightning you can train mt5 on TPU with xla support. but i think you need to edit the code .
However, t5 code with flax (jax) seems your best option now since flax much faster than XLA torch

@StephennFernandes
Copy link
Author

Cool!, @salrowili thanks a ton. btw how should the dataset be arranged for multilingual language pretraining ? should all the langages be arranged in a sequential order where a sequence of one lang followed by another eg: [French, German, Italian] or should all the languages be randomly shuffled ?

@patrickvonplaten
Copy link
Contributor

Hey @StephennFernandes,

Yes it should work just fine! Note that mt5 used more or less the same pretraining logic that was used for t5v1_1, which is stated here: https://huggingface.co/docs/transformers/model_doc/t5v1.1

@StephennFernandes
Copy link
Author

@patrickvonplaten what about sampling the text corpus. i have a text corpus that has been arranged sequentially based on the languages. but the pretraining script, randomly shuffles the data based on the max_seq_len. to yeild batches that have similar seq_len to train efficiently.

this however works fine for T5, but when coming to multi-lingual training, the training script would let the model train randomly where one sample sequence was "french" the other was "spanish" etc. is it okay for mt5 to randomly shuffle multi-lingual data and pretrain ?

@patrickvonplaten
Copy link
Contributor

Hey @StephennFernandes,

Could you maybe ask such a question on the forum: https://discuss.huggingface.co/ ? :-) We try to keep Transformers issues for questions related just the modeling code and bugs. Thanks!

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@PiotrNawrot
Copy link

We’ve released nanoT5 1 which is a minimal codebase that reproduces T5-model (similar to BART) pre-training in PyTorch (not Flax), using Huggingface.

You can take a look, it should be easy to modify it so that it works with multilingual data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants