-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
can i use the transformers pretraining script of T5 as mT5 ? #16571
Comments
check this https://github.com/Shivanandroy/simpleT5 |
Cool!, @salrowili thanks a ton. btw how should the dataset be arranged for multilingual language pretraining ? should all the langages be arranged in a sequential order where a sequence of one lang followed by another eg: [French, German, Italian] or should all the languages be randomly shuffled ? |
Hey @StephennFernandes, Yes it should work just fine! Note that mt5 used more or less the same pretraining logic that was used for t5v1_1, which is stated here: https://huggingface.co/docs/transformers/model_doc/t5v1.1 |
@patrickvonplaten what about sampling the text corpus. i have a text corpus that has been arranged sequentially based on the languages. but the pretraining script, randomly shuffles the data based on the max_seq_len. to yeild batches that have similar seq_len to train efficiently. this however works fine for T5, but when coming to multi-lingual training, the training script would let the model train randomly where one sample sequence was "french" the other was "spanish" etc. is it okay for mt5 to randomly shuffle multi-lingual data and pretrain ? |
Hey @StephennFernandes, Could you maybe ask such a question on the forum: https://discuss.huggingface.co/ ? :-) We try to keep Transformers issues for questions related just the modeling code and bugs. Thanks! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
We’ve released nanoT5 1 which is a minimal codebase that reproduces T5-model (similar to BART) pre-training in PyTorch (not Flax), using Huggingface. You can take a look, it should be easy to modify it so that it works with multilingual data |
@patrickvonplaten @lewtun, @NielsRogge I am planing to pretrain multilingual T5 small and/or medium from scratch using the huggingface T5 pre-training script , i came across this post #5079 and the hugginface implementation for T5, my question is can i use the same pretraining script from T5 , by replace the T5Config with mT5Config ? WOULD THIS WORK ?
Also how should the dataset be arranged for multilingual languages pretraining ? should all the langages be arranged in a sequential order where a sequence of one lang followed by another eg: [French, German, Italian] or should all the languages be randomly shuffled ?
for the record i am planning to pretrain mT5 on indian languages on the oscar corpus and some additionally sourced text corpus.
The text was updated successfully, but these errors were encountered: