can i use the transformers pretraining script of T5 as mT5 ? #16571

StephennFernandes · 2022-04-03T15:48:26Z

@patrickvonplaten @lewtun, @NielsRogge I am planing to pretrain multilingual T5 small and/or medium from scratch using the huggingface T5 pre-training script , i came across this post #5079 and the hugginface implementation for T5, my question is can i use the same pretraining script from T5 , by replace the T5Config with mT5Config ? WOULD THIS WORK ?

Also how should the dataset be arranged for multilingual languages pretraining ? should all the langages be arranged in a sequential order where a sequence of one lang followed by another eg: [French, German, Italian] or should all the languages be randomly shuffled ?

for the record i am planning to pretrain mT5 on indian languages on the oscar corpus and some additionally sourced text corpus.

salrowili · 2022-04-04T04:50:08Z

check this https://github.com/Shivanandroy/simpleT5
it uses transformers and it based on transformers and PyTorch lightening and it supports mt5 training.
with Pytorch lightning you can train mt5 on TPU with xla support. but i think you need to edit the code .
However, t5 code with flax (jax) seems your best option now since flax much faster than XLA torch

StephennFernandes · 2022-04-04T05:17:51Z

Cool!, @salrowili thanks a ton. btw how should the dataset be arranged for multilingual language pretraining ? should all the langages be arranged in a sequential order where a sequence of one lang followed by another eg: [French, German, Italian] or should all the languages be randomly shuffled ?

patrickvonplaten · 2022-04-04T07:47:41Z

Hey @StephennFernandes,

Yes it should work just fine! Note that mt5 used more or less the same pretraining logic that was used for t5v1_1, which is stated here: https://huggingface.co/docs/transformers/model_doc/t5v1.1

StephennFernandes · 2022-05-03T11:52:58Z

@patrickvonplaten what about sampling the text corpus. i have a text corpus that has been arranged sequentially based on the languages. but the pretraining script, randomly shuffles the data based on the max_seq_len. to yeild batches that have similar seq_len to train efficiently.

this however works fine for T5, but when coming to multi-lingual training, the training script would let the model train randomly where one sample sequence was "french" the other was "spanish" etc. is it okay for mt5 to randomly shuffle multi-lingual data and pretrain ?

patrickvonplaten · 2022-05-03T17:19:06Z

Hey @StephennFernandes,

Could you maybe ask such a question on the forum: https://discuss.huggingface.co/ ? :-) We try to keep Transformers issues for questions related just the modeling code and bugs. Thanks!

github-actions · 2022-05-28T15:02:05Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

PiotrNawrot · 2023-03-17T10:48:51Z

We’ve released nanoT5 1 which is a minimal codebase that reproduces T5-model (similar to BART) pre-training in PyTorch (not Flax), using Huggingface.

You can take a look, it should be easy to modify it so that it works with multilingual data

StephennFernandes closed this as completed May 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can i use the transformers pretraining script of T5 as mT5 ? #16571

can i use the transformers pretraining script of T5 as mT5 ? #16571

StephennFernandes commented Apr 3, 2022

salrowili commented Apr 4, 2022 •

edited

Loading

StephennFernandes commented Apr 4, 2022

patrickvonplaten commented Apr 4, 2022

StephennFernandes commented May 3, 2022

patrickvonplaten commented May 3, 2022

github-actions bot commented May 28, 2022

PiotrNawrot commented Mar 17, 2023

can i use the transformers pretraining script of T5 as mT5 ? #16571

can i use the transformers pretraining script of T5 as mT5 ? #16571

Comments

StephennFernandes commented Apr 3, 2022

salrowili commented Apr 4, 2022 • edited Loading

StephennFernandes commented Apr 4, 2022

patrickvonplaten commented Apr 4, 2022

StephennFernandes commented May 3, 2022

patrickvonplaten commented May 3, 2022

github-actions bot commented May 28, 2022

PiotrNawrot commented Mar 17, 2023

salrowili commented Apr 4, 2022 •

edited

Loading