-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to just use the mixture functionality in seqio #244
Comments
Hey Stephen, you can remove the preprocessors you don't need from your seqio Task definition. E.g. you can remove the [1] https://seqio.readthedocs.io/en/latest/overview.html#preprocessors |
hey @gauravmishra, thank you enormously for replying and helping out. I actually did as you instructed. But sadly i get an error here's the following code for the same:
ERROR: i have mentioned Am i doing something wrong ? i went through the docs to find a fix, but couldn't find anything. |
Hey, setting the rank of your feature to 0 should work: |
@gauravmishra, thanks a ton once again. But sadly i am still facing issues here. I am really sorry to prompt you constantly. after setting rank=0, and running:
i got a TypeError: ``tf.data.Dataset.as_numpy_iterator() so i tried to iter without calling the like this:
And i get the output in byte representation : where the item is a |
Hey Stephen, you should remove the "inputs" key from your target_to_key preprocessor - preprocessors=[ This should get rid of the "inputs" field that is getting set to None. |
@gauravmishra, Hey Gaurav. i did as you instructed, but the output is still in bytes format: The following is the code for the same:
output: |
Something like the following should work - or if huggingface works with numpy, then you can pass |
@gauravmishra Hey gaurav, It works fine! ... Cannot thank you enough for all the dedicated help. It really means a lot. Thanks a ton 🙏 |
@gauravmishra Hey Gaurav, I had a doubt. what is the best way to decide on which mixture ratio is optimal. In the mT5 paper the alpha value 0.3 gave the best balance between ideal performance for high and low resource languages. However i am pretraining mT5 on indian languages, and i have a diverse variety of indian multi-lingual corpus, where Hindi has 60M+ samples and Kashmiri has around 100k samples. So i wanted to know if i could h-param tune somehow on t5x, or would just using alpha=0.3 work fine in my usecase ? |
Hey there, I've been wanting to pretrain MT5 on Huggingface training script as mentioned here: https://github.com/huggingface/transformers/blob/main/examples/flax/language-modeling/run_t5_mlm_flax.py
But sadly the Huggingface script doesn't support a mixture to pretrain MT5 in such a way that the model generalise well on low-resource as well as high-resource langauges.
Hence I've been wanting to use the mixture functionality of seqio, but sadly upon using it i have to tokenize the model into the T5 sentencepiece vocabulary and seqio tasks does all the preprocessing.
The Huggingface trainer takes care of the preprocessing maping the dataset to the tokenizer etc.
My question is is there a way where i could only just use the mixture functionality of seqio without actually doing any preprocessing on the incoming datasets.
I was wondering if there is a way to feed in multiple datasets, get an output dataset (in text str format) which is only an appropriate mixture of all samples of the datsets, passed by the mixture function. which i could then use to pretrain on the HF trainer and then do all the preprocessing on it in HF trainer
The text was updated successfully, but these errors were encountered: