New models #18

avostryakov · 2020-08-18T22:10:44Z

Thanks for all of these models! Sometimes it works comparable with Google Translate!

I noticed that you improve a model for French and several other languages. Do you have plans to do the same for es-en, pt-en, da-en, it-en pairs?

And what was the trick that improved results?

jorgtied · 2020-08-21T20:32:34Z

At the moment we focus on training models for the Tatoeba MT Challenge that we released recently (https://github.com/Helsinki-NLP/Tatoeba-Challenge). There will be some updated models there. Check it out. Otherwise, we will continue updating existing language pairs but progress may be slow as training requires a lot of resources and time. I cannot promise new models frequently.

jorgtied · 2020-08-21T20:33:59Z

And, yes, the trick to improve models is to train more. SentencePiece based segmentation is also useful and some other smallish improvements in data pre-processing.

avostryakov · 2020-08-22T16:26:29Z

Oo, great! Very thanks again for the Tatoeba-Challenge project! Recently you published a Spanish-to-English and other models that we need!
By the way, about the pre-processing step for OPUS datasets. Maybe you read facebook's article: https://arxiv.org/pdf/1907.06616.pdf (Facebook FAIR’s WMT19 News Translation Task Submission). There are two important steps there:

applying language identification filtering. it can be CLD2 library, for example.
removing sentence pairs with a source/target length ratio exceeding 1.5

And, of course, back-translation. I noticed that you do something with back-translation. There is another facebook article with details: https://arxiv.org/abs/1808.09381. Only this step allows them to improve BLUE on 4 points.

jorgtied · 2020-08-23T08:05:20Z

Yes, I do apply language identification in the new Tatoeba-MT models and some other basic filtering. Length-ratio filtering has always been part of the pipeline. This is a very well-known since old SMT times and Moses tools. However, I am not as strict as the paper suggests. There is a lot of hyper-parameters that can be optimized for each language pair. Backtranslation is part of all models that include "+bt" in their string. I need to stress that the OPUS-MT models are not tuned towards news translation from the WMT tests. It is not surprising if their are performance differences as simple domain-adaptation boosts the performance a lot. I will try to also include some fine-tuned models later. A finetuning framework is already integrated in OPUS-MT

By the way, it's a bit funny that most people point to Facebook/Google papers when they refer to techniques developed and proposed by researchers in academia. I guess that universities have to improve their PR units ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New models #18

New models #18

avostryakov commented Aug 18, 2020

jorgtied commented Aug 21, 2020

jorgtied commented Aug 21, 2020

avostryakov commented Aug 22, 2020

jorgtied commented Aug 23, 2020

New models #18

New models #18

Comments

avostryakov commented Aug 18, 2020

jorgtied commented Aug 21, 2020

jorgtied commented Aug 21, 2020

avostryakov commented Aug 22, 2020

jorgtied commented Aug 23, 2020