Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some single words are being duplicated in the translation in EN-RU combination #2

Open
acornestean opened this issue Sep 9, 2021 · 2 comments

Comments

@acornestean
Copy link

[Affected versions]:
Firefox Nightly (94.0a1/20210908213905)

[Affected Platforms]:
Windows 10 x64

[Prerequisites]:
Access https://mozilla.github.io/translate/.

[Steps to reproduce]:

  1. On the translation website, set EN-RU as a language combination
  2. In the “From” field, type one single word, for example “mother”, “love”, “people” (these were all I could find that act this way. Some other words like “pear”, ‘pea”, “carrot” etc result in only one appearance of the translated word)
  3. Notice that in the translation field, the translated word appears 2 times.

NOTES:

  1. Adding a second word in the “From” field, after the original one, will cause the duplicate translated word to be replaced by the translation of the second one i.e. “mother” gets translated to “мать мать”, but “mother apple” gets translated to “мать яблоко”.
  2. Capitalizing the word to be translated will result in only one appearance of the translated word i.e. “love” gets translated to “Любить любовь”, but “Love” get translated to “Любовь”.

[Expected]:
Translating one word should result in only one appearance of the translated word and not duplicates.

[Actual]:
Translating one word should causes the translated word to appear twice.

@eu9ene
Copy link
Collaborator

eu9ene commented Sep 9, 2021

This is a very interesting finding. I updated the model to the latest version, trained on a bigger dataset and it became even funnier: mother is translated as "Мать и сестра" (mother and sister) now! I see similar wrong behaviour for many other single-word examples.

I assume it's because the model was trained only on long sentences and it has never seen single word ones (we have special cleaning rules for this). It might make sense for web page translations but it doesn't for Google Translate kind of user experience. What is weird is that we don't see this problem for other languages. We'll have another pass to improve the quality of models, maybe it will be fixed then.

@kpu
Copy link

kpu commented Oct 6, 2021

Clean corpora should be allowed to bypass the bicleaner rules. For better or worse this means a manual mapping of corpus to cleanliness. It's happening for ru because my understanding is @eu9ene's current pipeline puts everything through bicleaner whereas the consortium provided models used the janky manual pipeline in which only some corpora are cleaned with bicleaner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants