-
Notifications
You must be signed in to change notification settings - Fork 777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokenizer.perl: split final dots unconditionally #204
Conversation
Allow tokenization of non-breaking prefixes at end of sentences. This should be a fair compromise in many cases to construct a cleaner vocabulary. EN-old: So am I. EN-new: So am I . DE-old: ... schwer wie ein iPhone 5. DE-new: ... schwer wie ein iPhone 5 . FR-old: Des gens admirent une œuvre d' art. FR-new: Des gens admirent une œuvre d' art . CS-old: Dvě děti, které běží bez bot. CS-new: Dvě děti, které běží bez bot .
Shouldn't things like this be optional? The Moses tokenizer is pretty much a standard tool by now. You are introducing backwards-incompatibility. BPE or other subword methods are handling cases like this just fine, so a performance improvement is unlikely at the cost of predictability. |
The correct working of tokenizer should not be relying on post-fixes that may be done by tools like subword-nmt or sentencepiece, etc right? I noticed this when I was working completely on word-level. But you may be right, this is a behavior change. But in that case, Moses tokenizer should be frozen and no longer be modified. So let's hear from other people as well. Thanks for the comment :) |
I am actually for freezing it. Regardless of correctness. |
Also, isn't there a whole bunch of regression tests that would now blow up? This affects the entire downstream path. |
I would side with Ozan on this. If people want backward compatibility, they can add a flag --old-behaviour to the tokenizer script, or use a previous version of Moses. This change looks like it would result in better translation so I am minded to make it the default I don't think there is a test for the tokenizer. But in any case, regression tests are there to ensure that we don't accidently change something, not to prevent us from making a positive change. We will change the tests if needed Any other comments appreciated |
To add ' --old-behaviour' to your work-flow you would still need to figure out where the sudden change is coming from. This would happen without warning. |
Btw not only final fullstop has this problem, apostrophe too. https://github.com/alvations/sacremoses/issues/3 |
pulling as it looks like it will improve translation. Backward compatbility is a secondary priority imo |
@alvations did you fix it in sacremoses? |
It's night time here 😉 But I'll work on an equivalent patch with some backwards compatibility in sacremoses when I'm awake. Could we give some versioning to the Moses tokenizer? Then it'll be easy for ports and people to refer to which version they are using |
@alvations Is this a version of the sacremoses tokenizer? I was also on the market for a python moses tokenizer and added this to moses |
Hmmm, tricky, I was looking into too and didn't spot much difference until @PatDue raised the issue if the last token is one of the non-splitting string. https://github.com/alvations/sacremoses/issues/21 |
Allow tokenization of non-breaking prefixes at end of sentences. This should
be a fair compromise in many cases to construct a cleaner vocabulary.
EN-old: So am I.
EN-new: So am I .
DE-old: ... schwer wie ein iPhone 5.
DE-new: ... schwer wie ein iPhone 5 .
FR-old: Des gens admirent une œuvre d' art.
FR-new: Des gens admirent une œuvre d' art .
CS-old: Dvě děti, které běží bez bot.
CS-new: Dvě děti, které běží bez bot .