tokenizer.perl: split final dots unconditionally #204

ozancaglayan · 2018-11-07T10:12:53Z

Allow tokenization of non-breaking prefixes at end of sentences. This should
be a fair compromise in many cases to construct a cleaner vocabulary.

EN-old: So am I.
EN-new: So am I .

DE-old: ... schwer wie ein iPhone 5.
DE-new: ... schwer wie ein iPhone 5 .

FR-old: Des gens admirent une œuvre d' art.
FR-new: Des gens admirent une œuvre d' art .

CS-old: Dvě děti, které běží bez bot.
CS-new: Dvě děti, které běží bez bot .

Allow tokenization of non-breaking prefixes at end of sentences. This should be a fair compromise in many cases to construct a cleaner vocabulary. EN-old: So am I. EN-new: So am I . DE-old: ... schwer wie ein iPhone 5. DE-new: ... schwer wie ein iPhone 5 . FR-old: Des gens admirent une œuvre d' art. FR-new: Des gens admirent une œuvre d' art . CS-old: Dvě děti, které běží bez bot. CS-new: Dvě děti, které běží bez bot .

emjotde · 2018-11-07T10:15:36Z

Shouldn't things like this be optional? The Moses tokenizer is pretty much a standard tool by now. You are introducing backwards-incompatibility. BPE or other subword methods are handling cases like this just fine, so a performance improvement is unlikely at the cost of predictability.

ozancaglayan · 2018-11-07T10:21:15Z

The correct working of tokenizer should not be relying on post-fixes that may be done by tools like subword-nmt or sentencepiece, etc right? I noticed this when I was working completely on word-level. But you may be right, this is a behavior change. But in that case, Moses tokenizer should be frozen and no longer be modified. So let's hear from other people as well.

Thanks for the comment :)

emjotde · 2018-11-07T10:23:51Z

I am actually for freezing it. Regardless of correctness.

emjotde · 2018-11-07T10:27:23Z

Also, isn't there a whole bunch of regression tests that would now blow up? This affects the entire downstream path.

hieuhoang · 2018-11-07T11:44:51Z

I would side with Ozan on this. If people want backward compatibility, they can add a flag --old-behaviour to the tokenizer script, or use a previous version of Moses. This change looks like it would result in better translation so I am minded to make it the default

I don't think there is a test for the tokenizer. But in any case, regression tests are there to ensure that we don't accidently change something, not to prevent us from making a positive change. We will change the tests if needed

Any other comments appreciated

emjotde · 2018-11-07T11:47:32Z

To add ' --old-behaviour' to your work-flow you would still need to figure out where the sudden change is coming from. This would happen without warning.

alvations · 2018-11-07T13:25:15Z

Btw not only final fullstop has this problem, apostrophe too. https://github.com/alvations/sacremoses/issues/3

hieuhoang · 2018-11-07T14:40:03Z

pulling as it looks like it will improve translation. Backward compatbility is a secondary priority imo

hieuhoang · 2018-11-07T14:44:58Z

@alvations did you fix it in sacremoses?

alvations · 2018-11-07T15:24:38Z

It's night time here 😉 But I'll work on an equivalent patch with some backwards compatibility in sacremoses when I'm awake.

Could we give some versioning to the Moses tokenizer? Then it'll be easy for ports and people to refer to which version they are using

hieuhoang · 2018-11-09T13:09:10Z

@alvations Is this a version of the sacremoses tokenizer?
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/python-tokenizer/moses.py
It might be better to delete it from moses and tell people about sacremoses instead.

I was also on the market for a python moses tokenizer and added this to moses
https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer/mosestokenizer
based on Luis Gomes' code
https://pypi.org/project/mosestokenizer/

alvations · 2019-01-25T07:07:42Z

Hmmm, tricky, I was looking into too and didn't spot much difference until @PatDue raised the issue if the last token is one of the non-splitting string. https://github.com/alvations/sacremoses/issues/21

hieuhoang merged commit 2217bc1 into moses-smt:master Nov 7, 2018

alvations mentioned this pull request Jan 25, 2019

"p.m." is not tokenized as in the original script. hplt-project/sacremoses#21

Open

alvations mentioned this pull request May 2, 2019

Handle trailing dot apostrophe. hplt-project/sacremoses#47

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenizer.perl: split final dots unconditionally #204

tokenizer.perl: split final dots unconditionally #204

ozancaglayan commented Nov 7, 2018

emjotde commented Nov 7, 2018

ozancaglayan commented Nov 7, 2018

emjotde commented Nov 7, 2018

emjotde commented Nov 7, 2018

hieuhoang commented Nov 7, 2018

emjotde commented Nov 7, 2018

alvations commented Nov 7, 2018

hieuhoang commented Nov 7, 2018

hieuhoang commented Nov 7, 2018

alvations commented Nov 7, 2018 •

edited

Loading

hieuhoang commented Nov 9, 2018

alvations commented Jan 25, 2019

tokenizer.perl: split final dots unconditionally #204

tokenizer.perl: split final dots unconditionally #204

Conversation

ozancaglayan commented Nov 7, 2018

emjotde commented Nov 7, 2018

ozancaglayan commented Nov 7, 2018

emjotde commented Nov 7, 2018

emjotde commented Nov 7, 2018

hieuhoang commented Nov 7, 2018

emjotde commented Nov 7, 2018

alvations commented Nov 7, 2018

hieuhoang commented Nov 7, 2018

hieuhoang commented Nov 7, 2018

alvations commented Nov 7, 2018 • edited Loading

hieuhoang commented Nov 9, 2018

alvations commented Jan 25, 2019

alvations commented Nov 7, 2018 •

edited

Loading