Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer.perl: split final dots unconditionally #204

Merged
merged 1 commit into from
Nov 7, 2018
Merged

tokenizer.perl: split final dots unconditionally #204

merged 1 commit into from
Nov 7, 2018

Conversation

ozancaglayan
Copy link
Contributor

Allow tokenization of non-breaking prefixes at end of sentences. This should
be a fair compromise in many cases to construct a cleaner vocabulary.

EN-old: So am I.
EN-new: So am I .

DE-old: ... schwer wie ein iPhone 5.
DE-new: ... schwer wie ein iPhone 5 .

FR-old: Des gens admirent une œuvre d' art.
FR-new: Des gens admirent une œuvre d' art .

CS-old: Dvě děti, které běží bez bot.
CS-new: Dvě děti, které běží bez bot .

Allow tokenization of non-breaking prefixes at end of sentences. This should
be a fair compromise in many cases to construct a cleaner vocabulary.

EN-old: So am I.
EN-new: So am I .

DE-old: ... schwer wie ein iPhone 5.
DE-new: ... schwer wie ein iPhone 5 .

FR-old: Des gens admirent une œuvre d' art.
FR-new: Des gens admirent une œuvre d' art .

CS-old: Dvě děti, které běží bez bot.
CS-new: Dvě děti, které běží bez bot .
@emjotde
Copy link
Contributor

emjotde commented Nov 7, 2018

Shouldn't things like this be optional? The Moses tokenizer is pretty much a standard tool by now. You are introducing backwards-incompatibility. BPE or other subword methods are handling cases like this just fine, so a performance improvement is unlikely at the cost of predictability.

@ozancaglayan
Copy link
Contributor Author

The correct working of tokenizer should not be relying on post-fixes that may be done by tools like subword-nmt or sentencepiece, etc right? I noticed this when I was working completely on word-level. But you may be right, this is a behavior change. But in that case, Moses tokenizer should be frozen and no longer be modified. So let's hear from other people as well.

Thanks for the comment :)

@emjotde
Copy link
Contributor

emjotde commented Nov 7, 2018

I am actually for freezing it. Regardless of correctness.

@emjotde
Copy link
Contributor

emjotde commented Nov 7, 2018

Also, isn't there a whole bunch of regression tests that would now blow up? This affects the entire downstream path.

@hieuhoang
Copy link
Contributor

I would side with Ozan on this. If people want backward compatibility, they can add a flag --old-behaviour to the tokenizer script, or use a previous version of Moses. This change looks like it would result in better translation so I am minded to make it the default

I don't think there is a test for the tokenizer. But in any case, regression tests are there to ensure that we don't accidently change something, not to prevent us from making a positive change. We will change the tests if needed

Any other comments appreciated

@emjotde
Copy link
Contributor

emjotde commented Nov 7, 2018

To add ' --old-behaviour' to your work-flow you would still need to figure out where the sudden change is coming from. This would happen without warning.

@alvations
Copy link
Contributor

Btw not only final fullstop has this problem, apostrophe too. https://github.com/alvations/sacremoses/issues/3

@hieuhoang hieuhoang merged commit 2217bc1 into moses-smt:master Nov 7, 2018
@hieuhoang
Copy link
Contributor

pulling as it looks like it will improve translation. Backward compatbility is a secondary priority imo

@hieuhoang
Copy link
Contributor

@alvations did you fix it in sacremoses?

@alvations
Copy link
Contributor

alvations commented Nov 7, 2018

It's night time here 😉 But I'll work on an equivalent patch with some backwards compatibility in sacremoses when I'm awake.

Could we give some versioning to the Moses tokenizer? Then it'll be easy for ports and people to refer to which version they are using

@hieuhoang
Copy link
Contributor

@alvations Is this a version of the sacremoses tokenizer?
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/python-tokenizer/moses.py
It might be better to delete it from moses and tell people about sacremoses instead.

I was also on the market for a python moses tokenizer and added this to moses
https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer/mosestokenizer
based on Luis Gomes' code
https://pypi.org/project/mosestokenizer/

@alvations
Copy link
Contributor

Hmmm, tricky, I was looking into too and didn't spot much difference until @PatDue raised the issue if the last token is one of the non-splitting string. https://github.com/alvations/sacremoses/issues/21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants