-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parity with perl normalize. #146
Parity with perl normalize. #146
Conversation
@NIXBLACK11 looks great to me! @alvations / @jelmervdl would you mind kindly reviewing? For context, we have been using the moses perl script |
Ohhh thanks! I wanted to do this for a while and release it as a 2.0 breaking change, but happy to accept it behind a flag until then! You said if perl_parity:
self.NORMALIZE_UNICODE[11] = ("’", r'"')
self.FRENCH_QUOTES[0] = ("\u00A0«\u00A0", r' "')
self.FRENCH_QUOTES[3] = ("\u00A0»\u00A0", r'" ') … since I'm a bit scared that any update or refactor of those regex arrays will get these specific indexes out of sync. |
Just to confirm, this is the perl script you're comparing against? https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/normalize-punctuation.perl Or is it @kpu's version: https://github.com/kpu/preprocess/blob/master/moses/tokenizer/normalize-punctuation.perl There's only a small difference, but hey: 45,46c45,46
< s/‘/\'/g;
< s/‚/\'/g;
---
> s/‘/\"/g;
> s/‚/\"/g; |
@jelmervdl it's compared to the "official" version in your first link: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/normalize-punctuation.perl |
Hi @jelmervdl! Sure, here are three example sentences from FLORES200 which contain at least one example of each of the above regexes (in order):
|
""" | ||
|
||
if perl_parity: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably add unit tests into https://github.com/hplt-project/sacremoses/blob/master/sacremoses/test/test_normalizer.py
that compare the normalizer outputs to expected values with and without this flag.
In a comment to this PR, Kevin provided some examples of sentences that trigger each of the regexes, but for the test we can probably use even shorter phrases, to keep the test readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @avidale for your feedback and suggestions. I've added unit tests to the test_normalizer.py file, as per your recommendation here.
@jelmervdl , could you please take a look at the updates and provide your feedback?
I made the non-breaking spaces in the test a bit more explicit. But looks good to merge to me now. |
In the modified Python script, several changes have been made to improve its functionality and maintain parity with the original Perl script.
Parity with perl normalize
The original perl script had regex (s/’/"/g) at line 47 this was ported to ("’", r"'") at line 43 which was not correct,so I updated it to ("’", r'"').
The original perl script had regex (s/ « / "/g) at line 52 this was ported to ("\u00A0«\u00A0", r'"') at line 50 which was not correct,so I updated it to ("\u00A0«\u00A0", r' "').
The original perl script had regex (s/ » /" /g) at line 55 this was ported to ("\u00A0»\u00A0", r'"') at line 53 ,so I updated it to ("\u00A0»\u00A0", r'" ').
A new argument named perl_parity has been introduced in the constructor. This argument is optional and defaults to False. It doesn't affect the script's behavior when using the default arguments. However, if set to True, it enforces certain changes to align the Python script more closely with the original Perl script.
Test plan
The test plan below runs the normalize function with both the default and newly added argument perl_parity=True on all of the FLORES200 dataset (408 files, totalling 409836 lines), and compares the outputs to that of the original perl script. With the default arguments, there are 188 files which are different to the perl script output. With the newly added argument, there no differences.
Results produced from this code