Parity with perl normalize. #146

NIXBLACK11 · 2023-10-20T16:19:14Z

In the modified Python script, several changes have been made to improve its functionality and maintain parity with the original Perl script.

Parity with perl normalize

The original perl script had regex (s/’/"/g) at line 47 this was ported to ("’", r"'") at line 43 which was not correct,so I updated it to ("’", r'"').
The original perl script had regex (s/ « / "/g) at line 52 this was ported to ("\u00A0«\u00A0", r'"') at line 50 which was not correct,so I updated it to ("\u00A0«\u00A0", r' "').
The original perl script had regex (s/ » /" /g) at line 55 this was ported to ("\u00A0»\u00A0", r'"') at line 53 ,so I updated it to ("\u00A0»\u00A0", r'" ').
A new argument named perl_parity has been introduced in the constructor. This argument is optional and defaults to False. It doesn't affect the script's behavior when using the default arguments. However, if set to True, it enforces certain changes to align the Python script more closely with the original Perl script.

Test plan

The test plan below runs the normalize function with both the default and newly added argument perl_parity=True on all of the FLORES200 dataset (408 files, totalling 409836 lines), and compares the outputs to that of the original perl script. With the default arguments, there are 188 files which are different to the perl script output. With the newly added argument, there no differences.

from pathlib import Path
from sacremoses import MosesPunctNormalizer
import tempfile
from subprocess import check_output
import asyncio
import typing as tp
import numpy as np

async def check_for_differences(data, norm_default, norm_updated) -> tp.Tuple[bool, bool]:
    with tempfile.NamedTemporaryFile() as tmp_file:
        with open(tmp_file.name, "w") as f:
            f.write(data)
        perl_output = (
            check_output(
                f"cat {tmp_file.name} | perl normalize-punctuation.perl", shell=True
            ).decode("utf-8").strip()
        )
        default_diff = norm_default.normalize(data).strip() != perl_output
        updated_diff = norm_updated.normalize(data).strip() != perl_output
    return (default_diff, updated_diff)

async def main():
    flores_dir = Path("/path/to/flores200_dataset")
    all_files = [
        file.read_text().strip()
        for ext in ["dev", "devtest"]
        for file in flores_dir.glob(f"*/*.{ext}")
    ]
    norm_default = MosesPunctNormalizer()
    norm_updated = MosesPunctNormalizer(perl_parity=True)
    differences = await asyncio.gather(
        *[check_for_differences(file, norm_default, norm_updated) for file in all_files]
    )
    default_differences, updated_differences = np.asarray(differences).sum(axis=0)
    print(f"total files: {len(differences)}")
    print(f"differences with default args: {default_differences}")
    print(f"differences with 'perl_parity=True': {updated_differences}")

asyncio.run(main())

Results produced from this code

total files: 408
differences with default args: 188
differences with 'perl_parity=True': 0

sacremoses/normalize.py

heffernankevin · 2023-10-23T10:58:15Z

@NIXBLACK11 looks great to me! @alvations / @jelmervdl would you mind kindly reviewing? For context, we have been using the moses perl script normalize-punctuation.perl to do preprocessing for some time. When switching to the sacremoses MosesPunctNormalizer we noticed we're not getting exactly the same texts as the original script, which is causing us some performance issues. @NIXBLACK11's PR resolves this to ensure we have 100% parity with the perl script (tested on approx 0.5 Million sentences).

jelmervdl · 2023-10-24T14:20:06Z

Ohhh thanks! I wanted to do this for a while and release it as a 2.0 breaking change, but happy to accept it behind a flag until then!

You said differences with default args: 188. Can I have one of these examples so I can add a test for it? Ideally one that triggers all three of these rules:

if perl_parity:
    self.NORMALIZE_UNICODE[11] = ("’", r'"')
    self.FRENCH_QUOTES[0] = ("\u00A0«\u00A0", r' "')
    self.FRENCH_QUOTES[3] = ("\u00A0»\u00A0", r'" ')

… since I'm a bit scared that any update or refactor of those regex arrays will get these specific indexes out of sync.

jelmervdl · 2023-10-24T14:24:24Z

Just to confirm, this is the perl script you're comparing against? https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/normalize-punctuation.perl

Or is it @kpu's version: https://github.com/kpu/preprocess/blob/master/moses/tokenizer/normalize-punctuation.perl

There's only a small difference, but hey:

45,46c45,46
<     s/‘/\'/g;
<     s/‚/\'/g;
---
>     s/‘/\"/g;
>     s/‚/\"/g;

heffernankevin · 2023-10-24T14:30:06Z

Just to confirm, this is the perl script you're comparing against? https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/normalize-punctuation.perl

Or is it @kpu's version: https://github.com/kpu/preprocess/blob/master/moses/tokenizer/normalize-punctuation.perl

There's only a small difference, but hey:
45,46c45,46
<     s/‘/\'/g;
<     s/‚/\'/g;
---
>     s/‘/\"/g;
>     s/‚/\"/g;

@jelmervdl it's compared to the "official" version in your first link: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/normalize-punctuation.perl

heffernankevin · 2023-10-24T15:11:34Z

Ohhh thanks! I wanted to do this for a while and release it as a 2.0 breaking change, but happy to accept it behind a flag until then!

You said differences with default args: 188. Can I have one of these examples so I can add a test for it? Ideally one that triggers all three of these rules:
if perl_parity:
    self.NORMALIZE_UNICODE[11] = ("’", r'"')
    self.FRENCH_QUOTES[0] = ("\u00A0«\u00A0", r' "')
    self.FRENCH_QUOTES[3] = ("\u00A0»\u00A0", r'" ')
… since I'm a bit scared that any update or refactor of those regex arrays will get these specific indexes out of sync.

Hi @jelmervdl! Sure, here are three example sentences from FLORES200 which contain at least one example of each of the above regexes (in order):

    example_affected_sentences = [
        "ভার্জিন কেবল নর্দান রকের ‘গুড ব্যাংক’ কিনেছে, সম্পদ পরিচালন সংস্থা কিনেনি।", 
        "Ci 18i fan ci weeru Màrs, 1965 da fa def jëf bu njëkk genn ci woto bi (EVA), wala\xa0«\xa0dox ci jawwu ji», des moom kese ci biti fafalndawu asamaan lu toll ci fukki simili ak ñaar.", 
        "Les «\xa0wagonways\xa0»\xa0étaient construits en Angleterre dès le XVIème siècle."
    ]

avidale · 2023-10-25T08:20:55Z

sacremoses/normalize.py

        """
+
+        if perl_parity:


We should probably add unit tests into https://github.com/hplt-project/sacremoses/blob/master/sacremoses/test/test_normalizer.py
that compare the normalizer outputs to expected values with and without this flag.

In a comment to this PR, Kevin provided some examples of sentences that trigger each of the regexes, but for the test we can probably use even shorter phrases, to keep the test readable.

Thanks @avidale for your feedback and suggestions. I've added unit tests to the test_normalizer.py file, as per your recommendation here.
@jelmervdl , could you please take a look at the updates and provide your feedback?

jelmervdl · 2023-10-26T15:45:10Z

I made the non-breaking spaces in the test a bit more explicit.

But looks good to merge to me now.

NIXBLACK11 and others added 10 commits October 15, 2023 17:53

Resolved parity issue with normalize-punctuation.perl

1384222

to conert all doble quote before s to 's

ac07ddb

Fixes most of the differences only 4 left in dev and 8 left in devtest

a020b76

decreases diff 1 for dev and one for devtest

876635d

changed quotes

9411cca

option added to switch between parity and previous regex

6d28db5

changes according to comments

fa2d4b0

changes according to comments

1f74795

changes made to indentation

0b55f13

Update normalize.py

4903ad8

heffernankevin reviewed Oct 20, 2023

View reviewed changes

sacremoses/normalize.py Show resolved Hide resolved

NIXBLACK11 added 2 commits October 20, 2023 22:00

Added comments

06d181e

Added comments

a57726d

NIXBLACK11 marked this pull request as ready for review October 23, 2023 09:00

NIXBLACK11 requested a review from heffernankevin October 23, 2023 09:07

NIXBLACK11 marked this pull request as draft October 23, 2023 09:12

NIXBLACK11 marked this pull request as ready for review October 23, 2023 09:14

avidale mentioned this pull request Oct 23, 2023

MLH fellowship contribution: adding the laser_encoders module facebookresearch/LASER#249

Merged

10 tasks

NIXBLACK11 marked this pull request as draft October 23, 2023 13:18

NIXBLACK11 marked this pull request as ready for review October 23, 2023 13:22

avidale reviewed Oct 25, 2023

View reviewed changes

NIXBLACK11 and others added 2 commits October 26, 2023 17:01

Update test_normalizer.py to include test for perl parity

c57e777

Make non-breaking spaces that are in there more explicit

4164186

jelmervdl merged commit 8d32b0e into hplt-project:master Oct 26, 2023
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parity with perl normalize. #146

Parity with perl normalize. #146

NIXBLACK11 commented Oct 20, 2023 •

edited

Loading

heffernankevin commented Oct 23, 2023 •

edited

Loading

jelmervdl commented Oct 24, 2023

jelmervdl commented Oct 24, 2023

heffernankevin commented Oct 24, 2023

heffernankevin commented Oct 24, 2023 •

edited

Loading

avidale Oct 25, 2023

NIXBLACK11 Oct 26, 2023

jelmervdl commented Oct 26, 2023

Parity with perl normalize. #146

Parity with perl normalize. #146

Conversation

NIXBLACK11 commented Oct 20, 2023 • edited Loading