Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parity with perl normalize. #146

Merged
merged 14 commits into from
Oct 26, 2023

Conversation

NIXBLACK11
Copy link
Contributor

@NIXBLACK11 NIXBLACK11 commented Oct 20, 2023

In the modified Python script, several changes have been made to improve its functionality and maintain parity with the original Perl script.

Parity with perl normalize

  • The original perl script had regex (s/’/"/g) at line 47 this was ported to ("’", r"'") at line 43 which was not correct,so I updated it to ("’", r'"').

  • The original perl script had regex (s/ « / "/g) at line 52 this was ported to ("\u00A0«\u00A0", r'"') at line 50 which was not correct,so I updated it to ("\u00A0«\u00A0", r' "').

  • The original perl script had regex (s/ » /" /g) at line 55 this was ported to ("\u00A0»\u00A0", r'"') at line 53 ,so I updated it to ("\u00A0»\u00A0", r'" ').

  • A new argument named perl_parity has been introduced in the constructor. This argument is optional and defaults to False. It doesn't affect the script's behavior when using the default arguments. However, if set to True, it enforces certain changes to align the Python script more closely with the original Perl script.

Test plan

The test plan below runs the normalize function with both the default and newly added argument perl_parity=True on all of the FLORES200 dataset (408 files, totalling 409836 lines), and compares the outputs to that of the original perl script. With the default arguments, there are 188 files which are different to the perl script output. With the newly added argument, there no differences.

from pathlib import Path
from sacremoses import MosesPunctNormalizer
import tempfile
from subprocess import check_output
import asyncio
import typing as tp
import numpy as np

async def check_for_differences(data, norm_default, norm_updated) -> tp.Tuple[bool, bool]:
    with tempfile.NamedTemporaryFile() as tmp_file:
        with open(tmp_file.name, "w") as f:
            f.write(data)
        perl_output = (
            check_output(
                f"cat {tmp_file.name} | perl normalize-punctuation.perl", shell=True
            ).decode("utf-8").strip()
        )
        default_diff = norm_default.normalize(data).strip() != perl_output
        updated_diff = norm_updated.normalize(data).strip() != perl_output
    return (default_diff, updated_diff)

async def main():
    flores_dir = Path("/path/to/flores200_dataset")
    all_files = [
        file.read_text().strip()
        for ext in ["dev", "devtest"]
        for file in flores_dir.glob(f"*/*.{ext}")
    ]
    norm_default = MosesPunctNormalizer()
    norm_updated = MosesPunctNormalizer(perl_parity=True)
    differences = await asyncio.gather(
        *[check_for_differences(file, norm_default, norm_updated) for file in all_files]
    )
    default_differences, updated_differences = np.asarray(differences).sum(axis=0)
    print(f"total files: {len(differences)}")
    print(f"differences with default args: {default_differences}")
    print(f"differences with 'perl_parity=True': {updated_differences}")

asyncio.run(main())

Results produced from this code

total files: 408
differences with default args: 188
differences with 'perl_parity=True': 0

@NIXBLACK11 NIXBLACK11 marked this pull request as ready for review October 23, 2023 09:00
@NIXBLACK11 NIXBLACK11 marked this pull request as draft October 23, 2023 09:12
@NIXBLACK11 NIXBLACK11 marked this pull request as ready for review October 23, 2023 09:14
@heffernankevin
Copy link

heffernankevin commented Oct 23, 2023

@NIXBLACK11 looks great to me! @alvations / @jelmervdl would you mind kindly reviewing? For context, we have been using the moses perl script normalize-punctuation.perl to do preprocessing for some time. When switching to the sacremoses MosesPunctNormalizer we noticed we're not getting exactly the same texts as the original script, which is causing us some performance issues. @NIXBLACK11's PR resolves this to ensure we have 100% parity with the perl script (tested on approx 0.5 Million sentences).

@NIXBLACK11 NIXBLACK11 marked this pull request as draft October 23, 2023 13:18
@NIXBLACK11 NIXBLACK11 marked this pull request as ready for review October 23, 2023 13:22
@jelmervdl
Copy link
Collaborator

Ohhh thanks! I wanted to do this for a while and release it as a 2.0 breaking change, but happy to accept it behind a flag until then!

You said differences with default args: 188. Can I have one of these examples so I can add a test for it? Ideally one that triggers all three of these rules:

if perl_parity:
    self.NORMALIZE_UNICODE[11] = ("’", r'"')
    self.FRENCH_QUOTES[0] = ("\u00A0«\u00A0", r' "')
    self.FRENCH_QUOTES[3] = ("\u00A0»\u00A0", r'" ')

… since I'm a bit scared that any update or refactor of those regex arrays will get these specific indexes out of sync.

@jelmervdl
Copy link
Collaborator

Just to confirm, this is the perl script you're comparing against? https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/normalize-punctuation.perl

Or is it @kpu's version: https://github.com/kpu/preprocess/blob/master/moses/tokenizer/normalize-punctuation.perl

There's only a small difference, but hey:

45,46c45,46
<     s/‘/\'/g;
<     s/‚/\'/g;
---
>     s/‘/\"/g;
>     s/‚/\"/g;

@heffernankevin
Copy link

Just to confirm, this is the perl script you're comparing against? https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/normalize-punctuation.perl

Or is it @kpu's version: https://github.com/kpu/preprocess/blob/master/moses/tokenizer/normalize-punctuation.perl

There's only a small difference, but hey:

45,46c45,46
<     s/‘/\'/g;
<     s/‚/\'/g;
---
>     s/‘/\"/g;
>     s/‚/\"/g;

@jelmervdl it's compared to the "official" version in your first link: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/normalize-punctuation.perl

@heffernankevin
Copy link

heffernankevin commented Oct 24, 2023

Ohhh thanks! I wanted to do this for a while and release it as a 2.0 breaking change, but happy to accept it behind a flag until then!

You said differences with default args: 188. Can I have one of these examples so I can add a test for it? Ideally one that triggers all three of these rules:

if perl_parity:
    self.NORMALIZE_UNICODE[11] = ("’", r'"')
    self.FRENCH_QUOTES[0] = ("\u00A0«\u00A0", r' "')
    self.FRENCH_QUOTES[3] = ("\u00A0»\u00A0", r'" ')

… since I'm a bit scared that any update or refactor of those regex arrays will get these specific indexes out of sync.

Hi @jelmervdl! Sure, here are three example sentences from FLORES200 which contain at least one example of each of the above regexes (in order):

    example_affected_sentences = [
        "ভার্জিন কেবল নর্দান রকের ‘গুড ব্যাংক’ কিনেছে, সম্পদ পরিচালন সংস্থা কিনেনি।", 
        "Ci 18i fan ci weeru Màrs, 1965 da fa def jëf bu njëkk genn ci woto bi (EVA), wala\xa0«\xa0dox ci jawwu ji», des moom kese ci biti fafalndawu asamaan lu toll ci fukki simili ak ñaar.", 
        "Les «\xa0wagonways\xa0»\xa0étaient construits en Angleterre dès le XVIème siècle."
    ]

"""

if perl_parity:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add unit tests into https://github.com/hplt-project/sacremoses/blob/master/sacremoses/test/test_normalizer.py
that compare the normalizer outputs to expected values with and without this flag.

In a comment to this PR, Kevin provided some examples of sentences that trigger each of the regexes, but for the test we can probably use even shorter phrases, to keep the test readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @avidale for your feedback and suggestions. I've added unit tests to the test_normalizer.py file, as per your recommendation here.
@jelmervdl , could you please take a look at the updates and provide your feedback?

@jelmervdl
Copy link
Collaborator

I made the non-breaking spaces in the test a bit more explicit.

But looks good to merge to me now.

@jelmervdl jelmervdl merged commit 8d32b0e into hplt-project:master Oct 26, 2023
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants