Flag --protected from original Moses tokenizer #35

noe · 2019-03-07T15:29:00Z

The original Moses tokenizer supports the --protected flag. It's effect is to accept a file with a list of regular expression that should be protected from tokenization.

Under the hoods, the tokenizer masks each match of the regexes, then tokenizes, then unmasks.

Is this functionality in the roadmap of sacremoses?

The text was updated successfully, but these errors were encountered:

alvations · 2019-03-08T00:30:47Z

Hmmm, looks easy to implement but tricky to test.

Do you have any example protected_patterns_file and related text that contains those patterns that can be tested? If you do, then I could easy code it up and write the test =)

noe · 2019-03-08T10:40:28Z

Sorry, I don't have an example of such files. My intention was to avoid URLs being tokenized and I was planning to use a regex like this one:

    import re
    regex = re.compile(
        r'(?:http|ftp)s?://'  # http:// or https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?))'
        r'(?::\d+)?'  # optional port
        r'(?:/\w+)*'
        r'(?:(?:\.[a-z]+)|/?)', re.IGNORECASE)

Then I found out that Moses tokenizer supported it and checked sacremoses for it because it's what we use.

P.S.: the regex above is used in our code but I don't know where I took it from; my past self wrote as a comment that it's loosely based on Django's URL validators but my present self can't see an evident connection with it.

alvations · 2019-04-12T06:04:29Z

@noe Django URL validators is a little too heavy to incorporate here.

I've tried incorporating the protected_patterns feature from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/basic-protected-patterns and a unittest case against the pattern you've listed in the previous comment in #46

On CLI:

 $ pip install -U sacremoses>=0.0.19

 $ wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/basic-protected-patterns

 $ sacremoses tokenize -j 4 -p basic-protected-patterns < big.txt > big.txt.tok
100%|██████████████████████████████████| 128457/128457 [00:05<00:00, 22183.94it/s

In Python:

from sacremoses import MosesTokenizer

moses = MosesTokenizer()
text = "this is a webpage https://stackoverflow.com/questions/6181381/how-to-print-variables-in-perl that kicks ass"
expected_tokens = ['this', 'is', 'a', 'webpage',
                   'https://stackoverflow.com/questions/6181381/how-to-print-variables-in-perl',
                   'that', 'kicks', 'ass']
assert moses.tokenize(text, protected_patterns=moses.BASIC_PROTECTED_PATTERNS) == expected_tokens

# Testing against pattern from https://github.com/alvations/sacremoses/issues/35
noe_patterns = [r'(?:http|ftp)s?://'  # http:// or https://
    r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?))'
    r'(?::\d+)?'  # optional port
    r'(?:/\w+)*'
    r'(?:(?:\.[a-z]+)|/?)']
assert moses.tokenize(text, protected_patterns=noe_patterns) == expected_tokens

alvations · 2019-05-15T00:00:01Z

Added feature. Thanks again @noe ! c.f. #46

ganeshvictory · 2023-01-24T10:14:15Z

Hi @alvations , the above code snippet is just handling the case if there are string before and after the url and in fact it is taking the url only in specific format. So my question is if there's any way that it can handle any condition say like even if the text has only url it shouldn't tokenise or in some cases even if there's no text before or after the url?

Thanks in advance!

alvations added the enhancement New feature or request label Mar 8, 2019

alvations added a commit that referenced this issue Apr 12, 2019

added test case from #35

f69c1c0

alvations mentioned this issue Apr 12, 2019

Added protected pattern feature and basic protected patterns #46

Merged

alvations closed this as completed May 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flag --protected from original Moses tokenizer #35

Flag --protected from original Moses tokenizer #35

noe commented Mar 7, 2019

alvations commented Mar 8, 2019

noe commented Mar 8, 2019 •

edited

alvations commented Apr 12, 2019 •

edited

alvations commented May 15, 2019

ganeshvictory commented Jan 24, 2023

Flag --protected from original Moses tokenizer #35

Flag --protected from original Moses tokenizer #35

Comments

noe commented Mar 7, 2019

alvations commented Mar 8, 2019

noe commented Mar 8, 2019 • edited

alvations commented Apr 12, 2019 • edited

alvations commented May 15, 2019

ganeshvictory commented Jan 24, 2023

noe commented Mar 8, 2019 •

edited

alvations commented Apr 12, 2019 •

edited