Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flag --protected from original Moses tokenizer #35

Closed
noe opened this issue Mar 7, 2019 · 5 comments
Closed

Flag --protected from original Moses tokenizer #35

noe opened this issue Mar 7, 2019 · 5 comments
Labels
enhancement New feature or request

Comments

@noe
Copy link

noe commented Mar 7, 2019

The original Moses tokenizer supports the --protected flag. It's effect is to accept a file with a list of regular expression that should be protected from tokenization.

Under the hoods, the tokenizer masks each match of the regexes, then tokenizes, then unmasks.

Is this functionality in the roadmap of sacremoses?

@alvations
Copy link
Contributor

Hmmm, looks easy to implement but tricky to test.

Do you have any example protected_patterns_file and related text that contains those patterns that can be tested? If you do, then I could easy code it up and write the test =)

@alvations alvations added the enhancement New feature or request label Mar 8, 2019
@noe
Copy link
Author

noe commented Mar 8, 2019

Sorry, I don't have an example of such files. My intention was to avoid URLs being tokenized and I was planning to use a regex like this one:

    import re
    regex = re.compile(
        r'(?:http|ftp)s?://'  # http:// or https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?))'
        r'(?::\d+)?'  # optional port
        r'(?:/\w+)*'
        r'(?:(?:\.[a-z]+)|/?)', re.IGNORECASE)

Then I found out that Moses tokenizer supported it and checked sacremoses for it because it's what we use.

P.S.: the regex above is used in our code but I don't know where I took it from; my past self wrote as a comment that it's loosely based on Django's URL validators but my present self can't see an evident connection with it.

alvations added a commit that referenced this issue Apr 12, 2019
@alvations
Copy link
Contributor

alvations commented Apr 12, 2019

@noe Django URL validators is a little too heavy to incorporate here.

I've tried incorporating the protected_patterns feature from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/basic-protected-patterns and a unittest case against the pattern you've listed in the previous comment in #46


On CLI:

 $ pip install -U sacremoses>=0.0.19

 $ wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/basic-protected-patterns

 $ sacremoses tokenize -j 4 -p basic-protected-patterns < big.txt > big.txt.tok
100%|██████████████████████████████████| 128457/128457 [00:05<00:00, 22183.94it/s

In Python:

from sacremoses import MosesTokenizer

moses = MosesTokenizer()
text = "this is a webpage https://stackoverflow.com/questions/6181381/how-to-print-variables-in-perl that kicks ass"
expected_tokens = ['this', 'is', 'a', 'webpage',
                   'https://stackoverflow.com/questions/6181381/how-to-print-variables-in-perl',
                   'that', 'kicks', 'ass']
assert moses.tokenize(text, protected_patterns=moses.BASIC_PROTECTED_PATTERNS) == expected_tokens

# Testing against pattern from https://github.com/alvations/sacremoses/issues/35
noe_patterns = [r'(?:http|ftp)s?://'  # http:// or https://
    r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?))'
    r'(?::\d+)?'  # optional port
    r'(?:/\w+)*'
    r'(?:(?:\.[a-z]+)|/?)']
assert moses.tokenize(text, protected_patterns=noe_patterns) == expected_tokens

@alvations
Copy link
Contributor

Added feature. Thanks again @noe ! c.f. #46

@ganeshvictory
Copy link

Hi @alvations , the above code snippet is just handling the case if there are string before and after the url and in fact it is taking the url only in specific format. So my question is if there's any way that it can handle any condition say like even if the text has only url it shouldn't tokenise or in some cases even if there's no text before or after the url?

Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants