-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flag --protected from original Moses tokenizer #35
Comments
Hmmm, looks easy to implement but tricky to test. Do you have any example |
Sorry, I don't have an example of such files. My intention was to avoid URLs being tokenized and I was planning to use a regex like this one:
Then I found out that Moses tokenizer supported it and checked sacremoses for it because it's what we use. P.S.: the regex above is used in our code but I don't know where I took it from; my past self wrote as a comment that it's loosely based on Django's URL validators but my present self can't see an evident connection with it. |
@noe Django URL validators is a little too heavy to incorporate here. I've tried incorporating the On CLI:
In Python: from sacremoses import MosesTokenizer
moses = MosesTokenizer()
text = "this is a webpage https://stackoverflow.com/questions/6181381/how-to-print-variables-in-perl that kicks ass"
expected_tokens = ['this', 'is', 'a', 'webpage',
'https://stackoverflow.com/questions/6181381/how-to-print-variables-in-perl',
'that', 'kicks', 'ass']
assert moses.tokenize(text, protected_patterns=moses.BASIC_PROTECTED_PATTERNS) == expected_tokens
# Testing against pattern from https://github.com/alvations/sacremoses/issues/35
noe_patterns = [r'(?:http|ftp)s?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?))'
r'(?::\d+)?' # optional port
r'(?:/\w+)*'
r'(?:(?:\.[a-z]+)|/?)']
assert moses.tokenize(text, protected_patterns=noe_patterns) == expected_tokens |
Hi @alvations , the above code snippet is just handling the case if there are string before and after the url and in fact it is taking the url only in specific format. So my question is if there's any way that it can handle any condition say like even if the text has only url it shouldn't tokenise or in some cases even if there's no text before or after the url? Thanks in advance! |
The original Moses tokenizer supports the
--protected
flag. It's effect is to accept a file with a list of regular expression that should be protected from tokenization.Under the hoods, the tokenizer masks each match of the regexes, then tokenizes, then unmasks.
Is this functionality in the roadmap of sacremoses?
The text was updated successfully, but these errors were encountered: