Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web and basic protected patterns by default #138

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

samirsalman
Copy link

By default the library is not using protected patterns such of WEB_PROTECTED_PATTERNS which contains for example URLs and emails patterns.

# Example
tokenizer.tokenize("http://www.someurl.com")

# Expected output
["http://www.someurl.com"]

# sacremoses output
["http", ":",  "/", "/", "www.someurl.com"]

I suggest to use WEB_PROTECTED_PATTERNS and BASIC_PATTERNS by default when user does not specify protected patterns.
This allow user to avoid issues with URLs tokenization when use tokenize function with default arguments. The user can still specify different protected patterns or force to don't use protected patterns by setting protected_patterns parameter to empty list:

tokenizer.tokenize("http://www.someurl.com",protected_patterns=[])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant