Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with clean_up_tokenization() function? #16

Closed
proycon opened this issue May 5, 2020 · 2 comments
Closed

Issues with clean_up_tokenization() function? #16

proycon opened this issue May 5, 2020 · 2 comments

Comments

@proycon
Copy link
Contributor

proycon commented May 5, 2020

I have some issues regarding the clean_up_tokenization() function, it looks like what it basically does is strip some whitespace? I agree that the tokens itself should not contain leading or trailing whitespace. I think it could be implemented more efficiently and generically (= language independent)?

https://github.com/guillaume-be/rust-tokenizers/blob/master/main/src/preprocessing/tokenizer/base_tokenizer.rs#L130

.replace(" do not", " don't")

I don't like this part because it changes the actual text and doesn't just concern whitespace.

@guillaume-be
Copy link
Owner

Hello,

The clean_up_tokenization function indeed cleans up tokenization artifacts. This includes the fact that in English most punctuation has a trailing but no space. It also rebuilds composite English forms such as I've.

This is indeed very English specific. This library was designed to be a port from the Transformers library, and is expected to have a very similar behaviour and API (from the Python bindings).

The clean_up_tokenization mirrors the behaviour of the original library: https://github.com/huggingface/transformers/blob/79b1c6966b2f0d63269eacbe87fade530ee4f05c/src/transformers/tokenization_utils.py#L2183

This tokenization clean-up can be turned off when decoding, and replaced by a custom method from the user.

@proycon
Copy link
Contributor Author

proycon commented May 5, 2020

I see, I didn't realize it was mirroring the original that closely, but that makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants