Issues with clean_up_tokenization() function? #16

proycon · 2020-05-05T09:45:54Z

I have some issues regarding the clean_up_tokenization() function, it looks like what it basically does is strip some whitespace? I agree that the tokens itself should not contain leading or trailing whitespace. I think it could be implemented more efficiently and generically (= language independent)?

https://github.com/guillaume-be/rust-tokenizers/blob/master/main/src/preprocessing/tokenizer/base_tokenizer.rs#L130

.replace(" do not", " don't")

I don't like this part because it changes the actual text and doesn't just concern whitespace.

The text was updated successfully, but these errors were encountered:

guillaume-be · 2020-05-05T14:56:33Z

Hello,

The clean_up_tokenization function indeed cleans up tokenization artifacts. This includes the fact that in English most punctuation has a trailing but no space. It also rebuilds composite English forms such as I've.

This is indeed very English specific. This library was designed to be a port from the Transformers library, and is expected to have a very similar behaviour and API (from the Python bindings).

The clean_up_tokenization mirrors the behaviour of the original library: https://github.com/huggingface/transformers/blob/79b1c6966b2f0d63269eacbe87fade530ee4f05c/src/transformers/tokenization_utils.py#L2183

This tokenization clean-up can be turned off when decoding, and replaced by a custom method from the user.

proycon · 2020-05-05T16:30:53Z

I see, I didn't realize it was mirroring the original that closely, but that makes sense.

guillaume-be closed this as completed Oct 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with clean_up_tokenization() function? #16

Issues with clean_up_tokenization() function? #16

proycon commented May 5, 2020

guillaume-be commented May 5, 2020

proycon commented May 5, 2020

Issues with clean_up_tokenization() function? #16

Issues with clean_up_tokenization() function? #16

Comments

proycon commented May 5, 2020

guillaume-be commented May 5, 2020

proycon commented May 5, 2020