You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have some issues regarding the clean_up_tokenization() function, it looks like what it basically does is strip some whitespace? I agree that the tokens itself should not contain leading or trailing whitespace. I think it could be implemented more efficiently and generically (= language independent)?
The clean_up_tokenization function indeed cleans up tokenization artifacts. This includes the fact that in English most punctuation has a trailing but no space. It also rebuilds composite English forms such as I've.
This is indeed very English specific. This library was designed to be a port from the Transformers library, and is expected to have a very similar behaviour and API (from the Python bindings).
I have some issues regarding the
clean_up_tokenization()
function, it looks like what it basically does is strip some whitespace? I agree that the tokens itself should not contain leading or trailing whitespace. I think it could be implemented more efficiently and generically (= language independent)?I don't like this part because it changes the actual text and doesn't just concern whitespace.
The text was updated successfully, but these errors were encountered: