You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Yes, this is expected. There are like 150K unicode characters, so if your vocab size is less than that some Unicode character has to be split into multiple tokens.
I found that tiktoken splits a Chinese character into two tokens, is this normal?
The text was updated successfully, but these errors were encountered: