-
Notifications
You must be signed in to change notification settings - Fork 150
Tokenizers fall back to BPE if unregistered #231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Now that the library has been tested more extensively it's not so valuable to fail on tokenizers we haven't encountered before. We still need to check what happens with tokenizers that use a different model, or those that haven't been ported from their original implementations. An alternative would be to expose a registration mechanism, as discussed in #63.
|
cc @FL33TW00D @mattt @xenova for opinions whether to move forward with this. If so, we need to ensure we issue helpful error messages if the tokenizer ends up not working. |
|
I think if we pair this with a warning it's much better behaviour! |
|
I agree that falling back to BPE is a reasonable default, and an improvement over the current behavior of throwing an error 👍 Runtime warnings are an imperfect solution, but probably the best we have at the moment. In general, I'm not a fan of the |
|
Going for a warning for now. How do you feel about using a |
|
This is what strict mode (enabled by default) looks like. cc @davidkoski as we briefly discussed about this long ago. You can opt out of strict mode when you load the tokenizer, as My main question still is whether we go this way (strict mode), or just default everything to BPE and simply print a warning. |
I think the mlx-swift-examples registration mechanism could gladly be retired if it wasn't needed. Anyway, I don't normally like silent failures (permissive mode) but in this case the BPE tokenizer seems to be what most models end up using anyway, so I think it is a reasonable default. I wonder if a parameter in the call to indicate that you want strict vs fallback behavior would be appropriate? |
Yes, that's what I finally ended up doing here, sorry for being unclear. This test shows that by default we still throw, but you can pass func testNllbTokenizer() async throws {
do {
_ = try await AutoTokenizer.from(pretrained: "Xenova/nllb-200-distilled-600M")
XCTFail("Expected AutoTokenizer.from to throw for strict mode")
} catch {
// Expected to throw in normal (strict) mode
}
// no strict mode proceeds
guard let tokenizer = try await AutoTokenizer.from(pretrained: "Xenova/nllb-200-distilled-600M", strict: false) as? PreTrainedTokenizer else {
XCTFail()
return
}
let ids = tokenizer.encode(text: "Why did the chicken cross the road?")
let expected = [256047, 24185, 4077, 349, 1001, 22690, 83580, 349, 82801, 248130, 2]
XCTAssertEqual(ids, expected)
}Would this work for you? |
For sure! |
|
Ok, merging then. Thanks all for the reviews and opinions! |
Proposal: now that the library has been tested more extensively it's not so valuable to fail on tokenizer names we haven't encountered before.
We still need to check what happens with tokenizers that use a different model, or those that haven't been ported from their original implementations, to ensure the error message is helpful.
An alternative would be to expose a registration mechanism, as discussed in #63.