Skip to content

Conversation

@pcuenca
Copy link
Member

@pcuenca pcuenca commented Sep 10, 2025

Proposal: now that the library has been tested more extensively it's not so valuable to fail on tokenizer names we haven't encountered before.

We still need to check what happens with tokenizers that use a different model, or those that haven't been ported from their original implementations, to ensure the error message is helpful.

An alternative would be to expose a registration mechanism, as discussed in #63.

Now that the library has been tested more extensively it's not so
valuable to fail on tokenizers we haven't encountered before.

We still need to check what happens with tokenizers that use a different
model, or those that haven't been ported from their original
implementations.

An alternative would be to expose a registration mechanism, as discussed
in #63.
@pcuenca
Copy link
Member Author

pcuenca commented Sep 10, 2025

cc @FL33TW00D @mattt @xenova for opinions whether to move forward with this. If so, we need to ensure we issue helpful error messages if the tokenizer ends up not working.

@FL33TW00D
Copy link
Contributor

I think if we pair this with a warning it's much better behaviour!

@mattt
Copy link
Collaborator

mattt commented Sep 11, 2025

I agree that falling back to BPE is a reasonable default, and an improvement over the current behavior of throwing an error 👍

Runtime warnings are an imperfect solution, but probably the best we have at the moment. In general, I'm not a fan of the AutoTokenizer.from_pretrained pattern from Python (too magical for my tastes), so I'd welcome anything that provides stronger guarantees without loss of convenience or familiarity. A registration mechanism could be nice.

@pcuenca
Copy link
Member Author

pcuenca commented Sep 11, 2025

Going for a warning for now. How do you feel about using a strict argument that still raises an exception? This would be the default behaviour, but downstream users like mlx-swift-examples could opt out because they already have registration mechanisms in place.

@pcuenca
Copy link
Member Author

pcuenca commented Sep 11, 2025

This is what strict mode (enabled by default) looks like.

cc @davidkoski as we briefly discussed about this long ago. You can opt out of strict mode when you load the tokenizer, as mlx-swift-examples has its own registration mechanism.


My main question still is whether we go this way (strict mode), or just default everything to BPE and simply print a warning.

@davidkoski
Copy link
Contributor

My main question still is whether we go this way (strict mode), or just default everything to BPE and simply print a warning.

I think the mlx-swift-examples registration mechanism could gladly be retired if it wasn't needed.

Anyway, I don't normally like silent failures (permissive mode) but in this case the BPE tokenizer seems to be what most models end up using anyway, so I think it is a reasonable default. I wonder if a parameter in the call to indicate that you want strict vs fallback behavior would be appropriate?

@pcuenca
Copy link
Member Author

pcuenca commented Sep 11, 2025

I wonder if a parameter in the call to indicate that you want strict vs fallback behavior

Yes, that's what I finally ended up doing here, sorry for being unclear. This test shows that by default we still throw, but you can pass strict: false to fall back to BPE:

    func testNllbTokenizer() async throws {
        do {
            _ = try await AutoTokenizer.from(pretrained: "Xenova/nllb-200-distilled-600M")
            XCTFail("Expected AutoTokenizer.from to throw for strict mode")
        } catch {
            // Expected to throw in normal (strict) mode
        }

        // no strict mode proceeds
        guard let tokenizer = try await AutoTokenizer.from(pretrained: "Xenova/nllb-200-distilled-600M", strict: false) as? PreTrainedTokenizer else {
            XCTFail()
            return
        }

        let ids = tokenizer.encode(text: "Why did the chicken cross the road?")
        let expected = [256047, 24185, 4077, 349, 1001, 22690, 83580, 349, 82801, 248130, 2]
        XCTAssertEqual(ids, expected)
    }

Would this work for you?

@davidkoski
Copy link
Contributor

Would this work for you?

For sure!

@pcuenca
Copy link
Member Author

pcuenca commented Sep 11, 2025

Ok, merging then. Thanks all for the reviews and opinions!

@pcuenca pcuenca merged commit 5059cd4 into main Sep 11, 2025
2 checks passed
@pcuenca pcuenca deleted the tokenizers-bpe-fallback branch September 11, 2025 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants