Tokenizers fall back to BPE if unregistered #231

pcuenca · 2025-09-10T18:00:44Z

Proposal: now that the library has been tested more extensively it's not so valuable to fail on tokenizer names we haven't encountered before.

We still need to check what happens with tokenizers that use a different model, or those that haven't been ported from their original implementations, to ensure the error message is helpful.

An alternative would be to expose a registration mechanism, as discussed in #63.

Now that the library has been tested more extensively it's not so valuable to fail on tokenizers we haven't encountered before. We still need to check what happens with tokenizers that use a different model, or those that haven't been ported from their original implementations. An alternative would be to expose a registration mechanism, as discussed in #63.

pcuenca · 2025-09-10T18:02:49Z

cc @FL33TW00D @mattt @xenova for opinions whether to move forward with this. If so, we need to ensure we issue helpful error messages if the tokenizer ends up not working.

FL33TW00D · 2025-09-10T21:19:58Z

I think if we pair this with a warning it's much better behaviour!

mattt · 2025-09-11T13:33:21Z

I agree that falling back to BPE is a reasonable default, and an improvement over the current behavior of throwing an error 👍

Runtime warnings are an imperfect solution, but probably the best we have at the moment. In general, I'm not a fan of the AutoTokenizer.from_pretrained pattern from Python (too magical for my tastes), so I'd welcome anything that provides stronger guarantees without loss of convenience or familiarity. A registration mechanism could be nice.

pcuenca · 2025-09-11T14:46:42Z

Going for a warning for now. How do you feel about using a strict argument that still raises an exception? This would be the default behaviour, but downstream users like mlx-swift-examples could opt out because they already have registration mechanisms in place.

pcuenca · 2025-09-11T16:53:41Z

This is what strict mode (enabled by default) looks like.

cc @davidkoski as we briefly discussed about this long ago. You can opt out of strict mode when you load the tokenizer, as mlx-swift-examples has its own registration mechanism.

My main question still is whether we go this way (strict mode), or just default everything to BPE and simply print a warning.

davidkoski · 2025-09-11T17:04:08Z

My main question still is whether we go this way (strict mode), or just default everything to BPE and simply print a warning.

I think the mlx-swift-examples registration mechanism could gladly be retired if it wasn't needed.

Anyway, I don't normally like silent failures (permissive mode) but in this case the BPE tokenizer seems to be what most models end up using anyway, so I think it is a reasonable default. I wonder if a parameter in the call to indicate that you want strict vs fallback behavior would be appropriate?

pcuenca · 2025-09-11T17:08:44Z

I wonder if a parameter in the call to indicate that you want strict vs fallback behavior

Yes, that's what I finally ended up doing here, sorry for being unclear. This test shows that by default we still throw, but you can pass strict: false to fall back to BPE:

    func testNllbTokenizer() async throws {
        do {
            _ = try await AutoTokenizer.from(pretrained: "Xenova/nllb-200-distilled-600M")
            XCTFail("Expected AutoTokenizer.from to throw for strict mode")
        } catch {
            // Expected to throw in normal (strict) mode
        }

        // no strict mode proceeds
        guard let tokenizer = try await AutoTokenizer.from(pretrained: "Xenova/nllb-200-distilled-600M", strict: false) as? PreTrainedTokenizer else {
            XCTFail()
            return
        }

        let ids = tokenizer.encode(text: "Why did the chicken cross the road?")
        let expected = [256047, 24185, 4077, 349, 1001, 22690, 83580, 349, 82801, 248130, 2]
        XCTAssertEqual(ids, expected)
    }

Would this work for you?

davidkoski · 2025-09-11T17:10:32Z

Would this work for you?

For sure!

pcuenca · 2025-09-11T17:12:19Z

Ok, merging then. Thanks all for the reviews and opinions!

mattt approved these changes Sep 11, 2025

View reviewed changes

Issue warning on unregistered tokenizer

8da34c8

pcuenca added 2 commits September 11, 2025 18:43

strict mode (throw by default)

aab3000

fix

8d79cde

pcuenca merged commit 5059cd4 into main Sep 11, 2025
2 checks passed

pcuenca deleted the tokenizers-bpe-fallback branch September 11, 2025 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenizers fall back to BPE if unregistered #231

Tokenizers fall back to BPE if unregistered #231

Uh oh!

pcuenca commented Sep 10, 2025

Uh oh!

pcuenca commented Sep 10, 2025

Uh oh!

FL33TW00D commented Sep 10, 2025

Uh oh!

mattt commented Sep 11, 2025

Uh oh!

pcuenca commented Sep 11, 2025

Uh oh!

pcuenca commented Sep 11, 2025

Uh oh!

davidkoski commented Sep 11, 2025

Uh oh!

pcuenca commented Sep 11, 2025

Uh oh!

davidkoski commented Sep 11, 2025

Uh oh!

pcuenca commented Sep 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Tokenizers fall back to BPE if unregistered #231

Tokenizers fall back to BPE if unregistered #231

Uh oh!

Conversation

pcuenca commented Sep 10, 2025

Uh oh!

pcuenca commented Sep 10, 2025

Uh oh!

FL33TW00D commented Sep 10, 2025

Uh oh!

mattt commented Sep 11, 2025

Uh oh!

pcuenca commented Sep 11, 2025

Uh oh!

pcuenca commented Sep 11, 2025

Uh oh!

davidkoski commented Sep 11, 2025

Uh oh!

pcuenca commented Sep 11, 2025

Uh oh!

davidkoski commented Sep 11, 2025

Uh oh!

pcuenca commented Sep 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants