Skip to content

Fix _set_model_specific_special_tokens to accept list-format extra_special_tokens#44781

Open
bensons wants to merge 5 commits intohuggingface:mainfrom
bensons:fix/extra-special-tokens-list-format
Open

Fix _set_model_specific_special_tokens to accept list-format extra_special_tokens#44781
bensons wants to merge 5 commits intohuggingface:mainfrom
bensons:fix/extra-special-tokens-list-format

Conversation

@bensons
Copy link
Copy Markdown

@bensons bensons commented Mar 17, 2026

What does this PR do?

Some model repos provide extra_special_tokens as a list in their tokenizer_config.json, which caused an AttributeError: 'list' object has no attribute 'keys'. This converts list inputs to a dict mapping each token to itself before processing.

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@Cyrilvallez
@itazap
@ArthurZucker

…_special_tokens`

Some model repos (e.g. jedisct1/Qwen3-Embedding-8B-q8-mlx) provide
`extra_special_tokens` as a list in their tokenizer_config.json, which
caused an `AttributeError: 'list' object has no attribute 'keys'`. This
converts list inputs to a dict mapping each token to itself before
processing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Cyrilvallez
Copy link
Copy Markdown
Member

Please provide the excat model where this happens before we review. Also, we do not want new test files

@bensons
Copy link
Copy Markdown
Author

bensons commented Mar 21, 2026

Example that caused me to find this behavior:
https://huggingface.co/jedisct1/Qwen3-VL-Embedding-8B-mlx/blob/main/tokenizer_config.json

@itazap
Copy link
Copy Markdown
Collaborator

itazap commented Mar 27, 2026

I'm able to load AutoTokenizer.from_pretrained("jedisct1/Qwen3-VL-Embedding-8B-mlx") without any errors. Can you please pull on main and share the full reproducer and error you have?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants