fix: handle dict vocab in CamembertTokenizer for tokenizer.json (#44488) by aayushbaluni · Pull Request #44800 · huggingface/transformers

aayushbaluni · 2026-03-17T17:20:35Z

Summary

CamembertTokenizer raised ValueError: too many values to unpack (expected 2) when loading models like cjvt/sleng-bert that provide vocab as a dict {token: id} from tokenizer.json (BPE format). The tokenizer expected a list of (token, score) tuples for Unigram.

Root cause

When AutoTokenizer.from_pretrained loads a model with tokenizer.json, convert_to_native_format passes vocab as a dict. CamembertTokenizer assumed list format and unpacked (tok, _) = token_string, causing the error.

Fix

Handle dict vocab by converting to list of (token, 0.0) tuples in id order before passing to Unigram.

Testing

Added test_camembert_tokenizer_with_dict_vocab in test_tokenization_camembert.py
Manually verified CamembertTokenizer(vocab=dict_from_sleng_bert) loads successfully

Made with Cursor

…ingface#44488)

github-actions · 2026-03-17T18:44:18Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: camembert

Rocketknight1 · 2026-03-18T15:37:54Z

No drive-by code agent PRs!

aayushbaluni added 2 commits March 17, 2026 22:50

fix: handle dict vocab in CamembertTokenizer for tokenizer.json (hugg…

8b2df19

…ingface#44488)

style: split long line into multiple statements for readability

8233986

Rocketknight1 closed this Mar 18, 2026

Rocketknight1 added the Code agent slop label Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle dict vocab in CamembertTokenizer for tokenizer.json (#44488)#44800

fix: handle dict vocab in CamembertTokenizer for tokenizer.json (#44488)#44800
aayushbaluni wants to merge 2 commits intohuggingface:mainfrom
aayushbaluni:fix/44488-camembert-dict-vocab

aayushbaluni commented Mar 17, 2026

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

Rocketknight1 commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aayushbaluni commented Mar 17, 2026

Summary

Root cause

Fix

Testing

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

Rocketknight1 commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants