Skip to content

fix: handle dict vocab in CamembertTokenizer for tokenizer.json (#44488)#44800

Closed
aayushbaluni wants to merge 2 commits intohuggingface:mainfrom
aayushbaluni:fix/44488-camembert-dict-vocab
Closed

fix: handle dict vocab in CamembertTokenizer for tokenizer.json (#44488)#44800
aayushbaluni wants to merge 2 commits intohuggingface:mainfrom
aayushbaluni:fix/44488-camembert-dict-vocab

Conversation

@aayushbaluni
Copy link
Copy Markdown

Summary

Fixes #44488

CamembertTokenizer raised ValueError: too many values to unpack (expected 2) when loading models like cjvt/sleng-bert that provide vocab as a dict {token: id} from tokenizer.json (BPE format). The tokenizer expected a list of (token, score) tuples for Unigram.

Root cause

When AutoTokenizer.from_pretrained loads a model with tokenizer.json, convert_to_native_format passes vocab as a dict. CamembertTokenizer assumed list format and unpacked (tok, _) = token_string, causing the error.

Fix

Handle dict vocab by converting to list of (token, 0.0) tuples in id order before passing to Unigram.

Testing

  • Added test_camembert_tokenizer_with_dict_vocab in test_tokenization_camembert.py
  • Manually verified CamembertTokenizer(vocab=dict_from_sleng_bert) loads successfully

Made with Cursor

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: camembert

@Rocketknight1
Copy link
Copy Markdown
Member

No drive-by code agent PRs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Current version also does not load "cjvt/sleng-bert"

2 participants