Add unigram sampling (alpha, nbest_size) by kennethsible · Pull Request #1994 · huggingface/tokenizers

kennethsible · 2026-03-27T18:02:57Z

I noticed that models.Unigram doesn't support sampling, which enables subword regularization (arguably one of the main reasons to choose the Unigram model). I checked GitHub and there are multiple closed issues on this topic (#730, #849). In one of these issues, it was mentioned that the sampling code has already been implemented in lattice.rs and simply needs to be exposed through Python. I filled in the missing details and added a sample_nbest function for parity with Google's implementation. I also copied the interface for BPE dropout as closely as possible, including getters and setters for the new sampling parameters.

HuggingFaceDocBuilderDev · 2026-03-30T07:31:15Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Thanks! Do you mind adding a test on python side bindings/python/tests/bindings/test_models.py!

Also update the doc of PyUnigram to mention these new args and the new behavior!

I am not super super familiar with unigram / nbest etc but happy to have parity!

kennethsible · 2026-04-01T21:38:26Z

Also update the doc of PyUnigram to mention these new args and the new behavior!

@ArthurZucker Is updating the docstring sufficient? I wasn't sure if the documentation is autogenerated.

ArthurZucker

Yep that's good

kennethsible and others added 3 commits March 16, 2026 14:54

Add unigram sampling

478150b

Add nbest_size for unigram sampling

08355c8

Merge branch 'main' into unigram-sampling

4f989ff

ArthurZucker reviewed Mar 30, 2026

View reviewed changes

ArthurZucker and others added 3 commits March 30, 2026 09:32

Merge branch 'main' into unigram-sampling

28922f3

Add unit test for unigram sampling

6292cff

Update docstring for PyUnigram

3d3b5e2

ArthurZucker added 2 commits April 8, 2026 07:45

style

4c2b612

fmt

e9a7c6e

ArthurZucker approved these changes Apr 8, 2026

View reviewed changes

ArthurZucker merged commit 2827113 into huggingface:main Apr 8, 2026
35 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add unigram sampling (alpha, nbest_size)#1994

Add unigram sampling (alpha, nbest_size)#1994
ArthurZucker merged 8 commits intohuggingface:mainfrom
kennethsible:unigram-sampling

kennethsible commented Mar 27, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Mar 30, 2026

Uh oh!

ArthurZucker left a comment

Uh oh!

kennethsible commented Apr 1, 2026

Uh oh!

ArthurZucker left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kennethsible commented Mar 27, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Mar 30, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

kennethsible commented Apr 1, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants