Skip to content

Commit 5c48360

Browse files
itazapArthurZucker
andauthored
Update MIGRATION_GUIDE_V5.md
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
1 parent bc4141f commit 5c48360

File tree

1 file changed

+2
-1
lines changed

1 file changed

+2
-1
lines changed

MIGRATION_GUIDE_V5.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,8 @@ If you want something even higher up the stack, then `PreTrainedTokenizerBase` i
165165
- `save_pretrained`
166166
- among a few others
167167

168-
**Note for implementing new tokenizers:** When creating a tokenizer class that loads from SentencePiece files, you can override the `convert_from_spm` class method in your converter to customize vocabulary structure during conversion. This is useful if the model requires specific token ordering or additional tokens. See existing converter classes in `convert_slow_tokenizer.py` for examples.
168+
**Note for implementing new tokenizers:** When creating a tokenizer class that loads from SentencePiece files, you can override the `convert_from_spm` class method in your converter to customize vocabulary structure, normalizers, regexes and anything that you would want to be passed to the tokenizers your are converting.
169+
This is useful if the model requires specific token ordering or special split regex patterns. See existing converter classes in `convert_slow_tokenizer.py` for examples.
169170

170171
### API Changes
171172

0 commit comments

Comments
 (0)