-
Notifications
You must be signed in to change notification settings - Fork 1
Open
3 / 33 of 3 issues completedDescription
While working on the transformers.js integration, here is a list of features we should work on adding:
- model_max_length: NOTE: This is only available in the tokenizer_config.json. Maybe we won't need the tokenizer_config.json after all in tokenizers.js, and we move all logic that requires tokenizer_config.json to transformers.js
- special tokens like 'bos_token', 'eos_token', 'unk_token', 'sep_token', 'pad_token', 'cls_token', and 'mask_token', as well as their id counterparts ('pad_token_id', ...) also only exist in tokenizer_config.json, so once again may only be necessary to expose in transformers.js (via the
PreTrainedTokenizerclass - Remove any padding/truncation logic from tokenizers.js, and move that to transformers.js. This makes sense actually since the only purpose of padding/truncation is to create a structure which can be converted to a tensor (fixed shape). But since tokenizers.js has no concept of Tensor, we don't need this logic in it.
- Type issue: Namespace '".../transformers.js/node_modules/@huggingface/tokenizers/types/index"' has no exported member 'Encoding'.ts(2694)
- We should expose the different components. Maybe with separate exports like it is in the rust/python library:
from tokenizers.pre_tokenizers import Metaspace. This is necessary because... for example... certain llama tokenizers need to handle a legacy mode to ensure backwards compatibility (see here)
Sub-issues
Metadata
Metadata
Assignees
Labels
No labels