Expose additional features

While working on the transformers.js integration, here is a list of features we should work on adding:

- [ ] model_max_length: NOTE: This is only available in the tokenizer_config.json. Maybe we won't need the tokenizer_config.json after all in tokenizers.js, and we move all logic that requires tokenizer_config.json to transformers.js
- [ ] special tokens like 'bos_token', 'eos_token', 'unk_token', 'sep_token', 'pad_token', 'cls_token', and 'mask_token', as well as their id counterparts ('pad_token_id', ...) also only exist in tokenizer_config.json, so once again may only be necessary to expose in transformers.js (via the `PreTrainedTokenizer` class
- [x] Remove any padding/truncation logic from tokenizers.js, and move that to transformers.js. This makes sense actually since the only purpose of padding/truncation is to create a structure which can be converted to a tensor (fixed shape). But since tokenizers.js has no concept of Tensor, we don't need this logic in it.
- [x] Type issue: Namespace '".../transformers.js/node_modules/@huggingface/tokenizers/types/index"' has no exported member 'Encoding'.ts(2694)
- [x] We should expose the different [components](https://huggingface.co/docs/tokenizers/components). Maybe with separate exports like it is in the rust/python library: `from tokenizers.pre_tokenizers import Metaspace`. This is necessary because... for example... certain llama tokenizers need to handle a legacy mode to ensure backwards compatibility (see [here](https://github.com/huggingface/transformers.js/blob/4c908ec898f85510ccbc989b76faa3e2705c1461/src/tokenizers.js#L3417-L3432))


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Expose additional features #6

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Expose additional features #6

Description

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions