Skip to content

Expose additional features #6

@xenova

Description

@xenova

While working on the transformers.js integration, here is a list of features we should work on adding:

  • model_max_length: NOTE: This is only available in the tokenizer_config.json. Maybe we won't need the tokenizer_config.json after all in tokenizers.js, and we move all logic that requires tokenizer_config.json to transformers.js
  • special tokens like 'bos_token', 'eos_token', 'unk_token', 'sep_token', 'pad_token', 'cls_token', and 'mask_token', as well as their id counterparts ('pad_token_id', ...) also only exist in tokenizer_config.json, so once again may only be necessary to expose in transformers.js (via the PreTrainedTokenizer class
  • Remove any padding/truncation logic from tokenizers.js, and move that to transformers.js. This makes sense actually since the only purpose of padding/truncation is to create a structure which can be converted to a tensor (fixed shape). But since tokenizers.js has no concept of Tensor, we don't need this logic in it.
  • Type issue: Namespace '".../transformers.js/node_modules/@huggingface/tokenizers/types/index"' has no exported member 'Encoding'.ts(2694)
  • We should expose the different components. Maybe with separate exports like it is in the rust/python library: from tokenizers.pre_tokenizers import Metaspace. This is necessary because... for example... certain llama tokenizers need to handle a legacy mode to ensure backwards compatibility (see here)

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions