Skip to content

Issue with Qwen 1.5 tokenizer #572

@mjp0

Description

@mjp0

System Info

transformers.js version 2.14.2.

Environment/Platform

  • Website/web-app

Description

When I try to use tokenizers from https://huggingface.co/Qwen/Qwen1.5-14B-Chat with const t = new PreTrainedTokenizer(tok.tokenizer, tok.config).encode(text), I get the following error:

Uncaught (in promise) Error: SyntaxError: Invalid regular expression: /(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+/gu: Invalid group

I can't get anything more specific out from Chrome's debugger than that it happens with the PreTrainedTokenizer call.

All other tokenizers, like Mistral, Llama, etc. work perfectly, so I'm thinking that this must be a some sort of compatibility bug with Qwen.

Reproduction

  1. Download https://huggingface.co/Qwen/Qwen1.5-14B-Chat/blob/main/tokenizer.json and https://huggingface.co/Qwen/Qwen1.5-14B-Chat/blob/main/tokenizer_config.json
  2. Execute const t = new PreTrainedTokenizer(tokenizer, config).encode('foobar')

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions