Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Circumvent Broken llama.cpp Pre-Tokenizer #892

Merged
merged 1 commit into from
May 17, 2024

Conversation

lapp0
Copy link
Contributor

@lapp0 lapp0 commented May 15, 2024

Fixes #820

Problem:

llama.cpp's pre-tokenizer doesn't handle unicode properly (draft PR: ggerganov/llama.cpp#5613). This results in tokens which are incompatible with Outlines byte-wise FSM and causes #820's error.

Solution:

    1. If models.llamacpp() specifies a LlamaHFTokenizer, populate the vocabulary used in index construction with tokenizer.get_vocab(). This takes advantage of huggingface's working pre-tokenizer.
    1. Warn users that they should pass a LlamaHFTokenizer:
>>> from outlines import models, generate
>>> model = models.llamacpp("Qwen/Qwen1.5-0.5B-Chat-GGUF", "*q8*.gguf")
/opt/conda/lib/python3.10/site-packages/outlines/models/llamacpp.py:294: UserWarning: llama.cpp pre-tokenizer is broken. You may recieve an Outlines error during Regex index construction.
To avoid this error when using `models.llamacpp` you may pass `tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained(<hf_repo_id>)` to `models.llamacpp()`
  warnings.warn(

Debug Notes / Observations:

  • The problematic token in Qwen1.5 is 29333 (b' \xef\xbf\xbd')
  • The issue can be reproduced with models.llamacpp, but not models.transformers
  • AutoTokenizer's get_vocab() is inconsistent with its encode / decode output.
    • get_vocab()[ b'\xc4\xa0\xc3\xaf\xc2\xbf\xc2\xbd'.decode()] = 29333
    • get_vocab()[ b' \xef\xbf\xbd'.decode()] -> KeyError
    • encode( b'\xc4\xa0\xc3\xaf\xc2\xbf\xc2\xbd'.decode()) = [144242, 37572, 30182, 26062]
    • encode(b' \xef\xbf\xbd'.decode()) = [29333]

tokenizer.get_vocab() has a distinct mapping from tokenizer.decode due to the pre-tokenizer:

>>> tokenizer = transformers.AutoTokenizer.from_pretrained("Qwen/Qwen1.5-0.5B-Chat")
>>> tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(b' \xef\xbf\xbd'.decode())[0][0].encode()
b'\xc4\xa0\xc3\xaf\xc2\xbf\xc2\xbd'

@lapp0 lapp0 changed the title Tests to Reproduce Issue #820 Tests to Reproduce Issue #820 (RuntimeError: Cannot convert token � (29333) to bytes: � Using generate() with models.llamacpp May 15, 2024
@lapp0 lapp0 changed the title Tests to Reproduce Issue #820 (RuntimeError: Cannot convert token � (29333) to bytes: � Using generate() with models.llamacpp Fix Issue #820 (RuntimeError: Cannot convert token � (29333) to bytes: � Using generate() with models.llamacpp May 15, 2024
@lapp0 lapp0 force-pushed the test-issue-820 branch 2 times, most recently from 01b7390 to 496e0a6 Compare May 16, 2024 19:40
@lapp0 lapp0 marked this pull request as ready for review May 16, 2024 19:40
outlines/models/llamacpp.py Outdated Show resolved Hide resolved
outlines/models/llamacpp.py Outdated Show resolved Hide resolved
outlines/models/llamacpp.py Outdated Show resolved Hide resolved
outlines/models/llamacpp.py Outdated Show resolved Hide resolved
@lapp0 lapp0 changed the title Fix Issue #820 (RuntimeError: Cannot convert token � (29333) to bytes: � Using generate() with models.llamacpp Circumvent Broken llama.cpp Pre-Tokenizer May 17, 2024
@lapp0 lapp0 force-pushed the test-issue-820 branch 3 times, most recently from 2987784 to 4e84812 Compare May 17, 2024 14:30
@rlouf
Copy link
Member

rlouf commented May 17, 2024

Looks good, thank you for the fix. I think it will be valuable to many people in the community 🙏

@rlouf rlouf merged commit 315d531 into dottxt-ai:main May 17, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants