Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix generation of multi-token unicode characters #738

Merged
merged 2 commits into from
Mar 14, 2024

Conversation

ai-and-i
Copy link

This PR fixes generation of unicode strings that can only be represented by sequences of multiple tokens (closes #725).

Currently, outlines will prevent any such characters from being generated at all, even with '.*' regex. This creates a major problem when generating text in non-latin languages, or generating special characters like emojis.

This PR addresses this problem by converting character-level regex FSMs to byte-level FSMs. More precisely, it augments the FSM by adding byte-by-byte transitions that can be triggered by sub-character tokens generated by the LLM. The full-character transitions are kept as-is, so the performance of generation for normal tokens isn't impacted.

I considered other design choices, including keeping the logic of dealing with such tokens in the RegexGuide class. I don't think it can work at least for GPT2-like tokenizers (which includes gpt2, phi, qwen and other models). Such tokenizers have tokens that combine full utf8 characters followed by parts of the next character (for example, b'\x20\xf0'), and deciding whether to accept such tokens requires walking the FSM.

@rlouf
Copy link
Member

rlouf commented Mar 14, 2024

We knew we would need something like this at some point. Thank you so much for implementing it!

@rlouf rlouf merged commit 043117f into dottxt-ai:main Mar 14, 2024
5 checks passed
@ai-and-i
Copy link
Author

Great, thanks for merging, it unblocks me from using outlines in my project. Thanks for the great tool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi-token unicode characters aren't supported in regexps
3 participants