Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed parsing token vocabularies for gemma and gpt-sw3 models #763

Merged
merged 1 commit into from
Mar 25, 2024

Conversation

ai-and-i
Copy link

This PR closes #762

@saattrupdan
Copy link
Contributor

Looks good to me! @rlouf?

@rlouf rlouf merged commit c744e25 into outlines-dev:main Mar 25, 2024
5 checks passed
@rlouf
Copy link
Member

rlouf commented Mar 25, 2024

Thank you!

@rlouf rlouf added the structured generation Linked to structured generation label Mar 25, 2024
rlouf pushed a commit that referenced this pull request Jun 12, 2024
This PR is an extension of
#763, related to extending
the `re_replacement_seq` regex.

The new [NorwAI models](https://huggingface.co/NorwAI) use a tokenizer
that has the token `�.`, which leads to the same error as was described
in the previous issue
#762.

This PR extends the fix from
#763 to deal with this
case, as well as adding a unit test to test various tokenizers, and a
comment describing why we need the prefix and suffix in the regex.
lapp0 pushed a commit to lapp0/outlines that referenced this pull request Jun 12, 2024
This PR is an extension of
outlines-dev#763, related to extending
the `re_replacement_seq` regex.

The new [NorwAI models](https://huggingface.co/NorwAI) use a tokenizer
that has the token `�.`, which leads to the same error as was described
in the previous issue
outlines-dev#762.

This PR extends the fix from
outlines-dev#763 to deal with this
case, as well as adding a unit test to test various tokenizers, and a
comment describing why we need the prefix and suffix in the regex.
lapp0 pushed a commit to lapp0/outlines that referenced this pull request Jun 12, 2024
This PR is an extension of
outlines-dev#763, related to extending
the `re_replacement_seq` regex.

The new [NorwAI models](https://huggingface.co/NorwAI) use a tokenizer
that has the token `�.`, which leads to the same error as was described
in the previous issue
outlines-dev#762.

This PR extends the fix from
outlines-dev#763 to deal with this
case, as well as adding a unit test to test various tokenizers, and a
comment describing why we need the prefix and suffix in the regex.
fpgmaas pushed a commit to fpgmaas/outlines that referenced this pull request Jun 14, 2024
This PR is an extension of
outlines-dev#763, related to extending
the `re_replacement_seq` regex.

The new [NorwAI models](https://huggingface.co/NorwAI) use a tokenizer
that has the token `�.`, which leads to the same error as was described
in the previous issue
outlines-dev#762.

This PR extends the fix from
outlines-dev#763 to deal with this
case, as well as adding a unit test to test various tokenizers, and a
comment describing why we need the prefix and suffix in the regex.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
structured generation Linked to structured generation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regex FSM fails with some tokenizers
4 participants