fixed parsing token vocabularies for gemma and gpt-sw3 models #763

ai-and-i · 2024-03-20T19:47:07Z

This PR closes #762

saattrupdan · 2024-03-21T06:13:55Z

Looks good to me! @rlouf?

rlouf · 2024-03-25T07:37:08Z

Thank you!

This PR is an extension of #763, related to extending the `re_replacement_seq` regex. The new [NorwAI models](https://huggingface.co/NorwAI) use a tokenizer that has the token `�.`, which leads to the same error as was described in the previous issue #762. This PR extends the fix from #763 to deal with this case, as well as adding a unit test to test various tokenizers, and a comment describing why we need the prefix and suffix in the regex.

This PR is an extension of outlines-dev#763, related to extending the `re_replacement_seq` regex. The new [NorwAI models](https://huggingface.co/NorwAI) use a tokenizer that has the token `�.`, which leads to the same error as was described in the previous issue outlines-dev#762. This PR extends the fix from outlines-dev#763 to deal with this case, as well as adding a unit test to test various tokenizers, and a comment describing why we need the prefix and suffix in the regex.

fixed parsing token vocabularies for gemma and gpt-sw3 models

a8d50d7

rlouf merged commit c744e25 into outlines-dev:main Mar 25, 2024
5 checks passed

rlouf added the structured generation Linked to structured generation label Mar 25, 2024

saattrupdan mentioned this pull request Jun 10, 2024

Fix/extend re replacement seq #948

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixed parsing token vocabularies for gemma and gpt-sw3 models #763

fixed parsing token vocabularies for gemma and gpt-sw3 models #763

ai-and-i commented Mar 20, 2024

saattrupdan commented Mar 21, 2024

rlouf commented Mar 25, 2024

fixed parsing token vocabularies for gemma and gpt-sw3 models #763

fixed parsing token vocabularies for gemma and gpt-sw3 models #763

Conversation

ai-and-i commented Mar 20, 2024

saattrupdan commented Mar 21, 2024

rlouf commented Mar 25, 2024