Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/extend re replacement seq #948

Merged
merged 15 commits into from
Jun 12, 2024

Conversation

saattrupdan
Copy link
Contributor

This PR is an extension of #763, related to extending the re_replacement_seq regex.

The new NorwAI models use a tokenizer that has the token �., which leads to the same error as was described in the previous issue #762.

This PR extends the fix from #763 to deal with this case, as well as adding a unit test to test various tokenizers, and a comment describing why we need the prefix and suffix in the regex.

@saattrupdan
Copy link
Contributor Author

@rlouf The tests are failing as we need a Hugging Face token to load the NorwAI tokeniser, which is the reason why this PR is needed in the first place. Would it be possible to have a Hugging Face token as a Github secret to deal with this?

@rlouf
Copy link
Member

rlouf commented Jun 11, 2024

HUGGINGFACE_API_TOKEN should now be available in actions

@saattrupdan
Copy link
Contributor Author

saattrupdan commented Jun 11, 2024

HUGGINGFACE_API_TOKEN should now be available in actions

That's great, thanks. Can you please get access to the following models with that token?

  1. meta-llama/Meta-Llama-3-8B
  2. mistralai/Mistral-7B-v0.3
  3. google/gemma-2b
  4. NorwAI/NorwAI-Mistral-7B-instruct

Those are the culprits of the failing tests.

Also, just to be sure, the HUGGINGFACE_API_TOKEN is an access token and not a token for the Hugging Face inference API, right?

Alternatively, if that's too much of a hassle, I can simply include the failure cases manually, rather than accessing them from "real" tokenisers. Let me know what you think.

@rlouf
Copy link
Member

rlouf commented Jun 11, 2024

It's probably better to do this "manually" indeed.

@saattrupdan
Copy link
Contributor Author

@rlouf Changed the test to a "manual" one now, and all tests pass 🙂

@rlouf rlouf merged commit a987159 into outlines-dev:main Jun 12, 2024
7 checks passed
@rlouf
Copy link
Member

rlouf commented Jun 12, 2024

Awesome, thank you!

@saattrupdan saattrupdan deleted the fix/extend-re-replacement-seq branch June 12, 2024 13:38
lapp0 pushed a commit to lapp0/outlines that referenced this pull request Jun 12, 2024
This PR is an extension of
outlines-dev#763, related to extending
the `re_replacement_seq` regex.

The new [NorwAI models](https://huggingface.co/NorwAI) use a tokenizer
that has the token `�.`, which leads to the same error as was described
in the previous issue
outlines-dev#762.

This PR extends the fix from
outlines-dev#763 to deal with this
case, as well as adding a unit test to test various tokenizers, and a
comment describing why we need the prefix and suffix in the regex.
lapp0 pushed a commit to lapp0/outlines that referenced this pull request Jun 12, 2024
This PR is an extension of
outlines-dev#763, related to extending
the `re_replacement_seq` regex.

The new [NorwAI models](https://huggingface.co/NorwAI) use a tokenizer
that has the token `�.`, which leads to the same error as was described
in the previous issue
outlines-dev#762.

This PR extends the fix from
outlines-dev#763 to deal with this
case, as well as adding a unit test to test various tokenizers, and a
comment describing why we need the prefix and suffix in the regex.
fpgmaas pushed a commit to fpgmaas/outlines that referenced this pull request Jun 14, 2024
This PR is an extension of
outlines-dev#763, related to extending
the `re_replacement_seq` regex.

The new [NorwAI models](https://huggingface.co/NorwAI) use a tokenizer
that has the token `�.`, which leads to the same error as was described
in the previous issue
outlines-dev#762.

This PR extends the fix from
outlines-dev#763 to deal with this
case, as well as adding a unit test to test various tokenizers, and a
comment describing why we need the prefix and suffix in the regex.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants