Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

documents not being applied in apply_chat_tempplate #33421

Closed
1 of 4 tasks
selkordy opened this issue Sep 11, 2024 · 7 comments
Closed
1 of 4 tasks

documents not being applied in apply_chat_tempplate #33421

selkordy opened this issue Sep 11, 2024 · 7 comments
Labels

Comments

@selkordy
Copy link

selkordy commented Sep 11, 2024

System Info

  • transformers version: 4.44.2
  • Platform: macOS-14.6.1-arm64-arm-64bit
  • Python version: 3.12.5
  • Huggingface_hub version: 0.24.6
  • Safetensors version: 0.4.4
  • Accelerate version: 0.33.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.4.0 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: parallel

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I'm trying to apply documents in the chat template as per the chat_templating article, however it seems to be ignored. Passing documents has no effect on the chat template.

https://huggingface.co/docs/transformers/en/chat_templating

`
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")

chat1 = [
{"role": "user", "content": "Which is bigger, the moon or the sun?"},
{"role": "assistant", "content": "The sun."}
]
chat2 = [
{"role": "user", "content": "Which is bigger, a virus or a bacterium?"},
{"role": "assistant", "content": "A bacterium."}
]

document1 = {
"title": "The Moon: Our Age-Old Foe",
"contents": "Man has always dreamed of destroying the moon. In this essay, I shall..."
}

document2 = {
"title": "The Sun: Our Age-Old Friend",
"contents": "Although often underappreciated, the sun provides several notable benefits..."
}
model_input = tokenizer.apply_chat_template([chat1,chat2], tokenize=False, add_generation_prompt=False, documents=[document1, document2])

print(model_input)
`
model_input does not include documents and hence ignored by model
['<|user|>\nWhich is bigger, the moon or the sun?\n<|assistant|>\nThe sun.\n', '<|user|>\nWhich is bigger, a virus or a bacterium?\n<|assistant|>\nA bacterium.\n']

Expected behavior

I expect the chat template to include the documents

@selkordy selkordy added the bug label Sep 11, 2024
@A-Duss
Copy link
Contributor

A-Duss commented Sep 11, 2024

I've been trying to pinpoint where documents should be passed to the model. From what I gathered after exploring the Jinja templates, it seems they should be included in the chat_template key within the tokenizer_config.json file. However, it appears that Zephyr-7B-beta's chat template doesn't currently support this, as is.

Even if it did, I'm unsure how documents would be integrated into the chat template. The documentation (mentioned by @selkordy code snippets don't clearly specify which model is used, and while I noticed 'NousResearch/Hermes-2-Pro-Llama-3-8B' was the last loaded model in the documentation code, its tokenizer is currently broken due to a typo in the latest commit on its tokenizer_config.json file. Anyway, I didn't find any reference to documents in the chat template for that model either.

@Rocketknight1, I saw you implemented this in #30621—thanks for the great work, must have been a heck of a headache!
Would you be able to provide any insights into how this feature is supposed to work?

@Rocketknight1
Copy link
Member

Hi @selkordy @A-Duss, the cause of this problem is simply that documents is not supported by many models, and as a result, their chat templates discard this input. I should probably update the documentation to make this clearer, and maybe reduce the emphasis on documents because it's not widely supported.

However, one model that does support it is Command-R and Command-R+, using the rag ("retrieval-augmented generation") template. You can see it used in the "grounded generation" examples in their model cards.

@A-Duss
Copy link
Contributor

A-Duss commented Sep 11, 2024

@Rocketknight1 Thanks for the clarification! I think it might be helpful to use Command-R as the model in the example within the documentation then, while noting that not all models support this feature. I’m happy to assist with this if you’re short on time.

@Rocketknight1
Copy link
Member

Rocketknight1 commented Sep 11, 2024

@A-Duss sure! If you want to open a PR to update the chat template docs and tag me, that'd be great. However, we'd prefer to avoid apply_grounded_generation_template(), since it's very specific to Command-R. You can get the same effect for Command-R's models using the standard apply_chat_template() function like so:

tokenizer.apply_chat_template(messages=messages, documents=documents, chat_template="rag")

@selkordy
Copy link
Author

I see that when I look at the tokenizer_config there is no where it includes documents in the jinja config, and have a better understanding of how the library works.

Thank you @A-Duss and @Rocketknight1

@A-Duss
Copy link
Contributor

A-Duss commented Sep 15, 2024

@A-Duss sure! If you want to open a PR to update the chat template docs and tag me, that'd be great. However, we'd prefer to avoid apply_grounded_generation_template(), since it's very specific to Command-R. You can get the same effect for Command-R's models using the standard apply_chat_template() function like so:

tokenizer.apply_chat_template(messages=messages, documents=documents, chat_template="rag")

Noted, I'm working on it, I will open a PR once its looking decent.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants