-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer model_max_length #47
Comments
I have the same exact question, I've changed the
|
Same problem here! |
I found that it comes from here. During initialization, tokenizer does not read the max_length from the model. As a quick hack, I was able to update it to 4096 and then reinstall alignment-handbook by doing
|
I'm also curious about that, especially that zephyr-7b-beta has There is |
This also threw me off and caused a bug that was unnecessarily complicated to fix. # Set reasonable default for models without max length
if tokenizer.model_max_length > 100_000:
tokenizer.model_max_length = 2048 should not be there if there is a config value in the yaml. It leads to confusing results. |
Hello,
I was seeing warning during finetuning Mistral and tracked this line here
https://github.com/huggingface/alignment-handbook/blob/main/src/alignment/model_utils.py#L71
Because Mistral's tokenizer model max length has a large number so the model_max_length set as 2048. However my training data consists sequence length longer than that, e.g. 4000 characters. Would this be a problem?
Thank you!
The text was updated successfully, but these errors were encountered: