Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer model_max_length #47

Open
binarycrayon opened this issue Nov 24, 2023 · 6 comments
Open

Tokenizer model_max_length #47

binarycrayon opened this issue Nov 24, 2023 · 6 comments

Comments

@binarycrayon
Copy link

Hello,

I was seeing warning during finetuning Mistral and tracked this line here

https://github.com/huggingface/alignment-handbook/blob/main/src/alignment/model_utils.py#L71

Because Mistral's tokenizer model max length has a large number so the model_max_length set as 2048. However my training data consists sequence length longer than that, e.g. 4000 characters. Would this be a problem?

Thank you!

@eek
Copy link

eek commented Nov 28, 2023

I have the same exact question, I've changed the max_seq_length in config_lora.yaml to 4096, but I get these warnings:

Token indices sequence length is longer than the specified maximum sequence length for this model (2485 > 2048). Running this sequence through the model will result in indexing errors

@ChenDRAG
Copy link

Same problem here!

@mlmonk
Copy link

mlmonk commented Dec 1, 2023

I found that it comes from here. During initialization, tokenizer does not read the max_length from the model.

As a quick hack, I was able to update it to 4096 and then reinstall alignment-handbook by doing

cd ./alignment-handbook/
python -m pip install .

@bugface
Copy link

bugface commented Jan 15, 2024

@lewtun Would you be able to comment on here why the max_seq_len has been hard coded instead of reading from config? Any reason for this decision? Thanks

@eryk-mazus
Copy link

I'm also curious about that, especially that zephyr-7b-beta has model_max_length set to 1000000000000000019884624838656, so these models were somehow exempt from that 😐

There is max_length argument passed in config.yaml, so setting model_max_length to some hard-coded value inside the script seems pointless

@Shiniri
Copy link

Shiniri commented May 23, 2024

This also threw me off and caused a bug that was unnecessarily complicated to fix.
I second the notion that this snippet:

    # Set reasonable default for models without max length
    if tokenizer.model_max_length > 100_000:
        tokenizer.model_max_length = 2048

should not be there if there is a config value in the yaml. It leads to confusing results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants