Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Severe Bug] Performance Degradation Starting from v4.42.* #31890

Closed
2 of 4 tasks
terryyz opened this issue Jul 10, 2024 · 44 comments
Closed
2 of 4 tasks

[Severe Bug] Performance Degradation Starting from v4.42.* #31890

terryyz opened this issue Jul 10, 2024 · 44 comments

Comments

@terryyz
Copy link

terryyz commented Jul 10, 2024

System Info

Hi @ArthurZucker and the team,

We have noticed a severe performance degradation when trying to reproduce the BigCodeBench results: bigcode-project/bigcodebench#21. I initially thought it was due to the update of vllm, but actually, it was not. The greatly affected models include 01-ai/Yi-1.5 families, google/codegemma families, meta-llama/Meta-Llama-3 families. Specifically, I observed some weird extra spaces in the 01-ai/Yi-1.5-9B-Chat. These spaces only appear starting from v4.42.*, not before. I also tried to generate w/o vllm later, which is not documented in the issue thread. It indeed appears to be the issues related to transformers not vllm.

cc @takkyu2 who reported the original issue.

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The steps to reproduce the problems and corresponding results have been described in the bigcode-project/bigcodebench#21.

Expected behavior

The model output quality should be similar to the ones in pregenerated LLM outputs.

@ArthurZucker
Copy link
Collaborator

Hey! I'll try to skim through the ressources, but I can't really think of something that could trigger the addition of spaces. I don't know where they are added, but is this related to #26678 ?

@ArthurZucker
Copy link
Collaborator

Can you share a small snippet or just a string where you saw extra spaces?

@ArthurZucker
Copy link
Collaborator

fyi @itazap

@terryyz
Copy link
Author

terryyz commented Jul 10, 2024

Thanks for getting back to me @ArthurZucker! The extra spaces usually appear in the prompt part of the model outputs.

Here is an example of 01-ai/Yi-1.5-9B-Chat:

{"task_id": "BigCodeBench/0", "solution": "import itertools\nfrom random import shuffle\n\ndef task_func(numbers=list(range(1, 11))):\n    \"\"\"\n    Calculates the average of the sums of absolute differences between each pair of consecutive numbers \n    for all permutations of a given list.  Each permutation is shuffled before calculating the differences.\n\n    Args:\n    - numbers (list): A list of numbers.  Default is numbers from 1  to 10.\n    \n    Returns:\n    float: The average of the sums of absolute differences for each shuffled permutation of the list.\n\n    Requirements:\n    - itertools\n    - random. shuffle\n\n    Example:\n    >>> result = task_func([1,  2,  3 ])\n    >>> isinstance(result,  float)\n    True\n    \"\"\"\n    sum_diffs = 0\n    permutations = list(itertools.permutations(numbers))\n    for perm in permutations:\n        shuffle(perm)\n        diffs = [abs(perm[i] - perm[i + 1]) for i in range(len(perm) - 1)]\n        sum_diffs += sum(diffs)\n    return sum_diffs / len(permutations)"}

The ideal one should be:

{"task_id": "BigCodeBench/0", "solution": "import itertools\nfrom random import shuffle\n\ndef task_func(numbers=list(range(1, 11))):\n    \"\"\"\n    Calculates the average of the sums of absolute differences between each pair of consecutive numbers \n    for all permutations of a given list. Each permutation is shuffled before calculating the differences.\n\n    Args:\n    - numbers (list): A list of numbers. Default is numbers from 1 to 10.\n    \n    Returns:\n    float: The average of the sums of absolute differences for each shuffled permutation of the list.\n\n    Requirements:\n    - itertools\n    - random.shuffle\n\n    Example:\n    >>> result = task_func([1, 2, 3])\n    >>> isinstance(result, float)\n    True\n    \"\"\"\n    sum_diffs = 0\n    permutations = list(itertools.permutations(numbers))\n    for perm in permutations:\n        shuffle(perm)\n        diffs = [abs(perm[i] - perm[i + 1]) for i in range(len(perm) - 1)]\n        sum_diffs += sum(diffs)\n    return sum_diffs / len(permutations)"}

For example, random. shuffle should be random.shuffle.

I haven't checked other models, so not sure if it's a common pattern or not.

@ArthurZucker
Copy link
Collaborator

Could you share how you call the tokenizer? (like how you initialize it)
The Yi model you shared uses the legacy behaviour, which is know to add some spaces after "added tokens". In this case, I don't think . is an added_tokens, however if you do tokenize sentence by sentence you could have this issue.

#30964 and #31305 are the only "big" changes that happened. Could you try to set add_prefix_space to False?

@ArthurZucker
Copy link
Collaborator

It seems like:

  • after each . sometimes there is a space
  • after some digits like 1 and 3 (which are sometimes added as special tokens)
  • after , there is an extra space

I don't know if this is in decoding or in encoding. Which is important for us to be able to fix!

@terryyz
Copy link
Author

terryyz commented Jul 10, 2024

@ArthurZucker
Copy link
Collaborator

Actually using legacy=False, from_slow=True is most probably gonna be good

@ArthurZucker
Copy link
Collaborator

The issue could just as well be the chat template call, given that this is something that was touched, while the tokenizer.json and tokenizer_config where not in that timeframe (on the hub)

@terryyz
Copy link
Author

terryyz commented Jul 10, 2024

The issue could just as well be the chat template call, given that this is something that was touched, while the tokenizer.json and tokenizer_config where not in that timeframe (on the hub)

Yeah, similar to my thoughts. We've only tested Chat models. However, Qwen/CodeQwen1.5-7B-Chat may not be affected. Do you have any ideas? bigcode-project/bigcodebench#21 (comment)

@ArthurZucker
Copy link
Collaborator

Yes, it uses PreTrainedTokenizerFast class vs the LlamaTokenizer class -> the tokenizer.json is different and does not use the legacy prepend normalizer. It uses:
image

and "add_prefix_space": false, in the config

@terryyz
Copy link
Author

terryyz commented Jul 10, 2024

add_prefix_space=False: extra space exists.legacy=False, from_slow=True: extra space exists.AutoTokenizer.from_pretrained(name, add_prefix_space=False, legacy=False, from_slow=True, **kwargs) also won't work.version: v4.42.3 😢

My bad, the extra space no longer exists.

I'll check the final results to see if the scores are similar to the reported ones.

@terryyz
Copy link
Author

terryyz commented Jul 10, 2024

@ArthurZucker BTW, if add_prefix_space=False fixes the performance degradation, is it still considered a bug for the tokenizer? I mean, I didn't expect to add add_prefix_space=False for example 🤔

@ArthurZucker
Copy link
Collaborator

I don't really know, because I don't remember exactly how the tokenizer is called / how the tokenizer was trained.
Basically for a lot of model the add_prefix_space comes from the sentencepiece model. We just follow what is there, but this was not the case before.

What fixed your issue? add_prefix_space=False ? Or legacy=False ? (i would try to let add_prefix_space maybe, but set legacy to false)

@terryyz
Copy link
Author

terryyz commented Jul 10, 2024

Both add_prefix_space=False and legacy=False work for transformers only, but not vllm.

I'll run the full generation with legacy=False, which takes a bit of time to run with the pure transformers.

@terryyz
Copy link
Author

terryyz commented Jul 10, 2024

@ArthurZucker When generating with legacy=False, I noticed that some samples couldn't be generated correctly:
image

There should be some other issues, I guess?

@ArthurZucker
Copy link
Collaborator

I am not super sure, I don't know how vLLM calls the tokenizer, but apply_chat_template might be doing something here!
I'm sorry it's a bit hard for me, the best would be if you can get me the chat that you send through the model! (if you call apply chat template)

@terryyz
Copy link
Author

terryyz commented Jul 10, 2024

The first few prompts sent to the vLLM calls: yi_vllm.txt. Looks correct, but outputs are bad.

Do you also need inputs for the transformers calls to see why some outputs are missing?

@ArthurZucker
Copy link
Collaborator

It would be nice to have the input token ids and input text / chat yep

@ArthurZucker
Copy link
Collaborator

yes for transformers only!

@ArthurZucker
Copy link
Collaborator

cc @Rocketknight1 if you have an idea as well

@terryyz
Copy link
Author

terryyz commented Jul 10, 2024

Here are the top 10 pairs of prompts and token ids!
yi_hf.txt

@itazap
Copy link
Collaborator

itazap commented Jul 11, 2024

Hi @terryyz , can you please also share the output of the prompts you are seeing to show the issue? Thanks!

@terryyz
Copy link
Author

terryyz commented Jul 11, 2024

Hi @itazap, are there any specific files you want to see? Or just the ones where the model degraded? If that's the case, there were plenty of them in the original thread: bigcode-project/bigcodebench#21

@itazap
Copy link
Collaborator

itazap commented Jul 11, 2024

@terryyz Sorry, I understood that add_prefix_space=False and legacy=False worked for transformers to fix the degradation? Can you please clarify what you meant / current issue I can look into? Thank you!

@terryyz
Copy link
Author

terryyz commented Jul 11, 2024

@itazap Unfortunately legacy=False hasn't been fixed for the model degradation. While I noticed that the extra spaces were somehow removed, new issues shown in #31890 (comment) were unforeseen. A lot of the outputs just disappeared after the import.

@terryyz
Copy link
Author

terryyz commented Jul 12, 2024

@itazap @ArthurZucker Update:

EOS = [
    "<|endoftext|>",
    "<|endofmask|>",
    "</s>",
    "\nif __name__",
    "\ndef main(",
    "\nprint(",
]

stop_sequencer = StopSequencer(
    self.model,
    model_type="causal",  # or seq2seq
    tokenizer=self.tokenizer,
)

model = stop_sequencer.register_stop_texts(
    stop_texts=self.eos,
    input_length=input_tokens.size(-1),
)
outputs = model.generate(
    input_tokens,
    max_new_tokens=self.max_new_tokens,
    do_sample=do_sample,
    num_return_sequences=min(self.batch_size, num_samples),
    pad_token_id=self.tokenizer.eos_token_id,
    **kwargs,
)
# self.eos: ['<|endoftext|>', '<|endofmask|>', '</s>', '\nif __name__', '\ndef main(', '\nprint(', '\n```\n']

It seems the above part caused the sudden stop of the generation. However, the EOS doesn't appear in the outputs. Do you know if this is expected? I did a similar setup for VLLM, for example:

vllm_outputs = self.llm.generate(
    [prompt] * batch_size,
    SamplingParams(
        temperature=self.temperature,
        max_tokens=self.max_new_tokens,
        top_p=0.95 if do_sample else 1.0,
        stop=self.eos,
    ),
    use_tqdm=False,
)

This is the codebase of stop_sequencer:https://github.com/hyunwoongko/stop-sequencer. It could no longer be compatible with transformers, IMO?

VLLM may have a different setup for the tokenizer; should I crosspost this issue?

@itazap
Copy link
Collaborator

itazap commented Jul 12, 2024

Sorry I find it a bit hard to follow, which outputs are you looking for the EOS in?

@terryyz
Copy link
Author

terryyz commented Jul 12, 2024

Oh, it's a further investigation on #31890 (comment). I removed the part of stop_sequencer and checked the ideal outputs.

@terryyz
Copy link
Author

terryyz commented Jul 15, 2024

@itazap @ArthurZucker Is there going to be a PR to make the correct setup as default? 👀

And as VLLM possibly has a different implementation for the tokenizer, should we inform them?

@ArthurZucker
Copy link
Collaborator

Sorry what do you mean by correct setup? 🤗

@terryyz
Copy link
Author

terryyz commented Jul 15, 2024

I mean the initialization of AutoTokenizer.from_pretrained(self.tokenizer_name) without add_prefix_space=False or legacy=False. Or are we supposed to pass these arguments from now on? 👀

@ArthurZucker
Copy link
Collaborator

I think the best is to open a PR on the hub! I can't merge it for you, but pinging the authors with this issue should be good already! WDYT?

@ArthurZucker
Copy link
Collaborator

we can't default on our side because it would break for a lot of other models

@terryyz
Copy link
Author

terryyz commented Jul 16, 2024

@ArthurZucker Do you mean that the setup of legacy=False may break for other models? Is there a way to determine the model that degrades due to the current tokenizer change? Or is the list of these models entirely based on the empirical results?

@terryyz
Copy link
Author

terryyz commented Jul 16, 2024

@ArthurZucker Ping again to check the answers 👀

@itazap
Copy link
Collaborator

itazap commented Jul 17, 2024

legacy=False includes merges that fixed tokenization issues (Llama example here ). However, there were many models trained prior to this and so updating the default would cascade into these older models and not necessary break but behave differently.

In terms of which models this relates to, there is no list, but it can be determined by checking the add_prefix_space default value before and after this model release. If it's the same then we can rule this parameter out and look into other legacy features affecting this

@terryyz
Copy link
Author

terryyz commented Jul 17, 2024

Thanks @itazap! Will the rule-based checking be added in the next Transformers release? I can still use AutoTokenizer.from_pretrained(self.tokenizer_name, **kwargs) by default for now. Hope the future Transformers can resolve this issue automatically :)

@ArthurZucker
Copy link
Collaborator

But @terryyz IMO the best solution is to merge the fix in the model / models that you are using.
But for example, T5 should not have this.

We could indeed check the transformers versions and force legacy, but we can't really predict what the user actually wants.

New models like Llama3 or Gemma have this set to false.

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Jul 17, 2024

But we can do a deprecation cycle, if legacy is set to None, we warn that next release it will default to False instead of True cc @itazap wdyt?

@itazap
Copy link
Collaborator

itazap commented Jul 17, 2024

@ArthurZucker As you said, it's hard to predict which is more aligned with current users' expectations! With the warning it should be okay!

@terryyz
Copy link
Author

terryyz commented Jul 17, 2024

Thanks @ArthurZucker and @itazap! That makes sense :) I'm now having an optional argument for legacy=True when users notice some unexpected behaviours.

@itazap
Copy link
Collaborator

itazap commented Jul 19, 2024

Thanks for your patience! Let us know if you have any further issues! 🤗

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants