Padding side incorrect for Mistral DPO #1217

hengjiUSTC · 2024-01-11T02:28:15Z

For code here: https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L538C4-L550 Does it always padding to right?

I get error when run dpo for mixtral 7b.

I am still seeing this error after set tokenizer.padding_side = 'left'
This is my demo notebook: https://colab.research.google.com/drive/1sVqbYEOqjJYl7CzNzXzviEBB6A984cMq?usp=sharing

Tokenizer already set with left padding

Train

Still have: ValueError: You are attempting to perform batched generation with padding_side='right' this may lead to unexpected behaviour for Flash Attention version of Mistral. Make sure to call tokenizer.padding_side = 'left' before tokenizing the input.

Seems https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L538C4-L550 or other part of code might lead to bug?

transformers 4.36.2
trl 0.7.7
peft 0.6.0

The text was updated successfully, but these errors were encountered:

hengjiUSTC · 2024-01-11T07:55:58Z

Did some debugging, quiet confirmed that https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L538C4-L550 gives incorrect padding side and lead to crash in transformers.

File ~/learn-llm/venv/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py:899, in MistralModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
    897     is_padding_right = attention_mask[:, -1].sum().item() != batch_size
    898     if is_padding_right:
--> 899         raise ValueError(
    900             "You are attempting to perform batched generation with padding_side='right'"
    901             " this may lead to unexpected behaviour for Flash Attention version of Mistral. Make sure to "
    902             " call `tokenizer.padding_side  = 'left'` before tokenizing the input. "
    903         )
    905 if self._use_flash_attention_2:
    906     # 2d mask is passed through the layers
    907     attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None

ValueError: You are attempting to perform batched generation with padding_side='right' this may lead to unexpected behaviour for Flash Attention version of Mistral. Make sure to  call `tokenizer.padding_side  = 'left'` before tokenizing the input.

gchhablani · 2024-01-12T07:48:44Z

Can I take this up @vwxyzjn?

arkapal3 · 2024-01-12T13:59:46Z

I had the same issue and went down a debugging rabbit hole, but realised in the end I had forgotten to set use_cache to False for my reference model.

Remember you need to set use_cache = False for both the main and reference model. If you do so the check on is_padding_right above won't trigger (which it shouldn't, because we are using the forward to compute log probs, not do generation).

hengjiUSTC · 2024-01-12T14:07:14Z

I had the same issue and went down a debugging rabbit hole, but realised in the end I had forgotten to set use_cache to False for my reference model.

Remember you need to set use_cache = False for both the main and reference model. If you do so the check on is_padding_right above won't trigger (which it shouldn't, because we are using the forward to compute log probs, not do generation).

Checked this solution work.

vwxyzjn · 2024-01-12T14:15:21Z

Hi thanks for the issue. Maybe during training the tokenizer.padding_side should be set to right?

hengjiUSTC · 2024-01-12T14:21:58Z

Any reason that training should be right? For most of dpo scripts I have seen, they all set mistral tokenizer padding to left.

Hi thanks for the issue. Maybe during training the tokenizer.padding_side should be set to right?

vwxyzjn · 2024-01-12T14:38:52Z

For both our and DPO's original repo, we all do right padding. Out of curiosity, which DPO scripts have you seen use the pad left? Pad left is primarily used for generation such as in PPO but in DPO it's unnecessary because DPO does not do generation.

hengjiUSTC · 2024-01-12T14:56:49Z

https://towardsdatascience.com/fine-tune-a-mistral-7b-model-with-direct-preference-optimization-708042745aac
https://colab.research.google.com/drive/15iFBr1xWgztXvhrj5I9fBv20c7CFOPBE?usp=sharing

Does setting pad_side to left impact DPO result? What is the correct setting we should use?

younesbelkada · 2024-01-30T07:26:38Z

This should be now fixed on TRL main !

hengjiUSTC changed the title ~~Padding side incorrect for Mixtral DPO~~ Padding side incorrect for Mistral DPO Jan 11, 2024

hengjiUSTC mentioned this issue Jan 12, 2024

Mistral with flash attention 2 and right padding huggingface/transformers#26877

Closed

4 tasks

younesbelkada mentioned this issue Jan 30, 2024

[DPOTrainer] Fix DPO trainer + mistral + FA2 #1290

Merged

kashif closed this as completed in #1290 Jan 30, 2024

jiwooya1000 mentioned this issue Apr 9, 2024

attention mask in compute_logps function xfactlab/orpo#17

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Padding side incorrect for Mistral DPO #1217

Padding side incorrect for Mistral DPO #1217

hengjiUSTC commented Jan 11, 2024 •

edited

Loading

hengjiUSTC commented Jan 11, 2024

gchhablani commented Jan 12, 2024

arkapal3 commented Jan 12, 2024

hengjiUSTC commented Jan 12, 2024

vwxyzjn commented Jan 12, 2024

hengjiUSTC commented Jan 12, 2024

vwxyzjn commented Jan 12, 2024

hengjiUSTC commented Jan 12, 2024 •

edited

Loading

younesbelkada commented Jan 30, 2024

Padding side incorrect for Mistral DPO #1217

Padding side incorrect for Mistral DPO #1217

Comments

hengjiUSTC commented Jan 11, 2024 • edited Loading

hengjiUSTC commented Jan 11, 2024

gchhablani commented Jan 12, 2024

arkapal3 commented Jan 12, 2024

hengjiUSTC commented Jan 12, 2024

vwxyzjn commented Jan 12, 2024

hengjiUSTC commented Jan 12, 2024

vwxyzjn commented Jan 12, 2024

hengjiUSTC commented Jan 12, 2024 • edited Loading

younesbelkada commented Jan 30, 2024

hengjiUSTC commented Jan 11, 2024 •

edited

Loading

hengjiUSTC commented Jan 12, 2024 •

edited

Loading