Normalize negative_prompt in LongCatAudioDiT pipeline for CFG symmetry#13525
Open
Ricardo-M-L wants to merge 1 commit intohuggingface:mainfrom
Open
Normalize negative_prompt in LongCatAudioDiT pipeline for CFG symmetry#13525Ricardo-M-L wants to merge 1 commit intohuggingface:mainfrom
Ricardo-M-L wants to merge 1 commit intohuggingface:mainfrom
Conversation
The LongCatAudioDiT pipeline pre-processes positive prompts through
`_normalize_text` before encoding:
```python
normalized_prompts = [_normalize_text(text) for text in prompt]
...
prompt_embeds, prompt_embeds_len = self.encode_prompt(normalized_prompts, device)
```
`_normalize_text` lowercases the string, strips ASCII and curly quote
characters (`"“”‘’`), and collapses whitespace. The upstream reference
implementation applies the same normalization (`normalize_text` in
`utils.py`) to every piece of text it feeds to the UMT5 text encoder.
However, when a user passes a `negative_prompt`, the pipeline encodes
it raw:
```python
negative_prompt_embeds, negative_prompt_embeds_len = self.encode_prompt(
negative_prompt, device
)
```
That creates a distribution mismatch between the conditional and
unconditional text embeddings used in CFG — the model was trained
against normalized text on both branches, so a raw negative prompt
lands off-distribution and weakens/distorts guidance rather than
guiding away from the described audio. Empty/`None` `negative_prompt`
is unaffected (that path uses a zero tensor to match the reference).
This change applies the same `_normalize_text` pass to user-supplied
negative prompts before encoding, restoring symmetry with the positive
branch and with the reference pipeline.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
In
LongCatAudioDiTPipeline.__call__, positive prompts are normalized via_normalize_textbefore being encoded:_normalize_textlowercases the string, strips ASCII and curly quote characters ("“”‘’), and collapses whitespace. The upstream reference implementationmeituan-longcat/LongCat-AudioDiTapplies the samenormalize_textpass to every piece of text it feeds to the UMT5 text encoder.However, when a user passes a
negative_prompt, the pipeline currently encodes it raw:Why this is a bug
The model was trained with normalized text on both branches, so a raw negative prompt lands off-distribution relative to the normalized positive prompt. In classifier-free guidance
the
(pos - neg)delta no longer represents a clean "target − avoid" direction — it's partly absorbing normalization noise (case, quote glyphs, whitespace). That weakens/distorts guidance rather than guiding away from what the user described.Examples that currently fail to behave symmetrically:
negative_prompt="Loud Noise"encodes differently from the positive path's"loud noise"negative_prompt='"speech"'keeps the quote glyphs that the positive path would stripnegative_prompt="multiple speakers"keeps the extra whitespaceThe empty/
Nonebranch is unaffected — it already uses a zero tensor to match the reference model.Fix
Apply the same
_normalize_textpass to user-supplied negative prompts before encoding, restoring symmetry with the positive branch and with the reference pipeline.if negative_prompt is None or (isinstance(negative_prompt, str) and negative_prompt == ""): negative_prompt_embeds = torch.zeros(...) negative_prompt_embeds_len = torch.tensor([1] * batch_size, device=device) else: - negative_prompt_embeds, negative_prompt_embeds_len = self.encode_prompt(negative_prompt, device) + normalized_negative_prompts = [_normalize_text(text) for text in negative_prompt] + negative_prompt_embeds, negative_prompt_embeds_len = self.encode_prompt( + normalized_negative_prompts, device + )Before submitting
Who can review?
@yiyixuxu @sayakpaul