Normalize negative_prompt in LongCatAudioDiT pipeline for CFG symmetry by Ricardo-M-L · Pull Request #13525 · huggingface/diffusers

Ricardo-M-L · 2026-04-21T05:32:36Z

What this PR does

In LongCatAudioDiTPipeline.__call__, positive prompts are normalized via _normalize_text before being encoded:

normalized_prompts = [_normalize_text(text) for text in prompt]
...
prompt_embeds, prompt_embeds_len = self.encode_prompt(normalized_prompts, device)

_normalize_text lowercases the string, strips ASCII and curly quote characters ("“”‘’), and collapses whitespace. The upstream reference implementation meituan-longcat/LongCat-AudioDiT applies the same normalize_text pass to every piece of text it feeds to the UMT5 text encoder.

However, when a user passes a negative_prompt, the pipeline currently encodes it raw:

negative_prompt_embeds, negative_prompt_embeds_len = self.encode_prompt(negative_prompt, device)

Why this is a bug

The model was trained with normalized text on both branches, so a raw negative prompt lands off-distribution relative to the normalized positive prompt. In classifier-free guidance

noise = neg + guidance_scale * (pos - neg)

the (pos - neg) delta no longer represents a clean "target − avoid" direction — it's partly absorbing normalization noise (case, quote glyphs, whitespace). That weakens/distorts guidance rather than guiding away from what the user described.

Examples that currently fail to behave symmetrically:

negative_prompt="Loud Noise" encodes differently from the positive path's "loud noise"
negative_prompt='"speech"' keeps the quote glyphs that the positive path would strip
negative_prompt="multiple speakers" keeps the extra whitespace

The empty/None branch is unaffected — it already uses a zero tensor to match the reference model.

Fix

Apply the same _normalize_text pass to user-supplied negative prompts before encoding, restoring symmetry with the positive branch and with the reference pipeline.

 if negative_prompt is None or (isinstance(negative_prompt, str) and negative_prompt == ""):
     negative_prompt_embeds = torch.zeros(...)
     negative_prompt_embeds_len = torch.tensor([1] * batch_size, device=device)
 else:
-    negative_prompt_embeds, negative_prompt_embeds_len = self.encode_prompt(negative_prompt, device)
+    normalized_negative_prompts = [_normalize_text(text) for text in negative_prompt]
+    negative_prompt_embeds, negative_prompt_embeds_len = self.encode_prompt(
+        normalized_negative_prompts, device
+    )

Before submitting

Did you read the contributor guideline?
Was this discussed/approved via a Github issue or the forum? N/A — small consistency fix.
Did you make sure to update the documentation with your changes? N/A — no public API change.
Did you write any new necessary tests? N/A — mirrors the existing positive-prompt path.

Who can review?

@yiyixuxu @sayakpaul

The LongCatAudioDiT pipeline pre-processes positive prompts through `_normalize_text` before encoding: ```python normalized_prompts = [_normalize_text(text) for text in prompt] ... prompt_embeds, prompt_embeds_len = self.encode_prompt(normalized_prompts, device) ``` `_normalize_text` lowercases the string, strips ASCII and curly quote characters (`"“”‘’`), and collapses whitespace. The upstream reference implementation applies the same normalization (`normalize_text` in `utils.py`) to every piece of text it feeds to the UMT5 text encoder. However, when a user passes a `negative_prompt`, the pipeline encodes it raw: ```python negative_prompt_embeds, negative_prompt_embeds_len = self.encode_prompt( negative_prompt, device ) ``` That creates a distribution mismatch between the conditional and unconditional text embeddings used in CFG — the model was trained against normalized text on both branches, so a raw negative prompt lands off-distribution and weakens/distorts guidance rather than guiding away from the described audio. Empty/`None` `negative_prompt` is unaffected (that path uses a zero tensor to match the reference). This change applies the same `_normalize_text` pass to user-supplied negative prompts before encoding, restoring symmetry with the positive branch and with the reference pipeline.

github-actions Bot added pipelines size/S PR with diff < 50 LOC labels Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize negative_prompt in LongCatAudioDiT pipeline for CFG symmetry#13525

Normalize negative_prompt in LongCatAudioDiT pipeline for CFG symmetry#13525
Ricardo-M-L wants to merge 1 commit intohuggingface:mainfrom
Ricardo-M-L:fix-longcat-audio-dit-negative-prompt-normalize

Ricardo-M-L commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Ricardo-M-L commented Apr 21, 2026

What this PR does

Why this is a bug

Fix

Before submitting

Who can review?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant