[feat]: implement "local" caption upsampling for Flux.2 #12718

sayakpaul · 2025-11-26T06:50:35Z

What does this PR do?

Test code:

from diffusers import Flux2Pipeline 
import torch 

pipe = Flux2Pipeline.from_pretrained("black-forest-labs/FLUX.2-dev", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()

prompt = "a photo of a forest with mist swirling around the tree trunks. The word 'FLUX.2' is painted over it in big, red brush strokes with visible texture"
image = pipe(
    prompt=prompt,
    height=768,
    width=1360,
    generator=torch.Generator(device="cuda").manual_seed(42),
    caption_upsample_temperature=0.15,
).images[0]
image.save("upsampling.png")
print(f"{torch.cuda.max_memory_reserved() / (1024 ** 3)=}")

Generated upsampled prompt:

A serene and atmospheric forest scene, captured in a high-resolution photograph, showcases towering, ancient trees with thick, gnarled trunks and sprawling branches that create a dense canopy overhead. The forest floor is carpeted with a lush layer of moss, ferns, and fallen leaves, adding a sense of depth and texture to the image. Mist swirls gently around the tree trunks, creating a dreamy, ethereal atmosphere. The mist is illuminated by soft, diffused light that filters through the canopy, casting dappled shadows and highlighting the intricate details of the bark and foliage. The overall color palette is muted and natural, with shades of green, brown, and gray dominating the scene. In the foreground, the word "FLUX.2" is painted in bold, red brush strokes with visible texture, standing out against the natural backdrop. The paint appears wet and glossy, with visible brushstrokes and subtle drips, adding a dynamic and artistic element to the image. The text is centrally placed and slightly tilted, drawing the viewer\'s eye and adding a sense of movement to the otherwise tranquil scene.

Output

No prompt upsampling	Prompt upsampling

Notes

I decided to create a system_messages.py script under src/diffusers/pipelines/flux2 so that other pipelines derived from Flux2 can easily use it.
If caption_upsample_temperature is set (defaults to None), we perform the process.
The image processor changes are to accommodate this method in the original codebase. Open to other suggestions, of course, on how to best accommodate them.
The original codebase implements an OpenRouter client with a larger Pixtral model for doing prompt upsampling remotely. I think we can do that through Inference Endpoints? (cc: @apolinario @ariG23498)

HuggingFaceDocBuilderDev · 2025-11-26T07:07:22Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/diffusers/pipelines/flux2/system_messages.py

sayakpaul · 2025-11-26T09:59:41Z

src/diffusers/pipelines/flux2/system_messages.py

@@ -0,0 +1,29 @@
+"""
+These system prompts come from:


As discussed internally, this new-line character thingy messes up the quality a bit. Hence, I have decided to keep these system messages one-to-one same as the original implementation linked above.

If we run make style && make quality, this order will be completely destroyed. We can change the pyproject.toml to exclude this path from getting formatted. But before we do that, let's see if this is the best we have.

yiyixuxu

thanks a lot on working on this!
I left some feedbacks!

yiyixuxu · 2025-11-26T19:23:33Z

src/diffusers/pipelines/flux2/image_processor.py


    @staticmethod
-    def _resize_to_target_area(image: PIL.Image.Image, target_area: int = 1024 * 1024) -> Tuple[int, int]:
+    def _resize_to_target_area(


ohh do you want to add a new method called something like _resize_if_exceeds_area? or rename this one if we only use it this way

Yup.

I created _resize_if_exceeds_area() which is basically:

def _resize_if_exceeds_area(image, target_area=1024 * 1024) -> PIL.Image.Image: image_width, image_height = image.size pixel_count = image_width * image_height if pixel_count <= target_area: return image return Flux2ImageProcessor._resize_to_target_area(image, target_area)

yiyixuxu · 2025-11-26T19:28:29Z

src/diffusers/pipelines/flux2/pipeline_flux2.py

+
+# Adapted from
+# https://github.com/black-forest-labs/flux2/blob/5a5d316b1b42f6b59a8c9194b77c8256be848432/src/flux2/text_encoder.py#L49C5-L66C19
+def _validate_and_process_images(


can we have a seperate step to validate and process image and then run format_input?

Yes. We now first _validate_and_process_images() and then pass the resultant images to format_input().

ariG23498 · 2025-11-27T04:21:40Z

@sayakpaul this is really nice. Do you want me to start working on an Endpoint? We can take this conversation to private slack and see how this works.

Update:

After inspecting, it turns out that the model needs upwards of 300 GBs to run. (from the official model card)

note that running this model on GPU requires over 300 GB of GPU RAM

Building a free Inference Endpoint does not seem to be feasible by me and @sayakpaul and hence we are benching this project. Another option would be to route through Inference Providers, but we have not seen a need (other than this specific one) to let our providers host this model.

sayakpaul added 6 commits November 26, 2025 10:31

feat: implement caption upsampling for flux.2.

7350d07

doc

e6a0ab6

up

b4a8406

fix

0b1f884

up

ceb8a3a

Merge branch 'main' into flux2-upsample

b07bee3

sayakpaul requested review from dg845 and yiyixuxu November 26, 2025 06:59

fix system prompts 🤷‍

82685f2

sayakpaul commented Nov 26, 2025

View reviewed changes

src/diffusers/pipelines/flux2/system_messages.py Outdated Show resolved Hide resolved

sayakpaul commented Nov 26, 2025

View reviewed changes

yiyixuxu reviewed Nov 26, 2025

View reviewed changes

up

6397a67

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat]: implement "local" caption upsampling for Flux.2 #12718

[feat]: implement "local" caption upsampling for Flux.2 #12718

sayakpaul commented Nov 26, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Nov 26, 2025

Uh oh!

Uh oh!

sayakpaul Nov 26, 2025

Uh oh!

yiyixuxu left a comment

Uh oh!

yiyixuxu Nov 26, 2025

Uh oh!

sayakpaul Nov 27, 2025

Uh oh!

yiyixuxu Nov 26, 2025

Uh oh!

sayakpaul Nov 27, 2025

Uh oh!

ariG23498 commented Nov 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[feat]: implement "local" caption upsampling for Flux.2 #12718

Are you sure you want to change the base?

[feat]: implement "local" caption upsampling for Flux.2 #12718

Conversation

sayakpaul commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Output

Notes

Uh oh!

HuggingFaceDocBuilderDev commented Nov 26, 2025

Uh oh!

Uh oh!

sayakpaul Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

yiyixuxu Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

sayakpaul Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

yiyixuxu Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

sayakpaul Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

ariG23498 commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sayakpaul commented Nov 26, 2025 •

edited

Loading

ariG23498 commented Nov 27, 2025 •

edited

Loading