From d67491e7dcc58610625f3cd7a2e3f55c038c81d8 Mon Sep 17 00:00:00 2001
From: Nazar Kozak <nzrkzk@gmail.com>
Date: Fri, 24 Apr 2026 23:34:53 -0700
Subject: [PATCH 1/2] docs(flux2): clarify image= is reference conditioning,
 not img2img

---
 docs/source/en/api/pipelines/flux2.md         | 20 +++++++++++++++++++
 .../pipelines/flux2/pipeline_flux2.py         | 14 ++++++++-----
 .../pipelines/flux2/pipeline_flux2_klein.py   | 14 ++++++++-----
 3 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/docs/source/en/api/pipelines/flux2.md b/docs/source/en/api/pipelines/flux2.md
index 2a2b39b95630..e62e877bf1c5 100644
--- a/docs/source/en/api/pipelines/flux2.md
+++ b/docs/source/en/api/pipelines/flux2.md
@@ -32,6 +32,26 @@ Flux.2 can potentially generate better better outputs with better prompts. We ca
 an input prompt by setting the `caption_upsample_temperature` argument in the pipeline call arguments.
 The [official implementation](https://github.com/black-forest-labs/flux2/blob/5a5d316b1b42f6b59a8c9194b77c8256be848432/src/flux2/text_encoder.py#L140) recommends this value to be 0.15.
 
+## Reference conditioning vs. img2img
+
+The `image` argument on `Flux2Pipeline` and `Flux2KleinPipeline` is **reference conditioning**, not
+img2img. Reference images are encoded into additional attention tokens that flow through the
+transformer alongside the text prompt — there is no noisy latent initialization, and so no `strength`
+parameter to scale.
+
+This differs from `StableDiffusionImg2ImgPipeline`, `FluxImg2ImgPipeline`, and
+`FluxKontextInpaintPipeline`, which add noise to a latent encoding of the input image and then
+partially denoise it. If you port code from those pipelines and pass `strength=...` to a Flux.2
+pipeline, you will see:
+
+```
+TypeError: Flux2Pipeline.__call__() got an unexpected keyword argument 'strength'
+```
+
+Drop the `strength` kwarg and pass references via `image=` (a single image, or a list for multiple
+references). For Flux.2 inpainting (which does add noise to a latent and therefore does take a
+`strength` parameter), use `Flux2KleinInpaintPipeline` instead.
+
 ## Flux2Pipeline
 
 [[autodoc]] Flux2Pipeline
diff --git a/src/diffusers/pipelines/flux2/pipeline_flux2.py b/src/diffusers/pipelines/flux2/pipeline_flux2.py
index 4b60c6042d4f..2602a49934a9 100644
--- a/src/diffusers/pipelines/flux2/pipeline_flux2.py
+++ b/src/diffusers/pipelines/flux2/pipeline_flux2.py
@@ -769,11 +769,15 @@ def __call__(
 
         Args:
             image (`torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, `list[torch.Tensor]`, `list[PIL.Image.Image]`, or `list[np.ndarray]`):
-                `Image`, numpy array or tensor representing an image batch to be used as the starting point. For both
-                numpy array and pytorch tensor, the expected value range is between `[0, 1]` If it's a tensor or a list
-                or tensors, the expected shape should be `(B, C, H, W)` or `(C, H, W)`. If it is a numpy array or a
-                list of arrays, the expected shape should be `(B, H, W, C)` or `(H, W, C)` It can also accept image
-                latents as `image`, but if passing latents directly it is not encoded again.
+                Reference image(s) used to condition generation. Flux.2 encodes them as additional attention tokens that
+                flow through the transformer alongside the text prompt — this is **reference conditioning**, not
+                SD/Flux.1 style img2img, so there is no companion `strength` argument. Pass a list to provide multiple
+                references.
+
+                For both numpy array and pytorch tensor, the expected value range is between `[0, 1]`. If it's a tensor
+                or a list of tensors, the expected shape should be `(B, C, H, W)` or `(C, H, W)`. If it is a numpy
+                array or a list of arrays, the expected shape should be `(B, H, W, C)` or `(H, W, C)`. Can also accept
+                image latents directly, in which case they will not be re-encoded.
             prompt (`str` or `list[str]`, *optional*):
                 The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
                 instead.
diff --git a/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py b/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py
index 1f3b5c3c4fde..6593e81dba5f 100644
--- a/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py
+++ b/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py
@@ -635,11 +635,15 @@ def __call__(
 
         Args:
             image (`torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.Tensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`):
-                `Image`, numpy array or tensor representing an image batch to be used as the starting point. For both
-                numpy array and pytorch tensor, the expected value range is between `[0, 1]` If it's a tensor or a list
-                or tensors, the expected shape should be `(B, C, H, W)` or `(C, H, W)`. If it is a numpy array or a
-                list of arrays, the expected shape should be `(B, H, W, C)` or `(H, W, C)` It can also accept image
-                latents as `image`, but if passing latents directly it is not encoded again.
+                Reference image(s) used to condition generation. Flux.2 encodes them as additional attention tokens that
+                flow through the transformer alongside the text prompt — this is **reference conditioning**, not
+                SD/Flux.1 style img2img, so there is no companion `strength` argument. Pass a list to provide multiple
+                references.
+
+                For both numpy array and pytorch tensor, the expected value range is between `[0, 1]`. If it's a tensor
+                or a list of tensors, the expected shape should be `(B, C, H, W)` or `(C, H, W)`. If it is a numpy
+                array or a list of arrays, the expected shape should be `(B, H, W, C)` or `(H, W, C)`. Can also accept
+                image latents directly, in which case they will not be re-encoded.
             prompt (`str` or `List[str]`, *optional*):
                 The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
                 instead.

From 2db8a9e7330c61de8a1aa6649822223d28788f56 Mon Sep 17 00:00:00 2001
From: Nazar Kozak <nzrkzk@gmail.com>
Date: Mon, 27 Apr 2026 14:18:30 -0700
Subject: [PATCH 2/2] =?UTF-8?q?docs(flux2):=20address=20review=20=E2=80=94?=
 =?UTF-8?q?=20tighten=20wording,=20use=20doc-link=20refs?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Apply @stevhliu's suggestions:
- Compress 'Reference conditioning vs. img2img' section.
- Use [`ClassName`] cross-reference syntax for Flux2Pipeline,
  Flux2KleinPipeline, FluxImg2ImgPipeline, Flux2KleinInpaintPipeline.
- Drop the 'This differs from...' paragraph; the TypeError example
  alone makes the point.
---
 docs/source/en/api/pipelines/flux2.md | 15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/docs/source/en/api/pipelines/flux2.md b/docs/source/en/api/pipelines/flux2.md
index e62e877bf1c5..9cd724aea0ef 100644
--- a/docs/source/en/api/pipelines/flux2.md
+++ b/docs/source/en/api/pipelines/flux2.md
@@ -34,23 +34,14 @@ The [official implementation](https://github.com/black-forest-labs/flux2/blob/5a
 
 ## Reference conditioning vs. img2img
 
-The `image` argument on `Flux2Pipeline` and `Flux2KleinPipeline` is **reference conditioning**, not
-img2img. Reference images are encoded into additional attention tokens that flow through the
-transformer alongside the text prompt — there is no noisy latent initialization, and so no `strength`
-parameter to scale.
-
-This differs from `StableDiffusionImg2ImgPipeline`, `FluxImg2ImgPipeline`, and
-`FluxKontextInpaintPipeline`, which add noise to a latent encoding of the input image and then
-partially denoise it. If you port code from those pipelines and pass `strength=...` to a Flux.2
-pipeline, you will see:
+The `image` argument on [`Flux2Pipeline`] and [`Flux2KleinPipeline`] is a *reference conditioning*. Reference images are encoded as additional attention tokens that flow through the
+transformer alongside the text prompt. Flux.2 doesn't add noise to the input image unlike [`FluxImg2ImgPipeline`]. Passing `strength` to [`Flux2Pipeline`] raises:
 
 ```
 TypeError: Flux2Pipeline.__call__() got an unexpected keyword argument 'strength'
 ```
 
-Drop the `strength` kwarg and pass references via `image=` (a single image, or a list for multiple
-references). For Flux.2 inpainting (which does add noise to a latent and therefore does take a
-`strength` parameter), use `Flux2KleinInpaintPipeline` instead.
+Drop the `strength` argument and pass references with `image`. For inpainting, use [`Flux2KleinInpaintPipeline`] instead.
 
 ## Flux2Pipeline