Skip to content

Joyai image#1392

Closed
mi804 wants to merge 6 commits intomodelscope:mainfrom
mi804:joyai-image
Closed

Joyai image#1392
mi804 wants to merge 6 commits intomodelscope:mainfrom
mi804:joyai-image

Conversation

@mi804
Copy link
Copy Markdown
Collaborator

@mi804 mi804 commented Apr 14, 2026

No description provided.

@mi804 mi804 closed this Apr 14, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the JoyAI-Image model, featuring a new DiT architecture, a Qwen3-VL-based text encoder, and a dedicated inference pipeline. The review identifies several critical bugs, including a missing device parameter in the pipeline that causes a KeyError and logic in the image embedder that ignores input images during denoising. Additionally, feedback points out redundant code in the DiT model, unused utility functions, and inconsistencies in pipeline unit definitions that should be cleaned up to improve maintainability and performance.

Comment on lines +101 to +108
inputs_shared = {
"cfg_scale": cfg_scale,
"input_image": input_image, "edit_images": edit_images,
"denoising_strength": denoising_strength,
"height": height, "width": width,
"seed": seed, "max_sequence_length": max_sequence_length,
"tiled": tiled, "tile_size": tile_size, "tile_stride": tile_stride,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The rand_device parameter is missing from the inputs_shared dictionary in the __call__ method. This will lead to a KeyError when JoyAIImageUnit_NoiseInitializer is executed, as it expects rand_device as an input parameter. You should add "rand_device": self.device (or a user-provided device) to the inputs_shared dictionary.


def _encode_text_only(self, pipe, prompt, max_sequence_length):
# TODO: may support for text-only encoding in the future.
raise NotImplementedError("Text-only encoding is not implemented yet. Please provide edit_images for now.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The _encode_text_only method currently raises a NotImplementedError. This prevents the pipeline from being used for text-to-image generation, which is listed as a supported feature in the model description. Implementing this method would ensure the pipeline is fully functional for all advertised tasks.

Comment on lines +258 to +267
def process(self, pipe: JoyAIImagePipeline, input_image, noise, tiled, tile_size, tile_stride):
if input_image is None:
return {"latents": noise}
pipe.load_models_to_device(self.onload_model_names)
if isinstance(input_image, Image.Image):
input_image = [input_image]
input_image = [pipe.preprocess_image(img).transpose(0, 1) for img in input_image]
latents = pipe.vae.encode(input_image, device=pipe.device, tiled=tiled, tile_size=tile_size, tile_stride=tile_stride)
input_latents = rearrange(latents, "(b n) c 1 h w -> b n c 1 h w", n=(len(input_image)))
return {"latents": noise, "input_latents": input_latents}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The JoyAIImageUnit_InputImageEmbedder does not correctly handle the initial latents for img2img mode. It always returns the pure noise as latents on line 267, effectively ignoring the input_image and the denoising_strength parameter. To support img2img, it should blend the encoded input_latents with noise based on the starting timestep determined by denoising_strength using the scheduler's add_noise method.

    def process(self, pipe: JoyAIImagePipeline, input_image, noise, tiled, tile_size, tile_stride):
        if input_image is None:
            return {"latents": noise}
        pipe.load_models_to_device(self.onload_model_names)
        if isinstance(input_image, Image.Image):
            input_image = [input_image]
        input_image = [pipe.preprocess_image(img).transpose(0, 1) for img in input_image]
        latents = pipe.vae.encode(input_image, device=pipe.device, tiled=tiled, tile_size=tile_size, tile_stride=tile_stride)
        input_latents = rearrange(latents, "(b n) c 1 h w -> b n c 1 h w", n=(len(input_image)))
        if pipe.scheduler.denoising_strength < 1.0:
            latents = pipe.scheduler.add_noise(input_latents, noise, pipe.scheduler.timesteps[0])
        else:
            latents = noise
        return {"latents": latents, "input_latents": input_latents}

Comment on lines +127 to +135
if activation_fn == "gelu-approximate":
self.proj = nn.Linear(dim, inner_dim, bias=bias)
self.act = lambda x: F.gelu(x, approximate="tanh")
elif activation_fn == "gelu":
self.proj = nn.Linear(dim, inner_dim, bias=bias)
self.act = F.gelu
else:
self.proj = nn.Linear(dim, inner_dim, bias=bias)
self.act = F.gelu
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The nn.Linear layer is initialized identically in all branches of the if-elif-else block. This can be refactored to improve maintainability and reduce code duplication by moving the linear layer initialization outside the conditional block.

        self.proj = nn.Linear(dim, inner_dim, bias=bias)
        if activation_fn == "gelu-approximate":
            self.act = lambda x: F.gelu(x, approximate="tanh")
        else:
            self.act = F.gelu

Comment on lines +154 to +165
def get_cu_seqlens(text_mask, img_len):
batch_size = text_mask.shape[0]
text_len = text_mask.sum(dim=1)
max_len = text_mask.shape[1] + img_len
cu_seqlens = torch.zeros([2 * batch_size + 1], dtype=torch.int32, device="cuda")
for i in range(batch_size):
s = text_len[i] + img_len
s1 = i * max_len + s
s2 = (i + 1) * max_len
cu_seqlens[2 * i + 1] = s1
cu_seqlens[2 * i + 2] = s2
return cu_seqlens
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function get_cu_seqlens is defined but not used anywhere in the file. Additionally, it contains a hardcoded device="cuda" on line 158, which would cause runtime errors in CPU or NPU environments. It is recommended to remove this unused function to keep the code clean and avoid potential device-related bugs.

Comment on lines +624 to +625
x = torch.cat((img, txt), 1)
img = x[:, :img_len, ...]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This concatenation and subsequent slicing operation is redundant. img is concatenated with txt and then immediately sliced back to its original length using img_len. This does not change the content of img and can be removed to improve efficiency.

Comment on lines +215 to +217
input_params=("edit_images", "tiled", "tile_size", "tile_stride", "edit_image_basesize", "height", "width"),
output_params=("ref_latents", "num_items", "is_multi_item"),
onload_model_names=("wan_video_vae",),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are several inconsistencies in the unit parameter definitions. In JoyAIImageUnit_EditImageEmbedder, edit_image_basesize is listed as an input but is unused in the process method. Additionally, num_items and is_multi_item are listed as output parameters but are never returned by the unit. Similarly, in JoyAIImageUnit_PromptEmbedder, the positive parameter is defined but unused. These unused parameters should be removed to improve code clarity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant