Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for the JoyAI-Image model, featuring a new DiT architecture, a Qwen3-VL-based text encoder, and a dedicated inference pipeline. The review identifies several critical bugs, including a missing device parameter in the pipeline that causes a KeyError and logic in the image embedder that ignores input images during denoising. Additionally, feedback points out redundant code in the DiT model, unused utility functions, and inconsistencies in pipeline unit definitions that should be cleaned up to improve maintainability and performance.
| inputs_shared = { | ||
| "cfg_scale": cfg_scale, | ||
| "input_image": input_image, "edit_images": edit_images, | ||
| "denoising_strength": denoising_strength, | ||
| "height": height, "width": width, | ||
| "seed": seed, "max_sequence_length": max_sequence_length, | ||
| "tiled": tiled, "tile_size": tile_size, "tile_stride": tile_stride, | ||
| } |
There was a problem hiding this comment.
The rand_device parameter is missing from the inputs_shared dictionary in the __call__ method. This will lead to a KeyError when JoyAIImageUnit_NoiseInitializer is executed, as it expects rand_device as an input parameter. You should add "rand_device": self.device (or a user-provided device) to the inputs_shared dictionary.
|
|
||
| def _encode_text_only(self, pipe, prompt, max_sequence_length): | ||
| # TODO: may support for text-only encoding in the future. | ||
| raise NotImplementedError("Text-only encoding is not implemented yet. Please provide edit_images for now.") |
There was a problem hiding this comment.
The _encode_text_only method currently raises a NotImplementedError. This prevents the pipeline from being used for text-to-image generation, which is listed as a supported feature in the model description. Implementing this method would ensure the pipeline is fully functional for all advertised tasks.
| def process(self, pipe: JoyAIImagePipeline, input_image, noise, tiled, tile_size, tile_stride): | ||
| if input_image is None: | ||
| return {"latents": noise} | ||
| pipe.load_models_to_device(self.onload_model_names) | ||
| if isinstance(input_image, Image.Image): | ||
| input_image = [input_image] | ||
| input_image = [pipe.preprocess_image(img).transpose(0, 1) for img in input_image] | ||
| latents = pipe.vae.encode(input_image, device=pipe.device, tiled=tiled, tile_size=tile_size, tile_stride=tile_stride) | ||
| input_latents = rearrange(latents, "(b n) c 1 h w -> b n c 1 h w", n=(len(input_image))) | ||
| return {"latents": noise, "input_latents": input_latents} |
There was a problem hiding this comment.
The JoyAIImageUnit_InputImageEmbedder does not correctly handle the initial latents for img2img mode. It always returns the pure noise as latents on line 267, effectively ignoring the input_image and the denoising_strength parameter. To support img2img, it should blend the encoded input_latents with noise based on the starting timestep determined by denoising_strength using the scheduler's add_noise method.
def process(self, pipe: JoyAIImagePipeline, input_image, noise, tiled, tile_size, tile_stride):
if input_image is None:
return {"latents": noise}
pipe.load_models_to_device(self.onload_model_names)
if isinstance(input_image, Image.Image):
input_image = [input_image]
input_image = [pipe.preprocess_image(img).transpose(0, 1) for img in input_image]
latents = pipe.vae.encode(input_image, device=pipe.device, tiled=tiled, tile_size=tile_size, tile_stride=tile_stride)
input_latents = rearrange(latents, "(b n) c 1 h w -> b n c 1 h w", n=(len(input_image)))
if pipe.scheduler.denoising_strength < 1.0:
latents = pipe.scheduler.add_noise(input_latents, noise, pipe.scheduler.timesteps[0])
else:
latents = noise
return {"latents": latents, "input_latents": input_latents}| if activation_fn == "gelu-approximate": | ||
| self.proj = nn.Linear(dim, inner_dim, bias=bias) | ||
| self.act = lambda x: F.gelu(x, approximate="tanh") | ||
| elif activation_fn == "gelu": | ||
| self.proj = nn.Linear(dim, inner_dim, bias=bias) | ||
| self.act = F.gelu | ||
| else: | ||
| self.proj = nn.Linear(dim, inner_dim, bias=bias) | ||
| self.act = F.gelu |
There was a problem hiding this comment.
The nn.Linear layer is initialized identically in all branches of the if-elif-else block. This can be refactored to improve maintainability and reduce code duplication by moving the linear layer initialization outside the conditional block.
self.proj = nn.Linear(dim, inner_dim, bias=bias)
if activation_fn == "gelu-approximate":
self.act = lambda x: F.gelu(x, approximate="tanh")
else:
self.act = F.gelu| def get_cu_seqlens(text_mask, img_len): | ||
| batch_size = text_mask.shape[0] | ||
| text_len = text_mask.sum(dim=1) | ||
| max_len = text_mask.shape[1] + img_len | ||
| cu_seqlens = torch.zeros([2 * batch_size + 1], dtype=torch.int32, device="cuda") | ||
| for i in range(batch_size): | ||
| s = text_len[i] + img_len | ||
| s1 = i * max_len + s | ||
| s2 = (i + 1) * max_len | ||
| cu_seqlens[2 * i + 1] = s1 | ||
| cu_seqlens[2 * i + 2] = s2 | ||
| return cu_seqlens |
There was a problem hiding this comment.
The function get_cu_seqlens is defined but not used anywhere in the file. Additionally, it contains a hardcoded device="cuda" on line 158, which would cause runtime errors in CPU or NPU environments. It is recommended to remove this unused function to keep the code clean and avoid potential device-related bugs.
| x = torch.cat((img, txt), 1) | ||
| img = x[:, :img_len, ...] |
| input_params=("edit_images", "tiled", "tile_size", "tile_stride", "edit_image_basesize", "height", "width"), | ||
| output_params=("ref_latents", "num_items", "is_multi_item"), | ||
| onload_model_names=("wan_video_vae",), |
There was a problem hiding this comment.
There are several inconsistencies in the unit parameter definitions. In JoyAIImageUnit_EditImageEmbedder, edit_image_basesize is listed as an input but is unused in the process method. Additionally, num_items and is_multi_item are listed as output parameters but are never returned by the unit. Similarly, in JoyAIImageUnit_PromptEmbedder, the positive parameter is defined but unused. These unused parameters should be removed to improve code clarity.
No description provided.