From 4219037d7b71bed0513ee164188b4b01a2e57241 Mon Sep 17 00:00:00 2001 From: "yueyang.hyy" Date: Tue, 5 Aug 2025 15:26:21 +0800 Subject: [PATCH 1/3] update qwen image doc --- docs/tutorial.md | 46 +++++++++++++++++++++++++++++++++++++++++++++ docs/tutorial_zh.md | 45 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 91 insertions(+) diff --git a/docs/tutorial.md b/docs/tutorial.md index 6b99c234..05e38337 100644 --- a/docs/tutorial.md +++ b/docs/tutorial.md @@ -88,6 +88,52 @@ We will continuously update DiffSynth-Engine to support more models. (Wan2.2 LoR After the model is downloaded, load the model with the corresponding pipeline and perform inference. + +### Image Generation(Qwen-Image) + +The following code calls `QwenImagePipeline` to load the [Qwen-Image](https://www.modelscope.cn/models/Qwen/Qwen-Image) model and generate an image. Recommended resolutions are 928×1664, 1104×1472, 1328×1328, 1472×1104, and 1664×928, cfg_scale, with a suggested cfg_scale of 4. If no negative_prompt is provided, it defaults to a single space character (not an empty string). For multi-GPU parallelism, currently only cfg parallelism is supported (parallelism=2), with other optimization efforts underway. + +```python +from diffsynth_engine import fetch_model, QwenImagePipeline, QwenImagePipelineConfig + +config = QwenImagePipelineConfig.basic_config( + model_path=fetch_model("MusePublic/Qwen-image", revision="v1", path="transformer/*.safetensors"), + encoder_path=fetch_model("MusePublic/Qwen-image", revision="v1", path="text_encoder/*.safetensors"), + vae_path=fetch_model("MusePublic/Qwen-image", revision="v1", path="vae/*.safetensors"), + parallelism=2, +) +pipe = QwenImagePipeline.from_pretrained(config) + +prompt = """ + 一副典雅庄重的对联悬挂于厅堂之中,房间是个安静古典的中式布置,桌子上放着一些青花瓷,对联上左书“思涌如泉万类灵感皆可触”,右书“智启于问千机代码自天成”,横批“AI脑洞力”,字体飘逸灵动,兼具传统笔意与未来感。中间挂着一幅中国风的画作,内容是岳阳楼,云雾缭绕间似有数据流光隐现,古今交融,意境深远。 + """ +negative_prompt = " " +image = pipe( + prompt=prompt, + negative_prompt=negative_prompt, + cfg_scale=4.0, + width=1104, + height=1472, + num_inference_steps=30, + seed=42, +) +image.save("image.png") +``` + +Please note that if some necessary modules, like text encoders, are missing from a model repository, the pipeline will automatically download the required files. + +#### Detailed Parameters(Qwen-Image) + +In the image generation pipeline `pipe`, we can use the following parameters for fine-grained control: + +* `prompt`: The prompt, used to describe the content of the generated image, It supports multiple languages (Chinese, English, Japanese, etc.), e.g., “一只猫” (Chinese), "a cat" (English), or "庭を走る猫" (Japanese). +* `negative_prompt`:The negative prompt, used to describe content you do not want in the image, it defaults to a single space character (not an empty string), e.g., "ugly". +* `cfg_scale`:The guidance scale for [Classifier-Free Guidance](https://arxiv.org/abs/2207.12598). A larger value usually results in stronger correlation between the text and the image but reduces the diversity of the generated content. +* `height`: Image height. +* `width`: Image width. +* `num_inference_steps`: The number of inference steps. Generally, more steps lead to longer computation time but higher image quality. +* `seed`: The random seed. A fixed seed ensures reproducible results. + ### Image Generation The following code calls `FluxImagePipeline` to load the [MajicFlus](https://www.modelscope.cn/models/MAILAND/majicflus_v1/summary?version=v1.0) model and generate an image. To load other types of models, replace `FluxImagePipeline` and `FluxPipelineConfig` in the code with the corresponding pipeline and config. diff --git a/docs/tutorial_zh.md b/docs/tutorial_zh.md index 5248251e..33b52d10 100644 --- a/docs/tutorial_zh.md +++ b/docs/tutorial_zh.md @@ -88,6 +88,51 @@ Diffusion 模型包含多种多样的模型结构,每种模型由对应的流 模型下载完毕后,我们可以根据对应的模型类型选择流水线加载模型并进行推理。 +### 图像生成(Qwen-Image) + +以下代码可以调用 `QwenImagePipeline` 加载[Qwen-Image](https://www.modelscope.cn/models/Qwen/Qwen-Image)模型生成一张图。推荐分辨率为928*1664 1104*1472 1328*1328 1472*1104 1664*928,cfg_scale为4,如果没有negative_prompt默认为一个空格而不是空字符串。多卡并行目前支持cfg并行(parallelism=2),其他优化工作正在进行中。 + +```python +from diffsynth_engine import fetch_model, QwenImagePipeline, QwenImagePipelineConfig + +config = QwenImagePipelineConfig.basic_config( + model_path=fetch_model("MusePublic/Qwen-image", revision="v1", path="transformer/*.safetensors"), + encoder_path=fetch_model("MusePublic/Qwen-image", revision="v1", path="text_encoder/*.safetensors"), + vae_path=fetch_model("MusePublic/Qwen-image", revision="v1", path="vae/*.safetensors"), + parallelism=2, +) +pipe = QwenImagePipeline.from_pretrained(config) + +prompt = """ + 一副典雅庄重的对联悬挂于厅堂之中,房间是个安静古典的中式布置,桌子上放着一些青花瓷,对联上左书“思涌如泉万类灵感皆可触”,右书“智启于问千机代码自天成”,横批“AI脑洞力”,字体飘逸灵动,兼具传统笔意与未来感。中间挂着一幅中国风的画作,内容是岳阳楼,云雾缭绕间似有数据流光隐现,古今交融,意境深远。 + """ +negative_prompt = " " +image = pipe( + prompt=prompt, + negative_prompt=negative_prompt, + cfg_scale=4.0, + width=1104, + height=1472, + num_inference_steps=30, + seed=42, +) +image.save("image.png") +``` + +请注意,某些模型库中缺乏必要的文本编码器等模块,我们的代码会自动补充下载所需的模型文件。 + +#### 详细参数(Qwen-Image) + +在图像生成流水线 `pipe` 中,我们可以通过以下参数进行精细的控制: + +* `prompt`: 提示词,用于描述生成图像的内容,支持多种语言(中文/英文/日文等),例如“一只猫”/"a cat"/"庭を走る猫"。 +* `negative_prompt`:负面提示词,用于描述不希望图像中出现的内容,例如“ugly”,默认为一个空格而不是空字符串, " "。 +* `cfg_scale`:[Classifier-free guidance](https://arxiv.org/abs/2207.12598) 的引导系数,通常更大的引导系数可以达到更强的文图相关性,但会降低生成内容的多样性,推荐值为4。 +* `height`:图像高度。 +* `width`:图像宽度。 +* `num_inference_steps`:推理步数,通常推理步数越多,计算时间越长,图像质量越高。 +* `seed`:随机种子,固定的随机种子可以使生成的内容固定。 + ### 图像生成 以下代码可以调用 `FluxImagePipeline` 加载[麦橘超然](https://www.modelscope.cn/models/MAILAND/majicflus_v1/summary?version=v1.0)模型生成一张图。如果要加载其他结构的模型,请将代码中的 `FluxImagePipeline` 和 `FluxPipelineConfig` 替换成对应的流水线模块及配置。 From 7b7708f4ebaf7ef09d22895549eb9fe0943c3381 Mon Sep 17 00:00:00 2001 From: "yueyang.hyy" Date: Tue, 5 Aug 2025 15:35:39 +0800 Subject: [PATCH 2/3] fix tutorial_zh resolutions --- docs/tutorial_zh.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tutorial_zh.md b/docs/tutorial_zh.md index 33b52d10..7c9373d8 100644 --- a/docs/tutorial_zh.md +++ b/docs/tutorial_zh.md @@ -90,7 +90,7 @@ Diffusion 模型包含多种多样的模型结构,每种模型由对应的流 ### 图像生成(Qwen-Image) -以下代码可以调用 `QwenImagePipeline` 加载[Qwen-Image](https://www.modelscope.cn/models/Qwen/Qwen-Image)模型生成一张图。推荐分辨率为928*1664 1104*1472 1328*1328 1472*1104 1664*928,cfg_scale为4,如果没有negative_prompt默认为一个空格而不是空字符串。多卡并行目前支持cfg并行(parallelism=2),其他优化工作正在进行中。 +以下代码可以调用 `QwenImagePipeline` 加载[Qwen-Image](https://www.modelscope.cn/models/Qwen/Qwen-Image)模型生成一张图。推荐分辨率为928×1664, 1104×1472, 1328×1328, 1472×1104, 1664×928,cfg_scale为4,如果没有negative_prompt默认为一个空格而不是空字符串。多卡并行目前支持cfg并行(parallelism=2),其他优化工作正在进行中。 ```python from diffsynth_engine import fetch_model, QwenImagePipeline, QwenImagePipelineConfig From c89bc3affe84feb8be87c36f951ed0309956d8db Mon Sep 17 00:00:00 2001 From: "zhuguoxuan.zgx" Date: Tue, 5 Aug 2025 20:35:39 +0800 Subject: [PATCH 3/3] fix typo --- docs/tutorial.md | 56 ++++++++++++++++++++++----------------------- docs/tutorial_zh.md | 50 ++++++++++++++++++++-------------------- 2 files changed, 53 insertions(+), 53 deletions(-) diff --git a/docs/tutorial.md b/docs/tutorial.md index 05e38337..5371b204 100644 --- a/docs/tutorial.md +++ b/docs/tutorial.md @@ -91,7 +91,7 @@ After the model is downloaded, load the model with the corresponding pipeline an ### Image Generation(Qwen-Image) -The following code calls `QwenImagePipeline` to load the [Qwen-Image](https://www.modelscope.cn/models/Qwen/Qwen-Image) model and generate an image. Recommended resolutions are 928×1664, 1104×1472, 1328×1328, 1472×1104, and 1664×928, cfg_scale, with a suggested cfg_scale of 4. If no negative_prompt is provided, it defaults to a single space character (not an empty string). For multi-GPU parallelism, currently only cfg parallelism is supported (parallelism=2), with other optimization efforts underway. +The following code calls `QwenImagePipeline` to load the [Qwen-Image](https://www.modelscope.cn/models/Qwen/Qwen-Image) model and generate an image. Recommended resolutions are 928×1664, 1104×1472, 1328×1328, 1472×1104, and 1664×928, with a suggested cfg_scale of 4. If no negative_prompt is provided, it defaults to a single space character (not an empty string). For multi-GPU parallelism, currently only cfg parallelism is supported (parallelism=2), with other optimization efforts underway. ```python from diffsynth_engine import fetch_model, QwenImagePipeline, QwenImagePipelineConfig @@ -127,12 +127,12 @@ Please note that if some necessary modules, like text encoders, are missing from In the image generation pipeline `pipe`, we can use the following parameters for fine-grained control: * `prompt`: The prompt, used to describe the content of the generated image, It supports multiple languages (Chinese, English, Japanese, etc.), e.g., “一只猫” (Chinese), "a cat" (English), or "庭を走る猫" (Japanese). -* `negative_prompt`:The negative prompt, used to describe content you do not want in the image, it defaults to a single space character (not an empty string), e.g., "ugly". -* `cfg_scale`:The guidance scale for [Classifier-Free Guidance](https://arxiv.org/abs/2207.12598). A larger value usually results in stronger correlation between the text and the image but reduces the diversity of the generated content. -* `height`: Image height. -* `width`: Image width. -* `num_inference_steps`: The number of inference steps. Generally, more steps lead to longer computation time but higher image quality. -* `seed`: The random seed. A fixed seed ensures reproducible results. +* `negative_prompt`: The negative prompt, used to describe content you do not want in the image, it defaults to a single space character (not an empty string), e.g., "ugly". +* `cfg_scale`: The guidance scale for [Classifier-Free Guidance](https://arxiv.org/abs/2207.12598). A larger value usually results in stronger correlation between the text and the image but reduces the diversity of the generated content. +* `height`: Image height. +* `width`: Image width. +* `num_inference_steps`: The number of inference steps. Generally, more steps lead to longer computation time but higher image quality. +* `seed`: The random seed. A fixed seed ensures reproducible results. ### Image Generation @@ -155,16 +155,16 @@ Please note that if some necessary modules, like text encoders, are missing from In the image generation pipeline `pipe`, we can use the following parameters for fine-grained control: -* `prompt`: The prompt, used to describe the content of the generated image, e.g., "a cat". -* `negative_prompt`: The negative prompt, used to describe content you do not want in the image, e.g., "ugly". -* `cfg_scale`: The guidance scale for [Classifier-Free Guidance](https://arxiv.org/abs/2207.12598). A larger value usually results in stronger correlation between the text and the image but reduces the diversity of the generated content. -* `clip_skip`: The number of layers to skip in the [CLIP](https://arxiv.org/abs/2103.00020) text encoder. The more layers skipped, the lower the text-image correlation, but this can lead to interesting variations in the generated content. -* `input_image`: Input image, used for image-to-image generation. -* `denoising_strength`: The denoising strength. When set to 1, a full generation process is performed. When set to a value between 0 and 1, some information from the input image is preserved. -* `height`: Image height. -* `width`: Image width. -* `num_inference_steps`: The number of inference steps. Generally, more steps lead to longer computation time but higher image quality. -* `seed`: The random seed. A fixed seed ensures reproducible results. +* `prompt`: The prompt, used to describe the content of the generated image, e.g., "a cat". +* `negative_prompt`: The negative prompt, used to describe content you do not want in the image, e.g., "ugly". +* `cfg_scale`: The guidance scale for [Classifier-Free Guidance](https://arxiv.org/abs/2207.12598). A larger value usually results in stronger correlation between the text and the image but reduces the diversity of the generated content. +* `clip_skip`: The number of layers to skip in the [CLIP](https://arxiv.org/abs/2103.00020) text encoder. The more layers skipped, the lower the text-image correlation, but this can lead to interesting variations in the generated content. +* `input_image`: Input image, used for image-to-image generation. +* `denoising_strength`: The denoising strength. When set to 1, a full generation process is performed. When set to a value between 0 and 1, some information from the input image is preserved. +* `height`: Image height. +* `width`: Image width. +* `num_inference_steps`: The number of inference steps. Generally, more steps lead to longer computation time but higher image quality. +* `seed`: The random seed. A fixed seed ensures reproducible results. #### Loading LoRA @@ -223,17 +223,17 @@ save_video(video, "video.mp4") In the video generation pipeline `pipe`, we can use the following parameters for fine-grained control: -* `prompt`: The prompt, used to describe the content of the generated video, e.g., "a cat". -* `negative_prompt`: The negative prompt, used to describe content you do not want in the video, e.g., "ugly". -* `cfg_scale`: The guidance scale for [Classifier-Free Guidance](https://arxiv.org/abs/2207.12598). A larger value usually results in stronger correlation between the text and the video but reduces the diversity of the generated content. -* `input_image`: Input image, only effective in image-to-video models, such as [Wan-AI/Wan2.1-I2V-14B-720P](https://modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-720P). -* `input_video`: Input video, used for video-to-video generation. -* `denoising_strength`: The denoising strength. When set to 1, a full generation process is performed. When set to a value between 0 and 1, some information from the input video is preserved. -* `height`: Video frame height. -* `width`: Video frame width. -* `num_frames`: Number of video frames. -* `num_inference_steps`: The number of inference steps. Generally, more steps lead to longer computation time but higher video quality. -* `seed`: The random seed. A fixed seed ensures reproducible results. +* `prompt`: The prompt, used to describe the content of the generated video, e.g., "a cat". +* `negative_prompt`: The negative prompt, used to describe content you do not want in the video, e.g., "ugly". +* `cfg_scale`: The guidance scale for [Classifier-Free Guidance](https://arxiv.org/abs/2207.12598). A larger value usually results in stronger correlation between the text and the video but reduces the diversity of the generated content. +* `input_image`: Input image, only effective in image-to-video models, such as [Wan-AI/Wan2.1-I2V-14B-720P](https://modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-720P). +* `input_video`: Input video, used for video-to-video generation. +* `denoising_strength`: The denoising strength. When set to 1, a full generation process is performed. When set to a value between 0 and 1, some information from the input video is preserved. +* `height`: Video frame height. +* `width`: Video frame width. +* `num_frames`: Number of video frames. +* `num_inference_steps`: The number of inference steps. Generally, more steps lead to longer computation time but higher video quality. +* `seed`: The random seed. A fixed seed ensures reproducible results. #### Loading LoRA diff --git a/docs/tutorial_zh.md b/docs/tutorial_zh.md index 7c9373d8..f2c8834a 100644 --- a/docs/tutorial_zh.md +++ b/docs/tutorial_zh.md @@ -126,12 +126,12 @@ image.save("image.png") 在图像生成流水线 `pipe` 中,我们可以通过以下参数进行精细的控制: * `prompt`: 提示词,用于描述生成图像的内容,支持多种语言(中文/英文/日文等),例如“一只猫”/"a cat"/"庭を走る猫"。 -* `negative_prompt`:负面提示词,用于描述不希望图像中出现的内容,例如“ugly”,默认为一个空格而不是空字符串, " "。 -* `cfg_scale`:[Classifier-free guidance](https://arxiv.org/abs/2207.12598) 的引导系数,通常更大的引导系数可以达到更强的文图相关性,但会降低生成内容的多样性,推荐值为4。 -* `height`:图像高度。 -* `width`:图像宽度。 -* `num_inference_steps`:推理步数,通常推理步数越多,计算时间越长,图像质量越高。 -* `seed`:随机种子,固定的随机种子可以使生成的内容固定。 +* `negative_prompt`: 负面提示词,用于描述不希望图像中出现的内容,例如“ugly”,默认为一个空格而不是空字符串, " "。 +* `cfg_scale`: [Classifier-free guidance](https://arxiv.org/abs/2207.12598) 的引导系数,通常更大的引导系数可以达到更强的文图相关性,但会降低生成内容的多样性,推荐值为4。 +* `height`: 图像高度。 +* `width`: 图像宽度。 +* `num_inference_steps`: 推理步数,通常推理步数越多,计算时间越长,图像质量越高。 +* `seed`: 随机种子,固定的随机种子可以使生成的内容固定。 ### 图像生成 @@ -155,15 +155,15 @@ image.save("image.png") 在图像生成流水线 `pipe` 中,我们可以通过以下参数进行精细的控制: * `prompt`: 提示词,用于描述生成图像的内容,例如“a cat”。 -* `negative_prompt`:负面提示词,用于描述不希望图像中出现的内容,例如“ugly”。 -* `cfg_scale`:[Classifier-free guidance](https://arxiv.org/abs/2207.12598) 的引导系数,通常更大的引导系数可以达到更强的文图相关性,但会降低生成内容的多样性。 -* `clip_skip`:跳过 [CLIP](https://arxiv.org/abs/2103.00020) 文本编码器的层数,跳过的层数越多,生成的图像与文本的相关性越低,但生成的图像内容可能会出现奇妙的变化。 -* `input_image`:输入图像,用于图生图。 -* `denoising_strength`:去噪力度,当设置为 1 时,执行完整的生成过程,当设置为 0 到 1 之间的值时,会保留输入图像中的部分信息。 -* `height`:图像高度。 -* `width`:图像宽度。 -* `num_inference_steps`:推理步数,通常推理步数越多,计算时间越长,图像质量越高。 -* `seed`:随机种子,固定的随机种子可以使生成的内容固定。 +* `negative_prompt`: 负面提示词,用于描述不希望图像中出现的内容,例如“ugly”。 +* `cfg_scale`: [Classifier-free guidance](https://arxiv.org/abs/2207.12598) 的引导系数,通常更大的引导系数可以达到更强的文图相关性,但会降低生成内容的多样性。 +* `clip_skip`: 跳过 [CLIP](https://arxiv.org/abs/2103.00020) 文本编码器的层数,跳过的层数越多,生成的图像与文本的相关性越低,但生成的图像内容可能会出现奇妙的变化。 +* `input_image`: 输入图像,用于图生图。 +* `denoising_strength`: 去噪力度,当设置为 1 时,执行完整的生成过程,当设置为 0 到 1 之间的值时,会保留输入图像中的部分信息。 +* `height`: 图像高度。 +* `width`: 图像宽度。 +* `num_inference_steps`: 推理步数,通常推理步数越多,计算时间越长,图像质量越高。 +* `seed`: 随机种子,固定的随机种子可以使生成的内容固定。 #### LoRA 加载 @@ -220,16 +220,16 @@ save_video(video, "video.mp4") 在视频生成流水线 `pipe` 中,我们可以通过以下参数进行精细的控制: * `prompt`: 提示词,用于描述生成图像的内容,例如“a cat”。 -* `negative_prompt`:负面提示词,用于描述不希望图像中出现的内容,例如“ugly”。 -* `cfg_scale`:[Classifier-free guidance](https://arxiv.org/abs/2207.12598) 的引导系数,通常更大的引导系数可以达到更强的文图相关性,但会降低生成内容的多样性。 -* `input_image`:输入图像,只在图生视频模型中有效,例如 [Wan-AI/Wan2.1-I2V-14B-720P](https://modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-720P)。 -* `input_video`:输入视频,用于视频生视频。 -* `denoising_strength`:去噪力度,当设置为 1 时,执行完整的生成过程,当设置为 0 到 1 之间的值时,会保留输入视频中的部分信息。 -* `height`:视频帧高度。 -* `width`:视频帧宽度。 -* `num_frames`:视频帧数。 -* `num_inference_steps`:推理步数,通常推理步数越多,计算时间越长,图像质量越高。 -* `seed`:随机种子,固定的随机种子可以使生成的内容固定。 +* `negative_prompt`: 负面提示词,用于描述不希望图像中出现的内容,例如“ugly”。 +* `cfg_scale`: [Classifier-free guidance](https://arxiv.org/abs/2207.12598) 的引导系数,通常更大的引导系数可以达到更强的文图相关性,但会降低生成内容的多样性。 +* `input_image`: 输入图像,只在图生视频模型中有效,例如 [Wan-AI/Wan2.1-I2V-14B-720P](https://modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-720P)。 +* `input_video`: 输入视频,用于视频生视频。 +* `denoising_strength`: 去噪力度,当设置为 1 时,执行完整的生成过程,当设置为 0 到 1 之间的值时,会保留输入视频中的部分信息。 +* `height`: 视频帧高度。 +* `width`: 视频帧宽度。 +* `num_frames`: 视频帧数。 +* `num_inference_steps`: 推理步数,通常推理步数越多,计算时间越长,图像质量越高。 +* `seed`: 随机种子,固定的随机种子可以使生成的内容固定。 #### LoRA 加载