Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 67 additions & 21 deletions docs/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,52 @@ We will continuously update DiffSynth-Engine to support more models. (Wan2.2 LoR

After the model is downloaded, load the model with the corresponding pipeline and perform inference.


### Image Generation(Qwen-Image)

The following code calls `QwenImagePipeline` to load the [Qwen-Image](https://www.modelscope.cn/models/Qwen/Qwen-Image) model and generate an image. Recommended resolutions are 928×1664, 1104×1472, 1328×1328, 1472×1104, and 1664×928, with a suggested cfg_scale of 4. If no negative_prompt is provided, it defaults to a single space character (not an empty string). For multi-GPU parallelism, currently only cfg parallelism is supported (parallelism=2), with other optimization efforts underway.

```python
from diffsynth_engine import fetch_model, QwenImagePipeline, QwenImagePipelineConfig

config = QwenImagePipelineConfig.basic_config(
model_path=fetch_model("MusePublic/Qwen-image", revision="v1", path="transformer/*.safetensors"),
encoder_path=fetch_model("MusePublic/Qwen-image", revision="v1", path="text_encoder/*.safetensors"),
vae_path=fetch_model("MusePublic/Qwen-image", revision="v1", path="vae/*.safetensors"),
parallelism=2,
)
pipe = QwenImagePipeline.from_pretrained(config)

prompt = """
一副典雅庄重的对联悬挂于厅堂之中,房间是个安静古典的中式布置,桌子上放着一些青花瓷,对联上左书“思涌如泉万类灵感皆可触”,右书“智启于问千机代码自天成”,横批“AI脑洞力”,字体飘逸灵动,兼具传统笔意与未来感。中间挂着一幅中国风的画作,内容是岳阳楼,云雾缭绕间似有数据流光隐现,古今交融,意境深远。
"""
negative_prompt = " "
image = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
cfg_scale=4.0,
width=1104,
height=1472,
num_inference_steps=30,
seed=42,
)
image.save("image.png")
```

Please note that if some necessary modules, like text encoders, are missing from a model repository, the pipeline will automatically download the required files.

#### Detailed Parameters(Qwen-Image)

In the image generation pipeline `pipe`, we can use the following parameters for fine-grained control:

* `prompt`: The prompt, used to describe the content of the generated image, It supports multiple languages (Chinese, English, Japanese, etc.), e.g., “一只猫” (Chinese), "a cat" (English), or "庭を走る猫" (Japanese).
* `negative_prompt`: The negative prompt, used to describe content you do not want in the image, it defaults to a single space character (not an empty string), e.g., "ugly".
* `cfg_scale`: The guidance scale for [Classifier-Free Guidance](https://arxiv.org/abs/2207.12598). A larger value usually results in stronger correlation between the text and the image but reduces the diversity of the generated content.
* `height`: Image height.
* `width`: Image width.
* `num_inference_steps`: The number of inference steps. Generally, more steps lead to longer computation time but higher image quality.
* `seed`: The random seed. A fixed seed ensures reproducible results.

### Image Generation

The following code calls `FluxImagePipeline` to load the [MajicFlus](https://www.modelscope.cn/models/MAILAND/majicflus_v1/summary?version=v1.0) model and generate an image. To load other types of models, replace `FluxImagePipeline` and `FluxPipelineConfig` in the code with the corresponding pipeline and config.
Expand All @@ -109,16 +155,16 @@ Please note that if some necessary modules, like text encoders, are missing from

In the image generation pipeline `pipe`, we can use the following parameters for fine-grained control:

* `prompt`: The prompt, used to describe the content of the generated image, e.g., "a cat".
* `negative_prompt`: The negative prompt, used to describe content you do not want in the image, e.g., "ugly".
* `cfg_scale`: The guidance scale for [Classifier-Free Guidance](https://arxiv.org/abs/2207.12598). A larger value usually results in stronger correlation between the text and the image but reduces the diversity of the generated content.
* `clip_skip`: The number of layers to skip in the [CLIP](https://arxiv.org/abs/2103.00020) text encoder. The more layers skipped, the lower the text-image correlation, but this can lead to interesting variations in the generated content.
* `input_image`: Input image, used for image-to-image generation.
* `denoising_strength`: The denoising strength. When set to 1, a full generation process is performed. When set to a value between 0 and 1, some information from the input image is preserved.
* `height`: Image height.
* `width`: Image width.
* `num_inference_steps`: The number of inference steps. Generally, more steps lead to longer computation time but higher image quality.
* `seed`: The random seed. A fixed seed ensures reproducible results.
* `prompt`: The prompt, used to describe the content of the generated image, e.g., "a cat".
* `negative_prompt`: The negative prompt, used to describe content you do not want in the image, e.g., "ugly".
* `cfg_scale`: The guidance scale for [Classifier-Free Guidance](https://arxiv.org/abs/2207.12598). A larger value usually results in stronger correlation between the text and the image but reduces the diversity of the generated content.
* `clip_skip`: The number of layers to skip in the [CLIP](https://arxiv.org/abs/2103.00020) text encoder. The more layers skipped, the lower the text-image correlation, but this can lead to interesting variations in the generated content.
* `input_image`: Input image, used for image-to-image generation.
* `denoising_strength`: The denoising strength. When set to 1, a full generation process is performed. When set to a value between 0 and 1, some information from the input image is preserved.
* `height`: Image height.
* `width`: Image width.
* `num_inference_steps`: The number of inference steps. Generally, more steps lead to longer computation time but higher image quality.
* `seed`: The random seed. A fixed seed ensures reproducible results.

#### Loading LoRA

Expand Down Expand Up @@ -177,17 +223,17 @@ save_video(video, "video.mp4")

In the video generation pipeline `pipe`, we can use the following parameters for fine-grained control:

* `prompt`: The prompt, used to describe the content of the generated video, e.g., "a cat".
* `negative_prompt`: The negative prompt, used to describe content you do not want in the video, e.g., "ugly".
* `cfg_scale`: The guidance scale for [Classifier-Free Guidance](https://arxiv.org/abs/2207.12598). A larger value usually results in stronger correlation between the text and the video but reduces the diversity of the generated content.
* `input_image`: Input image, only effective in image-to-video models, such as [Wan-AI/Wan2.1-I2V-14B-720P](https://modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-720P).
* `input_video`: Input video, used for video-to-video generation.
* `denoising_strength`: The denoising strength. When set to 1, a full generation process is performed. When set to a value between 0 and 1, some information from the input video is preserved.
* `height`: Video frame height.
* `width`: Video frame width.
* `num_frames`: Number of video frames.
* `num_inference_steps`: The number of inference steps. Generally, more steps lead to longer computation time but higher video quality.
* `seed`: The random seed. A fixed seed ensures reproducible results.
* `prompt`: The prompt, used to describe the content of the generated video, e.g., "a cat".
* `negative_prompt`: The negative prompt, used to describe content you do not want in the video, e.g., "ugly".
* `cfg_scale`: The guidance scale for [Classifier-Free Guidance](https://arxiv.org/abs/2207.12598). A larger value usually results in stronger correlation between the text and the video but reduces the diversity of the generated content.
* `input_image`: Input image, only effective in image-to-video models, such as [Wan-AI/Wan2.1-I2V-14B-720P](https://modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-720P).
* `input_video`: Input video, used for video-to-video generation.
* `denoising_strength`: The denoising strength. When set to 1, a full generation process is performed. When set to a value between 0 and 1, some information from the input video is preserved.
* `height`: Video frame height.
* `width`: Video frame width.
* `num_frames`: Number of video frames.
* `num_inference_steps`: The number of inference steps. Generally, more steps lead to longer computation time but higher video quality.
* `seed`: The random seed. A fixed seed ensures reproducible results.

#### Loading LoRA

Expand Down
83 changes: 64 additions & 19 deletions docs/tutorial_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,51 @@ Diffusion 模型包含多种多样的模型结构,每种模型由对应的流

模型下载完毕后,我们可以根据对应的模型类型选择流水线加载模型并进行推理。

### 图像生成(Qwen-Image)

以下代码可以调用 `QwenImagePipeline` 加载[Qwen-Image](https://www.modelscope.cn/models/Qwen/Qwen-Image)模型生成一张图。推荐分辨率为928×1664, 1104×1472, 1328×1328, 1472×1104, 1664×928,cfg_scale为4,如果没有negative_prompt默认为一个空格而不是空字符串。多卡并行目前支持cfg并行(parallelism=2),其他优化工作正在进行中。

```python
from diffsynth_engine import fetch_model, QwenImagePipeline, QwenImagePipelineConfig

config = QwenImagePipelineConfig.basic_config(
model_path=fetch_model("MusePublic/Qwen-image", revision="v1", path="transformer/*.safetensors"),
encoder_path=fetch_model("MusePublic/Qwen-image", revision="v1", path="text_encoder/*.safetensors"),
vae_path=fetch_model("MusePublic/Qwen-image", revision="v1", path="vae/*.safetensors"),
parallelism=2,
)
pipe = QwenImagePipeline.from_pretrained(config)

prompt = """
一副典雅庄重的对联悬挂于厅堂之中,房间是个安静古典的中式布置,桌子上放着一些青花瓷,对联上左书“思涌如泉万类灵感皆可触”,右书“智启于问千机代码自天成”,横批“AI脑洞力”,字体飘逸灵动,兼具传统笔意与未来感。中间挂着一幅中国风的画作,内容是岳阳楼,云雾缭绕间似有数据流光隐现,古今交融,意境深远。
"""
negative_prompt = " "
image = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
cfg_scale=4.0,
width=1104,
height=1472,
num_inference_steps=30,
seed=42,
)
image.save("image.png")
```

请注意,某些模型库中缺乏必要的文本编码器等模块,我们的代码会自动补充下载所需的模型文件。

#### 详细参数(Qwen-Image)

在图像生成流水线 `pipe` 中,我们可以通过以下参数进行精细的控制:

* `prompt`: 提示词,用于描述生成图像的内容,支持多种语言(中文/英文/日文等),例如“一只猫”/"a cat"/"庭を走る猫"。
* `negative_prompt`: 负面提示词,用于描述不希望图像中出现的内容,例如“ugly”,默认为一个空格而不是空字符串, " "。
* `cfg_scale`: [Classifier-free guidance](https://arxiv.org/abs/2207.12598) 的引导系数,通常更大的引导系数可以达到更强的文图相关性,但会降低生成内容的多样性,推荐值为4。
* `height`: 图像高度。
* `width`: 图像宽度。
* `num_inference_steps`: 推理步数,通常推理步数越多,计算时间越长,图像质量越高。
* `seed`: 随机种子,固定的随机种子可以使生成的内容固定。

### 图像生成

以下代码可以调用 `FluxImagePipeline` 加载[麦橘超然](https://www.modelscope.cn/models/MAILAND/majicflus_v1/summary?version=v1.0)模型生成一张图。如果要加载其他结构的模型,请将代码中的 `FluxImagePipeline` 和 `FluxPipelineConfig` 替换成对应的流水线模块及配置。
Expand All @@ -110,15 +155,15 @@ image.save("image.png")
在图像生成流水线 `pipe` 中,我们可以通过以下参数进行精细的控制:

* `prompt`: 提示词,用于描述生成图像的内容,例如“a cat”。
* `negative_prompt`负面提示词,用于描述不希望图像中出现的内容,例如“ugly”。
* `cfg_scale`[Classifier-free guidance](https://arxiv.org/abs/2207.12598) 的引导系数,通常更大的引导系数可以达到更强的文图相关性,但会降低生成内容的多样性。
* `clip_skip`跳过 [CLIP](https://arxiv.org/abs/2103.00020) 文本编码器的层数,跳过的层数越多,生成的图像与文本的相关性越低,但生成的图像内容可能会出现奇妙的变化。
* `input_image`输入图像,用于图生图。
* `denoising_strength`去噪力度,当设置为 1 时,执行完整的生成过程,当设置为 0 到 1 之间的值时,会保留输入图像中的部分信息。
* `height`图像高度。
* `width`图像宽度。
* `num_inference_steps`推理步数,通常推理步数越多,计算时间越长,图像质量越高。
* `seed`随机种子,固定的随机种子可以使生成的内容固定。
* `negative_prompt`: 负面提示词,用于描述不希望图像中出现的内容,例如“ugly”。
* `cfg_scale`: [Classifier-free guidance](https://arxiv.org/abs/2207.12598) 的引导系数,通常更大的引导系数可以达到更强的文图相关性,但会降低生成内容的多样性。
* `clip_skip`: 跳过 [CLIP](https://arxiv.org/abs/2103.00020) 文本编码器的层数,跳过的层数越多,生成的图像与文本的相关性越低,但生成的图像内容可能会出现奇妙的变化。
* `input_image`: 输入图像,用于图生图。
* `denoising_strength`: 去噪力度,当设置为 1 时,执行完整的生成过程,当设置为 0 到 1 之间的值时,会保留输入图像中的部分信息。
* `height`: 图像高度。
* `width`: 图像宽度。
* `num_inference_steps`: 推理步数,通常推理步数越多,计算时间越长,图像质量越高。
* `seed`: 随机种子,固定的随机种子可以使生成的内容固定。

#### LoRA 加载

Expand Down Expand Up @@ -175,16 +220,16 @@ save_video(video, "video.mp4")
在视频生成流水线 `pipe` 中,我们可以通过以下参数进行精细的控制:

* `prompt`: 提示词,用于描述生成图像的内容,例如“a cat”。
* `negative_prompt`负面提示词,用于描述不希望图像中出现的内容,例如“ugly”。
* `cfg_scale`[Classifier-free guidance](https://arxiv.org/abs/2207.12598) 的引导系数,通常更大的引导系数可以达到更强的文图相关性,但会降低生成内容的多样性。
* `input_image`输入图像,只在图生视频模型中有效,例如 [Wan-AI/Wan2.1-I2V-14B-720P](https://modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-720P)。
* `input_video`输入视频,用于视频生视频。
* `denoising_strength`去噪力度,当设置为 1 时,执行完整的生成过程,当设置为 0 到 1 之间的值时,会保留输入视频中的部分信息。
* `height`视频帧高度。
* `width`视频帧宽度。
* `num_frames`视频帧数。
* `num_inference_steps`推理步数,通常推理步数越多,计算时间越长,图像质量越高。
* `seed`随机种子,固定的随机种子可以使生成的内容固定。
* `negative_prompt`: 负面提示词,用于描述不希望图像中出现的内容,例如“ugly”。
* `cfg_scale`: [Classifier-free guidance](https://arxiv.org/abs/2207.12598) 的引导系数,通常更大的引导系数可以达到更强的文图相关性,但会降低生成内容的多样性。
* `input_image`: 输入图像,只在图生视频模型中有效,例如 [Wan-AI/Wan2.1-I2V-14B-720P](https://modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-720P)。
* `input_video`: 输入视频,用于视频生视频。
* `denoising_strength`: 去噪力度,当设置为 1 时,执行完整的生成过程,当设置为 0 到 1 之间的值时,会保留输入视频中的部分信息。
* `height`: 视频帧高度。
* `width`: 视频帧宽度。
* `num_frames`: 视频帧数。
* `num_inference_steps`: 推理步数,通常推理步数越多,计算时间越长,图像质量越高。
* `seed`: 随机种子,固定的随机种子可以使生成的内容固定。

#### LoRA 加载

Expand Down