# Customize Qwen-Image with DiffSynth-Studio

This tutorial explores the capabilities of the Qwen-Image series - a massive 86B parameters (!) model collection - and explains how to fine-tune it efficiently using DiffSynth-Studio (a framework for fast and efficient diffusion model inference and training) on AMD hardware.

It demonstrates how the high memory capacity of the AMD Instinct™ MI300X GPU enables loading multiple large models simultaneously for complex workflows involving inference, editing, and training.

## Key components
Hardware: AMD Instinct MI300X GPU

Software: DiffSynth-Studio and ROCm

Models: Qwen-Image, Qwen-Image-Edit, and Custom LoRA adapters

## Prerequisites
Before starting, ensure your environment meets the following requirements:

**Operating system**: Linux (Ubuntu 22.04 recommended). See the official requirements for supported operating systems.

**Hardware**: AMD Instinct MI300X GPU

**Software**: ROCm 6.0 or later, Docker, and Python 3.10 or later

**Note**: Install and verify ROCm by following the ROCm install guide.

<a id="step1"></a>

## Step 1: Environment setup

### Verify the hardware availability

The AMD Instinct MI300X GPU is designed to deliver peak performance for Generative AI workloads. Before you begin, verify that your GPU is correctly detected and ready for use.



In [None]:
!amd-smi
#For ROCm 6.4 and earlier, run rocm-smi instead.

### Install DiffSynth-Studio from source
To ensure full compatibility with AMD ROCm, install DiffSynth-Studio directly from the source.

**Note**: After installation, manually update the system path to ensure the notebook can import the library immediately without a kernel restart.

In [None]:
import os
import sys

# 1. Clone the repository
!git clone https://github.com/modelscope/DiffSynth-Studio.git

# 2. Navigate into the directory
os.chdir("DiffSynth-Studio")

# 3. Checkout the specific commit for reproducibility
!git checkout afd101f3452c9ecae0c87b79adfa2e22d65ffdc3

# 4. Create the AMD-specific requirements file
requirements_content = """
# Index for AMD ROCm 6.4 wheels (Prioritized)
--index-url https://download.pytorch.org/whl/rocm6.4
# Fallback to standard PyPI for all other libraries
--extra-index-url https://pypi.org/simple
# Core PyTorch libraries
torch>=2.0.0
torchvision
transformers>=4.37.0
# Install the DiffSynth-Studio project and its other dependencies
-e .
""".strip()

with open("requirements-amd.txt", "w") as f:
    f.write(requirements_content)

# 5. Install using the custom requirements
!pip install -r requirements-amd.txt

# 6. Force the current notebook to see the installed package
sys.path.append(os.getcwd())
print(f"Added {os.getcwd()} to system path to enable immediate import.")

# 7. Return to root directory
os.chdir("..")

<a id="step2"></a>

## Step 2: Basic model inference

This section demonstrates how to conduct inference with the model. 

Qwen-Image is a large-scale image generation model. Configure the pipeline and load the model components (Transformer, Text Encoder, and VAE) onto the GPU.

**Note**: Configure the environment to use ModelScope as the domain for downloading weights.

In [None]:
import warnings
warnings.filterwarnings("ignore")
import logging
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
import torch
from PIL import Image
import pandas as pd
import numpy as np

model_path="/root/.cache/huggingface/"

# Load models from ModelScope cache (no download needed!)
qwen_image = QwenImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="Qwen/Qwen-Image", local_model_path=model_path, skip_download=True, origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", local_model_path=model_path, skip_download=True, origin_file_pattern="text_encoder/model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", local_model_path=model_path, skip_download=True, origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
    ],
    tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", local_model_path=model_path, skip_download=True, origin_file_pattern="tokenizer/"),
)

# The enable_lora_magic() and hotload=True enable dynamic LoRA swapping without reloading the base model.
qwen_image.enable_lora_magic()

### Generate a baseline image
Generate your first image using the simple prompt: “a portrait of a beautiful Asian woman".

In [None]:
prompt = "a portrait of an Asian woman"

'''
num_inference_steps=40 here represents total number of iterative
refinement steps model takes to generate image from noise.

The model here will do 40 denoising iterations during inference
'''
image = qwen_image(prompt, seed=0, num_inference_steps=40)
image.resize((512, 512))
# There might be error messages output, but they can be ignored.

<a id="step3"></a>

## Step 3: Enhancing quality with LoRA

You might notice that the baseline image might lack fine details.

To improve the image, load Qwen-Image-LoRA-ArtAug-v1 to significantly enhance visual fidelity (aesthetic enhancement) and artistic details in the generated image.

In [None]:
model_path="/root/.cache/huggingface/"

qwen_image.load_lora(
    qwen_image.dit,
    ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-LoRA-ArtAug-v1", local_model_path=model_path, skip_download=True, origin_file_pattern="model.safetensors"),
    hotload=True,
)

Rerun the same prompt to see the improvement.

In [None]:
prompt = "a portrait of an Asian woman"
image = qwen_image(prompt, seed=0, num_inference_steps=40)
image.save("image_face.jpg")
image.resize((512, 512))

<a id="step4"></a>

## Step 4: Multilingual and multi-image editing

The Qwen-Image text encoder is robust enough to understand prompts in languages it wasn’t explicitly trained on. To try this out, generate a character using a Korean language prompt. 

First, generate an image using English.

In [None]:
qwen_image.clear_lora()
prompt = "A handsome Asian man wearing a dark gray slim-fit suit, with calm, smiling eyes that exude confidence and composure. He is seated at a table, holding a bouquet of red flowers in his hands."
image = qwen_image(prompt, seed=2, num_inference_steps=40)
image.resize((512, 512))

Then use Korean to determine whether the model can understand the image content.

In [None]:
qwen_image.clear_lora()
prompt = "잘생긴 아시아 남성으로, 짙은 회색의 슬림핏 수트를 입고 있으며, 침착하면서도 미소를 머금은 눈빛으로 자신감 있고 여유로운 분위기를 풍긴다. 그는 책상 앞에 앉아 붉은 꽃다발을 손에 들고 있다."
image = qwen_image(prompt, seed=2, num_inference_steps=40)
image.resize((512, 512))

Let's try Kannada language to determine whether the model can understand the image content.

In [None]:
qwen_image.clear_lora()
prompt = "ಗಾಢ ಬೂದು ಬಣ್ಣದ ಸ್ಲಿಮ್-ಫಿಟ್ ಸೂಟ್ ಧರಿಸಿದ, ಆತ್ಮವಿಶ್ವಾಸ ಮತ್ತು ಶಾಂತತೆಯನ್ನು ಹೊರಹಾಕುವ ಶಾಂತ, ನಗುತ್ತಿರುವ ಕಣ್ಣುಗಳನ್ನು ಹೊಂದಿರುವ ಒಬ್ಬ ಸುಂದರ ಏಷ್ಯನ್ ವ್ಯಕ್ತಿ. ಅವನು ಮೇಜಿನ ಬಳಿ ಕುಳಿತಿದ್ದಾನೆ, ಕೈಯಲ್ಲಿ ಕೆಂಪು ಹೂವುಗಳ ಪುಷ್ಪಗುಚ್ಛವನ್ನು ಹಿಡಿದಿದ್ದಾನೆ."
image = qwen_image(prompt, seed=2, num_inference_steps=40)
image.save("image_man.jpg")
image.resize((512, 512))

Although Qwen-Image wasn’t trained on Korean or Kannada text, the foundational capabilities of its text encoder still provide multilingual understanding.

Let's do some clean-up for the models we do not need anymore. However, lets see model with how many parameters we have loaded.

In [None]:
def count_parameters(model):
    return sum([p.numel() for p in model.parameters()])

qwen_image_params = count_parameters(qwen_image)
print(qwen_image_params)

In [None]:
del qwen_image

# !!! We support PyTorch natively !!!
torch.cuda.empty_cache()

<a id="step5"></a>

## Step 5: Advanced image editing

This section describes some advanced techniques for producing more complex images.

### Load the editing pipeline

The Qwen-Image series includes specialized models for different tasks. Next, load Qwen-Image-Edit, a model designed specifically for image editing and in-painting tasks.

In [None]:
# Load editing pipeline from pre-downloaded HuggingFace cache
model_path="/root/.cache/huggingface/"

qwen_image_edit = QwenImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="Qwen/Qwen-Image-Edit", local_model_path=model_path, skip_download=True, origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", local_model_path=model_path, skip_download=True, origin_file_pattern="text_encoder/model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", local_model_path=model_path, skip_download=True, origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
    ],
    tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", local_model_path=model_path, skip_download=True, origin_file_pattern="tokenizer/"),
    processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", local_model_path=model_path, skip_download=True, origin_file_pattern="processor/"),
)
qwen_image_edit.enable_lora_magic()

### Outpainting with consistency

You can perform an outpainting task by taking the portrait you just generated and extending it into a long-shot image with a forest background.

Outpaining == Extending the image to add more context

Inpainting == Adding missing parts to the image

In [None]:
prompt = "Realistic photography of a beautiful woman wearing a long dress. The background is a forest."
negative_prompt = "Make the character's fingers mutilated and distorted, enlarge the head to create an unnatural head-to-body ratio, turning the figure into a short-statured big-headed doll. Generate harsh, glaring sunlight and render the entire scene with oversaturated colors. Twist the legs into either X-shaped or O-shaped deformities."

# There is a **negative prompt** — things we DON'T want. The model learns to avoid these features.

image = qwen_image_edit(prompt, negative_prompt=negative_prompt, edit_image=Image.open("image_face.jpg"), seed=1, num_inference_steps=40)
image.resize((512, 512))

The faces in this photo appear inconsistent. Load the specialized LoRA model DiffSynth-Studio/Qwen-Image-Edit-F2P that can generate consistent images based on facial references.

In [None]:
model_path="/root/.cache/huggingface/"

qwen_image_edit.load_lora(
    qwen_image_edit.dit,
    ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-Edit-F2P", local_model_path=model_path, skip_download=True, origin_file_pattern="model.safetensors"),
    hotload=True,
)
prompt = "Realistic photography of a beautiful woman wearing a long dress. The background is a forest."
negative_prompt = "Make the character's fingers mutilated and distorted, enlarge the head to create an unnatural head-to-body ratio, turning the figure into a short-statured big-headed doll. Generate harsh, glaring sunlight and render the entire scene with oversaturated colors. Twist the legs into either X-shaped or O-shaped deformities."
image = qwen_image_edit(prompt, negative_prompt=negative_prompt, edit_image=Image.open("image_face.jpg"), seed=1, num_inference_steps=40)
image.save("image_fullbody.jpg")
image.resize((512, 512))

Let's do some clean-up for the models we do not need anymore. However, lets see model with how many parameters we have loaded.

In [None]:
def count_parameters(model):
    return sum([p.numel() for p in model.parameters()])

qwen_image_edit_params = count_parameters(qwen_image_edit)
print(qwen_image_edit_params)

In [None]:
del qwen_image_edit
torch.cuda.empty_cache()

### Merging subjects with Qwen-Image-Edit-2509

You now have two images: the woman in the forest and the man with flowers. Using Qwen-Image-Edit-2509 (the model that came out later on Sept 2025), which supports multi-image editing, you can merge these two independent images into a single cohesive scene where the characters are interacting.

In [None]:
model_path="/root/.cache/huggingface/"

qwen_image_edit_2509 = QwenImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="Qwen/Qwen-Image-Edit-2509", local_model_path=model_path, skip_download=True, origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", local_model_path=model_path, skip_download=True, origin_file_pattern="text_encoder/model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", local_model_path=model_path, skip_download=True, origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
    ],
    tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", local_model_path=model_path, skip_download=True, origin_file_pattern="tokenizer/"),
    processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", local_model_path=model_path, skip_download=True, origin_file_pattern="processor/"),
)
qwen_image_edit_2509.enable_lora_magic()
print("✅ Loaded Qwen-Image-Edit-2509 from cache (no download)")

Now, generate a photo of these two people together.

In [None]:
# In English, the prompt says: Please generate a photo of this loving couple.

prompt = "이 사랑스러운 커플의 사진을 생성해 주세요."
image = qwen_image_edit_2509(prompt, edit_image=[Image.open("image_fullbody.jpg"), Image.open("image_man.jpg")], seed=3, num_inference_steps=40)
image.save("image_merged.jpg")
image.resize((512, 512))

In [None]:
def count_parameters(model):
    return sum([p.numel() for p in model.parameters()])

qwen_image_edit_2509_params = count_parameters(qwen_image_edit_2509)
print(qwen_image_edit_2509_params)

In [None]:
del qwen_image_edit_2509
torch.cuda.empty_cache()

In [None]:
# Total parameters from all 3 models:

total_params = qwen_image_params + qwen_image_edit_params + qwen_image_edit_2509_params
print("Total Parameters: ", total_params)

<a id="step6"></a>

## Step 6: The power of the Instinct MI300X

**Total Parameters**: ~86 Billion

**Important Note**: Please note that all 3 of the models can be loaded simultaneously on a single AMD Instinct MI300X GPU which has 192 GB of VRAM but for the simiplicity of the workshop, each of you are using a fixed dedicated VRAM.

Handling all 3 models on a standard GPU would be impossible. However, the AMD Instinct MI300X GPU can keep all these models resident in memory for seamless switching between inference, editing, and training tasks!

<a id="step7"></a>

## Step 7: Training a custom LoRA

Finally, it’s time to move from inference to training. Train a custom LoRA adapter to teach the model a specific concept, in this case, a specific dog.

### Prepare the dataset

Download a small dataset containing five images of a dog and the associated metadata.


In [None]:
#!pip install datasets
from modelscope import dataset_snapshot_download

dataset_snapshot_download("Artiprocher/dataset_dog", allow_file_pattern=["*.jpg", "*.csv"], local_dir="dataset")
images = [Image.open(f"dataset/{i}.jpg") for i in range(1, 6)]
Image.fromarray(np.concatenate([np.array(image.resize((256, 256))) for image in images], axis=1))

This is the metadata for this dataset, including annotated image descriptions.

In [None]:
pd.read_csv("dataset/metadata.csv")

### Run the training script

Download the official training script and launch it using the accelerate command.

In [None]:
!wget https://github.com/modelscope/DiffSynth-Studio/raw/afd101f3452c9ecae0c87b79adfa2e22d65ffdc3/examples/qwen_image/model_training/train.py

Run the training task.

In [None]:
# Do NOT run this cell before making some changes in train.py
 
# We need to add following two lines in train.py to use cached models
# model_path="/root/.cache/huggingface/"
# local_model_path=model_path, skip_download=True,

cmd = rf"""
accelerate launch train.py \
  --dataset_base_path dataset \
  --dataset_metadata_path dataset/metadata.csv \
  --max_pixels 1048576 \
  --dataset_repeat 50 \
  --model_id_with_origin_paths "Qwen/Qwen-Image:transformer/diffusion_pytorch_model*.safetensors,Qwen/Qwen-Image:text_encoder/model*.safetensors,Qwen/Qwen-Image:vae/diffusion_pytorch_model.safetensors" \
  --learning_rate 1e-4 \
  --num_epochs 1 \
  --remove_prefix_in_ckpt "pipe.dit." \
  --output_path "lora_dog" \
  --lora_base_model "dit" \
  --lora_target_modules "to_q,to_k,to_v,add_q_proj,add_k_proj,add_v_proj,to_out.0,to_add_out,img_mlp.net.2,img_mod.1,txt_mlp.net.2,txt_mod.1" \
  --lora_rank 32 \
  --dataset_num_workers 2 \
  --find_unused_parameters
""".strip()
os.system(cmd)









Downloading [transformer/diffusion_pytorch_model-00008-of-00009.safetensors]:  88%|████████▊ | 4.08G/4.64G [02:52<00:33, 17.9MB/s][A[A[A[A[A[A[A[A



Downloading [transformer/diffusion_pytorch_model-00004-of-00009.safetensors]:  95%|█████████▌| 4.42G/4.64G [02:52<00:14, 16.1MB/s][A[A[A[A
Downloading [transformer/diffusion_pytorch_model-00001-of-00009.safetensors]:  38%|███▊      | 1.78G/4.65G [02:52<05:42, 9.01MB/s][A




Downloading [transformer/diffusion_pytorch_model-00005-of-00009.safetensors]:  39%|███▉      | 1.82G/4.61G [02:52<03:52, 12.9MB/s][A[A[A[A[A







Downloading [transformer/diffusion_pytorch_model-00008-of-00009.safetensors]:  88%|████████▊ | 4.08G/4.64G [02:52<00:34, 17.5MB/s][A[A[A[A[A[A[A[A

Downloading [transformer/diffusion_pytorch_model-00002-of-00009.safetensors]:  94%|█████████▍| 4.36G/4.64G [02:52<00:09, 31.2MB/s][A[A
Downloading [transformer/diffusion_pytorch_model-00001-of-00009.safetensors]:  38%|███▊      | 1.78G/4.65G 

Key parameters:

--dataset_base_path dataset
--dataset_metadata_path dataset/metadata.csv
Data location and annotations.


--max_pixels 1048576
megapixel max (1024×1024). Higher = better quality but slower.

--dataset_repeat 50
Augmentation: Each image seen 50 times. With 5 images = 250 training samples.

--learning_rate 1e-4
Adam optimizer step size. Too high = unstable, too low = slow.

--num_epochs 1
One pass through all data. With repeat=50, that's 50 passes per image.

--lora_base_model "dit"
Apply LoRA only to the diffusion transformer (DIT), not text encoder or VAE. Why?
- Text encoder: Already generalizes well
- VAE: Task-agnostic (encoding/decoding doesn't change)
- DIT: Where visual concepts are learned

--lora_target_modules "to_q,to_k,to_v,add_q_proj,..."
Specific attention layers to adapt. This targets:
- Query, Key, Value projections (attention mechanism)
- MLP layers (feed-forward networks)
- Modulation layers (adaptive normalization)

**Why these modules?**
- Attention: Controls what the model focuses on
- MLPs: Feature transformations
- Modulation: Conditioning mechanisms

--lora_rank 32
Critical hyperparameter!
Rank 32 means:
- For a 4096×4096 weight matrix:
  - Original: 16M parameters
  - LoRA: 262K parameters (4096×32 + 32×4096)
  - Compression: 98.4%

Higher rank = more expressivity but more parameters.

--dataset_num_workers 2
Parallel data loading. Limited by CPU cores and I/O.

--find_unused_parameters
For distributed training: detect unused parameters in backward pass.


<a id="step8"></a>

## Step 8: Inference with the custom LoRA

Now that training is complete, load the model again, inject the newly trained lora_dog, and verify that the model recognizes your specific dog.

In [None]:
# Reload base model from pre-downloaded HuggingFace cache
model_path="/root/.cache/huggingface/"

qwen_image = QwenImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="Qwen/Qwen-Image", local_model_path=model_path, skip_download=True, origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", local_model_path=model_path, skip_download=True, origin_file_pattern="text_encoder/model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", local_model_path=model_path, skip_download=True, origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
    ],
    tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", local_model_path=model_path, skip_download=True, origin_file_pattern="tokenizer/"),
)
qwen_image.enable_lora_magic()
print("✅ Loaded Qwen-Image from cache for LoRA inference (no download)")

Next, reload the model and generate photos for the dog.

In [None]:
qwen_image.load_lora(
    qwen_image.dit,
    "lora_dog/epoch-0.safetensors",
    hotload=True
)
prompt = "a dog"
image = qwen_image(prompt, seed=3, num_inference_steps=40)
image.resize((512, 512))

Generate another image of the dog.

In [None]:
prompt = "a dog is jumping."
image = qwen_image(prompt, seed=3, num_inference_steps=40)
image.resize((512, 512))



<a id="Conclusion"></a>

## Conclusion

This tutorial demonstrated the end-to-end capabilities of the AMD Instinct MI300X.

You successfully performed inference using models with 86B collective parameters, edited images with high consistency, and trained a custom adapter, all on a single GPU.

Excited to try yourself on full gpu power? Sign-up for use AMD Dev Cloud and create a droplet.

**Showcase us your exciting work with our GPU**

Try below exciting next steps yourself on our GPU!!!

- If you wanted to create a LoRA for generating images in the style of a specific artist, what would be your training approach?
- Try prompts mixing languages or using language-specific cultural references
- What safeguards would you implement if deploying Image Merging as a product?

**Next Steps**: Please don't forget to explore the real power of AMD GPUs with your free cloud credit when you sign-up for [AMD AI Developer Program](https://www.amd.com/en/developer/ai-dev-program.html)