Skip to content

Support GLM-Image model quantizaiton#1512

Merged
chensuyue merged 17 commits intomainfrom
lvl/support_glm_image
Mar 21, 2026
Merged

Support GLM-Image model quantizaiton#1512
chensuyue merged 17 commits intomainfrom
lvl/support_glm_image

Conversation

@lvliang-intel
Copy link
Copy Markdown
Contributor

Description

Quantize/save/evaluate the zai-org/GLM-Image in w4a16 format.

Model: https://huggingface.co/zai-org/GLM-Image
Target dtypes: w4a16

Save the quantized model for vllm-omni.

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Other (please specify):

Related Issues

#1509

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.

Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Copilot AI review requested due to automatic review settings March 8, 2026 13:24
pre-commit-ci bot and others added 3 commits March 8, 2026 13:26
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
…-round into lvl/support_glm_image

Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds GLM-Image (zai-org/GLM-Image) support for w4a16 quantization workflows, including pipeline-style (model_index.json) loading and correct export layout for vLLM-Omni-style directories.

Changes:

  • Add GLM-Image multimodal block discovery and register glm_image in MLLM handling/template registries.
  • Support loading models from diffusers-style pipeline directories (local + remote) and propagate pipeline subfolder metadata for downstream export/sharding.
  • Update export paths to save model weights under the pipeline’s model component subfolder while saving processor/tokenizer assets in the appropriate location.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
test/test_cpu/models/test_glm_image.py Adds unit tests for GLM-Image helpers and pipeline subfolder discovery (but currently imports a missing module).
auto_round/utils/model.py Adds local/remote pipeline subfolder resolution and uses it in mllm_load_model; passes subfolder to from_pretrained; tags loaded models with pipeline component metadata.
auto_round/utils/common.py Adds vqmodel to multimodal key list.
auto_round/special_model_handler.py Registers glm_image in limited-batch and only-text quantization lists; adds GLM-Image block-name helper to SPECIAL_MULTIMODAL_BLOCK.
auto_round/export/utils.py Adds pipeline directory detection and export layout resolution; copies pipeline artifacts to output layout.
auto_round/export/export_to_autoround/export.py Saves tokenizer/processor to processor output dir; saves model weights to model component dir when exporting pipeline models.
auto_round/export/export_to_autogptq/export.py Same as above for AutoGPTQ export paths.
auto_round/compressors/shard_writer.py Writes shard outputs under the pipeline model component subfolder when applicable.
auto_round/compressors/mllm/utils.py Adds vqmodel to VISUAL_KEYS.
auto_round/compressors/mllm/template.py Registers glm_image template using the HF processor.
auto_round/compressors/mllm/compressor.py Ensures image_processor is forwarded into export/save path.
auto_round/autoround.py Switches to MLLM mode when processor/image_processor are provided via kwargs.

@lvliang-intel
Copy link
Copy Markdown
Contributor Author

Inference test with quantized model using transformers:

1 T2I test
CUDA_VISIBLE_DEVICES=5 python run_glm_image.py --model-dir tmp_glm_image_w4a16 --prompt "A watercolor fox reading a book"
Loading GLM-Image pipeline from /mnt/disk1/lvl/auto-round-main/tmp_glm_image_w4a16 ...
Loading checkpoint shards: 100%|████████████████████████████████████████| 3/3 [00:01<00:00, 1.60it/s]
Loading weights: 100%|████████████████████████████████████████████| 111/111 [00:00<00:00, 1417.02it/s]
Loading pipeline components...: 86%|██████████████████████████████ | 6/7 [00:05<00:00, 2.04it/s]2026-03-08 21:08:39 INFO device.py L1643: [Memory Monitor] Before Applying general replacements: 'peak_ram': 1.62GB
2026-03-08 21:08:39 WARNING modeling_utils.py L4430: loss_type=None was set in the config but it is unrecognized. Using the default loss: ForCausalLMLoss.
2026-03-08 21:08:39 INFO device.py L1646: [Memory Monitor] After Applying general replacements: 'peak_ram': 1.62GB
2026-03-08 21:08:39 WARNING backend.py L1084: Better backend is found, please install all the following requirements to enable it.
2026-03-08 21:08:39 WARNING backend.py L1084: pip install -v "gptqmodel>=2.0" --no-build-isolation
Loading weights: 100%|██████████████████████████████████████████| 1491/1491 [00:00<00:00, 1522.83it/s]
Loading pipeline components...: 100%|███████████████████████████████████| 7/7 [00:07<00:00, 1.07s/it]
Pipeline loaded.

Prompt: A watercolor fox reading a book
Mode: T2I
Reference images: 0
Height: 1152
Width: 768
Steps: 50
Guidance scale: 1.5
Seed: 42
Device map: cuda
100%|█████████████████████████████████████████████████████████████████| 50/50 [00:33<00:00, 1.49it/s]
Saved rendered image to glm_image_output.png

2 I2I test

CUDA_VISIBLE_DEVICES=5 python run_glm_image.py --model-dir tmp_glm_image_w4a16 --i2i-demo
Loading GLM-Image pipeline from /mnt/disk1/lvl/auto-round-main/tmp_glm_image_w4a16 ...
Loading checkpoint shards: 100%|████████████████████████████████████████| 3/3 [00:01<00:00, 1.62it/s]
Loading weights: 100%|████████████████████████████████████████████| 111/111 [00:00<00:00, 1424.23it/s]
Loading pipeline components...: 86%|██████████████████████████████ | 6/7 [00:04<00:00, 1.21it/s]2026-03-08 21:21:30 INFO device.py L1643: [Memory Monitor] Before Applying general replacements: 'peak_ram': 1.69GB
2026-03-08 21:21:30 WARNING modeling_utils.py L4430: loss_type=None was set in the config but it is unrecognized. Using the default loss: ForCausalLMLoss.
2026-03-08 21:21:30 INFO device.py L1646: [Memory Monitor] After Applying general replacements: 'peak_ram': 1.69GB
2026-03-08 21:21:30 WARNING backend.py L1084: Better backend is found, please install all the following requirements to enable it.
2026-03-08 21:21:30 WARNING backend.py L1084: pip install -v "gptqmodel>=2.0" --no-build-isolation
Loading weights: 100%|██████████████████████████████████████████| 1491/1491 [00:00<00:00, 1520.15it/s]
Loading pipeline components...: 100%|███████████████████████████████████| 7/7 [00:07<00:00, 1.04s/it]
Pipeline loaded.

No reference image provided; generating a synthetic condition image ...
Prompt: Replace the background of the snow forest with an underground station featuring an automatic escalator.
Mode: I2I
Reference images: 1
Height: 1056
Width: 1024
Steps: 50
Guidance scale: 1.5
Seed: 42
Device map: cuda
100%|█████████████████████████████████████████████████████████████████| 50/50 [00:43<00:00, 1.14it/s]
Saved rendered image to glm_image_output_i2i.png

@lvliang-intel
Copy link
Copy Markdown
Contributor Author

Currently only quantize autoregressive part, I'm working on designing a hybrid mode for quantizing both autoregressive and diffusion part.

Signed-off-by: lvliang-intel <liang1.lv@intel.com>
@chensuyue chensuyue added this to the 0.12.0 milestone Mar 16, 2026
lvliang-intel and others added 5 commits March 17, 2026 16:29
@lvliang-intel
Copy link
Copy Markdown
Contributor Author

Verified with vllm-omni (based on PR vllm-project/vllm-omni#1777 and some adaptions):

1 Original GLM-Image Model
cd vllm-omni/examples/offline_inference/glm_image
CUDA_VISIBLE_DEVICES=5 python end2end.py --model-path /mnt/disk5/lvl/GLM-Image --config-path /mnt/disk1/lvl/vllm-omni/vllm_omni/model_executor/stage_configs/glm_image.yaml --prompt "A cat sitting on the table" --output cat_orig.png

cat_orig

2 Quantized GLM-Image Model

CUDA_VISIBLE_DEVICES=5 python end2end.py --model-path /mnt/disk1/lvl/auto-round-main/tmp_glm_image_w4a16/ --config-path /mnt/disk1/lvl/vllm-omni/vllm_omni/model_executor/stage_configs/glm_image.yaml --prompt "A cat sitting on the table" --output cat_quantized.png

cat_quantized

end2end.py

Copy link
Copy Markdown
Contributor

@yiliu30 yiliu30 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chensuyue chensuyue merged commit 62f92d1 into main Mar 21, 2026
30 checks passed
@chensuyue chensuyue deleted the lvl/support_glm_image branch March 21, 2026 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants