You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
2
+
3
+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+
the License. You may obtain a copy of the License at
5
+
6
+
http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+
specific language governing permissions and limitations under the License. -->
11
+
12
+
# NVIDIA ModelOpt
13
+
14
+
[NVIDIA-ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) is a unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed.
15
+
16
+
Before you begin, make sure you have nvidia_modelopt installed.
17
+
18
+
```bash
19
+
pip install -U "nvidia_modelopt[hf]"
20
+
```
21
+
22
+
Quantize a model by passing [`NVIDIAModelOptConfig`] to [`~ModelMixin.from_pretrained`] (you can also load pre-quantized models). This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.
23
+
24
+
The example below only quantizes the weights to FP8.
25
+
26
+
```python
27
+
import torch
28
+
from diffusers import AutoModel, SanaPipeline, NVIDIAModelOptConfig
> The quantization methods in NVIDIA-ModelOpt are designed to reduce the memory footprint of model weights using various QAT (Quantization-Aware Training) and PTQ (Post-Training Quantization) techniques while maintaining model performance. However, the actual performance gain during inference depends on the deployment framework (e.g., TRT-LLM, TensorRT) and the specific hardware configuration.
59
+
>
60
+
> More details can be found [here](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples).
61
+
62
+
## NVIDIAModelOptConfig
63
+
64
+
The `NVIDIAModelOptConfig` class accepts three parameters:
65
+
-`quant_type`: A string value mentioning one of the quantization types below.
66
+
-`modules_to_not_convert`: A list of module full/partial module names for which quantization should not be performed. For example, to not perform any quantization of the [`SD3Transformer2DModel`]'s pos_embed projection blocks, one would specify: `modules_to_not_convert=["pos_embed.proj.weight"]`.
67
+
-`disable_conv_quantization`: A boolean value which when set to `True` disables quantization for all convolutional layers in the model. This is useful as channel and block quantization generally don't work well with convolutional layers (used with INT4, NF4, NVFP4). If you want to disable quantization for specific convolutional layers, use `modules_to_not_convert` instead.
68
+
-`algorithm`: The algorithm to use for determining scale, defaults to `"max"`. You can check modelopt documentation for more algorithms and details.
69
+
-`forward_loop`: The forward loop function to use for calibrating activation during quantization. If not provided, it relies on static scale values computed using the weights only.
70
+
-`kwargs`: A dict of keyword arguments to pass to the underlying quantization method which will be invoked based on `quant_type`.
71
+
72
+
## Supported quantization types
73
+
74
+
ModelOpt supports weight-only, channel and block quantization int8, fp8, int4, nf4, and nvfp4. The quantization methods are designed to reduce the memory footprint of the model weights while maintaining the performance of the model during inference.
75
+
76
+
Weight-only quantization stores the model weights in a specific low-bit data type but performs computation with a higher-precision data type, like `bfloat16`. This lowers the memory requirements from model weights but retains the memory peaks for activation computation.
77
+
78
+
The quantization methods supported are as follows:
|**INT4**|`int4 weight only`, `int4 block quantization`|`quant_type`, `quant_type + channel_quantize + block_quantize`|`channel_quantize = -1 is only supported for now`|
85
+
|**NF4**|`nf4 weight only`, `nf4 double block quantization`|`quant_type`, `quant_type + channel_quantize + block_quantize + scale_channel_quantize` + `scale_block_quantize`|`channel_quantize = -1 and scale_channel_quantize = -1 are only supported for now`|
86
+
|**NVFP4**|`nvfp4 weight only`, `nvfp4 block quantization`|`quant_type`, `quant_type + channel_quantize + block_quantize`|`channel_quantize = -1 is only supported for now`|
87
+
88
+
89
+
Refer to the [official modelopt documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/) for a better understanding of the available quantization methods and the exhaustive list of configuration options available.
90
+
91
+
## Serializing and Deserializing quantized models
92
+
93
+
To serialize a quantized model in a given dtype, first load the model with the desired quantization dtype and then save it using the [`~ModelMixin.save_pretrained`] method.
94
+
95
+
```python
96
+
import torch
97
+
from diffusers import AutoModel, NVIDIAModelOptConfig
98
+
from modelopt.torch.opt import enable_huggingface_checkpointing
Copy file name to clipboardExpand all lines: docs/source/en/tutorials/autopipeline.md
+29-85Lines changed: 29 additions & 85 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,112 +12,56 @@ specific language governing permissions and limitations under the License.
12
12
13
13
# AutoPipeline
14
14
15
-
Diffusers provides many pipelines for basic tasks like generating images, videos, audio, and inpainting. On top of these, there are specialized pipelines for adapters and features like upscaling, super-resolution, and more. Different pipeline classes can even use the same checkpoint because they share the same pretrained model! With so many different pipelines, it can be overwhelming to know which pipeline class to use.
15
+
[AutoPipeline](../api/models/auto_model) is a *task-and-model*pipeline that automatically selects the correct pipeline subclass based on the task. It handles the complexity of loading different pipeline subclasses without needing to know the specific pipeline subclass name.
16
16
17
-
The [AutoPipeline](../api/pipelines/auto_pipeline) class is designed to simplify the variety of pipelines in Diffusers. It is a generic *task-first* pipeline that lets you focus on a task ([`AutoPipelineForText2Image`], [`AutoPipelineForImage2Image`], and [`AutoPipelineForInpainting`]) without needing to know the specific pipeline class. The [AutoPipeline](../api/pipelines/auto_pipeline) automatically detects the correct pipeline class to use.
17
+
This is unlike [`DiffusionPipeline`], a *model-only* pipeline that automatically selects the pipeline subclass based on the model.
18
18
19
-
For example, let's use the [dreamlike-art/dreamlike-photoreal-2.0](https://hf.co/dreamlike-art/dreamlike-photoreal-2.0) checkpoint.
20
-
21
-
Under the hood, [AutoPipeline](../api/pipelines/auto_pipeline):
22
-
23
-
1. Detects a `"stable-diffusion"` class from the [model_index.json](https://hf.co/dreamlike-art/dreamlike-photoreal-2.0/blob/main/model_index.json) file.
24
-
2. Depending on the task you're interested in, it loads the [`StableDiffusionPipeline`], [`StableDiffusionImg2ImgPipeline`], or [`StableDiffusionInpaintPipeline`]. Any parameter (`strength`, `num_inference_steps`, etc.) you would pass to these specific pipelines can also be passed to the [AutoPipeline](../api/pipelines/auto_pipeline).
25
-
26
-
<hfoptionsid="autopipeline">
27
-
<hfoptionid="text-to-image">
19
+
[`AutoPipelineForImage2Image`] returns a specific pipeline subclass, (for example, [`StableDiffusionXLImg2ImgPipeline`]), which can only be used for image-to-image tasks.
Notice how the [dreamlike-art/dreamlike-photoreal-2.0](https://hf.co/dreamlike-art/dreamlike-photoreal-2.0) checkpoint is used for both text-to-image and image-to-image tasks? To save memory and avoid loading the checkpoint twice, use the [`~DiffusionPipeline.from_pipe`] method.
Loading the same model with [`DiffusionPipeline`] returns the [`StableDiffusionXLPipeline`] subclass. It can be used for text-to-image, image-to-image, or inpainting tasks depending on the inputs.
Check the [mappings](https://github.com/huggingface/diffusers/blob/130fd8df54f24ffb006d84787b598d8adc899f23/src/diffusers/pipelines/auto_pipeline.py#L114) to see whether a model is supported or not.
105
52
106
-
</hfoption>
107
-
</hfoptions>
108
-
109
-
## Unsupported checkpoints
110
-
111
-
The [AutoPipeline](../api/pipelines/auto_pipeline) supports [Stable Diffusion](../api/pipelines/stable_diffusion/overview), [Stable Diffusion XL](../api/pipelines/stable_diffusion/stable_diffusion_xl), [ControlNet](../api/pipelines/controlnet), [Kandinsky 2.1](../api/pipelines/kandinsky.md), [Kandinsky 2.2](../api/pipelines/kandinsky_v22), and [DeepFloyd IF](../api/pipelines/deepfloyd_if) checkpoints.
112
-
113
-
If you try to load an unsupported checkpoint, you'll get an error.
53
+
Trying to load an unsupported model returns an error.
"ValueError: AutoPipeline can't find a pipeline linked to ShapEImg2ImgPipeline for None"
123
63
```
64
+
65
+
There are three types of [AutoPipeline](../api/models/auto_model) classes, [`AutoPipelineForText2Image`], [`AutoPipelineForImage2Image`] and [`AutoPipelineForInpainting`]. Each of these classes have a predefined mapping, linking a pipeline to their task-specific subclass.
66
+
67
+
When [`~AutoPipelineForText2Image.from_pretrained`] is called, it extracts the class name from the `model_index.json`fileand selects the appropriate pipeline subclass for the task based on the mapping.
0 commit comments