Skip to content

feat: add per-model FP8 layerwise casting for VRAM reduction#8945

Draft
Pfannkuchensack wants to merge 10 commits intoinvoke-ai:mainfrom
Pfannkuchensack:feature/fp8-layerwise-casting
Draft

feat: add per-model FP8 layerwise casting for VRAM reduction#8945
Pfannkuchensack wants to merge 10 commits intoinvoke-ai:mainfrom
Pfannkuchensack:feature/fp8-layerwise-casting

Conversation

@Pfannkuchensack
Copy link
Collaborator

@Pfannkuchensack Pfannkuchensack commented Mar 6, 2026

FP8 Layerwise Casting - Implementation

Summary

Add per-model fp8_storage option to model default settings that enables diffusers' enable_layerwise_casting() to store weights in FP8 (float8_e4m3fn) while casting to fp16/bf16 during inference. This reduces VRAM usage by ~50% per model with minimal quality loss.

Supported: SD1/SD2/SDXL/SD3, Flux, Flux2, CogView4, Z-Image, VAE (diffusers-based), ControlNet, T2IAdapter.
Not applicable: Text Encoders, LoRA, GGUF, BnB, custom classes.

Related Issues / Discussions

QA Instructions

  1. Set fp8_storage: true in a model's default_settings (via API or Model Manager UI)
  2. Load the model and generate an image
  3. Verify VRAM usage is reduced compared to normal loading
  4. Verify image quality is acceptable (minimal degradation expected)
  5. Verify Text Encoders are NOT affected (excluded by submodel type filter)
  6. Verify non-CUDA devices gracefully ignore the setting

Test Matrix

  • SD1.5 Diffusers with fp8_storage=true - load and generate
  • SDXL Diffusers with fp8_storage=true - load and generate
  • SDXL Safetensors with fp8_storage=true - load and generate
  • Flux1 Diffusers with fp8_storage=true - load and generate
  • Flux1 Safetensors with fp8_storage=true - load and generate
  • Flux2klein Diffusers with fp8_storage=true - load and generate
  • Flux2klein Safetensors with fp8_storage=true - load and generate
  • CogView4 with fp8_storage=true - load and generate
  • VAE with fp8_storage=true - check quality
  • ControlNet with fp8_storage=true - load and generate
  • VRAM comparison: with vs. without fp8_storage
  • Image quality comparison: FP8 vs fp16/bf16
  • MPS/CPU: verify fp8_storage is silently ignored
  • Text Encoder submodels: verify FP8 is NOT applied
  • GGUF/BnB models: verify FP8 is gracefully skipped
  • Z-Image Turbo Diffusers with fp8_storage=true - load and generate # it does not work

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • Changes to a redux slice have a corresponding migration
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

Add fp8_storage option to model default settings that enables
diffusers' enable_layerwise_casting() to store weights in FP8
(float8_e4m3fn) while casting to fp16/bf16 during inference.
This reduces VRAM usage by ~50% per model with minimal quality loss.

Supported: SD1/SD2/SDXL/SD3, Flux, Flux2, CogView4, Z-Image,
VAE (diffusers-based), ControlNet, T2IAdapter.
Not applicable: Text Encoders, LoRA, GGUF, BnB, custom classes
Add per-model FP8 storage toggle in Model Manager default settings for
both main models and control adapter models. When enabled, model weights
are stored in FP8 format in VRAM (~50% savings) and cast layer-by-layer
to compute precision during inference via diffusers' enable_layerwise_casting().

Backend: add fp8_storage field to MainModelDefaultSettings and
ControlAdapterDefaultSettings, apply FP8 layerwise casting in all
relevant model loaders (SD, SDXL, FLUX, CogView4, Z-Image, ControlNet,
T2IAdapter, VAE). Gracefully skips non-ModelMixin models (custom
checkpoint loaders, GGUF, BnB).

Frontend: add FP8 Storage switch to model default settings panels with
InformationalPopover, translation keys, and proper form handling.
@github-actions github-actions bot added python PRs that change python files backend PRs that change backend files frontend PRs that change frontend files labels Mar 6, 2026
Copy link
Collaborator

@JPPhoto JPPhoto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my quantized Krea dev setup, your code was never called - is this by design or an overlooked class?

I'd also like the UI to be tweaked so the fp8 setting appears as a single slider under Settings like CPU-only for text encoders rather than as a dual-slider in the model defaults section.

def _should_use_fp8(self, config: AnyModelConfig, submodel_type: Optional[SubModelType] = None) -> bool:
"""Check if FP8 layerwise casting should be applied to a model."""
# FP8 storage only works on CUDA
if self._torch_device.type != "cuda":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this check self._get_execution_device() to make sure the model is to be executed on cuda?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It checks "does the system even have CUDA?", not "does this model run on CUDA?". Both lead to the same result, but _torch_device is semantically a better fit for a hardware capability check. The only difference is that _get_execution_device() requires config and submodel_type as parameters, while _torch_device does not.

local_files_only=True,
)

model = self._apply_fp8_layerwise_casting(model, config, submodel_type)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this only apply to the v2 VAE and diffusers models? What about GGUF?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GGUF and BnB models are intentionally excluded — they already use their own quantization (typically Q4/Q8), so applying FP8 layerwise casting on top would be redundant and likely conflict with their dequantization logic during inference.

FluxCheckpointModel and Flux2CheckpointModel were missing the
_apply_fp8_layerwise_casting call. Additionally, the FP8 casting
only worked for diffusers ModelMixin models. Add manual layerwise
casting via forward hooks for plain nn.Module (custom Flux class).

Also simplify FP8 UI toggle from dual-slider to single switch,
matching the CPU-only toggle pattern per review feedback on invoke-ai#8945.
Z-Image's transformer has dtype mismatches with diffusers'
enable_layerwise_casting: skipped modules (t_embedder, cap_embedder)
stay in bf16 while hooked modules cast to fp16, causing crashes in
attention layers. Also hide the FP8 toggle in the UI for Z-Image models.
Models like Flux are loaded in bf16 but the global torch dtype is fp16,
causing dtype mismatches during FP8 layerwise casting. Detect the
model's actual parameter dtype and use it as compute_dtype for both
diffusers ModelMixin and plain nn.Module models.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend PRs that change backend files frontend PRs that change frontend files python PRs that change python files v6.13.x

Projects

Status: 6.13.x

Development

Successfully merging this pull request may close these issues.

3 participants