Skip to content

feat: stream model conversion#1581

Draft
shikaku2 wants to merge 1 commit into
leejet:masterfrom
shikaku2:feat/streaming-convert
Draft

feat: stream model conversion#1581
shikaku2 wants to merge 1 commit into
leejet:masterfrom
shikaku2:feat/streaming-convert

Conversation

@shikaku2
Copy link
Copy Markdown

Split out from draft PR #1573: #1573

Summary

Changes --convert to stream converted tensors instead of allocating the entire converted model in one ggml_context before writing the output file.

This PR intentionally only covers the regular conversion memory/threading path. RMSE-guided conversion is not included here and will be handled separately after this is reviewed.

What changed

  • Collect output tensor metadata first without loading tensor data.
  • Write GGUF or safetensors metadata/header up front.
  • Load, convert, and write tensors in batches instead of keeping every converted tensor resident until the end.
  • Parallelize tensor loading/conversion within each batch.
  • Cap each batch by output tensor bytes, so large tensors still stream with bounded peak memory while smaller tensors can use available CPU threads.
  • Reuse the existing convert(input_path, vae_path, output_path, output_type, tensor_type_rules, convert_name) API and CLI behavior.

What is not included

  • No RMSE option or RMSE type selection.
  • No AIO/separate text encoder/diffusion/VAE packaging changes.
  • No --lazy-load runtime behavior changes.

Validation

  • cmake --build build -j16
  • git diff --check
  • Tiny safetensors -> GGUF conversion: build/bin/sd-cli -M convert -m /tmp/sdcpp-convert-tiny.safetensors -o /tmp/sdcpp-convert-tiny-final.gguf --type f16
  • Full SD3.5 Medium conversion: time build/bin/sd-cli -M convert -m /home/aaron/models/sd3.5-medium/sd3.5_medium.safetensors -o /tmp/sd3.5_medium_streaming_convert.gguf
    • Output: /tmp/sd3.5_medium_streaming_convert.gguf, 4.8G
    • Completed successfully in about 3.7s wall time on my machine

Notes

This is a draft because the new streaming writer path should get review and broader testing across output formats and platforms before being marked ready.

Copy link
Copy Markdown
Contributor

@wbruna wbruna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, about the coding style: this is placing format-specific logic inside convert.cpp. The format-specific code should go to the appropriate files inside model_io/, likely with a separate "write tensor" per file type. Note you should also avoid opening and closing the model files for each tensor, so some kind of "opened model file" abstraction will probably be needed. A "read the tensor at the specified offset" abstraction would probably make sense, too.

I gave this a try for a .safetensors -> Q4_K .gguf. On my machine, it was never able to saturate all CPU cores, so it got much slower than the normal conversion (around 1/2 - 1/3 speed). I/O didn't seem to be the bottleneck: system and wait times remained low.

Looking at the code, my guess would be the batching calculation: it would explain this behavior if for some reason it consistently used only 1 or 2 threads (the number of threads should also respect the --threads parameter by the way). The batching division also looks sub-optimal: you split up work between threads, then stop everything, write everything, then open threads again. So you are not allowing an overlap between the conversion and the writing; plus, a thread could finish much sooner than the others, and would stay idle until the next batch.

I would avoid the fixed batching, and use a true pipeline instead: either n read+convert threads + 1 write thread, or n read+convert+write threads, controlling for the memory budget with a condition variable. I would bet on the second option: if writing is the bottleneck, you'd naturally parallelize it as well.

Note you are not forced to write sequentially, either: you have offsets for each tensor, so they could be written as soon as they are ready, with each thread using its own open file object (I'd recommend preallocating the file at the beginning, to give the filesystem a better chance to avoid fragmentation issues). An out-of-order approach could also help with models with huge tensors, since you can try to overlap them with smaller ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants