feat: stream model conversion#1581
Conversation
wbruna
left a comment
There was a problem hiding this comment.
First, about the coding style: this is placing format-specific logic inside convert.cpp. The format-specific code should go to the appropriate files inside model_io/, likely with a separate "write tensor" per file type. Note you should also avoid opening and closing the model files for each tensor, so some kind of "opened model file" abstraction will probably be needed. A "read the tensor at the specified offset" abstraction would probably make sense, too.
I gave this a try for a .safetensors -> Q4_K .gguf. On my machine, it was never able to saturate all CPU cores, so it got much slower than the normal conversion (around 1/2 - 1/3 speed). I/O didn't seem to be the bottleneck: system and wait times remained low.
Looking at the code, my guess would be the batching calculation: it would explain this behavior if for some reason it consistently used only 1 or 2 threads (the number of threads should also respect the --threads parameter by the way). The batching division also looks sub-optimal: you split up work between threads, then stop everything, write everything, then open threads again. So you are not allowing an overlap between the conversion and the writing; plus, a thread could finish much sooner than the others, and would stay idle until the next batch.
I would avoid the fixed batching, and use a true pipeline instead: either n read+convert threads + 1 write thread, or n read+convert+write threads, controlling for the memory budget with a condition variable. I would bet on the second option: if writing is the bottleneck, you'd naturally parallelize it as well.
Note you are not forced to write sequentially, either: you have offsets for each tensor, so they could be written as soon as they are ready, with each thread using its own open file object (I'd recommend preallocating the file at the beginning, to give the filesystem a better chance to avoid fragmentation issues). An out-of-order approach could also help with models with huge tensors, since you can try to overlap them with smaller ones.
Split out from draft PR #1573: #1573
Summary
Changes
--convertto stream converted tensors instead of allocating the entire converted model in oneggml_contextbefore writing the output file.This PR intentionally only covers the regular conversion memory/threading path. RMSE-guided conversion is not included here and will be handled separately after this is reviewed.
What changed
convert(input_path, vae_path, output_path, output_type, tensor_type_rules, convert_name)API and CLI behavior.What is not included
--lazy-loadruntime behavior changes.Validation
cmake --build build -j16git diff --checkbuild/bin/sd-cli -M convert -m /tmp/sdcpp-convert-tiny.safetensors -o /tmp/sdcpp-convert-tiny-final.gguf --type f16time build/bin/sd-cli -M convert -m /home/aaron/models/sd3.5-medium/sd3.5_medium.safetensors -o /tmp/sd3.5_medium_streaming_convert.gguf/tmp/sd3.5_medium_streaming_convert.gguf, 4.8GNotes
This is a draft because the new streaming writer path should get review and broader testing across output formats and platforms before being marked ready.