Summary
Integrate trueno-zram GPU compression for 10x faster model loading and 3x larger model support.
Problem
Current model loading:
Disk (SSD) → CPU RAM → GPU VRAM
500 MB/s 16 GB/s
Llama-70B (140GB): 280 seconds to load 😢
Solution
With trueno-zram:
Disk → ZRAM (compressed) → GPU decompress → VRAM
2 GB/s 60 GB/s (GPU)
Llama-70B: 27 seconds to load 🚀
Benefits
| Benefit |
Impact |
| 10x faster loading |
Cold start 280s → 27s |
| 3x larger models |
Fit 405B in 256GB RAM |
| Zero-copy GPU |
Decompress → VRAM direct |
| Lower memory pressure |
More RAM for KV cache |
Implementation
// Model loading with GPU ZRAM
let model = Model::load("llama-70b.safetensors")
.with_compression(trueno_zram::GpuBackend::auto())
.to_device(Device::Cuda(0))?;
API Changes
pub struct ModelLoader {
compression: Option<trueno_zram::GpuBackend>,
}
impl ModelLoader {
pub fn with_compression(mut self, backend: GpuBackend) -> Self;
pub fn preload_compressed(&self, path: &Path) -> Result<CompressedModel>;
}
Related
- trueno-zram#1: GPU batch compression
- trueno-zram#2: WGPU cross-platform backend
Summary
Integrate
trueno-zramGPU compression for 10x faster model loading and 3x larger model support.Problem
Solution
Benefits
Implementation
API Changes
Related