Skip to content

Integrate trueno-zram for GPU-accelerated model weight loading #145

@noahgift

Description

@noahgift

Summary

Integrate trueno-zram GPU compression for 10x faster model loading and 3x larger model support.

Problem

Current model loading:
Disk (SSD) → CPU RAM → GPU VRAM
     500 MB/s    16 GB/s

Llama-70B (140GB): 280 seconds to load 😢

Solution

With trueno-zram:
Disk → ZRAM (compressed) → GPU decompress → VRAM
   2 GB/s      60 GB/s (GPU)

Llama-70B: 27 seconds to load 🚀

Benefits

Benefit Impact
10x faster loading Cold start 280s → 27s
3x larger models Fit 405B in 256GB RAM
Zero-copy GPU Decompress → VRAM direct
Lower memory pressure More RAM for KV cache

Implementation

// Model loading with GPU ZRAM
let model = Model::load("llama-70b.safetensors")
    .with_compression(trueno_zram::GpuBackend::auto())
    .to_device(Device::Cuda(0))?;

API Changes

pub struct ModelLoader {
    compression: Option<trueno_zram::GpuBackend>,
}

impl ModelLoader {
    pub fn with_compression(mut self, backend: GpuBackend) -> Self;
    pub fn preload_compressed(&self, path: &Path) -> Result<CompressedModel>;
}

Related

  • trueno-zram#1: GPU batch compression
  • trueno-zram#2: WGPU cross-platform backend

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions