Skip to content

gpu-os/GPU-CR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPU-CR: GPU Checkpoint & Restore

cuda rocm ascend

GPU-CR is a system designed to support efficient Checkpoint and Restore (C/R) for GPU-accelerated applications. Its key advantage is completely yielding the GPU memory of the checkpointed app (reducing VRAM usage to 0), seamlessly freeing up space for other workloads to swap in and execute.

CLI Demonstration

A quick demonstration of executing the GPU-CR tool via the command-line interface.

I. Features

  • Cross-Vendor Support: Experimental support for both NVIDIA and AMD GPUs.
  • Transparent C/R: Uses LD_PRELOAD to inject a vGPU library that intercepts memory allocations and resource management.
  • Client CLI: Simple command-line interface (cr_client) to trigger checkpoint and restore operations.
  • Performance Optimization: Support for Huge Pages to accelerate memory saving.

II. TODO

We are actively working on expanding GPU-CR's capabilities:

  • 🚀 Broader Hardware Support: Extending compatibility to more architectures, such as Huawei Ascend.

III. Performance Evaluation

We compare GPU-CR with existing GPU checkpoint solutions on four LLM workloads:

  • Llama-8B
  • Phi-4-mini-instruct
  • pythia-1b
  • Qwen3-1.7B

For GPU-CR, the latency is split into:

  • Data — GPU data buffers
  • Control — GPU control states

Total latency = Data + Control

1.NVIDIA (CUDA Checkpoint vs GPU-CR)

  • GPU: NVIDIA A100-PCIE-40GB
  • Driver Version: 580.95.05
  • CUDA Version: 13.0
  • vLLM Version: 0.14.1

Performance Comparison

2.AMD (CRIU vs GPU-CR)

  • GPU: AMD Instinct MI100
  • ROCm Version: 6.4.3
  • vLLM Version: 0.11.1-rc7 Performance Comparison

IV. Prerequisites

  • Operating System: Linux (Tested on Ubuntu 22.04).
  • Build Tools: CMake, GCC/G++, Make.
  • Checkpoint Backend & Drivers:
    • NVIDIA:
      • Requires CUDA Toolkit 12.x or later.
      • Uses cuda-checkpoint (Included in this repository).
      • Note: If updates are needed, please update the parameters within the source code manually.[cuda-checkpoint]
    • AMD:
      • Requires ROCm 6.x or later.
      • Requires a custom-built criu with the AMD plugin enabled. (Manual Compilation Required).
      • Note: This custom CRIU is not included in this repository. Users must manually compile and install CRIU with AMD plugin before using GPU-CR.[CRIU AMDGPU Plugin Documentation]

V. Building

This project utilizes CMake for building. Please choose ONE of the following build options based on your target GPU vendor. Do not build both simultaneously in the same environment.

Option 1: Build for NVIDIA (CUDA)

mkdir build && cd build
export GPU_VENDOR=NVIDIA
cmake ..
make -j$(nproc)

This generates vGPU-NVIDIA.so and cr_client.

Option 2: Build for AMD (ROCm)

mkdir build && cd build
export GPU_VENDOR=AMD
cmake ..
make -j$(nproc)

This generates vGPU-AMD.so and cr_client.

VI. Usage

1. Environment Configuration

Before running, configure the necessary environment variables.

(1) General Configuration (Both NVIDIA & AMD)

  • VRAM Storage Strategy By default, GPU memory is saved to Huge Pages. You can optionally save it to a file system path using EXPORT_FILE_PATH.
# Optional: Path to save video memory content as a file.
# If NOT set, the system defaults to saving VRAM to Huge Pages.
export EXPORT_FILE_PATH=/path/to/save/vram_dump_path
  • Huge Pages (Recommended for Acceleration) Huge pages can significantly accelerate the save process for both vendors.
# Example: reserve 80GB huge pages
sudo bash -c "echo 40960 > /proc/sys/vm/nr_hugepages"

sudo mkdir /mnt/huge-ckpt
sudo mount -t hugetlbfs nodev /mnt/huge-ckpt
sudo chmod 777 -R /mnt/huge-ckpt

(2) AMD-Specific Configuration

If you are using AMD GPUs, you must specify the directory where CRIU will store its checkpoint files.

export AMD_CKPT_DIR=/path/to/save/criu_files

2. Running an Application

Launch the target application (e.g., a Python script using PyTorch/vLLM or a C++ binary) using LD_PRELOAD.

(1) Example (NVIDIA):

LD_PRELOAD=/path/to/build/vGPU-NVIDIA.so python3 ./apps/vllm/serving_vllm_nvidia.py

(2) Example (AMD):

LD_PRELOAD=/path/to/build/vGPU-AMD.so ./apps/vllm/serving_vllm_amd.sh

3. Checkpointing

Use the cr_client tool to trigger a checkpoint.

# -i: initialization mode
# -c: Checkpoint mode
# -p: Target PID
# -m: (Optional) The PID of the original parent process (Master) that CRIU needs to control.(for CRIU in AMD mode)
./cr_client -c -p <TARGET_PID>
# or
./cr_client -c -p <GPU_CHILD_PID> -m <PARENT_PID>

4. Restoring

Restore the process from the checkpoints.

# -r: Restore mode
# -p: Target PID (the original PID)
./cr_client -r -p <TARGET_PID>

VII. Directory Structure

  • src/: Source code for the vGPU library and cr_client.
    • GPUs/NVIDIA/: NVIDIA-specific implementation (CUDA hooks).
    • GPUs/AMD/: AMD-specific implementation (HIP hooks).
    • cr_client.cpp: Control client implementation.
  • apps/: Example scripts and applications (e.g., vLLM examples).

VIII. Citation

This project is based on our paper:

@inproceedings{GCR,
  author    = {Shaoxun Zeng and Tingxu Ren and Jiwu Shu and Youyou Lu},
  title     = {GPU Checkpoint/Restore Made Fast and Lightweight},
  booktitle = {24rd USENIX Conference on File and Storage Technologies (FAST'26)},
  year      = {2026},
  address   = {Santa Clara, CA},
  month     = feb,
  publisher = {USENIX Association},
  url       = {https://www.usenix.org/conference/fast26/presentation/zeng}
}

And the implementation of the paper is in: https://github.com/thustorage/GCR

About

GPU-CR: GPU Checkpoint & Restore

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors