GPU-CR: GPU Checkpoint & Restore

GPU-CR is a system designed to support efficient Checkpoint and Restore (C/R) for GPU-accelerated applications. Its key advantage is completely yielding the GPU memory of the checkpointed app (reducing VRAM usage to 0), seamlessly freeing up space for other workloads to swap in and execute.

A quick demonstration of executing the GPU-CR tool via the command-line interface.

I. Features

Cross-Vendor Support: Experimental support for both NVIDIA and AMD GPUs.
Transparent C/R: Uses LD_PRELOAD to inject a vGPU library that intercepts memory allocations and resource management.
Client CLI: Simple command-line interface (cr_client) to trigger checkpoint and restore operations.
Performance Optimization: Support for Huge Pages to accelerate memory saving.

II. TODO

We are actively working on expanding GPU-CR's capabilities:

🚀 Broader Hardware Support: Extending compatibility to more architectures, such as Huawei Ascend.

III. Performance Evaluation

We compare GPU-CR with existing GPU checkpoint solutions on four LLM workloads:

Llama-8B
Phi-4-mini-instruct
pythia-1b
Qwen3-1.7B

For GPU-CR, the latency is split into:

Data — GPU data buffers
Control — GPU control states

Total latency = Data + Control

1.NVIDIA (CUDA Checkpoint vs GPU-CR)

GPU: NVIDIA A100-PCIE-40GB
Driver Version: 580.95.05
CUDA Version: 13.0
vLLM Version: 0.14.1

2.AMD (CRIU vs GPU-CR)

GPU: AMD Instinct MI100
ROCm Version: 6.4.3
vLLM Version: 0.11.1-rc7

IV. Prerequisites

Operating System: Linux (Tested on Ubuntu 22.04).
Build Tools: CMake, GCC/G++, Make.
Checkpoint Backend & Drivers:
- NVIDIA:
  - Requires CUDA Toolkit 12.x or later.
  - Uses cuda-checkpoint (Included in this repository).
  - Note: If updates are needed, please update the parameters within the source code manually.[cuda-checkpoint]
- AMD:
  - Requires ROCm 6.x or later.
  - Requires a custom-built criu with the AMD plugin enabled. (Manual Compilation Required).
  - Note: This custom CRIU is not included in this repository. Users must manually compile and install CRIU with AMD plugin before using GPU-CR.[CRIU AMDGPU Plugin Documentation]

V. Building

This project utilizes CMake for building. Please choose ONE of the following build options based on your target GPU vendor. Do not build both simultaneously in the same environment.

Option 1: Build for NVIDIA (CUDA)

mkdir build && cd build
export GPU_VENDOR=NVIDIA
cmake ..
make -j$(nproc)

This generates vGPU-NVIDIA.so and cr_client.

Option 2: Build for AMD (ROCm)

mkdir build && cd build
export GPU_VENDOR=AMD
cmake ..
make -j$(nproc)

This generates vGPU-AMD.so and cr_client.

VI. Usage

1. Environment Configuration

Before running, configure the necessary environment variables.

(1) General Configuration (Both NVIDIA & AMD)

VRAM Storage Strategy By default, GPU memory is saved to Huge Pages. You can optionally save it to a file system path using EXPORT_FILE_PATH.

# Optional: Path to save video memory content as a file.
# If NOT set, the system defaults to saving VRAM to Huge Pages.
export EXPORT_FILE_PATH=/path/to/save/vram_dump_path

Huge Pages (Recommended for Acceleration) Huge pages can significantly accelerate the save process for both vendors.

# Example: reserve 80GB huge pages
sudo bash -c "echo 40960 > /proc/sys/vm/nr_hugepages"

sudo mkdir /mnt/huge-ckpt
sudo mount -t hugetlbfs nodev /mnt/huge-ckpt
sudo chmod 777 -R /mnt/huge-ckpt

(2) AMD-Specific Configuration

If you are using AMD GPUs, you must specify the directory where CRIU will store its checkpoint files.

export AMD_CKPT_DIR=/path/to/save/criu_files

2. Running an Application

Launch the target application (e.g., a Python script using PyTorch/vLLM or a C++ binary) using LD_PRELOAD.

(1) Example (NVIDIA):

LD_PRELOAD=/path/to/build/vGPU-NVIDIA.so python3 ./apps/vllm/serving_vllm_nvidia.py

(2) Example (AMD):

LD_PRELOAD=/path/to/build/vGPU-AMD.so ./apps/vllm/serving_vllm_amd.sh

3. Checkpointing

Use the cr_client tool to trigger a checkpoint.

# -i: initialization mode
# -c: Checkpoint mode
# -p: Target PID
# -m: (Optional) The PID of the original parent process (Master) that CRIU needs to control.(for CRIU in AMD mode)
./cr_client -c -p <TARGET_PID>
# or
./cr_client -c -p <GPU_CHILD_PID> -m <PARENT_PID>

4. Restoring

Restore the process from the checkpoints.

# -r: Restore mode
# -p: Target PID (the original PID)
./cr_client -r -p <TARGET_PID>

VII. Directory Structure

src/: Source code for the vGPU library and cr_client.
- GPUs/NVIDIA/: NVIDIA-specific implementation (CUDA hooks).
- GPUs/AMD/: AMD-specific implementation (HIP hooks).
- cr_client.cpp: Control client implementation.
apps/: Example scripts and applications (e.g., vLLM examples).

VIII. Citation

This project is based on our paper:

@inproceedings{GCR,
  author    = {Shaoxun Zeng and Tingxu Ren and Jiwu Shu and Youyou Lu},
  title     = {GPU Checkpoint/Restore Made Fast and Lightweight},
  booktitle = {24rd USENIX Conference on File and Storage Technologies (FAST'26)},
  year      = {2026},
  address   = {Santa Clara, CA},
  month     = feb,
  publisher = {USENIX Association},
  url       = {https://www.usenix.org/conference/fast26/presentation/zeng}
}

And the implementation of the paper is in: https://github.com/thustorage/GCR

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
apps/vllm		apps/vllm
cuda-checkpoint		cuda-checkpoint
source		source
src		src
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU-CR: GPU Checkpoint & Restore

I. Features

II. TODO

III. Performance Evaluation

1.NVIDIA (CUDA Checkpoint vs GPU-CR)

2.AMD (CRIU vs GPU-CR)

IV. Prerequisites

V. Building

Option 1: Build for NVIDIA (CUDA)

Option 2: Build for AMD (ROCm)

VI. Usage

1. Environment Configuration

(1) General Configuration (Both NVIDIA & AMD)

(2) AMD-Specific Configuration

2. Running an Application

3. Checkpointing

4. Restoring

VII. Directory Structure

VIII. Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GPU-CR: GPU Checkpoint & Restore

I. Features

II. TODO

III. Performance Evaluation

1.NVIDIA (CUDA Checkpoint vs GPU-CR)

2.AMD (CRIU vs GPU-CR)

IV. Prerequisites

V. Building

Option 1: Build for NVIDIA (CUDA)

Option 2: Build for AMD (ROCm)

VI. Usage

1. Environment Configuration

(1) General Configuration (Both NVIDIA & AMD)

(2) AMD-Specific Configuration

2. Running an Application

3. Checkpointing

4. Restoring

VII. Directory Structure

VIII. Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages