Disclaimer: I'm not an LLM/GPU expert at all, so I used AI (Claude Opus 4.6) to debug and collect data. I've done an initial evaluation and the thrust of the analysis seems reasonable, but apologies in advance if I missed something.
Summary
The llama-cpp gRPC backend crashes with a general protection fault in libamdhip64.so.7.2.70201 on AMD Strix Halo (gfx1151) during the first GPU compute dispatch. Stock llama.cpp (llama-cli and llama-server) works perfectly on the same hardware, same ROCm, same Docker container.
Environment
- Hardware: AMD Ryzen AI MAX+ 395 w/ Radeon 8060S (gfx1151), 128GB DDR5, 32GB VRAM (UMA)
- OS: Ubuntu 24.04 LTS
- Kernel:
6.17.0-1017-oem (installed via linux-oem-24.04c)
- ROCm: 7.2.1 (installed via
amdgpu-install --usecase=rocm --no-dkms)
- LocalAI: upstream master @
84870586
- llama.cpp:
fae3a28070fe4026f87bd6a544aba1b2d1896566 (as vendored by LocalAI)
- Backend built with:
rocm/dev-ubuntu-24.04:7.2.1, BUILD_TYPE=hipblas, AMDGPU_TARGETS=gfx1151
What works vs what crashes
| Scenario |
Binary |
GPU Offload |
Result |
Stock llama-cli on host |
ggerganov/llama.cpp |
ngl 10 |
PASS — 268 t/s prompt, 82 t/s gen |
Stock llama-cli inside Docker |
ggerganov/llama.cpp |
ngl 10 |
PASS — 263 t/s prompt, 83 t/s gen |
Stock llama-server inside Docker |
ggerganov/llama.cpp |
ngl 10 |
PASS — 248 t/s prompt, 82 t/s gen |
LocalAI gRPC llama-cpp-fallback |
LocalAI grpc-server.cpp |
ngl 10 |
CRASH — GPF in libamdhip64.so |
LocalAI gRPC llama-cpp-fallback |
LocalAI grpc-server.cpp |
ngl 0 (CPU) |
CRASH — same GPF |
| HIP test (hipBLAS SGEMM/HGEMM) |
custom test program |
GPU compute |
PASS |
Crash details
The backend process crashes with a general protection fault inside the HIP runtime library. The crash is consistent and reproducible — always at the same offset.
Kernel log:
traps: grpcpp_sync_ser[619967] general protection fault ip:7ef567c999df sp:7ef3ed7e39d0 error:0 in libamdhip64.so.7.2.70201[2b09df,7ef567a0c000+481000]
amdgpu: Freeing queue vital buffer 0x..., queue evicted
Crash point in LocalAI logs:
The gRPC backend successfully completes all these steps:
- HIP device initialization:
ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
- Model tensor loading:
load_tensors: offloaded 10/17 layers to GPU
- GPU VRAM allocation:
ROCm0 model buffer size = 521.18 MiB
- KV cache allocation:
ROCm0 KV buffer size = 36.00 MiB
- Compute graph reservation:
sched_reserve: ROCm0 compute buffer size = 254.50 MiB
- Graph planning:
sched_reserve: graph nodes = 503, graph splits = 87 (with bs=512), 2 (with bs=1)
srv load_model: initializing slots, n_slots = 1
- CRASH — process exits with GPF, gRPC connection EOF
The crash happens during slot initialization, which is the first actual GPU kernel dispatch.
Reproduction
Working (stock llama-server inside Docker):
# Build llama.cpp with ROCm
cd llama.cpp # https://github.com/ggerganov/llama.cpp @ 408225bb1a63faef725685515cfe583fcc964d09
cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151 -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc) --target llama-server
# Run in Docker with GPU
docker run --rm -p 8080:8080 \
--device /dev/kfd --device /dev/dri \
--group-add video --group-add $(stat -c '%g' /dev/dri/renderD128) \
-v $(pwd)/build/bin:/native:ro \
-v /path/to/model.gguf:/model.gguf:ro \
-v /opt/rocm:/opt/rocm:ro \
-e HSA_OVERRIDE_GFX_VERSION=11.5.1 \
-e ROCBLAS_TENSILE_LIBPATH=/opt/rocm/lib/rocblas/library \
--entrypoint /bin/sh ubuntu:24.04 -c '
export LD_LIBRARY_PATH=/native:/opt/rocm/lib:/opt/rocm/lib64:/opt/rocm/llvm/lib
/native/llama-server -m /model.gguf -ngl 10 -c 2048 --host 0.0.0.0 --port 8080
'
# Result: works at 248 t/s prompt, 82 t/s generation
Crashing (LocalAI gRPC backend):
# Build LocalAI backend with ROCm 7.2.1
docker buildx build -f backend/Dockerfile.llama-cpp \
--build-arg BUILD_TYPE=hipblas \
--build-arg BASE_IMAGE=rocm/dev-ubuntu-24.04:7.2.1 \
--build-arg GRPC_BASE_IMAGE=ubuntu:24.04 \
--build-arg CMAKE_ARGS="-DAMDGPU_TARGETS=gfx1151" \
--platform linux/amd64 --output type=docker \
-t localai-backend-rocm7:latest .
# Use it in an AIKit/LocalAI model image and run:
docker run --rm -p 8080:8080 \
--device /dev/kfd --device /dev/dri \
--group-add video --group-add $(stat -c '%g' /dev/dri/renderD128) \
-e LOCALAI_FORCE_META_BACKEND_CAPABILITY=amd \
-e HSA_OVERRIDE_GFX_VERSION=11.5.1 \
my-localai-model
# Result: GPF in libamdhip64.so.7.2.70201 at offset +0x2b09df
Analysis
The only difference between the working and crashing scenarios is the LocalAI gRPC server wrapper (backend/cpp/llama-cpp/grpc-server.cpp). The same llama.cpp version, same ROCm libraries, same Docker container, same model — stock llama.cpp binaries work while the gRPC-wrapped version crashes.
Key differences in the gRPC wrapper vs stock llama.cpp:
- gRPC thread pool:
BuildAndStart() spawns worker threads before any HIP initialization
- Model loading on worker thread:
LoadModel RPC handler calls llama_backend_init() and model loading on a gRPC worker thread
- Static gRPC linkage: The backend statically links gRPC which may interfere with HIP's signal handling or memory management
- Process name: Kernel identifies the process as
grpcpp_sync_ser (gRPC synchronous server)
In stock llama.cpp, all GPU initialization and inference happen on the main thread or a single controlled worker thread. The gRPC wrapper introduces multi-threaded context that appears to conflict with HIP's runtime on gfx1151.
Things I've already tried that did NOT fix it
- Moving
llama_backend_init() and common_init() to main() before BuildAndStart()
- Adding explicit
ggml_backend_dev_get() GPU enumeration on main thread before gRPC starts
- Disabling fit (
fit:false option)
- Disabling warmup (
warmup:false option)
- Using host ROCm libraries via bind mount instead of bundled ones
- Bypassing the bundled
ld.so dynamic linker
- Different model architectures (Llama, GraniteHybrid) — same crash
- Different VRAM sizes (512MB, 32GB) — same crash pattern after fixing OOM
Suggested investigation areas
- Does gRPC's thread pool / signal handling conflict with HIP's context management?
- Is there a gRPC build flag that could avoid this (e.g., async server instead of sync)?
- Could the static linkage of gRPC interfere with
libamdhip64.so's dynamic initialization?
- On other ROCm GPUs (gfx1100, gfx90a), does the gRPC backend work? This might be gfx1151-specific due to stricter HIP context requirements on APUs.
Disclaimer: I'm not an LLM/GPU expert at all, so I used AI (Claude Opus 4.6) to debug and collect data. I've done an initial evaluation and the thrust of the analysis seems reasonable, but apologies in advance if I missed something.
Summary
The
llama-cppgRPC backend crashes with a general protection fault inlibamdhip64.so.7.2.70201on AMD Strix Halo (gfx1151) during the first GPU compute dispatch. Stock llama.cpp (llama-cliandllama-server) works perfectly on the same hardware, same ROCm, same Docker container.Environment
6.17.0-1017-oem(installed vialinux-oem-24.04c)amdgpu-install --usecase=rocm --no-dkms)84870586fae3a28070fe4026f87bd6a544aba1b2d1896566(as vendored by LocalAI)rocm/dev-ubuntu-24.04:7.2.1,BUILD_TYPE=hipblas,AMDGPU_TARGETS=gfx1151What works vs what crashes
llama-clion hostggerganov/llama.cppngl 10llama-cliinside Dockerggerganov/llama.cppngl 10llama-serverinside Dockerggerganov/llama.cppngl 10llama-cpp-fallbackgrpc-server.cppngl 10libamdhip64.sollama-cpp-fallbackgrpc-server.cppngl 0(CPU)Crash details
The backend process crashes with a general protection fault inside the HIP runtime library. The crash is consistent and reproducible — always at the same offset.
Kernel log:
Crash point in LocalAI logs:
The gRPC backend successfully completes all these steps:
ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32load_tensors: offloaded 10/17 layers to GPUROCm0 model buffer size = 521.18 MiBROCm0 KV buffer size = 36.00 MiBsched_reserve: ROCm0 compute buffer size = 254.50 MiBsched_reserve: graph nodes = 503, graph splits = 87 (with bs=512), 2 (with bs=1)srv load_model: initializing slots, n_slots = 1The crash happens during slot initialization, which is the first actual GPU kernel dispatch.
Reproduction
Working (stock llama-server inside Docker):
Crashing (LocalAI gRPC backend):
Analysis
The only difference between the working and crashing scenarios is the LocalAI gRPC server wrapper (
backend/cpp/llama-cpp/grpc-server.cpp). The same llama.cpp version, same ROCm libraries, same Docker container, same model — stock llama.cpp binaries work while the gRPC-wrapped version crashes.Key differences in the gRPC wrapper vs stock llama.cpp:
BuildAndStart()spawns worker threads before any HIP initializationLoadModelRPC handler callsllama_backend_init()and model loading on a gRPC worker threadgrpcpp_sync_ser(gRPC synchronous server)In stock llama.cpp, all GPU initialization and inference happen on the main thread or a single controlled worker thread. The gRPC wrapper introduces multi-threaded context that appears to conflict with HIP's runtime on gfx1151.
Things I've already tried that did NOT fix it
llama_backend_init()andcommon_init()tomain()beforeBuildAndStart()ggml_backend_dev_get()GPU enumeration on main thread before gRPC startsfit:falseoption)warmup:falseoption)ld.sodynamic linkerSuggested investigation areas
libamdhip64.so's dynamic initialization?