-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Environment
- LocalAI Version: v3.7.0 (commit 9ecfdc5)
- Backend: cuda12-llama-cpp
- OS: Ubuntu 24.04
- CUDA Version: 12.6 (upgraded from 11.5)
- NVIDIA Driver: 580.95.05
- Hardware Setup:
- Main Server: AMD Ryzen 5 3600, 1x NVIDIA RTX 3060 (12GB)
- Worker 1: AMD Ryzen 5 3600, 1x NVIDIA RTX 3060 (12GB) at 192.168.0.177
- Worker 2: AMD Ryzen 5 3600, 1x NVIDIA RTX 3060 (12GB) at 192.168.0.229
- Installation Method: Binary installation
- Model: GPT-OSS-20B (Q4_K_M quantization, ~11GB)
Problem Description
LocalAI fails to connect to RPC workers despite:
- ✅ RPC workers running correctly on both machines
- ✅ Network connectivity confirmed (telnet/nc successfully connects)
- ✅ Environment variable
LLAMACPP_GRPC_SERVERSset correctly - ✅ Model loads successfully in local-only mode (non-RPC)
- ❌ Workers show no connection attempts when LocalAI tries to load model
The LLAMACPP_GRPC_SERVERS environment variable appears not to be passed to the llama-cpp backend, causing LocalAI to attempt local-only loading even when RPC is configured.
Steps to Reproduce
1. Start RPC Workers
On Worker 1 (192.168.0.177):
~/backends/cuda12-llama-cpp/llama-cpp-rpc-server -H 0.0.0.0 -p 50052 -t 8Output:
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
Starting RPC server v3.0.0
endpoint : 0.0.0.0:50052
local cache : n/a
Devices:
CUDA0: NVIDIA GeForce RTX 3060 (11907 MiB, 11495 MiB free)
On Worker 2 (192.168.0.229):
~/backends/cuda12-llama-cpp/llama-cpp-rpc-server -H 0.0.0.0 -p 50052 -t 8Same output as Worker 1.
2. Verify Network Connectivity
From Main Server:
nc -zv 192.168.0.177 50052
nc -zv 192.168.0.229 50052Result:
Connection to 192.168.0.177 50052 port [tcp/*] succeeded!
Connection to 192.168.0.229 50052 port [tcp/*] succeeded!
Workers show:
Accepted client connection
Client connection closed
✅ Network connectivity is working.
3. Start LocalAI with RPC Configuration
Method 1: Export then run
export LLAMACPP_GRPC_SERVERS="192.168.0.177:50052,192.168.0.229:50052"
export LOCALAI_FORCE_META_BACKEND_CAPABILITY=nvidia
echo $LLAMACPP_GRPC_SERVERS # Confirms: 192.168.0.177:50052,192.168.0.229:50052
local-ai run --models-path ~/models/Method 2: Inline environment variable
LLAMACPP_GRPC_SERVERS="192.168.0.177:50052,192.168.0.229:50052" \
LOCALAI_FORCE_META_BACKEND_CAPABILITY=nvidia \
local-ai run --models-path ~/models/Both methods produce the same result.
4. Attempt to Load Model
Open chat interface at http://localhost:8080/chat/gpt-oss-20b and send message "Hello".
Expected Behavior
Main Server should:
- Detect
LLAMACPP_GRPC_SERVERSenvironment variable - Pass RPC configuration to cuda12-llama-cpp backend
- Connect to both RPC workers
- Distribute model layers across 3 GPUs (1 local + 2 remote)
- Load model successfully
Workers should show:
Accepted client connection
Receiving tensors...
Model loaded successfully
[Connection maintained for inference]
Main Server logs should show:
Loading model: gpt-oss-20b
Registered RPC devices: 192.168.0.177:50052, 192.168.0.229:50052
llm_load_tensors: offloaded 33/33 layers to GPU
Model loaded (distributed across 3 GPUs)
Actual Behavior
Main Server logs:
5:24AM INF BackendLoader starting backend=cuda12-llama-cpp modelID=gpt-oss-20b o.model=gpt-oss-20b-Q4_K_M.gguf
5:24AM ERR Failed to load model gpt-oss-20b with backend cuda12-llama-cpp error="failed to load model with internal loader: could not load model: rpc error: code = Unknown desc = Unexpected error in RPC handling" modelID=gpt-oss-20b
5:24AM ERR Stream ended with error: failed to load model with internal loader: could not load model: rpc error: code = Unknown desc = Unexpected error in RPC handling
Workers show:
[Complete silence - no connection attempts]
Critical observation: Workers show no "Accepted client connection" message, indicating LocalAI never attempts to connect to them.
Model Configuration
~/models/gpt-oss-20b.yaml:
name: gpt-oss-20b
backend: cuda12-llama-cpp
parameters:
model: gpt-oss-20b-Q4_K_M.gguf
gpu_layers: 99
f16: true
mmap: true
context_size: 8192
threads: 8
batch: 512
temperature: 0.7
top_k: 40
top_p: 0.9
stopwords:
- <|im_end|>
- <dummy32000>
- </s>
- <|endoftext|>
- <|return|>
template:
chat: |-
<|start|>system<|message|>You are a helpful assistant.<|end|>{{- .Input -}}<|start|>assistant
chat_message: |-
<|start|>{{ .RoleName }}<|message|>{{ .Content }}<|end|>Note: Initially tried adding options: ["rpc_servers:192.168.0.177:50052,192.168.0.229:50052"] to YAML, but this had no effect (and appears to be unsupported).
Verification: Local-Only Mode Works
To confirm the model file and backend are functional, created a local-only configuration:
~/models/gpt-oss-20b-local.yaml:
name: gpt-oss-20b-local
backend: cuda12-llama-cpp
parameters:
model: gpt-oss-20b-Q4_K_M.gguf
gpu_layers: 25
f16: true
mmap: true
context_size: 2048
stopwords:
- <|im_end|>
- <dummy32000>
- </s>
- <|endoftext|>
- <|return|>
template:
chat: |-
<|start|>system<|message|>You are a helpful assistant.<|end|>{{- .Input -}}<|start|>assistant
chat_message: |-
<|start|>{{ .RoleName }}<|message|>{{ .Content }}<|end|>Started LocalAI WITHOUT RPC:
unset LLAMACPP_GRPC_SERVERS
export LOCALAI_FORCE_META_BACKEND_CAPABILITY=nvidia
local-ai run --models-path ~/models/Result: Model loads and responds successfully (partially offloaded to single GPU).
✅ Confirms:
- Model file is valid
- cuda12-llama-cpp backend works
- GPU is functional
- Issue is specifically with RPC configuration not being applied
Troubleshooting Attempted
1. Backend Version Mismatch (Resolved)
Initially, workers were running cuda11-llama-cpp while main server used cuda12-llama-cpp. Upgraded all systems to use cuda12-llama-cpp consistently.
2. CUDA Toolkit Upgrade (Completed)
Upgraded main server from CUDA 11.5 to CUDA 12.6 (workers don't need CUDA toolkit, only drivers).
3. Firewall Configuration (Verified)
sudo ufw allow from 192.168.0.0/24Network connectivity confirmed working via telnet/nc.
4. Environment Variable Formats Tested
# Tried all these formats:
LLAMACPP_GRPC_SERVERS="192.168.0.177:50052,192.168.0.229:50052"
LLAMACPP_RPC_SERVERS="192.168.0.177:50052,192.168.0.229:50052"
LLAMA_RPC_SERVERS="192.168.0.177:50052,192.168.0.229:50052"
# Also tried YAML options:
options: ["rpc_servers:192.168.0.177:50052,192.168.0.229:50052"]
options: ["grpc_servers:192.168.0.177:50052,192.168.0.229:50052"]None resulted in RPC connections being attempted.
5. Debug Logging Enabled
DEBUG=true LOCALAI_LOG_LEVEL=debug \
LLAMACPP_GRPC_SERVERS="192.168.0.177:50052,192.168.0.229:50052" \
local-ai run --models-path ~/models/No additional RPC-related debug output observed.
Analysis
The issue appears to be that the LLAMACPP_GRPC_SERVERS environment variable is:
- Set correctly in the shell environment (verified with
echo) - Not being detected or passed to the cuda12-llama-cpp backend during model loading
- Never triggering RPC connection attempts (workers show no activity)
The backend appears to be attempting local-only loading and failing with "RPC handling" error, possibly because it's trying to use RPC features without proper configuration.
Questions
- Is
LLAMACPP_GRPC_SERVERSthe correct environment variable name for LocalAI v3.7.0? - Does cuda12-llama-cpp backend support RPC, or is there a separate RPC-enabled backend variant?
- How does LocalAI pass environment variables to backends? Is there a subprocess isolation issue?
- Should RPC configuration be in YAML instead? If so, what's the correct format?
- Are there any prerequisites for RPC mode (config files, flags, etc.) beyond the environment variable?
Expected Fix
Either:
- Fix environment variable propagation so
LLAMACPP_GRPC_SERVERSreaches the backend process - Add YAML configuration support for RPC servers (e.g.,
rpc_servers: ["192.168.0.177:50052", "192.168.0.229:50052"]) - Document the correct method if the current approach is wrong
- Add debug logging to show when RPC configuration is detected and applied
Additional Context
- This setup worked briefly during testing when using
ncto manually connect (proves network/RPC servers work) - LocalAI's P2P mode might be an alternative, but requires dynamic discovery which isn't ideal for static LAN setups
- Similar users may face this issue when trying to configure RPC with fixed IP addresses
Related Documentation
- [LocalAI Distributed Inference Docs](https://localai.io/features/distribute/)
- [llama.cpp RPC Documentation](https://github.com/ggerganov/llama.cpp/blob/master/tools/rpc/README.md)
- [LocalAI PR #2324](feat(llama.cpp): add distributed llama.cpp inferencing #2324) - Original distributed inference implementation
System Information:
$ local-ai --version
LocalAI version: v3.7.0 (9ecfdc593827942488daa6c4027047f0ac04bd6d)
$ nvcc --version
Cuda compilation tools, release 12.6, V12.6.77
$ nvidia-smi
NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0
$ ls -la ~/backends/cuda12-llama-cpp/
drwxr-x--- 3 rnr rnr 4096 Nov 20 10:13 .
-rwxr-xr-x 1 rnr rnr 458325176 Oct 31 21:45 llama-cpp-grpc
-rwxr-xr-x 1 rnr rnr 396851480 Oct 31 21:45 llama-cpp-rpc-server