Skip to content

RPC Distributed Inference: Environment Variable Not Being Passed to Backend #7355

@lvnvpreet

Description

@lvnvpreet

Environment

  • LocalAI Version: v3.7.0 (commit 9ecfdc5)
  • Backend: cuda12-llama-cpp
  • OS: Ubuntu 24.04
  • CUDA Version: 12.6 (upgraded from 11.5)
  • NVIDIA Driver: 580.95.05
  • Hardware Setup:
    • Main Server: AMD Ryzen 5 3600, 1x NVIDIA RTX 3060 (12GB)
    • Worker 1: AMD Ryzen 5 3600, 1x NVIDIA RTX 3060 (12GB) at 192.168.0.177
    • Worker 2: AMD Ryzen 5 3600, 1x NVIDIA RTX 3060 (12GB) at 192.168.0.229
  • Installation Method: Binary installation
  • Model: GPT-OSS-20B (Q4_K_M quantization, ~11GB)

Problem Description

LocalAI fails to connect to RPC workers despite:

  • ✅ RPC workers running correctly on both machines
  • ✅ Network connectivity confirmed (telnet/nc successfully connects)
  • ✅ Environment variable LLAMACPP_GRPC_SERVERS set correctly
  • ✅ Model loads successfully in local-only mode (non-RPC)
  • ❌ Workers show no connection attempts when LocalAI tries to load model

The LLAMACPP_GRPC_SERVERS environment variable appears not to be passed to the llama-cpp backend, causing LocalAI to attempt local-only loading even when RPC is configured.

Steps to Reproduce

1. Start RPC Workers

On Worker 1 (192.168.0.177):

~/backends/cuda12-llama-cpp/llama-cpp-rpc-server -H 0.0.0.0 -p 50052 -t 8

Output:

ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
Starting RPC server v3.0.0
  endpoint       : 0.0.0.0:50052
  local cache    : n/a
Devices:
  CUDA0: NVIDIA GeForce RTX 3060 (11907 MiB, 11495 MiB free)

On Worker 2 (192.168.0.229):

~/backends/cuda12-llama-cpp/llama-cpp-rpc-server -H 0.0.0.0 -p 50052 -t 8

Same output as Worker 1.

2. Verify Network Connectivity

From Main Server:

nc -zv 192.168.0.177 50052
nc -zv 192.168.0.229 50052

Result:

Connection to 192.168.0.177 50052 port [tcp/*] succeeded!
Connection to 192.168.0.229 50052 port [tcp/*] succeeded!

Workers show:

Accepted client connection
Client connection closed

✅ Network connectivity is working.

3. Start LocalAI with RPC Configuration

Method 1: Export then run

export LLAMACPP_GRPC_SERVERS="192.168.0.177:50052,192.168.0.229:50052"
export LOCALAI_FORCE_META_BACKEND_CAPABILITY=nvidia
echo $LLAMACPP_GRPC_SERVERS  # Confirms: 192.168.0.177:50052,192.168.0.229:50052
local-ai run --models-path ~/models/

Method 2: Inline environment variable

LLAMACPP_GRPC_SERVERS="192.168.0.177:50052,192.168.0.229:50052" \
LOCALAI_FORCE_META_BACKEND_CAPABILITY=nvidia \
local-ai run --models-path ~/models/

Both methods produce the same result.

4. Attempt to Load Model

Open chat interface at http://localhost:8080/chat/gpt-oss-20b and send message "Hello".

Expected Behavior

Main Server should:

  1. Detect LLAMACPP_GRPC_SERVERS environment variable
  2. Pass RPC configuration to cuda12-llama-cpp backend
  3. Connect to both RPC workers
  4. Distribute model layers across 3 GPUs (1 local + 2 remote)
  5. Load model successfully

Workers should show:

Accepted client connection
Receiving tensors...
Model loaded successfully
[Connection maintained for inference]

Main Server logs should show:

Loading model: gpt-oss-20b
Registered RPC devices: 192.168.0.177:50052, 192.168.0.229:50052
llm_load_tensors: offloaded 33/33 layers to GPU
Model loaded (distributed across 3 GPUs)

Actual Behavior

Main Server logs:

5:24AM INF BackendLoader starting backend=cuda12-llama-cpp modelID=gpt-oss-20b o.model=gpt-oss-20b-Q4_K_M.gguf
5:24AM ERR Failed to load model gpt-oss-20b with backend cuda12-llama-cpp error="failed to load model with internal loader: could not load model: rpc error: code = Unknown desc = Unexpected error in RPC handling" modelID=gpt-oss-20b
5:24AM ERR Stream ended with error: failed to load model with internal loader: could not load model: rpc error: code = Unknown desc = Unexpected error in RPC handling

Workers show:

[Complete silence - no connection attempts]

Critical observation: Workers show no "Accepted client connection" message, indicating LocalAI never attempts to connect to them.

Model Configuration

~/models/gpt-oss-20b.yaml:

name: gpt-oss-20b
backend: cuda12-llama-cpp

parameters:
  model: gpt-oss-20b-Q4_K_M.gguf

gpu_layers: 99
f16: true
mmap: true
context_size: 8192
threads: 8
batch: 512

temperature: 0.7
top_k: 40
top_p: 0.9

stopwords:
  - <|im_end|>
  - <dummy32000>
  - </s>
  - <|endoftext|>
  - <|return|>

template:
  chat: |-
    <|start|>system<|message|>You are a helpful assistant.<|end|>{{- .Input -}}<|start|>assistant
  chat_message: |-
    <|start|>{{ .RoleName }}<|message|>{{ .Content }}<|end|>

Note: Initially tried adding options: ["rpc_servers:192.168.0.177:50052,192.168.0.229:50052"] to YAML, but this had no effect (and appears to be unsupported).

Verification: Local-Only Mode Works

To confirm the model file and backend are functional, created a local-only configuration:

~/models/gpt-oss-20b-local.yaml:

name: gpt-oss-20b-local
backend: cuda12-llama-cpp
parameters:
  model: gpt-oss-20b-Q4_K_M.gguf
gpu_layers: 25
f16: true
mmap: true
context_size: 2048

stopwords:
  - <|im_end|>
  - <dummy32000>
  - </s>
  - <|endoftext|>
  - <|return|>

template:
  chat: |-
    <|start|>system<|message|>You are a helpful assistant.<|end|>{{- .Input -}}<|start|>assistant
  chat_message: |-
    <|start|>{{ .RoleName }}<|message|>{{ .Content }}<|end|>

Started LocalAI WITHOUT RPC:

unset LLAMACPP_GRPC_SERVERS
export LOCALAI_FORCE_META_BACKEND_CAPABILITY=nvidia
local-ai run --models-path ~/models/

Result: Model loads and responds successfully (partially offloaded to single GPU).

✅ Confirms:

  • Model file is valid
  • cuda12-llama-cpp backend works
  • GPU is functional
  • Issue is specifically with RPC configuration not being applied

Troubleshooting Attempted

1. Backend Version Mismatch (Resolved)

Initially, workers were running cuda11-llama-cpp while main server used cuda12-llama-cpp. Upgraded all systems to use cuda12-llama-cpp consistently.

2. CUDA Toolkit Upgrade (Completed)

Upgraded main server from CUDA 11.5 to CUDA 12.6 (workers don't need CUDA toolkit, only drivers).

3. Firewall Configuration (Verified)

sudo ufw allow from 192.168.0.0/24

Network connectivity confirmed working via telnet/nc.

4. Environment Variable Formats Tested

# Tried all these formats:
LLAMACPP_GRPC_SERVERS="192.168.0.177:50052,192.168.0.229:50052"
LLAMACPP_RPC_SERVERS="192.168.0.177:50052,192.168.0.229:50052"
LLAMA_RPC_SERVERS="192.168.0.177:50052,192.168.0.229:50052"

# Also tried YAML options:
options: ["rpc_servers:192.168.0.177:50052,192.168.0.229:50052"]
options: ["grpc_servers:192.168.0.177:50052,192.168.0.229:50052"]

None resulted in RPC connections being attempted.

5. Debug Logging Enabled

DEBUG=true LOCALAI_LOG_LEVEL=debug \
LLAMACPP_GRPC_SERVERS="192.168.0.177:50052,192.168.0.229:50052" \
local-ai run --models-path ~/models/

No additional RPC-related debug output observed.

Analysis

The issue appears to be that the LLAMACPP_GRPC_SERVERS environment variable is:

  1. Set correctly in the shell environment (verified with echo)
  2. Not being detected or passed to the cuda12-llama-cpp backend during model loading
  3. Never triggering RPC connection attempts (workers show no activity)

The backend appears to be attempting local-only loading and failing with "RPC handling" error, possibly because it's trying to use RPC features without proper configuration.

Questions

  1. Is LLAMACPP_GRPC_SERVERS the correct environment variable name for LocalAI v3.7.0?
  2. Does cuda12-llama-cpp backend support RPC, or is there a separate RPC-enabled backend variant?
  3. How does LocalAI pass environment variables to backends? Is there a subprocess isolation issue?
  4. Should RPC configuration be in YAML instead? If so, what's the correct format?
  5. Are there any prerequisites for RPC mode (config files, flags, etc.) beyond the environment variable?

Expected Fix

Either:

  1. Fix environment variable propagation so LLAMACPP_GRPC_SERVERS reaches the backend process
  2. Add YAML configuration support for RPC servers (e.g., rpc_servers: ["192.168.0.177:50052", "192.168.0.229:50052"])
  3. Document the correct method if the current approach is wrong
  4. Add debug logging to show when RPC configuration is detected and applied

Additional Context

  • This setup worked briefly during testing when using nc to manually connect (proves network/RPC servers work)
  • LocalAI's P2P mode might be an alternative, but requires dynamic discovery which isn't ideal for static LAN setups
  • Similar users may face this issue when trying to configure RPC with fixed IP addresses

Related Documentation


System Information:

$ local-ai --version
LocalAI version: v3.7.0 (9ecfdc593827942488daa6c4027047f0ac04bd6d)

$ nvcc --version
Cuda compilation tools, release 12.6, V12.6.77

$ nvidia-smi
NVIDIA-SMI 580.95.05    Driver Version: 580.95.05    CUDA Version: 13.0

$ ls -la ~/backends/cuda12-llama-cpp/
drwxr-x--- 3 rnr rnr      4096 Nov 20 10:13 .
-rwxr-xr-x 1 rnr rnr 458325176 Oct 31 21:45 llama-cpp-grpc
-rwxr-xr-x 1 rnr rnr 396851480 Oct 31 21:45 llama-cpp-rpc-server

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions