RPC Distributed Inference: Environment Variable Not Being Passed to Backend


## Environment

- **LocalAI Version**: v3.7.0 (commit 9ecfdc593827942488daa6c4027047f0ac04bd6d)
- **Backend**: cuda12-llama-cpp
- **OS**: Ubuntu 24.04
- **CUDA Version**: 12.6 (upgraded from 11.5)
- **NVIDIA Driver**: 580.95.05
- **Hardware Setup**: 
  - Main Server: AMD Ryzen 5 3600, 1x NVIDIA RTX 3060 (12GB)
  - Worker 1: AMD Ryzen 5 3600, 1x NVIDIA RTX 3060 (12GB) at 192.168.0.177
  - Worker 2: AMD Ryzen 5 3600, 1x NVIDIA RTX 3060 (12GB) at 192.168.0.229
- **Installation Method**: Binary installation
- **Model**: GPT-OSS-20B (Q4_K_M quantization, ~11GB)

## Problem Description

LocalAI fails to connect to RPC workers despite:
- ✅ RPC workers running correctly on both machines
- ✅ Network connectivity confirmed (telnet/nc successfully connects)
- ✅ Environment variable `LLAMACPP_GRPC_SERVERS` set correctly
- ✅ Model loads successfully in local-only mode (non-RPC)
- ❌ Workers show **no connection attempts** when LocalAI tries to load model

The `LLAMACPP_GRPC_SERVERS` environment variable appears not to be passed to the llama-cpp backend, causing LocalAI to attempt local-only loading even when RPC is configured.

## Steps to Reproduce

### 1. Start RPC Workers

**On Worker 1 (192.168.0.177):**
```bash
~/backends/cuda12-llama-cpp/llama-cpp-rpc-server -H 0.0.0.0 -p 50052 -t 8
```

**Output:**
```
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
Starting RPC server v3.0.0
  endpoint       : 0.0.0.0:50052
  local cache    : n/a
Devices:
  CUDA0: NVIDIA GeForce RTX 3060 (11907 MiB, 11495 MiB free)
```

**On Worker 2 (192.168.0.229):**
```bash
~/backends/cuda12-llama-cpp/llama-cpp-rpc-server -H 0.0.0.0 -p 50052 -t 8
```

Same output as Worker 1.

### 2. Verify Network Connectivity

**From Main Server:**
```bash
nc -zv 192.168.0.177 50052
nc -zv 192.168.0.229 50052
```

**Result:**
```
Connection to 192.168.0.177 50052 port [tcp/*] succeeded!
Connection to 192.168.0.229 50052 port [tcp/*] succeeded!
```

Workers show:
```
Accepted client connection
Client connection closed
```

**✅ Network connectivity is working.**

### 3. Start LocalAI with RPC Configuration

**Method 1: Export then run**
```bash
export LLAMACPP_GRPC_SERVERS="192.168.0.177:50052,192.168.0.229:50052"
export LOCALAI_FORCE_META_BACKEND_CAPABILITY=nvidia
echo $LLAMACPP_GRPC_SERVERS  # Confirms: 192.168.0.177:50052,192.168.0.229:50052
local-ai run --models-path ~/models/
```

**Method 2: Inline environment variable**
```bash
LLAMACPP_GRPC_SERVERS="192.168.0.177:50052,192.168.0.229:50052" \
LOCALAI_FORCE_META_BACKEND_CAPABILITY=nvidia \
local-ai run --models-path ~/models/
```

Both methods produce the same result.

### 4. Attempt to Load Model

Open chat interface at `http://localhost:8080/chat/gpt-oss-20b` and send message "Hello".

## Expected Behavior

**Main Server should:**
1. Detect `LLAMACPP_GRPC_SERVERS` environment variable
2. Pass RPC configuration to cuda12-llama-cpp backend
3. Connect to both RPC workers
4. Distribute model layers across 3 GPUs (1 local + 2 remote)
5. Load model successfully

**Workers should show:**
```
Accepted client connection
Receiving tensors...
Model loaded successfully
[Connection maintained for inference]
```

**Main Server logs should show:**
```
Loading model: gpt-oss-20b
Registered RPC devices: 192.168.0.177:50052, 192.168.0.229:50052
llm_load_tensors: offloaded 33/33 layers to GPU
Model loaded (distributed across 3 GPUs)
```

## Actual Behavior

**Main Server logs:**
```
5:24AM INF BackendLoader starting backend=cuda12-llama-cpp modelID=gpt-oss-20b o.model=gpt-oss-20b-Q4_K_M.gguf
5:24AM ERR Failed to load model gpt-oss-20b with backend cuda12-llama-cpp error="failed to load model with internal loader: could not load model: rpc error: code = Unknown desc = Unexpected error in RPC handling" modelID=gpt-oss-20b
5:24AM ERR Stream ended with error: failed to load model with internal loader: could not load model: rpc error: code = Unknown desc = Unexpected error in RPC handling
```

**Workers show:**
```
[Complete silence - no connection attempts]
```

**Critical observation:** Workers show **no "Accepted client connection" message**, indicating LocalAI never attempts to connect to them.

## Model Configuration

**~/models/gpt-oss-20b.yaml:**
```yaml
name: gpt-oss-20b
backend: cuda12-llama-cpp

parameters:
  model: gpt-oss-20b-Q4_K_M.gguf

gpu_layers: 99
f16: true
mmap: true
context_size: 8192
threads: 8
batch: 512

temperature: 0.7
top_k: 40
top_p: 0.9

stopwords:
  - <|im_end|>
  - <dummy32000>
  - </s>
  - <|endoftext|>
  - <|return|>

template:
  chat: |-
    <|start|>system<|message|>You are a helpful assistant.<|end|>{{- .Input -}}<|start|>assistant
  chat_message: |-
    <|start|>{{ .RoleName }}<|message|>{{ .Content }}<|end|>
```

**Note:** Initially tried adding `options: ["rpc_servers:192.168.0.177:50052,192.168.0.229:50052"]` to YAML, but this had no effect (and appears to be unsupported).

## Verification: Local-Only Mode Works

To confirm the model file and backend are functional, created a local-only configuration:

**~/models/gpt-oss-20b-local.yaml:**
```yaml
name: gpt-oss-20b-local
backend: cuda12-llama-cpp
parameters:
  model: gpt-oss-20b-Q4_K_M.gguf
gpu_layers: 25
f16: true
mmap: true
context_size: 2048

stopwords:
  - <|im_end|>
  - <dummy32000>
  - </s>
  - <|endoftext|>
  - <|return|>

template:
  chat: |-
    <|start|>system<|message|>You are a helpful assistant.<|end|>{{- .Input -}}<|start|>assistant
  chat_message: |-
    <|start|>{{ .RoleName }}<|message|>{{ .Content }}<|end|>
```

**Started LocalAI WITHOUT RPC:**
```bash
unset LLAMACPP_GRPC_SERVERS
export LOCALAI_FORCE_META_BACKEND_CAPABILITY=nvidia
local-ai run --models-path ~/models/
```

**Result:** Model loads and responds successfully (partially offloaded to single GPU).

**✅ Confirms:**
- Model file is valid
- cuda12-llama-cpp backend works
- GPU is functional
- **Issue is specifically with RPC configuration not being applied**

## Troubleshooting Attempted

### 1. Backend Version Mismatch (Resolved)
Initially, workers were running cuda11-llama-cpp while main server used cuda12-llama-cpp. Upgraded all systems to use cuda12-llama-cpp consistently.

### 2. CUDA Toolkit Upgrade (Completed)
Upgraded main server from CUDA 11.5 to CUDA 12.6 (workers don't need CUDA toolkit, only drivers).

### 3. Firewall Configuration (Verified)
```bash
sudo ufw allow from 192.168.0.0/24
```
Network connectivity confirmed working via telnet/nc.

### 4. Environment Variable Formats Tested
```bash
# Tried all these formats:
LLAMACPP_GRPC_SERVERS="192.168.0.177:50052,192.168.0.229:50052"
LLAMACPP_RPC_SERVERS="192.168.0.177:50052,192.168.0.229:50052"
LLAMA_RPC_SERVERS="192.168.0.177:50052,192.168.0.229:50052"

# Also tried YAML options:
options: ["rpc_servers:192.168.0.177:50052,192.168.0.229:50052"]
options: ["grpc_servers:192.168.0.177:50052,192.168.0.229:50052"]
```

None resulted in RPC connections being attempted.

### 5. Debug Logging Enabled
```bash
DEBUG=true LOCALAI_LOG_LEVEL=debug \
LLAMACPP_GRPC_SERVERS="192.168.0.177:50052,192.168.0.229:50052" \
local-ai run --models-path ~/models/
```

No additional RPC-related debug output observed.

## Analysis

The issue appears to be that the `LLAMACPP_GRPC_SERVERS` environment variable is:
1. Set correctly in the shell environment (verified with `echo`)
2. **Not being detected or passed** to the cuda12-llama-cpp backend during model loading
3. Never triggering RPC connection attempts (workers show no activity)

The backend appears to be attempting local-only loading and failing with "RPC handling" error, possibly because it's trying to use RPC features without proper configuration.

## Questions

1. **Is `LLAMACPP_GRPC_SERVERS` the correct environment variable name** for LocalAI v3.7.0?
2. **Does cuda12-llama-cpp backend support RPC**, or is there a separate RPC-enabled backend variant?
3. **How does LocalAI pass environment variables to backends?** Is there a subprocess isolation issue?
4. **Should RPC configuration be in YAML instead?** If so, what's the correct format?
5. **Are there any prerequisites** for RPC mode (config files, flags, etc.) beyond the environment variable?

## Expected Fix

Either:
1. **Fix environment variable propagation** so `LLAMACPP_GRPC_SERVERS` reaches the backend process
2. **Add YAML configuration support** for RPC servers (e.g., `rpc_servers: ["192.168.0.177:50052", "192.168.0.229:50052"]`)
3. **Document the correct method** if the current approach is wrong
4. **Add debug logging** to show when RPC configuration is detected and applied

## Additional Context

- This setup worked briefly during testing when using `nc` to manually connect (proves network/RPC servers work)
- LocalAI's P2P mode might be an alternative, but requires dynamic discovery which isn't ideal for static LAN setups
- Similar users may face this issue when trying to configure RPC with fixed IP addresses

## Related Documentation

- [[LocalAI Distributed Inference Docs](https://localai.io/features/distribute/)](https://localai.io/features/distribute/)
- [[llama.cpp RPC Documentation](https://github.com/ggerganov/llama.cpp/blob/master/tools/rpc/README.md)](https://github.com/ggerganov/llama.cpp/blob/master/tools/rpc/README.md)
- [[LocalAI PR #2324](https://github.com/mudler/LocalAI/pull/2324)](https://github.com/mudler/LocalAI/pull/2324) - Original distributed inference implementation

---

**System Information:**
```bash
$ local-ai --version
LocalAI version: v3.7.0 (9ecfdc593827942488daa6c4027047f0ac04bd6d)

$ nvcc --version
Cuda compilation tools, release 12.6, V12.6.77

$ nvidia-smi
NVIDIA-SMI 580.95.05    Driver Version: 580.95.05    CUDA Version: 13.0

$ ls -la ~/backends/cuda12-llama-cpp/
drwxr-x--- 3 rnr rnr      4096 Nov 20 10:13 .
-rwxr-xr-x 1 rnr rnr 458325176 Oct 31 21:45 llama-cpp-grpc
-rwxr-xr-x 1 rnr rnr 396851480 Oct 31 21:45 llama-cpp-rpc-server
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RPC Distributed Inference: Environment Variable Not Being Passed to Backend #7355

Environment

Problem Description

Steps to Reproduce

1. Start RPC Workers

2. Verify Network Connectivity

3. Start LocalAI with RPC Configuration

4. Attempt to Load Model

Expected Behavior

Actual Behavior

Model Configuration

Verification: Local-Only Mode Works

Troubleshooting Attempted

1. Backend Version Mismatch (Resolved)

2. CUDA Toolkit Upgrade (Completed)

3. Firewall Configuration (Verified)

4. Environment Variable Formats Tested

5. Debug Logging Enabled

Analysis

Questions

Expected Fix

Additional Context

Related Documentation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

RPC Distributed Inference: Environment Variable Not Being Passed to Backend #7355

Description

Environment

Problem Description

Steps to Reproduce

1. Start RPC Workers

2. Verify Network Connectivity

3. Start LocalAI with RPC Configuration

4. Attempt to Load Model

Expected Behavior

Actual Behavior

Model Configuration

Verification: Local-Only Mode Works

Troubleshooting Attempted

1. Backend Version Mismatch (Resolved)

2. CUDA Toolkit Upgrade (Completed)

3. Firewall Configuration (Verified)

4. Environment Variable Formats Tested

5. Debug Logging Enabled

Analysis

Questions

Expected Fix

Additional Context

Related Documentation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions