Multi-model LLM inference on a single machine (AMD RX 7900, 20GB VRAM) with secure Cloudflare tunnel exposure.
- Qwen 7B Instruct
qwen.yourdomain.com(port 8000) - Llama 2 8B Instruct
llama.yourdomain.com(port 8001) - Both models served via vLLM with OpenAI-compatible API
- Exposed via Cloudflare Tunnel for secure public access
- Docker with AMD GPU support
- HuggingFace account (for model downloads, if not cached)
- Cloudflare tunnel token
`powershell
$env:CLOUDFLARE_TUNNEL_TOKEN = "your-tunnel-token"
docker-compose build docker-compose up `
`powershell
python3.10 -m venv venv .\venv\Scripts\Activate.ps1
pip install requests
python test_client.py `
-
Configure Cloudflare Tunnel
- Update
cloudflare/config.ymlwith your tunnel ID and domain - Replace
<your-tunnel-id>andyourdomain.complaceholders
- Update
-
Performance Tuning (if needed)
- Adjust
gpu_memory_utilizationinserver.pyfiles if OOM errors occur - Consider model quantization for better concurrency
- Adjust
-
Production Deployment
- Set up persistent model caching in Docker volumes
- Add authentication layer (API keys) on top of vLLM servers
- Monitor GPU temperature and memory under load
For faster iteration without containers:
`powershell
$env:VLLM_ATTENTION_BACKEND = "xformers"
pip install vllm fastapi uvicorn
python qwen-7b/server.py
python llama8b/server.py
curl -X POST http://localhost:8000/v1/completions
-H "Content-Type: application/json"
-d '{"model":"Qwen/Qwen-7B-Instruct","prompt":"Hello","max_tokens":50}'
`
qwen-7b/server.pyQwen 7B inference with OpenAI API wrapperllama8b/server.pyLlama 8B inference with OpenAI API wrapperdocker-compose.ymlOrchestration of both models + Cloudflare tunneltest_client.pyCompatibility test client.github/copilot-instructions.mdAI agent development guide