Model Host: AMD GPU LLM Inference Server

Multi-model LLM inference on a single machine (AMD RX 7900, 20GB VRAM) with secure Cloudflare tunnel exposure.

Architecture

Qwen 7B Instruct qwen.yourdomain.com (port 8000)
Llama 2 8B Instruct llama.yourdomain.com (port 8001)
Both models served via vLLM with OpenAI-compatible API
Exposed via Cloudflare Tunnel for secure public access

Quick Start

Prerequisites

Docker with AMD GPU support
HuggingFace account (for model downloads, if not cached)
Cloudflare tunnel token

Build & Run

`powershell

Set your tunnel token

$env:CLOUDFLARE_TUNNEL_TOKEN = "your-tunnel-token"

Build and start services

docker-compose build docker-compose up `

Test the Servers

`powershell

Create and activate venv

python3.10 -m venv venv .\venv\Scripts\Activate.ps1

Install test client dependencies

pip install requests

Run tests

python test_client.py `

Next Steps

Configure Cloudflare Tunnel
- Update cloudflare/config.yml with your tunnel ID and domain
- Replace <your-tunnel-id> and yourdomain.com placeholders
Performance Tuning (if needed)
- Adjust gpu_memory_utilization in server.py files if OOM errors occur
- Consider model quantization for better concurrency
Production Deployment
- Set up persistent model caching in Docker volumes
- Add authentication layer (API keys) on top of vLLM servers
- Monitor GPU temperature and memory under load

Local Development

For faster iteration without containers:

`powershell

Set environment for AMD GPU

$env:VLLM_ATTENTION_BACKEND = "xformers"

Install dependencies

pip install vllm fastapi uvicorn

Run single server

python qwen-7b/server.py

or

python llama8b/server.py

Test endpoint

curl -X POST http://localhost:8000/v1/completions
-H "Content-Type: application/json"
-d '{"model":"Qwen/Qwen-7B-Instruct","prompt":"Hello","max_tokens":50}' `

Key Files

qwen-7b/server.py Qwen 7B inference with OpenAI API wrapper
llama8b/server.py Llama 8B inference with OpenAI API wrapper
docker-compose.yml Orchestration of both models + Cloudflare tunnel
test_client.py Compatibility test client
.github/copilot-instructions.md AI agent development guide

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Model Host: AMD GPU LLM Inference Server

Architecture

Quick Start

Prerequisites

Build & Run

Set your tunnel token

Build and start services

Test the Servers

Create and activate venv

Install test client dependencies

Run tests

Next Steps

Local Development

Set environment for AMD GPU

Install dependencies

Run single server

or

Test endpoint

Key Files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
cloudflare		cloudflare
llama8b		llama8b
qwen-7b		qwen-7b
README.md		README.md
docker-compose.yml		docker-compose.yml
test_client.py		test_client.py

Folders and files

Latest commit

History

Repository files navigation

Model Host: AMD GPU LLM Inference Server

Architecture

Quick Start

Prerequisites

Build & Run

Set your tunnel token

Build and start services

Test the Servers

Create and activate venv

Install test client dependencies

Run tests

Next Steps

Local Development

Set environment for AMD GPU

Install dependencies

Run single server

or

Test endpoint

Key Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages