Skip to content

ph8tel/model-host

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Model Host: AMD GPU LLM Inference Server

Multi-model LLM inference on a single machine (AMD RX 7900, 20GB VRAM) with secure Cloudflare tunnel exposure.

Architecture

  • Qwen 7B Instruct qwen.yourdomain.com (port 8000)
  • Llama 2 8B Instruct llama.yourdomain.com (port 8001)
  • Both models served via vLLM with OpenAI-compatible API
  • Exposed via Cloudflare Tunnel for secure public access

Quick Start

Prerequisites

  • Docker with AMD GPU support
  • HuggingFace account (for model downloads, if not cached)
  • Cloudflare tunnel token

Build & Run

`powershell

Set your tunnel token

$env:CLOUDFLARE_TUNNEL_TOKEN = "your-tunnel-token"

Build and start services

docker-compose build docker-compose up `

Test the Servers

`powershell

Create and activate venv

python3.10 -m venv venv .\venv\Scripts\Activate.ps1

Install test client dependencies

pip install requests

Run tests

python test_client.py `

Next Steps

  1. Configure Cloudflare Tunnel

    • Update cloudflare/config.yml with your tunnel ID and domain
    • Replace <your-tunnel-id> and yourdomain.com placeholders
  2. Performance Tuning (if needed)

    • Adjust gpu_memory_utilization in server.py files if OOM errors occur
    • Consider model quantization for better concurrency
  3. Production Deployment

    • Set up persistent model caching in Docker volumes
    • Add authentication layer (API keys) on top of vLLM servers
    • Monitor GPU temperature and memory under load

Local Development

For faster iteration without containers:

`powershell

Set environment for AMD GPU

$env:VLLM_ATTENTION_BACKEND = "xformers"

Install dependencies

pip install vllm fastapi uvicorn

Run single server

python qwen-7b/server.py

or

python llama8b/server.py

Test endpoint

curl -X POST http://localhost:8000/v1/completions
-H "Content-Type: application/json"
-d '{"model":"Qwen/Qwen-7B-Instruct","prompt":"Hello","max_tokens":50}' `

Key Files

  • qwen-7b/server.py Qwen 7B inference with OpenAI API wrapper
  • llama8b/server.py Llama 8B inference with OpenAI API wrapper
  • docker-compose.yml Orchestration of both models + Cloudflare tunnel
  • test_client.py Compatibility test client
  • .github/copilot-instructions.md AI agent development guide

About

Server setup for hosting qwen7b-instruct and llama8b for sentiment analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors