A production-ready, scalable inference service for Large Language Models (LLMs) built on Kubernetes. This service leverages vLLM for high-performance inference and implements an asynchronous, queue-based architecture for handling concurrent requests efficiently.
- Overview
- Architecture
- Features
- Prerequisites
- Quick Start
- Deployment
- Usage
- Configuration
- Testing
- Monitoring
- Scaling
- Troubleshooting
- Contributing
- License
This project provides a cloud-native LLM inference service designed for production workloads. It handles the complexities of serving large language models at scale, including:
- High-Performance Inference: Powered by vLLM for 24x faster throughput
- Asynchronous Processing: Non-blocking request handling via RabbitMQ
- Horizontal Scalability: Add workers to handle increased load
- GPU Optimization: Efficient GPU memory management with PagedAttention
- Production Ready: Comprehensive error handling, logging, and testing
Default Model: Llama-2-7b-chat-hf (easily configurable to other models)
The system consists of four microservices deployed on Kubernetes:
βββββββββββββββ
β Client β
ββββββββ¬βββββββ
β HTTP
βΌ
βββββββββββββββββββ ββββββββββββββββ βββββββββββββββ
β API Server βββββββΊβ RabbitMQ βββββββΊβ LLM Service β
β (FastAPI) β β (Queue/Broker)β β (vLLM) β
βββββββββββ¬ββββββββ ββββββββββββββββ ββββββββ¬βββββββ
β β
β ββββββββββββββββ β
βββββββββββββββΊβ PostgreSQL ββββββββββββββββ
β (Database) β
ββββββββββββββββ
| Component | Technology | Purpose |
|---|---|---|
| API Server | FastAPI | Receives user requests, returns job IDs, queries status |
| Message Queue | RabbitMQ | Distributes inference tasks to workers asynchronously |
| LLM Service | vLLM + CUDA | Processes inference requests using GPU-accelerated models |
| Database | PostgreSQL | Stores job results and maintains request history |
- Client β POST request to API Server with prompt
- API Server β Generates job ID, publishes task to RabbitMQ
- API Server β Returns job ID to client immediately
- LLM Service β Consumes task from queue, runs inference
- LLM Service β Stores result in PostgreSQL
- Client β Polls API Server with job ID to retrieve result
- β‘ 24x Faster: vLLM with PagedAttention vs. standard transformers
- π₯ GPU Accelerated: NVIDIA CUDA 11.8 support
- π Batching: Continuous batching for optimal throughput
- πΎ Memory Efficient: 2x reduction in GPU memory usage
- π Async Processing: Non-blocking request handling
- β»οΈ Auto-Retry: Automatic reconnection and retry logic
- π‘οΈ Error Handling: Comprehensive exception handling
- β Message Durability: No data loss on worker failure
- π Horizontally Scalable: Add workers as needed
- ποΈ Configurable: Environment variables and YAML config
- π Well Tested: Comprehensive test suite included
- π Documented: Complete setup and usage documentation
Before deployment, ensure you have:
- Google Cloud Platform (GCP) account with billing enabled
- gcloud CLI installed and authenticated
- kubectl installed and configured
- Docker installed for building images
- GPU Quota on GCP (at least 1x NVIDIA T4 or better)
For local testing and development:
- Python 3.9+
- CUDA 11.8+ (for GPU inference)
- PostgreSQL
- RabbitMQ
- 16GB+ RAM
- NVIDIA GPU with 16GB+ VRAM
git clone https://github.com/yourusername/llm-inference-service.git
cd llm-inference-service# Create GKE cluster with GPU nodes
gcloud container clusters create llm-cluster \
--region us-central1 \
--machine-type n1-standard-4 \
--accelerator type=nvidia-tesla-t4,count=1 \
--num-nodes 3
# Authenticate
gcloud container clusters get-credentials llm-cluster \
--region us-central1 \
--project your-project-idkubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml# Deploy database
kubectl apply -f services/db-service/deployment-db.yaml
kubectl apply -f services/db-service/service-db.yaml
# Deploy RabbitMQ
kubectl apply -f services/pub-sub-service/deployment/deployment-pub-sub.yaml
kubectl apply -f services/pub-sub-service/deployment/service-pub-sub.yaml
# Wait for services to be ready
kubectl wait --for=condition=ready pod -l app=db --timeout=300s
kubectl wait --for=condition=ready pod -l app=rabbitmq --timeout=300s
# Deploy LLM service
kubectl apply -f services/llm-service/deployment/deployment-llm-service.yaml
kubectl apply -f services/llm-service/deployment/service-llm-service.yaml
# Deploy API server
kubectl apply -f services/api-server/deployment/deployment-api-server.yaml
kubectl apply -f services/api-server/deployment/service-api-server.yaml
kubectl apply -f services/api-server/deployment/ingress-api-server.yaml# Check all pods are running
kubectl get pods
# Expected output:
# NAME READY STATUS RESTARTS AGE
# api-server-xxx 1/1 Running 0 2m
# db-xxx 1/1 Running 0 5m
# inference-xxx 1/1 Running 0 3m
# rabbitmq-xxx 1/1 Running 0 5m
# Get API server external IP
kubectl get svc api-server# Submit a request
export API_URL=$(kubectl get svc api-server -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
curl -X POST http://${API_URL}:8000/chat \
-H "Content-Type: application/json" \
-d '{"text": "Explain quantum computing in simple terms"}'
# Output: {"job_id": "550e8400-e29b-41d4-a716-446655440000"}
# Check status
curl http://${API_URL}:8000/status/550e8400-e29b-41d4-a716-446655440000If you need to customize the services:
# Build LLM service
cd services/llm-service
docker build -t your-dockerhub-username/llm-service:latest .
docker push your-dockerhub-username/llm-service:latest
# Build API server
cd ../api-server
docker build -t your-dockerhub-username/api-server:latest .
docker push your-dockerhub-username/api-server:latest
# Update deployment files with your image names
# Then apply the deploymentsLlama-2 requires a HuggingFace access token:
- Request access at HuggingFace Llama-2
- Get your token from HuggingFace Settings
- Create Kubernetes secret:
kubectl create secret generic huggingface-secret \
--from-literal=token=hf_your_token_here- Uncomment the HF_TOKEN section in
services/llm-service/deployment/deployment-llm-service.yaml
Use the provided script for automated deployment:
./scripts/deploy_gke.shimport requests
import time
API_URL = "http://your-api-server-ip:8000"
def submit_request(prompt: str) -> str:
"""Submit inference request and return job ID."""
response = requests.post(
f"{API_URL}/chat",
json={"text": prompt}
)
return response.json()["job_id"]
def get_result(job_id: str, timeout: int = 120) -> str:
"""Poll for result until completed or timeout."""
start_time = time.time()
while time.time() - start_time < timeout:
response = requests.get(f"{API_URL}/status/{job_id}")
data = response.json()
if data["status"] == "completed":
return data["result"]
print("Processing...", end="\r")
time.sleep(2)
raise TimeoutError("Request timed out")
# Usage
job_id = submit_request("Write a haiku about clouds")
print(f"Job ID: {job_id}")
result = get_result(job_id)
print(f"Result: {result}")Use the provided example client:
# Basic usage
python example_client.py --prompt "Explain machine learning"
# Custom endpoint
python example_client.py \
--url http://your-ip:8000 \
--prompt "Write a story about AI" \
--max-wait 300
# Submit without waiting
python example_client.py --prompt "Hello" --no-wait# Submit request
curl -X POST http://api-url:8000/chat \
-H "Content-Type: application/json" \
-d '{"text": "What is the meaning of life?"}'
# Check status
curl http://api-url:8000/status/{job_id}
# View API documentation
curl http://api-url:8000/docsEdit services/llm-service/config/config.yaml:
model:
name: "meta-llama/Llama-2-7b-chat-hf"
tensor_parallel_size: 1
max_model_len: 2048
gpu_memory_utilization: 0.9
inference:
temperature: 0.7
top_p: 0.9
max_tokens: 512Configure via deployment YAML or environment:
| Variable | Default | Description |
|---|---|---|
MODEL_NAME |
meta-llama/Llama-2-7b-chat-hf |
HuggingFace model ID |
TENSOR_PARALLEL_SIZE |
1 |
Number of GPUs for model parallelism |
MAX_MODEL_LEN |
2048 |
Maximum sequence length |
GPU_MEMORY_UTILIZATION |
0.9 |
GPU memory usage (0.0-1.0) |
TEMPERATURE |
0.7 |
Sampling temperature |
TOP_P |
0.9 |
Nucleus sampling parameter |
MAX_TOKENS |
512 |
Maximum tokens to generate |
The service supports any vLLM-compatible model:
meta-llama/Llama-2-7b-chat-hfβ (default)meta-llama/Llama-2-13b-chat-hfmistralai/Mistral-7B-Instruct-v0.1tiiuae/falcon-7b-instructHuggingFaceH4/zephyr-7b-beta
To change models, update MODEL_NAME in the deployment YAML.
cd services/tests
pip install -r requirements.txt
# Run all tests
pytest test_db.py -v
# Run specific test categories
pytest test_db.py::TestDatabase -v # Database tests
pytest test_db.py::TestRabbitMQ -v # Message queue tests
pytest test_db.py::TestAPIServer -v # API endpoint tests
pytest test_db.py::TestEndToEnd -v # Integration tests
# Run with coverage
pytest test_db.py --cov --cov-report=html- Database Tests: Connection, CRUD operations, error handling
- RabbitMQ Tests: Queue operations, message durability
- API Tests: Endpoint validation, status checking
- Integration Tests: End-to-end workflow
- Performance Tests: Concurrent request handling
See services/tests/README.md for detailed testing documentation.
# API Server
kubectl logs -f deployment/api-server
# LLM Service (inference worker)
kubectl logs -f deployment/inference
# Database
kubectl logs -f deployment/db
# RabbitMQ
kubectl logs -f deployment/rabbitmq# Port forward to access management interface
kubectl port-forward svc/rabbitmq-service 15672:15672
# Open browser to http://localhost:15672
# Default credentials: guest / guest- Inference Latency: Time from queue pickup to result storage
- Queue Depth: Number of pending tasks in RabbitMQ
- GPU Utilization: Monitor with
nvidia-smiin LLM pods - Error Rate: Failed inference attempts
- Throughput: Requests processed per second
Scale inference workers to handle more load:
# Scale to 3 workers
kubectl scale deployment inference --replicas=3
# Scale API servers
kubectl scale deployment api-server --replicas=5
# Verify
kubectl get deployments| Workers | Concurrent Users | GPU Type | Expected Latency |
|---|---|---|---|
| 1 | 10-20 | T4 | 2-5s |
| 3 | 30-60 | T4 | 2-5s |
| 5 | 50-100 | T4 | 2-5s |
| 1 | 40-80 | A100 | 1-2s |
Per Worker (Llama-2-7b-chat-hf):
- GPU: 1x NVIDIA T4 (16GB VRAM) or better
- RAM: 16GB
- CPU: 4 cores
- Storage: 20GB (for model cache)
Larger Models:
- Llama-2-13b: 1x A100 (40GB) or 2x T4 with tensor parallelism
- Llama-2-70b: 4x A100 (40GB) or 8x A100 (80GB)
Problem: Model takes too long to load or fails
Solutions:
- First run downloads ~14GB model (2-5 minutes)
- Use persistent volumes for model cache
- Check HuggingFace token for gated models
- Verify GPU availability:
kubectl exec -it <pod> -- nvidia-smi
Problem: OOM errors in LLM service
Solutions:
# Reduce GPU memory utilization
- name: GPU_MEMORY_UTILIZATION
value: "0.85" # Try 0.8, 0.7, etc.
# Reduce max sequence length
- name: MAX_MODEL_LEN
value: "1024" # Down from 2048
# Use smaller model
- name: MODEL_NAME
value: "meta-llama/Llama-2-7b-chat-hf" # Instead of 13bProblem: Services can't communicate
Solutions:
# Check all pods are running
kubectl get pods
# Verify services
kubectl get svc
# Test connectivity from pod
kubectl exec -it <api-pod> -- ping rabbitmq-service
kubectl exec -it <api-pod> -- ping db-service
# Check logs for errors
kubectl logs <pod-name>Problem: Inference takes too long
Solutions:
- Check GPU utilization: Should be >80%
- Increase
GPU_MEMORY_UTILIZATIONfor more parallel requests - Use larger GPU (A100 vs T4)
- Enable tensor parallelism for multi-GPU
- Check queue depth - may need more workers
| Error | Cause | Solution |
|---|---|---|
CUDA out of memory |
GPU VRAM exhausted | Reduce batch size or model size |
Connection refused |
Service not ready | Wait for pods to be Running |
Model not found |
Invalid model name | Check HuggingFace model ID |
Token required |
Gated model | Add HuggingFace token |
llm-inference-service/
βββ services/
β βββ api-server/ # FastAPI gateway
β β βββ src/main.py # API implementation
β β βββ config/config.yaml # Configuration
β β βββ deployment/ # K8s manifests
β β βββ Dockerfile
β β βββ requirements.txt
β β
β βββ llm-service/ # vLLM inference worker
β β βββ src/main.py # Inference implementation
β β βββ config/config.yaml # Model configuration
β β βββ deployment/ # K8s manifests
β β βββ Dockerfile # CUDA-enabled
β β βββ requirements.txt
β β βββ README.md
β β
β βββ db-service/ # PostgreSQL
β β βββ deployment-db.yaml
β β βββ service-db.yaml
β β
β βββ pub-sub-service/ # RabbitMQ
β β βββ deployment/
β β βββ deployment-pub-sub.yaml
β β βββ service-pub-sub.yaml
β β
β βββ tests/ # Test suite
β βββ test_db.py # All tests
β βββ requirements.txt
β βββ README.md
β
βββ scripts/
β βββ deploy_gke.sh # Deployment automation
β βββ push_images.sh # Image build/push
β
βββ example_client.py # Example Python client
βββ IMPLEMENTATION_NOTES.md # Technical details
βββ IMPLEMENTATION_SUMMARY.md # Implementation overview
βββ README.md # This file
βββ LICENSE
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
# Clone your fork
git clone https://github.com/yourusername/llm-inference-service.git
# Install dependencies
pip install -r services/llm-service/requirements.txt
pip install -r services/tests/requirements.txt
# Run tests
pytest services/tests/test_db.py -v
# Make changes and testThis project is licensed under the MIT License - see the LICENSE file for details.
- vLLM - High-performance LLM inference engine
- Meta AI - Llama-2 language model
- HuggingFace - Model hosting and transformers library
- FastAPI - Modern Python web framework
- RabbitMQ - Reliable message broker
For issues and questions:
- Issues: GitHub Issues
- Documentation: See
IMPLEMENTATION_NOTES.mdfor technical details - Tests: See
services/tests/README.mdfor testing guide
- β vLLM integration
- β Llama-2-7b-chat-hf support
- β Kubernetes deployment
- β Comprehensive testing
- β Full documentation
- Streaming responses (Server-Sent Events)
- Authentication and rate limiting
- Prometheus metrics export
- Grafana dashboards
- Auto-scaling based on queue depth
- Multi-model support
- LoRA adapter deployment
- Cost tracking per request
- Admin dashboard
- Multi-modal support (vision, audio)
Built with β€οΈ for scalable AI inference