🚀 Distributed LLM Inference Service

A production-ready, scalable inference service for Large Language Models (LLMs) built on Kubernetes. This service leverages vLLM for high-performance inference and implements an asynchronous, queue-based architecture for handling concurrent requests efficiently.

🎯 Overview

This project provides a cloud-native LLM inference service designed for production workloads. It handles the complexities of serving large language models at scale, including:

High-Performance Inference: Powered by vLLM for 24x faster throughput
Asynchronous Processing: Non-blocking request handling via RabbitMQ
Horizontal Scalability: Add workers to handle increased load
GPU Optimization: Efficient GPU memory management with PagedAttention
Production Ready: Comprehensive error handling, logging, and testing

Default Model: Llama-2-7b-chat-hf (easily configurable to other models)

🏗️ Architecture

The system consists of four microservices deployed on Kubernetes:

┌─────────────┐
│   Client    │
└──────┬──────┘
       │ HTTP
       ▼
┌─────────────────┐      ┌──────────────┐      ┌─────────────┐
│   API Server    │─────►│   RabbitMQ   │─────►│ LLM Service │
│   (FastAPI)     │      │ (Queue/Broker)│     │   (vLLM)    │
└─────────┬───────┘      └──────────────┘      └──────┬──────┘
          │                                             │
          │              ┌──────────────┐              │
          └─────────────►│  PostgreSQL  │◄─────────────┘
                         │  (Database)  │
                         └──────────────┘

Component Responsibilities

Component	Technology	Purpose
API Server	FastAPI	Receives user requests, returns job IDs, queries status
Message Queue	RabbitMQ	Distributes inference tasks to workers asynchronously
LLM Service	vLLM + CUDA	Processes inference requests using GPU-accelerated models
Database	PostgreSQL	Stores job results and maintains request history

Request Flow

Client → POST request to API Server with prompt
API Server → Generates job ID, publishes task to RabbitMQ
API Server → Returns job ID to client immediately
LLM Service → Consumes task from queue, runs inference
LLM Service → Stores result in PostgreSQL
Client → Polls API Server with job ID to retrieve result

✨ Features

Performance

⚡ 24x Faster: vLLM with PagedAttention vs. standard transformers
🔥 GPU Accelerated: NVIDIA CUDA 11.8 support
📊 Batching: Continuous batching for optimal throughput
💾 Memory Efficient: 2x reduction in GPU memory usage

Reliability

🔄 Async Processing: Non-blocking request handling
♻️ Auto-Retry: Automatic reconnection and retry logic
🛡️ Error Handling: Comprehensive exception handling
✅ Message Durability: No data loss on worker failure

Operations

📈 Horizontally Scalable: Add workers as needed
🎛️ Configurable: Environment variables and YAML config
📝 Well Tested: Comprehensive test suite included
📚 Documented: Complete setup and usage documentation

📦 Prerequisites

Before deployment, ensure you have:

Google Cloud Platform (GCP) account with billing enabled
gcloud CLI installed and authenticated
kubectl installed and configured
Docker installed for building images
GPU Quota on GCP (at least 1x NVIDIA T4 or better)

Local Development Prerequisites

For local testing and development:

Python 3.9+
CUDA 11.8+ (for GPU inference)
PostgreSQL
RabbitMQ
16GB+ RAM
NVIDIA GPU with 16GB+ VRAM

🚀 Quick Start

1. Clone the Repository

git clone https://github.com/yourusername/llm-inference-service.git
cd llm-inference-service

2. Set Up GKE Cluster

# Create GKE cluster with GPU nodes
gcloud container clusters create llm-cluster \
  --region us-central1 \
  --machine-type n1-standard-4 \
  --accelerator type=nvidia-tesla-t4,count=1 \
  --num-nodes 3

# Authenticate
gcloud container clusters get-credentials llm-cluster \
  --region us-central1 \
  --project your-project-id

3. Install GPU Device Plugin

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml

4. Deploy Services

# Deploy database
kubectl apply -f services/db-service/deployment-db.yaml
kubectl apply -f services/db-service/service-db.yaml

# Deploy RabbitMQ
kubectl apply -f services/pub-sub-service/deployment/deployment-pub-sub.yaml
kubectl apply -f services/pub-sub-service/deployment/service-pub-sub.yaml

# Wait for services to be ready
kubectl wait --for=condition=ready pod -l app=db --timeout=300s
kubectl wait --for=condition=ready pod -l app=rabbitmq --timeout=300s

# Deploy LLM service
kubectl apply -f services/llm-service/deployment/deployment-llm-service.yaml
kubectl apply -f services/llm-service/deployment/service-llm-service.yaml

# Deploy API server
kubectl apply -f services/api-server/deployment/deployment-api-server.yaml
kubectl apply -f services/api-server/deployment/service-api-server.yaml
kubectl apply -f services/api-server/deployment/ingress-api-server.yaml

5. Verify Deployment

# Check all pods are running
kubectl get pods

# Expected output:
# NAME                        READY   STATUS    RESTARTS   AGE
# api-server-xxx              1/1     Running   0          2m
# db-xxx                      1/1     Running   0          5m
# inference-xxx               1/1     Running   0          3m
# rabbitmq-xxx                1/1     Running   0          5m

# Get API server external IP
kubectl get svc api-server

6. Test the Service

# Submit a request
export API_URL=$(kubectl get svc api-server -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

curl -X POST http://${API_URL}:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"text": "Explain quantum computing in simple terms"}'

# Output: {"job_id": "550e8400-e29b-41d4-a716-446655440000"}

# Check status
curl http://${API_URL}:8000/status/550e8400-e29b-41d4-a716-446655440000

🔧 Deployment

Building Custom Images

If you need to customize the services:

# Build LLM service
cd services/llm-service
docker build -t your-dockerhub-username/llm-service:latest .
docker push your-dockerhub-username/llm-service:latest

# Build API server
cd ../api-server
docker build -t your-dockerhub-username/api-server:latest .
docker push your-dockerhub-username/api-server:latest

# Update deployment files with your image names
# Then apply the deployments

Using Gated Models (Llama-2)

Llama-2 requires a HuggingFace access token:

Request access at HuggingFace Llama-2
Get your token from HuggingFace Settings
Create Kubernetes secret:

kubectl create secret generic huggingface-secret \
  --from-literal=token=hf_your_token_here

Uncomment the HF_TOKEN section in services/llm-service/deployment/deployment-llm-service.yaml

Deployment Script

Use the provided script for automated deployment:

./scripts/deploy_gke.sh

💻 Usage

Python Client Example

import requests
import time

API_URL = "http://your-api-server-ip:8000"

def submit_request(prompt: str) -> str:
    """Submit inference request and return job ID."""
    response = requests.post(
        f"{API_URL}/chat",
        json={"text": prompt}
    )
    return response.json()["job_id"]

def get_result(job_id: str, timeout: int = 120) -> str:
    """Poll for result until completed or timeout."""
    start_time = time.time()
    
    while time.time() - start_time < timeout:
        response = requests.get(f"{API_URL}/status/{job_id}")
        data = response.json()
        
        if data["status"] == "completed":
            return data["result"]
        
        print("Processing...", end="\r")
        time.sleep(2)
    
    raise TimeoutError("Request timed out")

# Usage
job_id = submit_request("Write a haiku about clouds")
print(f"Job ID: {job_id}")

result = get_result(job_id)
print(f"Result: {result}")

Command Line Interface

Use the provided example client:

# Basic usage
python example_client.py --prompt "Explain machine learning"

# Custom endpoint
python example_client.py \
  --url http://your-ip:8000 \
  --prompt "Write a story about AI" \
  --max-wait 300

# Submit without waiting
python example_client.py --prompt "Hello" --no-wait

cURL Examples

# Submit request
curl -X POST http://api-url:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"text": "What is the meaning of life?"}'

# Check status
curl http://api-url:8000/status/{job_id}

# View API documentation
curl http://api-url:8000/docs

⚙️ Configuration

Model Configuration

Edit services/llm-service/config/config.yaml:

model:
  name: "meta-llama/Llama-2-7b-chat-hf"
  tensor_parallel_size: 1
  max_model_len: 2048
  gpu_memory_utilization: 0.9

inference:
  temperature: 0.7
  top_p: 0.9
  max_tokens: 512

Environment Variables

Configure via deployment YAML or environment:

Variable	Default	Description
`MODEL_NAME`	`meta-llama/Llama-2-7b-chat-hf`	HuggingFace model ID
`TENSOR_PARALLEL_SIZE`	`1`	Number of GPUs for model parallelism
`MAX_MODEL_LEN`	`2048`	Maximum sequence length
`GPU_MEMORY_UTILIZATION`	`0.9`	GPU memory usage (0.0-1.0)
`TEMPERATURE`	`0.7`	Sampling temperature
`TOP_P`	`0.9`	Nucleus sampling parameter
`MAX_TOKENS`	`512`	Maximum tokens to generate

Supported Models

The service supports any vLLM-compatible model:

meta-llama/Llama-2-7b-chat-hf ✅ (default)
meta-llama/Llama-2-13b-chat-hf
mistralai/Mistral-7B-Instruct-v0.1
tiiuae/falcon-7b-instruct
HuggingFaceH4/zephyr-7b-beta

To change models, update MODEL_NAME in the deployment YAML.

🧪 Testing

Run Test Suite

cd services/tests
pip install -r requirements.txt

# Run all tests
pytest test_db.py -v

# Run specific test categories
pytest test_db.py::TestDatabase -v      # Database tests
pytest test_db.py::TestRabbitMQ -v      # Message queue tests
pytest test_db.py::TestAPIServer -v     # API endpoint tests
pytest test_db.py::TestEndToEnd -v      # Integration tests

# Run with coverage
pytest test_db.py --cov --cov-report=html

Test Categories

Database Tests: Connection, CRUD operations, error handling
RabbitMQ Tests: Queue operations, message durability
API Tests: Endpoint validation, status checking
Integration Tests: End-to-end workflow
Performance Tests: Concurrent request handling

See services/tests/README.md for detailed testing documentation.

📊 Monitoring

View Logs

# API Server
kubectl logs -f deployment/api-server

# LLM Service (inference worker)
kubectl logs -f deployment/inference

# Database
kubectl logs -f deployment/db

# RabbitMQ
kubectl logs -f deployment/rabbitmq

RabbitMQ Management UI

# Port forward to access management interface
kubectl port-forward svc/rabbitmq-service 15672:15672

# Open browser to http://localhost:15672
# Default credentials: guest / guest

Key Metrics to Monitor

Inference Latency: Time from queue pickup to result storage
Queue Depth: Number of pending tasks in RabbitMQ
GPU Utilization: Monitor with nvidia-smi in LLM pods
Error Rate: Failed inference attempts
Throughput: Requests processed per second

📈 Scaling

Horizontal Scaling

Scale inference workers to handle more load:

# Scale to 3 workers
kubectl scale deployment inference --replicas=3

# Scale API servers
kubectl scale deployment api-server --replicas=5

# Verify
kubectl get deployments

Performance Guidelines

Workers	Concurrent Users	GPU Type	Expected Latency
1	10-20	T4	2-5s
3	30-60	T4	2-5s
5	50-100	T4	2-5s
1	40-80	A100	1-2s

Resource Requirements

Per Worker (Llama-2-7b-chat-hf):

GPU: 1x NVIDIA T4 (16GB VRAM) or better
RAM: 16GB
CPU: 4 cores
Storage: 20GB (for model cache)

Larger Models:

Llama-2-13b: 1x A100 (40GB) or 2x T4 with tensor parallelism
Llama-2-70b: 4x A100 (40GB) or 8x A100 (80GB)

🔍 Troubleshooting

Model Loading Issues

Problem: Model takes too long to load or fails

Solutions:

First run downloads ~14GB model (2-5 minutes)
Use persistent volumes for model cache
Check HuggingFace token for gated models
Verify GPU availability: kubectl exec -it <pod> -- nvidia-smi

Out of Memory Errors

Problem: OOM errors in LLM service

Solutions:

# Reduce GPU memory utilization
- name: GPU_MEMORY_UTILIZATION
  value: "0.85"  # Try 0.8, 0.7, etc.

# Reduce max sequence length
- name: MAX_MODEL_LEN
  value: "1024"  # Down from 2048

# Use smaller model
- name: MODEL_NAME
  value: "meta-llama/Llama-2-7b-chat-hf"  # Instead of 13b

Connection Timeouts

Problem: Services can't communicate

Solutions:

# Check all pods are running
kubectl get pods

# Verify services
kubectl get svc

# Test connectivity from pod
kubectl exec -it <api-pod> -- ping rabbitmq-service
kubectl exec -it <api-pod> -- ping db-service

# Check logs for errors
kubectl logs <pod-name>

Slow Inference

Problem: Inference takes too long

Solutions:

Check GPU utilization: Should be >80%
Increase GPU_MEMORY_UTILIZATION for more parallel requests
Use larger GPU (A100 vs T4)
Enable tensor parallelism for multi-GPU
Check queue depth - may need more workers

Common Error Messages

Error	Cause	Solution
`CUDA out of memory`	GPU VRAM exhausted	Reduce batch size or model size
`Connection refused`	Service not ready	Wait for pods to be Running
`Model not found`	Invalid model name	Check HuggingFace model ID
`Token required`	Gated model	Add HuggingFace token

📁 Project Structure

llm-inference-service/
├── services/
│   ├── api-server/               # FastAPI gateway
│   │   ├── src/main.py          # API implementation
│   │   ├── config/config.yaml   # Configuration
│   │   ├── deployment/          # K8s manifests
│   │   ├── Dockerfile
│   │   └── requirements.txt
│   │
│   ├── llm-service/             # vLLM inference worker
│   │   ├── src/main.py          # Inference implementation
│   │   ├── config/config.yaml   # Model configuration
│   │   ├── deployment/          # K8s manifests
│   │   ├── Dockerfile           # CUDA-enabled
│   │   ├── requirements.txt
│   │   └── README.md
│   │
│   ├── db-service/              # PostgreSQL
│   │   ├── deployment-db.yaml
│   │   └── service-db.yaml
│   │
│   ├── pub-sub-service/         # RabbitMQ
│   │   └── deployment/
│   │       ├── deployment-pub-sub.yaml
│   │       └── service-pub-sub.yaml
│   │
│   └── tests/                   # Test suite
│       ├── test_db.py           # All tests
│       ├── requirements.txt
│       └── README.md
│
├── scripts/
│   ├── deploy_gke.sh           # Deployment automation
│   └── push_images.sh          # Image build/push
│
├── example_client.py            # Example Python client
├── IMPLEMENTATION_NOTES.md      # Technical details
├── IMPLEMENTATION_SUMMARY.md    # Implementation overview
├── README.md                    # This file
└── LICENSE

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Setup

# Clone your fork
git clone https://github.com/yourusername/llm-inference-service.git

# Install dependencies
pip install -r services/llm-service/requirements.txt
pip install -r services/tests/requirements.txt

# Run tests
pytest services/tests/test_db.py -v

# Make changes and test

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

vLLM - High-performance LLM inference engine
Meta AI - Llama-2 language model
HuggingFace - Model hosting and transformers library
FastAPI - Modern Python web framework
RabbitMQ - Reliable message broker

📞 Support

For issues and questions:

Issues: GitHub Issues
Documentation: See IMPLEMENTATION_NOTES.md for technical details
Tests: See services/tests/README.md for testing guide

🎯 Roadmap

Current Version (v1.0)

✅ vLLM integration
✅ Llama-2-7b-chat-hf support
✅ Kubernetes deployment
✅ Comprehensive testing
✅ Full documentation

Planned Features (v1.1)

Streaming responses (Server-Sent Events)
Authentication and rate limiting
Prometheus metrics export
Grafana dashboards
Auto-scaling based on queue depth

Future Enhancements (v2.0)

Built with ❤️ for scalable AI inference

Back to Top ↑

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
scripts		scripts
services		services
.gitignore		.gitignore
IMPLEMENTATION_NOTES.md		IMPLEMENTATION_NOTES.md
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
LICENSE		LICENSE
README.md		README.md
example_client.py		example_client.py

Folders and files

Latest commit

History

Repository files navigation

🚀 Distributed LLM Inference Service

📋 Table of Contents

🎯 Overview

🏗️ Architecture

Component Responsibilities

Request Flow

✨ Features

Performance

Reliability

Operations

📦 Prerequisites

Local Development Prerequisites

🚀 Quick Start

1. Clone the Repository

2. Set Up GKE Cluster

3. Install GPU Device Plugin

4. Deploy Services

5. Verify Deployment

6. Test the Service

🔧 Deployment

Building Custom Images

Using Gated Models (Llama-2)

Deployment Script

💻 Usage

Python Client Example

Command Line Interface

cURL Examples

⚙️ Configuration

Model Configuration

Environment Variables

Supported Models

🧪 Testing

Run Test Suite

Test Categories

📊 Monitoring

View Logs

RabbitMQ Management UI

Key Metrics to Monitor

📈 Scaling

Horizontal Scaling

Performance Guidelines

Resource Requirements

🔍 Troubleshooting

Model Loading Issues

Out of Memory Errors

Connection Timeouts

Slow Inference

Common Error Messages

📁 Project Structure

🤝 Contributing

Development Setup

📄 License

🙏 Acknowledgments

📞 Support

🎯 Roadmap

Current Version (v1.0)

Planned Features (v1.1)

Future Enhancements (v2.0)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages