Skip to content

onlinerj/LLM-Inference-Service

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Distributed LLM Inference Service

A production-ready, scalable inference service for Large Language Models (LLMs) built on Kubernetes. This service leverages vLLM for high-performance inference and implements an asynchronous, queue-based architecture for handling concurrent requests efficiently.

Python vLLM Kubernetes CUDA


πŸ“‹ Table of Contents


🎯 Overview

This project provides a cloud-native LLM inference service designed for production workloads. It handles the complexities of serving large language models at scale, including:

  • High-Performance Inference: Powered by vLLM for 24x faster throughput
  • Asynchronous Processing: Non-blocking request handling via RabbitMQ
  • Horizontal Scalability: Add workers to handle increased load
  • GPU Optimization: Efficient GPU memory management with PagedAttention
  • Production Ready: Comprehensive error handling, logging, and testing

Default Model: Llama-2-7b-chat-hf (easily configurable to other models)


πŸ—οΈ Architecture

The system consists of four microservices deployed on Kubernetes:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Client    β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚ HTTP
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   API Server    │─────►│   RabbitMQ   │─────►│ LLM Service β”‚
β”‚   (FastAPI)     β”‚      β”‚ (Queue/Broker)β”‚     β”‚   (vLLM)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
          β”‚                                             β”‚
          β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”‚
          └─────────────►│  PostgreSQL  β”‚β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚  (Database)  β”‚
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Component Responsibilities

Component Technology Purpose
API Server FastAPI Receives user requests, returns job IDs, queries status
Message Queue RabbitMQ Distributes inference tasks to workers asynchronously
LLM Service vLLM + CUDA Processes inference requests using GPU-accelerated models
Database PostgreSQL Stores job results and maintains request history

Request Flow

  1. Client β†’ POST request to API Server with prompt
  2. API Server β†’ Generates job ID, publishes task to RabbitMQ
  3. API Server β†’ Returns job ID to client immediately
  4. LLM Service β†’ Consumes task from queue, runs inference
  5. LLM Service β†’ Stores result in PostgreSQL
  6. Client β†’ Polls API Server with job ID to retrieve result

✨ Features

Performance

  • ⚑ 24x Faster: vLLM with PagedAttention vs. standard transformers
  • πŸ”₯ GPU Accelerated: NVIDIA CUDA 11.8 support
  • πŸ“Š Batching: Continuous batching for optimal throughput
  • πŸ’Ύ Memory Efficient: 2x reduction in GPU memory usage

Reliability

  • πŸ”„ Async Processing: Non-blocking request handling
  • ♻️ Auto-Retry: Automatic reconnection and retry logic
  • πŸ›‘οΈ Error Handling: Comprehensive exception handling
  • βœ… Message Durability: No data loss on worker failure

Operations

  • πŸ“ˆ Horizontally Scalable: Add workers as needed
  • πŸŽ›οΈ Configurable: Environment variables and YAML config
  • πŸ“ Well Tested: Comprehensive test suite included
  • πŸ“š Documented: Complete setup and usage documentation

πŸ“¦ Prerequisites

Before deployment, ensure you have:

  • Google Cloud Platform (GCP) account with billing enabled
  • gcloud CLI installed and authenticated
  • kubectl installed and configured
  • Docker installed for building images
  • GPU Quota on GCP (at least 1x NVIDIA T4 or better)

Local Development Prerequisites

For local testing and development:

  • Python 3.9+
  • CUDA 11.8+ (for GPU inference)
  • PostgreSQL
  • RabbitMQ
  • 16GB+ RAM
  • NVIDIA GPU with 16GB+ VRAM

πŸš€ Quick Start

1. Clone the Repository

git clone https://github.com/yourusername/llm-inference-service.git
cd llm-inference-service

2. Set Up GKE Cluster

# Create GKE cluster with GPU nodes
gcloud container clusters create llm-cluster \
  --region us-central1 \
  --machine-type n1-standard-4 \
  --accelerator type=nvidia-tesla-t4,count=1 \
  --num-nodes 3

# Authenticate
gcloud container clusters get-credentials llm-cluster \
  --region us-central1 \
  --project your-project-id

3. Install GPU Device Plugin

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml

4. Deploy Services

# Deploy database
kubectl apply -f services/db-service/deployment-db.yaml
kubectl apply -f services/db-service/service-db.yaml

# Deploy RabbitMQ
kubectl apply -f services/pub-sub-service/deployment/deployment-pub-sub.yaml
kubectl apply -f services/pub-sub-service/deployment/service-pub-sub.yaml

# Wait for services to be ready
kubectl wait --for=condition=ready pod -l app=db --timeout=300s
kubectl wait --for=condition=ready pod -l app=rabbitmq --timeout=300s

# Deploy LLM service
kubectl apply -f services/llm-service/deployment/deployment-llm-service.yaml
kubectl apply -f services/llm-service/deployment/service-llm-service.yaml

# Deploy API server
kubectl apply -f services/api-server/deployment/deployment-api-server.yaml
kubectl apply -f services/api-server/deployment/service-api-server.yaml
kubectl apply -f services/api-server/deployment/ingress-api-server.yaml

5. Verify Deployment

# Check all pods are running
kubectl get pods

# Expected output:
# NAME                        READY   STATUS    RESTARTS   AGE
# api-server-xxx              1/1     Running   0          2m
# db-xxx                      1/1     Running   0          5m
# inference-xxx               1/1     Running   0          3m
# rabbitmq-xxx                1/1     Running   0          5m

# Get API server external IP
kubectl get svc api-server

6. Test the Service

# Submit a request
export API_URL=$(kubectl get svc api-server -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

curl -X POST http://${API_URL}:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"text": "Explain quantum computing in simple terms"}'

# Output: {"job_id": "550e8400-e29b-41d4-a716-446655440000"}

# Check status
curl http://${API_URL}:8000/status/550e8400-e29b-41d4-a716-446655440000

πŸ”§ Deployment

Building Custom Images

If you need to customize the services:

# Build LLM service
cd services/llm-service
docker build -t your-dockerhub-username/llm-service:latest .
docker push your-dockerhub-username/llm-service:latest

# Build API server
cd ../api-server
docker build -t your-dockerhub-username/api-server:latest .
docker push your-dockerhub-username/api-server:latest

# Update deployment files with your image names
# Then apply the deployments

Using Gated Models (Llama-2)

Llama-2 requires a HuggingFace access token:

  1. Request access at HuggingFace Llama-2
  2. Get your token from HuggingFace Settings
  3. Create Kubernetes secret:
kubectl create secret generic huggingface-secret \
  --from-literal=token=hf_your_token_here
  1. Uncomment the HF_TOKEN section in services/llm-service/deployment/deployment-llm-service.yaml

Deployment Script

Use the provided script for automated deployment:

./scripts/deploy_gke.sh

πŸ’» Usage

Python Client Example

import requests
import time

API_URL = "http://your-api-server-ip:8000"

def submit_request(prompt: str) -> str:
    """Submit inference request and return job ID."""
    response = requests.post(
        f"{API_URL}/chat",
        json={"text": prompt}
    )
    return response.json()["job_id"]

def get_result(job_id: str, timeout: int = 120) -> str:
    """Poll for result until completed or timeout."""
    start_time = time.time()
    
    while time.time() - start_time < timeout:
        response = requests.get(f"{API_URL}/status/{job_id}")
        data = response.json()
        
        if data["status"] == "completed":
            return data["result"]
        
        print("Processing...", end="\r")
        time.sleep(2)
    
    raise TimeoutError("Request timed out")

# Usage
job_id = submit_request("Write a haiku about clouds")
print(f"Job ID: {job_id}")

result = get_result(job_id)
print(f"Result: {result}")

Command Line Interface

Use the provided example client:

# Basic usage
python example_client.py --prompt "Explain machine learning"

# Custom endpoint
python example_client.py \
  --url http://your-ip:8000 \
  --prompt "Write a story about AI" \
  --max-wait 300

# Submit without waiting
python example_client.py --prompt "Hello" --no-wait

cURL Examples

# Submit request
curl -X POST http://api-url:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"text": "What is the meaning of life?"}'

# Check status
curl http://api-url:8000/status/{job_id}

# View API documentation
curl http://api-url:8000/docs

βš™οΈ Configuration

Model Configuration

Edit services/llm-service/config/config.yaml:

model:
  name: "meta-llama/Llama-2-7b-chat-hf"
  tensor_parallel_size: 1
  max_model_len: 2048
  gpu_memory_utilization: 0.9

inference:
  temperature: 0.7
  top_p: 0.9
  max_tokens: 512

Environment Variables

Configure via deployment YAML or environment:

Variable Default Description
MODEL_NAME meta-llama/Llama-2-7b-chat-hf HuggingFace model ID
TENSOR_PARALLEL_SIZE 1 Number of GPUs for model parallelism
MAX_MODEL_LEN 2048 Maximum sequence length
GPU_MEMORY_UTILIZATION 0.9 GPU memory usage (0.0-1.0)
TEMPERATURE 0.7 Sampling temperature
TOP_P 0.9 Nucleus sampling parameter
MAX_TOKENS 512 Maximum tokens to generate

Supported Models

The service supports any vLLM-compatible model:

  • meta-llama/Llama-2-7b-chat-hf βœ… (default)
  • meta-llama/Llama-2-13b-chat-hf
  • mistralai/Mistral-7B-Instruct-v0.1
  • tiiuae/falcon-7b-instruct
  • HuggingFaceH4/zephyr-7b-beta

To change models, update MODEL_NAME in the deployment YAML.


πŸ§ͺ Testing

Run Test Suite

cd services/tests
pip install -r requirements.txt

# Run all tests
pytest test_db.py -v

# Run specific test categories
pytest test_db.py::TestDatabase -v      # Database tests
pytest test_db.py::TestRabbitMQ -v      # Message queue tests
pytest test_db.py::TestAPIServer -v     # API endpoint tests
pytest test_db.py::TestEndToEnd -v      # Integration tests

# Run with coverage
pytest test_db.py --cov --cov-report=html

Test Categories

  • Database Tests: Connection, CRUD operations, error handling
  • RabbitMQ Tests: Queue operations, message durability
  • API Tests: Endpoint validation, status checking
  • Integration Tests: End-to-end workflow
  • Performance Tests: Concurrent request handling

See services/tests/README.md for detailed testing documentation.


πŸ“Š Monitoring

View Logs

# API Server
kubectl logs -f deployment/api-server

# LLM Service (inference worker)
kubectl logs -f deployment/inference

# Database
kubectl logs -f deployment/db

# RabbitMQ
kubectl logs -f deployment/rabbitmq

RabbitMQ Management UI

# Port forward to access management interface
kubectl port-forward svc/rabbitmq-service 15672:15672

# Open browser to http://localhost:15672
# Default credentials: guest / guest

Key Metrics to Monitor

  • Inference Latency: Time from queue pickup to result storage
  • Queue Depth: Number of pending tasks in RabbitMQ
  • GPU Utilization: Monitor with nvidia-smi in LLM pods
  • Error Rate: Failed inference attempts
  • Throughput: Requests processed per second

πŸ“ˆ Scaling

Horizontal Scaling

Scale inference workers to handle more load:

# Scale to 3 workers
kubectl scale deployment inference --replicas=3

# Scale API servers
kubectl scale deployment api-server --replicas=5

# Verify
kubectl get deployments

Performance Guidelines

Workers Concurrent Users GPU Type Expected Latency
1 10-20 T4 2-5s
3 30-60 T4 2-5s
5 50-100 T4 2-5s
1 40-80 A100 1-2s

Resource Requirements

Per Worker (Llama-2-7b-chat-hf):

  • GPU: 1x NVIDIA T4 (16GB VRAM) or better
  • RAM: 16GB
  • CPU: 4 cores
  • Storage: 20GB (for model cache)

Larger Models:

  • Llama-2-13b: 1x A100 (40GB) or 2x T4 with tensor parallelism
  • Llama-2-70b: 4x A100 (40GB) or 8x A100 (80GB)

πŸ” Troubleshooting

Model Loading Issues

Problem: Model takes too long to load or fails

Solutions:

  • First run downloads ~14GB model (2-5 minutes)
  • Use persistent volumes for model cache
  • Check HuggingFace token for gated models
  • Verify GPU availability: kubectl exec -it <pod> -- nvidia-smi

Out of Memory Errors

Problem: OOM errors in LLM service

Solutions:

# Reduce GPU memory utilization
- name: GPU_MEMORY_UTILIZATION
  value: "0.85"  # Try 0.8, 0.7, etc.

# Reduce max sequence length
- name: MAX_MODEL_LEN
  value: "1024"  # Down from 2048

# Use smaller model
- name: MODEL_NAME
  value: "meta-llama/Llama-2-7b-chat-hf"  # Instead of 13b

Connection Timeouts

Problem: Services can't communicate

Solutions:

# Check all pods are running
kubectl get pods

# Verify services
kubectl get svc

# Test connectivity from pod
kubectl exec -it <api-pod> -- ping rabbitmq-service
kubectl exec -it <api-pod> -- ping db-service

# Check logs for errors
kubectl logs <pod-name>

Slow Inference

Problem: Inference takes too long

Solutions:

  • Check GPU utilization: Should be >80%
  • Increase GPU_MEMORY_UTILIZATION for more parallel requests
  • Use larger GPU (A100 vs T4)
  • Enable tensor parallelism for multi-GPU
  • Check queue depth - may need more workers

Common Error Messages

Error Cause Solution
CUDA out of memory GPU VRAM exhausted Reduce batch size or model size
Connection refused Service not ready Wait for pods to be Running
Model not found Invalid model name Check HuggingFace model ID
Token required Gated model Add HuggingFace token

πŸ“ Project Structure

llm-inference-service/
β”œβ”€β”€ services/
β”‚   β”œβ”€β”€ api-server/               # FastAPI gateway
β”‚   β”‚   β”œβ”€β”€ src/main.py          # API implementation
β”‚   β”‚   β”œβ”€β”€ config/config.yaml   # Configuration
β”‚   β”‚   β”œβ”€β”€ deployment/          # K8s manifests
β”‚   β”‚   β”œβ”€β”€ Dockerfile
β”‚   β”‚   └── requirements.txt
β”‚   β”‚
β”‚   β”œβ”€β”€ llm-service/             # vLLM inference worker
β”‚   β”‚   β”œβ”€β”€ src/main.py          # Inference implementation
β”‚   β”‚   β”œβ”€β”€ config/config.yaml   # Model configuration
β”‚   β”‚   β”œβ”€β”€ deployment/          # K8s manifests
β”‚   β”‚   β”œβ”€β”€ Dockerfile           # CUDA-enabled
β”‚   β”‚   β”œβ”€β”€ requirements.txt
β”‚   β”‚   └── README.md
β”‚   β”‚
β”‚   β”œβ”€β”€ db-service/              # PostgreSQL
β”‚   β”‚   β”œβ”€β”€ deployment-db.yaml
β”‚   β”‚   └── service-db.yaml
β”‚   β”‚
β”‚   β”œβ”€β”€ pub-sub-service/         # RabbitMQ
β”‚   β”‚   └── deployment/
β”‚   β”‚       β”œβ”€β”€ deployment-pub-sub.yaml
β”‚   β”‚       └── service-pub-sub.yaml
β”‚   β”‚
β”‚   └── tests/                   # Test suite
β”‚       β”œβ”€β”€ test_db.py           # All tests
β”‚       β”œβ”€β”€ requirements.txt
β”‚       └── README.md
β”‚
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ deploy_gke.sh           # Deployment automation
β”‚   └── push_images.sh          # Image build/push
β”‚
β”œβ”€β”€ example_client.py            # Example Python client
β”œβ”€β”€ IMPLEMENTATION_NOTES.md      # Technical details
β”œβ”€β”€ IMPLEMENTATION_SUMMARY.md    # Implementation overview
β”œβ”€β”€ README.md                    # This file
└── LICENSE

🀝 Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Setup

# Clone your fork
git clone https://github.com/yourusername/llm-inference-service.git

# Install dependencies
pip install -r services/llm-service/requirements.txt
pip install -r services/tests/requirements.txt

# Run tests
pytest services/tests/test_db.py -v

# Make changes and test

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

  • vLLM - High-performance LLM inference engine
  • Meta AI - Llama-2 language model
  • HuggingFace - Model hosting and transformers library
  • FastAPI - Modern Python web framework
  • RabbitMQ - Reliable message broker

πŸ“ž Support

For issues and questions:

  • Issues: GitHub Issues
  • Documentation: See IMPLEMENTATION_NOTES.md for technical details
  • Tests: See services/tests/README.md for testing guide

🎯 Roadmap

Current Version (v1.0)

  • βœ… vLLM integration
  • βœ… Llama-2-7b-chat-hf support
  • βœ… Kubernetes deployment
  • βœ… Comprehensive testing
  • βœ… Full documentation

Planned Features (v1.1)

  • Streaming responses (Server-Sent Events)
  • Authentication and rate limiting
  • Prometheus metrics export
  • Grafana dashboards
  • Auto-scaling based on queue depth

Future Enhancements (v2.0)

  • Multi-model support
  • LoRA adapter deployment
  • Cost tracking per request
  • Admin dashboard
  • Multi-modal support (vision, audio)

Built with ❀️ for scalable AI inference

Back to Top ↑

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors