Skip to content

lmenta/inferenceforge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InferenceForge

Production-grade LLM inference platform on Kubernetes. OpenAI-compatible API, request queuing, Prometheus metrics, and a Helm chart that switches between CPU (free, local) and GPU (AWS EKS) with two values.


What it does

Most "deploy an LLM" tutorials stop at ollama run or a single Docker container. InferenceForge builds the production layer on top: the queue, the rate limiter, the metrics, the Helm chart, and the GPU node configuration.

                         ┌──────────────────────────────┐
Client (OpenAI SDK)  ──► │     Gateway (FastAPI)         │
                         │                              │
                         │  ┌─────────────────────┐    │
                         │  │  Rate limiter        │    │
                         │  │  60 req/min per IP   │    │
                         │  └──────────┬──────────┘    │
                         │             │                │
                         │  ┌──────────▼──────────┐    │
                         │  │  Request queue       │    │
                         │  │  asyncio.Semaphore   │    │
                         │  └──────────┬──────────┘    │
                         │             │                │
                         │  ┌──────────▼──────────┐    │
                         │  │  Prometheus metrics  │    │
                         │  │  /metrics endpoint   │    │
                         │  └─────────────────────┘    │
                         └──────────────┬───────────────┘
                                        │
                         ┌──────────────▼───────────────┐
                         │      Model Backend            │
                         │                              │
                         │  Local:  Ollama (CPU)        │
                         │  AWS:    vLLM (NVIDIA T4)    │
                         └──────────────────────────────┘
                                        │
                         ┌──────────────▼───────────────┐
                         │   Prometheus → Grafana        │
                         │   12 panels: latency,         │
                         │   throughput, queue depth,    │
                         │   GPU utilisation             │
                         └──────────────────────────────┘

One-flag switch: local → GPU

The entire difference between running on a MacBook and running on a T4 GPU is two YAML files:

# Local — CPU, free, runs in Docker
helm install inferenceforge ./chart -f chart/values-local.yaml

# AWS EKS — NVIDIA T4 spot (~$0.15/hr)
helm install inferenceforge ./chart -f chart/values-gpu.yaml
Setting Local GPU (EKS)
model.backend ollama vllm
model.name gemma3:1b mistralai/Mistral-7B
gpu.enabled false true
gateway.replicas 1 2
autoscaling.enabled false true

Same gateway image. Same manifests. Same Prometheus config.


Quick start (local, no Kubernetes)

Prerequisites: Python 3.11+, Ollama

git clone https://github.com/lmenta/inferenceforge
cd inferenceforge

# 1. Start Ollama and pull a model
ollama serve &
ollama pull gemma3:1b

# 2. Install gateway dependencies
python -m venv .venv && source .venv/bin/activate
pip install fastapi uvicorn httpx pydantic pydantic-settings prometheus-client

# 3. Start the gateway
PYTHONPATH=. MODEL_BACKEND_URL=http://localhost:11434 MODEL_NAME=gemma3:1b \
  uvicorn gateway.main:app --port 8082

# 4. Send a request
curl -s -X POST http://localhost:8082/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"What is Kubernetes?"}]}' | python3 -m json.tool

# 5. Check metrics
curl -s http://localhost:8082/metrics | grep inferenceforge_

Full Kubernetes deploy (local cluster with k3d)

Prerequisites: Docker, k3d, kubectl, helm, Ollama

make local-up

This single command:

  1. Creates a k3d cluster (lightweight K8s in Docker)
  2. Builds the gateway Docker image and imports it into the cluster
  3. Pulls the TinyLlama model via Ollama
  4. Deploys all Kubernetes manifests (gateway, model, HPA)
  5. Installs Prometheus + Grafana via Helm
  6. Opens Grafana at http://localhost:30300 (admin/admin)
make test        # health check + inference test
make metrics     # print current Prometheus metrics
make load-test   # run Locust load test (20 users, 60s)
make logs        # tail gateway logs
make local-down  # tear down the cluster

Deploy on AWS (GPU)

Prerequisites: AWS account, Terraform, kubectl, helm

# 1. Provision EKS cluster + GPU node group (g4dn.xlarge spot)
cd infra
terraform init
terraform apply

# 2. Configure kubectl
aws eks update-kubeconfig --name inferenceforge-dev --region eu-west-2

# 3. Install NVIDIA device plugin (registers nvidia.com/gpu as a K8s resource)
make nvidia-plugin

# 4. Deploy GPU stack
make eks-deploy

Cost: g4dn.xlarge spot ≈ $0.15/hr. The node group scales to 0 when idle — you only pay when requests are coming in.


API

The gateway exposes an OpenAI-compatible interface. Drop it in as a replacement for any OpenAI client:

import openai

client = openai.OpenAI(
    base_url="http://localhost:8082/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="gemma3:1b",
    messages=[{"role": "user", "content": "Explain LLM quantisation."}]
)
print(response.choices[0].message.content)
Method Path Description
POST /v1/chat/completions Inference — streaming and non-streaming
GET /health Liveness probe
GET /ready Readiness probe (checks backend is reachable)
GET /metrics Prometheus metrics
GET /queue/status Real-time queue depth

Kubernetes GPU scheduling

Three things must be in place for a pod to land on a GPU node:

# 1. Tolerate the GPU taint — prevents CPU workloads from landing on GPU nodes
tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

# 2. Pin to GPU node type
nodeSelector:
  node.kubernetes.io/instance-type: g4dn.xlarge

# 3. Request the GPU — K8s finds a free slot and exclusively assigns it
resources:
  limits:
    nvidia.com/gpu: "1"

The NVIDIA Device Plugin DaemonSet (deployed via make nvidia-plugin) runs on every GPU node and registers nvidia.com/gpu as a schedulable resource. Without it, Kubernetes doesn't know GPUs exist.


Autoscaling

The HPA scales the gateway, not the model backend (GPU cold start is too slow for reactive scaling).

Two metrics drive scaling simultaneously:

  • CPU utilisation — standard K8s metric
  • inferenceforge_queue_depth — custom Prometheus metric via prometheus-adapter

Scaling on queue depth fires earlier and more predictably than CPU alone. A slow model fills the queue before it stresses the gateway's CPU.

The model backend uses Karpenter: when a GPU pod can't be scheduled (no node exists), Karpenter provisions one. Slower than HPA (~2-3 min), but GPU nodes are expensive — you only want them when actually needed.


Prometheus metrics

Metric Type Description
inferenceforge_requests_total Counter Request count by model and status
inferenceforge_request_duration_seconds Histogram Latency p50/p95/p99
inferenceforge_queue_depth Gauge Waiting requests — also drives the HPA
inferenceforge_tokens_total Counter Tokens generated (throughput + cost estimation)
inferenceforge_active_requests Gauge Concurrent in-flight requests
inferenceforge_backend_errors_total Counter Backend failures by error type

Latency buckets are non-uniform (0.1s → 60s) because LLM inference is slow and unpredictable. A 5s bucket tells you more than a 1s bucket for this workload.

Import k8s/monitoring/grafana-dashboard.json to get the full 12-panel dashboard.


Project structure

inferenceforge/
├── gateway/
│   ├── main.py          # FastAPI app — queue, rate limiter, OpenAI adapter
│   ├── queue.py         # asyncio.Semaphore-based request queue
│   ├── metrics.py       # Prometheus metric definitions
│   ├── config.py        # Settings (pydantic-settings)
│   └── Dockerfile
├── chart/               # Helm chart
│   ├── values.yaml
│   ├── values-local.yaml
│   ├── values-gpu.yaml
│   └── templates/
│       ├── configmap.yaml
│       ├── gateway/     # deployment, service, hpa
│       └── model/       # deployment, service
├── k8s/                 # Raw Kubernetes manifests
│   ├── gateway/
│   ├── model/
│   ├── monitoring/      # Prometheus ServiceMonitor + Grafana dashboard
│   └── nvidia-device-plugin.yaml
├── infra/               # Terraform — EKS + GPU node group + Karpenter
│   ├── main.tf
│   └── variables.tf
├── Makefile             # local-up, local-down, test, load-test, eks-deploy
└── ARCHITECTURE.md      # Design decisions with rationale

See also

ARCHITECTURE.md — detailed design rationale: why the gateway and model backend are separate, how request flow works, autoscaling strategy, and the metrics design.

About

GPU-native LLM inference platform on Kubernetes — OpenAI-compatible, observable, one helm install

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors