InferenceForge

Production-grade LLM inference platform on Kubernetes. OpenAI-compatible API, request queuing, Prometheus metrics, and a Helm chart that switches between CPU (free, local) and GPU (AWS EKS) with two values.

What it does

Most "deploy an LLM" tutorials stop at ollama run or a single Docker container. InferenceForge builds the production layer on top: the queue, the rate limiter, the metrics, the Helm chart, and the GPU node configuration.

                         ┌──────────────────────────────┐
Client (OpenAI SDK)  ──► │     Gateway (FastAPI)         │
                         │                              │
                         │  ┌─────────────────────┐    │
                         │  │  Rate limiter        │    │
                         │  │  60 req/min per IP   │    │
                         │  └──────────┬──────────┘    │
                         │             │                │
                         │  ┌──────────▼──────────┐    │
                         │  │  Request queue       │    │
                         │  │  asyncio.Semaphore   │    │
                         │  └──────────┬──────────┘    │
                         │             │                │
                         │  ┌──────────▼──────────┐    │
                         │  │  Prometheus metrics  │    │
                         │  │  /metrics endpoint   │    │
                         │  └─────────────────────┘    │
                         └──────────────┬───────────────┘
                                        │
                         ┌──────────────▼───────────────┐
                         │      Model Backend            │
                         │                              │
                         │  Local:  Ollama (CPU)        │
                         │  AWS:    vLLM (NVIDIA T4)    │
                         └──────────────────────────────┘
                                        │
                         ┌──────────────▼───────────────┐
                         │   Prometheus → Grafana        │
                         │   12 panels: latency,         │
                         │   throughput, queue depth,    │
                         │   GPU utilisation             │
                         └──────────────────────────────┘

One-flag switch: local → GPU

The entire difference between running on a MacBook and running on a T4 GPU is two YAML files:

# Local — CPU, free, runs in Docker
helm install inferenceforge ./chart -f chart/values-local.yaml

# AWS EKS — NVIDIA T4 spot (~$0.15/hr)
helm install inferenceforge ./chart -f chart/values-gpu.yaml

Setting	Local	GPU (EKS)
`model.backend`	ollama	vllm
`model.name`	gemma3:1b	mistralai/Mistral-7B
`gpu.enabled`	false	true
`gateway.replicas`	1	2
`autoscaling.enabled`	false	true

Same gateway image. Same manifests. Same Prometheus config.

Quick start (local, no Kubernetes)

Prerequisites: Python 3.11+, Ollama

git clone https://github.com/lmenta/inferenceforge
cd inferenceforge

# 1. Start Ollama and pull a model
ollama serve &
ollama pull gemma3:1b

# 2. Install gateway dependencies
python -m venv .venv && source .venv/bin/activate
pip install fastapi uvicorn httpx pydantic pydantic-settings prometheus-client

# 3. Start the gateway
PYTHONPATH=. MODEL_BACKEND_URL=http://localhost:11434 MODEL_NAME=gemma3:1b \
  uvicorn gateway.main:app --port 8082

# 4. Send a request
curl -s -X POST http://localhost:8082/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"What is Kubernetes?"}]}' | python3 -m json.tool

# 5. Check metrics
curl -s http://localhost:8082/metrics | grep inferenceforge_

Full Kubernetes deploy (local cluster with k3d)

Prerequisites: Docker, k3d, kubectl, helm, Ollama

make local-up

This single command:

Creates a k3d cluster (lightweight K8s in Docker)
Builds the gateway Docker image and imports it into the cluster
Pulls the TinyLlama model via Ollama
Deploys all Kubernetes manifests (gateway, model, HPA)
Installs Prometheus + Grafana via Helm
Opens Grafana at http://localhost:30300 (admin/admin)

make test        # health check + inference test
make metrics     # print current Prometheus metrics
make load-test   # run Locust load test (20 users, 60s)
make logs        # tail gateway logs
make local-down  # tear down the cluster

Deploy on AWS (GPU)

Prerequisites: AWS account, Terraform, kubectl, helm

# 1. Provision EKS cluster + GPU node group (g4dn.xlarge spot)
cd infra
terraform init
terraform apply

# 2. Configure kubectl
aws eks update-kubeconfig --name inferenceforge-dev --region eu-west-2

# 3. Install NVIDIA device plugin (registers nvidia.com/gpu as a K8s resource)
make nvidia-plugin

# 4. Deploy GPU stack
make eks-deploy

Cost: g4dn.xlarge spot ≈ $0.15/hr. The node group scales to 0 when idle — you only pay when requests are coming in.

API

The gateway exposes an OpenAI-compatible interface. Drop it in as a replacement for any OpenAI client:

import openai

client = openai.OpenAI(
    base_url="http://localhost:8082/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="gemma3:1b",
    messages=[{"role": "user", "content": "Explain LLM quantisation."}]
)
print(response.choices[0].message.content)

Method	Path	Description
POST	`/v1/chat/completions`	Inference — streaming and non-streaming
GET	`/health`	Liveness probe
GET	`/ready`	Readiness probe (checks backend is reachable)
GET	`/metrics`	Prometheus metrics
GET	`/queue/status`	Real-time queue depth

Kubernetes GPU scheduling

Three things must be in place for a pod to land on a GPU node:

# 1. Tolerate the GPU taint — prevents CPU workloads from landing on GPU nodes
tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

# 2. Pin to GPU node type
nodeSelector:
  node.kubernetes.io/instance-type: g4dn.xlarge

# 3. Request the GPU — K8s finds a free slot and exclusively assigns it
resources:
  limits:
    nvidia.com/gpu: "1"

The NVIDIA Device Plugin DaemonSet (deployed via make nvidia-plugin) runs on every GPU node and registers nvidia.com/gpu as a schedulable resource. Without it, Kubernetes doesn't know GPUs exist.

Autoscaling

The HPA scales the gateway, not the model backend (GPU cold start is too slow for reactive scaling).

Two metrics drive scaling simultaneously:

CPU utilisation — standard K8s metric
inferenceforge_queue_depth — custom Prometheus metric via prometheus-adapter

Scaling on queue depth fires earlier and more predictably than CPU alone. A slow model fills the queue before it stresses the gateway's CPU.

The model backend uses Karpenter: when a GPU pod can't be scheduled (no node exists), Karpenter provisions one. Slower than HPA (~2-3 min), but GPU nodes are expensive — you only want them when actually needed.

Prometheus metrics

Metric	Type	Description
`inferenceforge_requests_total`	Counter	Request count by model and status
`inferenceforge_request_duration_seconds`	Histogram	Latency p50/p95/p99
`inferenceforge_queue_depth`	Gauge	Waiting requests — also drives the HPA
`inferenceforge_tokens_total`	Counter	Tokens generated (throughput + cost estimation)
`inferenceforge_active_requests`	Gauge	Concurrent in-flight requests
`inferenceforge_backend_errors_total`	Counter	Backend failures by error type

Latency buckets are non-uniform (0.1s → 60s) because LLM inference is slow and unpredictable. A 5s bucket tells you more than a 1s bucket for this workload.

Import k8s/monitoring/grafana-dashboard.json to get the full 12-panel dashboard.

Project structure

inferenceforge/
├── gateway/
│   ├── main.py          # FastAPI app — queue, rate limiter, OpenAI adapter
│   ├── queue.py         # asyncio.Semaphore-based request queue
│   ├── metrics.py       # Prometheus metric definitions
│   ├── config.py        # Settings (pydantic-settings)
│   └── Dockerfile
├── chart/               # Helm chart
│   ├── values.yaml
│   ├── values-local.yaml
│   ├── values-gpu.yaml
│   └── templates/
│       ├── configmap.yaml
│       ├── gateway/     # deployment, service, hpa
│       └── model/       # deployment, service
├── k8s/                 # Raw Kubernetes manifests
│   ├── gateway/
│   ├── model/
│   ├── monitoring/      # Prometheus ServiceMonitor + Grafana dashboard
│   └── nvidia-device-plugin.yaml
├── infra/               # Terraform — EKS + GPU node group + Karpenter
│   ├── main.tf
│   └── variables.tf
├── Makefile             # local-up, local-down, test, load-test, eks-deploy
└── ARCHITECTURE.md      # Design decisions with rationale

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InferenceForge

What it does

One-flag switch: local → GPU

Quick start (local, no Kubernetes)

Full Kubernetes deploy (local cluster with k3d)

Deploy on AWS (GPU)

API

Kubernetes GPU scheduling

Autoscaling

Prometheus metrics

Project structure

See also

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
chart		chart
gateway		gateway
infra		infra
k8s		k8s
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
Makefile		Makefile
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

InferenceForge

What it does

One-flag switch: local → GPU

Quick start (local, no Kubernetes)

Full Kubernetes deploy (local cluster with k3d)

Deploy on AWS (GPU)

API

Kubernetes GPU scheduling

Autoscaling

Prometheus metrics

Project structure

See also

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages