Workloads in code. Endpoints in minutes.
Declare an AI workload, call ensure(), and get back an
OpenAI-compatible endpoint you can hit right away.
One Rust core. Native Python & TypeScript packages.
Status: early development. A single Rust core plus a stable C ABI and thin native bindings per language. Python and TypeScript ship first; Go and Java follow over the same C ABI. The published packages (
inferencekeyon PyPI,@inferencekey/sdkon npm) are not out yet β today you build from this repo.
InferenceKey is an AI infrastructure platform: run every model, project and workload from one place, keep spend under control, and scale on demand. This repo is the SDK β an optional way to drive the platform from code instead of the dashboard.
ref = mgmt.ensure(WorkloadSpec(name="support-bot", slug="support-bot",
model="meta-llama/Llama-3.1-8B-Instruct", backend=Backend.VLLM,
command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192"))
out = data.endpoint(ref.workload_slug, api_key="ik_live_...").generate_text(prompt="Hi")
print(out.text)You declare a workload, the SDK ensures it exists on the platform, and you call the resulting OpenAI-compatible endpoint. That's the whole loop.
The SDK is an add-on β the platform works fully from the dashboard without it.
The SDK is a thin door onto a platform built for teams scaling AI β not fighting
infra. Behind the ensure() call:
| Capability | What it gives you | |
|---|---|---|
| πΈ | Spend control | Know exactly where your AI spend goes β per machine, project, model and team. Budgets and alerts, no surprises at the end of the month. |
| π | Scale on demand | Scale AI without rebuilding your stack. Demanding workloads β text & audio streaming, image analysis, content generation, knowledge-grounded apps β served in real time. |
| π | Swap engines, one URL | Run vLLM, SGLang, Ollama or llama.cpp behind a single endpoint. Swap the engine as you scale; the URL your code calls never changes. |
| βοΈ | Smart rules | Run compute only when it makes sense β start machines when there's work, stop them when there's none. Fixed, scheduled or autoscaling, per workload. |
And for the SDK itself:
- OpenAI-compatible β workloads expose the OpenAI chat/embeddings API. Point existing code at them, no new client to learn.
- Open source β Apache-2.0. One audited Rust core, native packages β read it, vendor it, trust it.
- Secure by design β two tokens, least privilege: a leaked inference key can never reconfigure your infrastructure.
You declare with the control plane (ManagementClient, an ik_sdk_ token)
and call with the data plane (DataClient, an ik_live_ token). The two
planes never share a path, a client, or a token.
flowchart LR
subgraph you["Your code"]
M["ManagementClient<br/>ik_sdk_ Β· control"]
D["DataClient<br/>ik_live_ Β· data"]
end
M -->|"control plane Β· /api"| P["InferenceKey<br/>Manager"]
P -->|"reconciles onto"| W["Workers<br/>vLLM Β· SGLang Β· Ollama Β· llama.cpp"]
D -->|"data plane Β· /endpoint/.../v1"| W
classDef brand fill:#0c1013,stroke:#1fd4c8,color:#9af0e9;
classDef mgr fill:#07090b,stroke:#1fd4c8,color:#ffffff;
class M,D brand;
class P,W mgr;
ensure() is idempotent on the slug, so you can run it on every deploy and it
converges instead of duplicating. β Architecture
The code that creates workloads is never the code that calls them β enforced server-side, and again client-side with fast, typed wrong-token errors.
| Token | Plane | Client | Scope | Inference? | Provision? |
|---|---|---|---|---|---|
ik_sdk_β¦ |
Control | ManagementClient |
One project | β | β |
ik_live_β¦ |
Data | DataClient endpoints |
Per workload | β | β |
A data key is passed per workload, so one app can drive many workloads, each with its own key β a leaked key blasts only a single workload's radius. β Tokens
1 Β· Get two tokens in the dashboard: a
control token (ik_sdk_) and a data token (ik_live_). β Tokens quickstart
2 Β· Set your environment:
export INFERENCEKEY_BASE_URL="https://api.inferencekey.com"
export INFERENCEKEY_PROJECT="acme"
export INFERENCEKEY_SDK_TOKEN="ik_sdk_..." # control plane
export INFERENCEKEY_API_KEY="ik_live_..." # data plane (default)3 Β· Ensure the workload, then call it:
Python
from inferencekey import ManagementClient, DataClient, WorkloadSpec, Backend
# Control plane: provision/reconcile the workload (ik_sdk_ token).
mgmt = ManagementClient.from_env(project="acme")
ref = mgmt.ensure(WorkloadSpec(
name="support-bot",
slug="support-bot",
model="meta-llama/Llama-3.1-8B-Instruct",
backend=Backend.VLLM,
command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
)) # on_drift defaults to RECONCILE
# Data plane: call the resulting OpenAI-compatible endpoint (ik_live_ token).
data = DataClient.from_env(project="acme")
ep = data.endpoint(ref.workload_slug, api_key="ik_live_...")
out = ep.generate_text(prompt="Hola", temperature=0.2, max_tokens=300)
print(out.text) # generated text
print(out.model) # model that served the requestTypeScript
import { ManagementClient, DataClient, Backend } from "@inferencekey/sdk";
// Control plane: provision/reconcile the workload (ik_sdk_ token).
const mgmt = ManagementClient.fromEnv({ project: "acme" });
const ref = await mgmt.ensure({
name: "support-bot",
slug: "support-bot",
model: "meta-llama/Llama-3.1-8B-Instruct",
backend: Backend.Vllm,
command: "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
});
// Data plane: call the resulting OpenAI-compatible endpoint (ik_live_ token).
const data = DataClient.fromEnv({ project: "acme" });
const ep = data.endpoint(ref.workloadSlug, { apiKey: process.env.SUPPORT_IK_LIVE });
const out = await ep.generateText({ prompt: "Hola", temperature: 0.2, maxTokens: 300 });
console.log(out.text); // generated textβ Full walkthrough: Quickstart Β·
runnable code in examples/
Pick the serving engine with the backend field. The platform handles placement β
a WorkloadSpec has no provider and no min_vram_gb.
Backend |
Wire | When to use |
|---|---|---|
OLLAMA / Ollama |
ollama |
Quick local-style serving and broad GGUF coverage. Simplest to stand up. |
VLLM / Vllm |
vllm |
High-throughput production text generation. You control the launch command. |
VLLM_OMNI / VllmOmni |
vllm-omni |
vLLM for multimodal / omni models. Same command-driven config. |
SGLANG / Sglang |
sglang |
Structured / programmatic generation on the SGLang runtime. |
LLAMACPP / Llamacpp |
llamacpp |
Prebuilt llama-server for GGUF models. Strong on AMD/ROCm and Apple Silicon. |
task_type defaults to text2text and spans 12 modalities (text, embeddings,
images, audio, reranking, classification, reward, β¦); execution_policy is
fixed, scheduled or autoscaling.
β Backends & policies
Runnable, self-contained examples live in examples/ β each one is
a folder you copy, set a few env vars, and run. Every example follows the same
shape: ensure() β wait until ready β call the endpoint.
gguf-llamacpp-private-amdβ serve a GGUF model withllamacppon a private AMD/ROCm worker.
inferencekey-sdk/
βββ core/ inferencekey-core β all logic + transport (reqwest/SSE).
β Pure domain (enums, spec, drift, wire, sse) + pipelines
β (ensure / generate_text / embed). One source of truth.
βββ capi/ inferencekey-capi β stable C ABI (extern "C") over the core,
β for FFI consumers (cgo, JNI/FFM, β¦). cbindgen β inferencekey.h.
βββ bindings/
βββ python/ pyo3 + maturin β the `inferencekey` PyPI package.
βββ node/ napi-rs β the `@inferencekey/sdk` npm package.
Behaviour lives in the Rust core, so every language behaves identically; the bindings are thin shells that marshal types and map errors to each language's idioms. β Architecture
cargo build # core + capi + bindings (Rust side)
cargo test # core unit tests
# Python wheel: (cd bindings/python && maturin build --release)
# Node addon: (cd bindings/node && npx napi build --release)
# C header: (cd capi && cbindgen --config cbindgen.toml --output include/inferencekey.h)| Area | Status |
|---|---|
| Rust core + C ABI | β Working |
Python binding (inferencekey) |
β Shipping β build from source today |
TypeScript binding (@inferencekey/sdk) |
β Shipping β build from source today |
| PyPI / npm publish | π Not yet published |
| Go / Java bindings | π Planned (over the same C ABI) |
This is early development: the wire format and the public surface may still change. Track progress and read the canonical reference at docs.inferencekey.com.
Everything in this README is a summary of the full docs, kept in manual sync:
- Quickstart β tokens, your first
ensure(), your first call. - Guides β authentication, workloads by policy / worker / modality, use cases.
- Reference β architecture, tokens, OnDrift, backends & policies, wire format, common errors.
- API reference β full Python & TypeScript surface (Go & Java coming soon).
New to InferenceKey? Open the dashboard Β· Learn more at inferencekey.com Β· Read the docs.