Skip to content

inferencekey/inferencekey-sdk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

24 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

InferenceKey SDK β€” Workloads in code. Endpoints in minutes.

InferenceKey SDK

Workloads in code. Endpoints in minutes.
Declare an AI workload, call ensure(), and get back an OpenAI-compatible endpoint you can hit right away.
One Rust core. Native Python & TypeScript packages.

Docs Open dashboard License: Apache-2.0

Python β€” shipping TypeScript β€” shipping Go β€” coming soon Java β€” coming soon

Status: early development. A single Rust core plus a stable C ABI and thin native bindings per language. Python and TypeScript ship first; Go and Java follow over the same C ABI. The published packages (inferencekey on PyPI, @inferencekey/sdk on npm) are not out yet β€” today you build from this repo.


What is this?

InferenceKey is an AI infrastructure platform: run every model, project and workload from one place, keep spend under control, and scale on demand. This repo is the SDK β€” an optional way to drive the platform from code instead of the dashboard.

ref = mgmt.ensure(WorkloadSpec(name="support-bot", slug="support-bot",
    model="meta-llama/Llama-3.1-8B-Instruct", backend=Backend.VLLM,
    command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192"))

out = data.endpoint(ref.workload_slug, api_key="ik_live_...").generate_text(prompt="Hi")
print(out.text)

You declare a workload, the SDK ensures it exists on the platform, and you call the resulting OpenAI-compatible endpoint. That's the whole loop.

The SDK is an add-on β€” the platform works fully from the dashboard without it.

Why InferenceKey

The SDK is a thin door onto a platform built for teams scaling AI β€” not fighting infra. Behind the ensure() call:

Capability What it gives you
πŸ’Έ Spend control Know exactly where your AI spend goes β€” per machine, project, model and team. Budgets and alerts, no surprises at the end of the month.
πŸ“ˆ Scale on demand Scale AI without rebuilding your stack. Demanding workloads β€” text & audio streaming, image analysis, content generation, knowledge-grounded apps β€” served in real time.
πŸ”Œ Swap engines, one URL Run vLLM, SGLang, Ollama or llama.cpp behind a single endpoint. Swap the engine as you scale; the URL your code calls never changes.
βš™οΈ Smart rules Run compute only when it makes sense β€” start machines when there's work, stop them when there's none. Fixed, scheduled or autoscaling, per workload.

And for the SDK itself:

  • OpenAI-compatible β€” workloads expose the OpenAI chat/embeddings API. Point existing code at them, no new client to learn.
  • Open source β€” Apache-2.0. One audited Rust core, native packages β€” read it, vendor it, trust it.
  • Secure by design β€” two tokens, least privilege: a leaked inference key can never reconfigure your infrastructure.

How it fits together

You declare with the control plane (ManagementClient, an ik_sdk_ token) and call with the data plane (DataClient, an ik_live_ token). The two planes never share a path, a client, or a token.

flowchart LR
  subgraph you["Your code"]
    M["ManagementClient<br/>ik_sdk_ Β· control"]
    D["DataClient<br/>ik_live_ Β· data"]
  end
  M -->|"control plane Β· /api"| P["InferenceKey<br/>Manager"]
  P -->|"reconciles onto"| W["Workers<br/>vLLM Β· SGLang Β· Ollama Β· llama.cpp"]
  D -->|"data plane Β· /endpoint/.../v1"| W

  classDef brand fill:#0c1013,stroke:#1fd4c8,color:#9af0e9;
  classDef mgr fill:#07090b,stroke:#1fd4c8,color:#ffffff;
  class M,D brand;
  class P,W mgr;
Loading

ensure() is idempotent on the slug, so you can run it on every deploy and it converges instead of duplicating. β†’ Architecture

Two tokens, least privilege

The code that creates workloads is never the code that calls them β€” enforced server-side, and again client-side with fast, typed wrong-token errors.

Token Plane Client Scope Inference? Provision?
ik_sdk_… Control ManagementClient One project ❌ βœ…
ik_live_… Data DataClient endpoints Per workload βœ… ❌

A data key is passed per workload, so one app can drive many workloads, each with its own key β€” a leaked key blasts only a single workload's radius. β†’ Tokens

First result in under 5 minutes

1 Β· Get two tokens in the dashboard: a control token (ik_sdk_) and a data token (ik_live_). β†’ Tokens quickstart

2 Β· Set your environment:

export INFERENCEKEY_BASE_URL="https://api.inferencekey.com"
export INFERENCEKEY_PROJECT="acme"
export INFERENCEKEY_SDK_TOKEN="ik_sdk_..."   # control plane
export INFERENCEKEY_API_KEY="ik_live_..."    # data plane (default)

3 Β· Ensure the workload, then call it:

Python
from inferencekey import ManagementClient, DataClient, WorkloadSpec, Backend

# Control plane: provision/reconcile the workload (ik_sdk_ token).
mgmt = ManagementClient.from_env(project="acme")
ref = mgmt.ensure(WorkloadSpec(
    name="support-bot",
    slug="support-bot",
    model="meta-llama/Llama-3.1-8B-Instruct",
    backend=Backend.VLLM,
    command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
))  # on_drift defaults to RECONCILE

# Data plane: call the resulting OpenAI-compatible endpoint (ik_live_ token).
data = DataClient.from_env(project="acme")
ep = data.endpoint(ref.workload_slug, api_key="ik_live_...")
out = ep.generate_text(prompt="Hola", temperature=0.2, max_tokens=300)

print(out.text)   # generated text
print(out.model)  # model that served the request
TypeScript
import { ManagementClient, DataClient, Backend } from "@inferencekey/sdk";

// Control plane: provision/reconcile the workload (ik_sdk_ token).
const mgmt = ManagementClient.fromEnv({ project: "acme" });
const ref = await mgmt.ensure({
  name: "support-bot",
  slug: "support-bot",
  model: "meta-llama/Llama-3.1-8B-Instruct",
  backend: Backend.Vllm,
  command: "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
});

// Data plane: call the resulting OpenAI-compatible endpoint (ik_live_ token).
const data = DataClient.fromEnv({ project: "acme" });
const ep = data.endpoint(ref.workloadSlug, { apiKey: process.env.SUPPORT_IK_LIVE });
const out = await ep.generateText({ prompt: "Hola", temperature: 0.2, maxTokens: 300 });

console.log(out.text); // generated text

β†’ Full walkthrough: Quickstart Β· runnable code in examples/

Backends

Pick the serving engine with the backend field. The platform handles placement β€” a WorkloadSpec has no provider and no min_vram_gb.

Backend Wire When to use
OLLAMA / Ollama ollama Quick local-style serving and broad GGUF coverage. Simplest to stand up.
VLLM / Vllm vllm High-throughput production text generation. You control the launch command.
VLLM_OMNI / VllmOmni vllm-omni vLLM for multimodal / omni models. Same command-driven config.
SGLANG / Sglang sglang Structured / programmatic generation on the SGLang runtime.
LLAMACPP / Llamacpp llamacpp Prebuilt llama-server for GGUF models. Strong on AMD/ROCm and Apple Silicon.

task_type defaults to text2text and spans 12 modalities (text, embeddings, images, audio, reranking, classification, reward, …); execution_policy is fixed, scheduled or autoscaling. β†’ Backends & policies

Examples

Runnable, self-contained examples live in examples/ β€” each one is a folder you copy, set a few env vars, and run. Every example follows the same shape: ensure() β†’ wait until ready β†’ call the endpoint.

Architecture β€” one core, many bindings

inferencekey-sdk/
β”œβ”€β”€ core/        inferencekey-core β€” all logic + transport (reqwest/SSE).
β”‚                Pure domain (enums, spec, drift, wire, sse) + pipelines
β”‚                (ensure / generate_text / embed). One source of truth.
β”œβ”€β”€ capi/        inferencekey-capi β€” stable C ABI (extern "C") over the core,
β”‚                for FFI consumers (cgo, JNI/FFM, …). cbindgen β†’ inferencekey.h.
└── bindings/
    β”œβ”€β”€ python/  pyo3 + maturin β†’ the `inferencekey` PyPI package.
    └── node/    napi-rs β†’ the `@inferencekey/sdk` npm package.

Behaviour lives in the Rust core, so every language behaves identically; the bindings are thin shells that marshal types and map errors to each language's idioms. β†’ Architecture

Building

cargo build            # core + capi + bindings (Rust side)
cargo test             # core unit tests

# Python wheel:  (cd bindings/python && maturin build --release)
# Node addon:    (cd bindings/node   && npx napi build --release)
# C header:      (cd capi && cbindgen --config cbindgen.toml --output include/inferencekey.h)

Project status

Area Status
Rust core + C ABI βœ… Working
Python binding (inferencekey) βœ… Shipping β€” build from source today
TypeScript binding (@inferencekey/sdk) βœ… Shipping β€” build from source today
PyPI / npm publish πŸ”œ Not yet published
Go / Java bindings πŸ”œ Planned (over the same C ABI)

This is early development: the wire format and the public surface may still change. Track progress and read the canonical reference at docs.inferencekey.com.

Documentation

Everything in this README is a summary of the full docs, kept in manual sync:

  • Quickstart β€” tokens, your first ensure(), your first call.
  • Guides β€” authentication, workloads by policy / worker / modality, use cases.
  • Reference β€” architecture, tokens, OnDrift, backends & policies, wire format, common errors.
  • API reference β€” full Python & TypeScript surface (Go & Java coming soon).

License

Apache-2.0.


New to InferenceKey? Open the dashboard Β· Learn more at inferencekey.com Β· Read the docs.

About

AI that grows with your organisation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors