shiftgate is an intelligent routing layer that automatically selects the right LoRA adapter for each task in your local agent loop.
shiftgate does not manage weights. It stores adapter metadata only — no downloading, caching, or loading LoRA files. You start Ollama or vLLM with your models and adapters loaded; shiftgate embeds each query, picks the best task cluster, and tells the backend which adapter to use.
shiftgate runrequires a running inference backend. Routing-only commands (shiftgate route,shiftgate init) work without one. To generate text, Ollama (localhost:11434) or vLLM (localhost:8000) must already be running with your adapters loaded.
Instead of hardcoding which adapter to use, shiftgate matches your query against a catalog of task clusters using cosine similarity — then routes to the best-fit LoRA adapter on that backend.
Requires Python 3.10+ and a running Ollama or vLLM instance for inference.
uv tool install shiftgate
# or: pip install shiftgatevLLM (example — load adapters with --lora-modules):
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B \
--enable-lora \
--lora-modules python-lora=/path/to/python-loraOllama (example — create a model that bundles base + adapter, then serve):
ollama create python-lora-ollama -f my-python-lora.Modelfile
ollama serveCreates ~/.shiftgate/ and computes task embeddings (one-time model download for routing):
shiftgate initPick the option that matches your setup (see Bring Your Own Models for details):
# Option 1 — adapter already loaded in vLLM
shiftgate adapter add python-lora --runtime python-lora --tags python --base meta-llama/Meta-Llama-3-8B
# Option 2 — adapter already loaded in Ollama
shiftgate adapter add python-lora --runtime python-lora-ollama --tags python --base llama3
# Option 3 — metadata-only (catalogue a HuggingFace repo; no weights downloaded)
shiftgate adapter add teknium/python-lora --tags python --base llama3# Route only — shows the decision, no inference
shiftgate route "write a python sorting function"
# Route + run through your backend
shiftgate run "write a python sorting function"Essential commands: init · adapter add · route · run · doctor · serve
shiftgate serve exposes the router as a drop-in OpenAI endpoint. Any client that speaks OpenAI can point at it and get auto-routing for free — just pass model="auto".
# Start the proxy (defaults to http://127.0.0.1:9000)
shiftgate serve# Use it from any OpenAI client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:9000/v1", api_key="not-needed")
client.chat.completions.create(
model="auto", # ← shiftgate picks the right adapter
messages=[{"role": "user", "content": "write a sql query"}],
)When model="auto", shiftgate routes the request to the best adapter and rewrites model to that adapter's backend name before forwarding upstream. The response carries an X-Shiftgate-Route: <adapter_id> (<score>) header so you can see what was chosen. Passing any other model id bypasses routing and forwards verbatim. Streaming (stream: true) is piped straight through via SSE.
shiftgate serve --port 9000 --host 127.0.0.1 --backend auto # backend: auto | ollama | vllm | cerebrasBind defaults to
127.0.0.1(localhost only). Pass--host 0.0.0.0to expose it on your network.
Point each tool's OpenAI base URL at the proxy and use model="auto":
# Cursor → Settings → Models → Override OpenAI Base URL
http://localhost:9000/v1
# Aider
aider --openai-api-base http://localhost:9000/v1 --openai-api-key not-needed --model auto# LangChain
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="http://localhost:9000/v1",
api_key="not-needed",
model="auto",
)shiftgate run "write a python sorting function"╭────────────────────────── Routing Decision ──────────────────────────╮
│ Query "write a python sorting function" │
│ Matched Task Python Code Generation ████████████████░░ 91.2% │
│ Adapter python-lora [meta-llama/Meta-Llama-3-8B] │
│ Backend vllm │
╰──────────────────────────────────────────────────────────────────────╯
Running via vllm…
────────────────────────────────── Response ──────────────────────────────────
def sort_array(arr):
"""Return a sorted copy using Python's Timsort."""
return sorted(arr)
───────────────────────────────────────────────────────────────────────────────
Inference: 6204 ms · Total: 6246 ms
Use shiftgate route "<query>" --explain to see the full decision tree — top task matches, similarity scores, and why an adapter was chosen.
Run a full health check anytime something feels off:
shiftgate doctorshiftgate doctor checks:
| Check | What it tells you |
|---|---|
| Embedder | Whether the routing embedding model loads and produces vectors |
| Backend | Whether Ollama (localhost:11434) or vLLM (localhost:8000) is reachable |
| Task embeddings | Whether all task clusters have computed centroids (shiftgate init) |
| Adapter runtime availability | For each registered adapter: linked status and whether it is loaded in the backend |
| Unlinked task clusters | Task clusters with no adapter wired — routing will match the task but cannot run inference |
Runtime adapter verification runs automatically when you register a backend-loaded adapter:
shiftgate adapter add python-lora --runtime python-lora --tags python --base llama3
# Backend: vllm ✓ verified ← adapter found in the running backend
# Backend: vllm ⚠ runtime 'python-lora' not loaded — did you pass --lora-modules?
# Backend: not running (verification skipped)Backend detection is automatic. shiftgate run, shiftgate status, and shiftgate doctor probe Ollama first, then vLLM. No config file required.
User query
│
▼
┌──────────────────────────────────────────────────┐
│ shiftgate CLI │
│ shiftgate route / shiftgate run │
└────────────────────┬─────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────┐
│ Router │
│ │
│ 1. Embed query (fastembed BAAI/bge-small-en) │
│ 2. Cosine similarity vs task centroids │
│ 3. top-K tasks → walk preferred_adapters list │
│ 4. Return RoutingTrace │
└──────────┬───────────────────────┬───────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌────────────────────────────┐
│ Task Registry │ │ Adapter Registry │
│ ~/.shiftgate/ │ │ ~/.shiftgate/adapters.json│
│ tasks.json │ │ │
│ (10 defaults) │ │ Add via: │
└─────────────────┘ │ shiftgate adapter add │
└────────────┬───────────────┘
│
▼
┌────────────────────────────────┐
│ BackendRouter │
│ │
│ Ollama (localhost:11434) │
│ vLLM (localhost:8000) │
│ Auto-detected at runtime │
└────────────────────────────────┘
│
▼
┌────────────────────────────────┐
│ Feedback Loop │
│ ~/.shiftgate/traces.jsonl │
│ shiftgate feedback accept │
│ shiftgate feedback stats │
└────────────────────────────────┘
When a backend is active, shiftgate filters candidate adapters to only those actually loaded on that backend. Switch from vLLM to Cerebras and shiftgate automatically picks Cerebras-compatible adapters — no re-registration needed. (When you run shiftgate route with no backend running, no filtering is applied, so you still see the full routing preview.)
shiftgate is a routing layer. You load weights into Ollama or vLLM first, then register what you loaded so shiftgate can route to it.
You can also catalogue adapters you have not loaded yet (Option 3) — useful for shiftgate route, but shiftgate run will not produce output until the adapter is available in a running backend.
Start vLLM with your adapters:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B \
--enable-lora \
--lora-modules sql-lora=/path/to/sql-loraRegister using the --lora-modules key as --runtime:
shiftgate adapter add sql-lora --runtime sql-lora --tags sql --base meta-llama/Meta-Llama-3-8Bshiftgate sends "model": "<runtime_name>" in each /v1/chat/completions request.
Create a Modelfile that bundles your base model and adapter:
# my-sql-lora.Modelfile
FROM llama3
ADAPTER /path/to/sql-lora.safetensorsollama create sql-lora-ollama -f my-sql-lora.Modelfile
ollama serveRegister using the Ollama model name as --runtime:
shiftgate adapter add sql-lora --runtime sql-lora-ollama --tags sql --base llama3shiftgate passes runtime_name (or falls back to id) as the Ollama model name.
Catalogue an adapter without downloading weights — metadata only:
shiftgate adapter add teknium/sql-lora --tags sql --base llama3You can also record a local path for your own reference (shiftgate still does not load the file):
shiftgate adapter add sql-lora --local /models/sql-lora --tags sql --base llama3Useful for exploring routing decisions before your backend is set up. To run inference, load the adapter in vLLM or Ollama and re-register with --runtime.
shiftgate also supports Cerebras as a cloud fallback. It uses Cerebras' OpenAI-compatible API and authenticates with a bearer token from the CEREBRAS_API_KEY environment variable (or the --cerebras-key global flag).
export CEREBRAS_API_KEY=csk-...
shiftgate adapter add llama3.1-8b --runtime llama3.1-8b --tags general --base llama3.1
shiftgate run "write a python sorting function"shiftgate auto-detects backends in the order Ollama → vLLM → Cerebras, so local backends always win and Cerebras is used only when no local backend is running.
Honest status: shiftgate routes to Cerebras' base-model inference today. When Cerebras Multi-LoRA goes public, register your adapter with
--runtime <cerebras-lora-id>and it just works — no shiftgate update needed.
- Fork this repo.
- Publish your adapter to HuggingFace and open a PR that documents it in a Community Adapters section (or add it to your local registry with
shiftgate adapter add). - The adapter registry ships empty by design — adapters are user-managed via
~/.shiftgate/adapters.json.
To add a task cluster that better matches your domain, run shiftgate task add interactively or edit ~/.shiftgate/tasks.json and add validation_examples that represent real queries your users ask. Run shiftgate init to recompute centroids.
~/.shiftgate/
├── adapters.json # your registered adapters
├── tasks.json # task clusters (copied from defaults on first init)
├── traces.jsonl # append-only routing trace log
└── embeddings_cache.npy # cached centroids — delete to force re-embedding
| Version | Focus |
|---|---|
| v0.1 | Single base model, multi-adapter routing ← current |
| v0.2 | Feedback loop + adapter scoring (auto-demote bad adapters) |
| v0.3 | Multi-model routing (route to different base models per task) |
| v1.0 | Community registry + web UI |
# Clone and install in editable mode with all dev dependencies
git clone https://github.com/shiftgate-ai/shiftgate
cd shiftgate
uv sync --extra dev # creates .venv, installs shiftgate + dev deps
# Run tests (no GPU needed — tests use synthetic embeddings)
uv run pytest
# Run the demo inside the venv
uv run shiftgate demoNote:
uv syncreadspyproject.tomland resolves a locked environment.
There is no need to runpip installmanually. Activate the venv with
.venv/Scripts/activate(Windows) orsource .venv/bin/activate(macOS/Linux)
if you want theshiftgatecommand on yourPATHwithout theuv runprefix.
Releases are managed through a CI release workflow (e.g. GitHub Actions).
No manual PyPI API token management is required for normal releases.
The recommended flow:
- Bump the version in
pyproject.toml(version = "x.y.z"). - Open a PR, get it reviewed and merged.
- Tag the commit:
git tag vx.y.z && git push origin vx.y.z. - The CI workflow builds the wheel with
uv buildand publishes to PyPI using Trusted Publishing (OIDC)
— no stored API token needed.
For a one-off manual publish (maintainers only):
uv build # produces dist/shiftgate-x.y.z-py3-none-any.whl
uv publish # authenticates via OIDC or a scoped PyPI tokenshiftgate/
├── cli.py # Typer CLI — all user commands
├── registry/
│ ├── schemas.py # Pydantic models: AdapterEntry, TaskCluster, RoutingTrace
│ ├── adapter_registry.py
│ └── task_registry.py
├── router/
│ ├── embedder.py # fastembed wrapper (CPU, singleton)
│ ├── matcher.py # cosine similarity, top-K, adapter selection
│ └── router.py # orchestrates embed → match → trace
├── runtime/
│ └── backend.py # OllamaBackend, VLLMBackend, BackendRouter
├── feedback/
│ └── loop.py # trace persistence, accept/reject, scoring
└── utils/
└── display.py # Rich panels, tables, animations
| Command | Description |
|---|---|
shiftgate init |
First-time setup: initialise ~/.shiftgate/, compute task embeddings |
shiftgate route "<query>" |
Route a query and show the decision — no inference |
shiftgate route "<query>" --explain |
Full decision tree: task scores, candidates, selection reason |
shiftgate run "<query>" |
Route + run via Ollama or vLLM |
shiftgate serve [--port 9000] [--host …] [--backend …] |
Run an OpenAI-compatible auto-routing proxy |
shiftgate doctor |
Full health check: embedder, backend, adapters, task embeddings |
shiftgate adapter add <hf_repo> [--tags …] [--base …] |
Register adapter from HuggingFace (metadata only) |
shiftgate adapter add <id> --local <path> [--tags …] |
Register a local adapter path |
shiftgate adapter add <id> --runtime <name> [--tags …] |
Register a backend-loaded adapter by its runtime name |
shiftgate adapter list |
Table of all registered adapters |
shiftgate adapter remove <id> |
Remove an adapter |
shiftgate task list |
Table of all task clusters |
shiftgate task add |
Interactively add a new task cluster |
shiftgate feedback accept |
Mark last routing as good |
shiftgate feedback reject |
Mark last routing as bad |
shiftgate feedback stats |
Adapter acceptance rate table |
shiftgate status |
Backend connectivity + registry summary |
shiftgate demo |
Animated demo with fake routing traces |
- LORAUTER — Effective LoRA Adapter Routing using Task Representations (Dhasade et al., EPFL, 2026). shiftgate's task-level semantic routing is inspired by this work; it is not a reimplementation of the paper's full algorithm.
MIT. See LICENSE.
