The agent runtime that slashes token spend for autonomous AI research.
AI Scientist costs $15/paper. AI-Researcher burns 2-5M tokens per task. AutoResearch hits context limits by iteration 60.
NightShift fixes the engine, not the agent. Run any AI research pipeline overnight on your laptop. Pay in time, not tokens.
Every autonomous research system has the same architecture:
[Read 100 papers] --> [Send everything to LLM] --> [Get output] --> [Repeat]
This is O(n) token cost where n = input size. We found five structural waste patterns across AI Scientist, AI-Researcher, and AutoResearch that account for 60-80% of all token spend:
| # | Pattern | Example | Waste |
|---|---|---|---|
| 1 | History accumulation | Full conversation replayed every turn, never compressed | 500K-1.5M tokens/session |
| 2 | Redundant context | Same code/templates/fewshot re-sent with every call | 2M tokens across 50 ideas |
| 3 | Raw data in prompts | 500K-2M JSON agent logs dumped into LLM input | 95% unnecessary |
| 4 | Uniform model routing | Frontier model used for simple JSON queries | 10-20x overspend |
| 5 | Zero cost awareness | No tracking, no budgets, no ROI measurement | Unoptimizable |
NightShift is a runtime layer that sits between your agent and the LLM API, eliminating these patterns automatically.
Without NightShift:
Agent sends 10M tokens to API --> $15/task
With NightShift:
Agent sends 10M tokens
--> Progressive compression (heavy local reduction before the API)
--> Confidence gate (skip API if local models handle it)
--> Content deduplication (don't re-send what hasn't changed)
--> Budget-optimal scheduling (spend $ where ROI is highest)
--> API receives 1.5K tokens --> $1.50/task
Illustrative example; actual reduction depends on the workload. No benchmark numbers are published yet (see Benchmarks).
Local small models (sub-300M, run on any laptop) process raw input through stages:
10M tokens --> Parse (Granite-Docling 258M)
2M tokens --> Extract (GLiNER 90M)
100K tokens --> Embed + Cluster (Jina v5 Nano 239M)
20K tokens --> Summarize (T5-small 60M)
4K tokens --> Rank + Deduplicate (MiniLM 22M)
1.5K tokens -> API
Each stage is free (local inference). The API only sees the final distillate.
Not every decision needs the API. The router estimates local confidence before calling out:
- High confidence locally (entity extraction, classification, dedup) --> skip API
- Low confidence (synthesis, hypothesis generation, strategic decisions) --> call API
- Ambiguous (novelty evaluation, quality judgment) --> call API with compressed context
Skips the API entirely for the calls local models can handle confidently.
Tracks what content has already been sent to the API. Never re-sends:
- Static templates, fewshot examples, code that hasn't changed
- Conversation history that's already been summarized
- Documents that have already been processed
Uses content hashing + diff-based updates.
Replaces unbounded conversation history with:
- Active window: Last N turns in full detail
- Compressed archive: Older turns summarized periodically
- Persistent knowledge: Facts extracted and stored in vector DB
Prevents the context accumulation that crashes AutoResearch at iteration 60.
Every API call is tracked:
{
"call_id": "synth_042",
"cost_usd": 0.45,
"input_tokens": 1847,
"new_insights": 4,
"cost_per_insight": 0.112,
"confidence_before": 0.6,
"confidence_after": 0.85
}UCB1 bandit algorithm allocates budget optimally across:
- Explore: New research directions (high variance)
- Deepen: Expand promising findings (medium variance)
- Synthesize: Connect and output (low variance)
Research doesn't start from zero. Cross-session memory:
- Facts, entities, and relations persist across runs
- New research queries existing knowledge first
- Repeated work reuses prior results instead of recomputing from scratch
┌──────────────────────────────────────────────────┐
│ YOUR AGENT │
│ (AI Scientist, custom, any agent) │
└──────────────────────┬───────────────────────────┘
│ raw LLM calls
┌──────────────────────▼───────────────────────────┐
│ NIGHTSHIFT RUNTIME │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌─────────┐ │
│ │ Progressive │ │ Confidence │ │ Content │ │
│ │ Compression │ │ Gate │ │ Dedup │ │
│ └──────┬──────┘ └──────┬───────┘ └────┬────┘ │
│ └────────────────┼────────────────┘ │
│ │ │
│ ┌───────────────────────▼──────────────────────┐│
│ │ Token Economics Engine ││
│ │ Budget tracker | Bandit scheduler | ROI log ││
│ └───────────────────────┬──────────────────────┘│
│ │ │
│ ┌───────────────────────▼──────────────────────┐│
│ │ Sliding Window History Manager ││
│ │ Active window | Compressed archive | KG ││
│ └───────────────────────┬──────────────────────┘│
│ │ │
│ ┌───────────────────────▼──────────────────────┐│
│ │ Local Model Pool ││
│ │ Load/unload on demand | ONNX/GGUF inference ││
│ │ Granite-Docling | GLiNER | Jina | T5 | MiniLM││
│ └──────────────────────────────────────────────┘│
└──────────────────────┬───────────────────────────┘
│ optimized calls only
┌──────────────────────▼───────────────────────────┐
│ LLM API │
│ (Claude, GPT-4, Gemini, local) │
└──────────────────────────────────────────────────┘
Benchmarks are not published yet. A harness under benchmarks/ is planned to compare
NightShift against AI Scientist, AI-Researcher, and AutoResearch on tokens and cost per
task; results will be added here once those runs are complete. Until then, treat the
figures in this README as illustrative of the mechanism, not measured results.
pip install nightshift-runtime
# Wrap any LLM client
from nightshift import NightShift
engine = NightShift(
local_models="auto", # downloads sub-300M models on first run
api_budget="$5.00", # hard cap
knowledge_db="./research", # persistent across sessions
)
# Drop-in replacement for your LLM calls
response = engine.complete(
messages=messages,
model="claude-sonnet-4-20250514",
compress=True, # enable progressive compression
gate=True, # enable confidence gating
)
# Or wrap an existing agent
engine.wrap(your_agent_function, budget="$10", duration="overnight")nightshift/
├── nightshift/
│ ├── __init__.py
│ ├── engine.py # Core runtime orchestrator
│ ├── compression/
│ │ ├── pipeline.py # Progressive compression stages
│ │ ├── dedup.py # Content deduplication
│ │ └── models.py # Local model pool manager
│ ├── routing/
│ │ ├── confidence.py # Confidence-gated router
│ │ ├── gate.py # API vs local decision logic
│ │ └── model_select.py # Task-appropriate model selection
│ ├── history/
│ │ ├── window.py # Sliding window manager
│ │ ├── compressor.py # History summarization
│ │ └── knowledge.py # Persistent knowledge graph
│ ├── economics/
│ │ ├── tracker.py # Token cost/value tracking
│ │ ├── bandit.py # UCB1 budget allocation
│ │ └── reports.py # ROI analysis and logging
│ └── agents/
│ ├── research.py # Reference research agent
│ ├── pdf.py # Reference PDF analysis agent
│ └── base.py # Base agent class
├── tests/
├── benchmarks/
│ ├── vs_ai_scientist.py
│ ├── vs_ai_researcher.py
│ └── vs_autoresearch.py
├── docs/
│ ├── architecture.md
│ ├── waste-analysis.md # The 5 patterns paper
│ └── token-economics.md # Formal framework
├── pyproject.toml
├── LICENSE # MIT
└── README.md
This project is accompanied by a research paper:
"Token Economics: Cost-Optimal Autonomous AI Research via Progressive Compression and Confidence-Gated Routing"
Key contributions:
- Formal framework for token economics in multi-model agent systems
- Progressive compression pipeline for large local token reduction before the API
- Confidence-gated routing that avoids the API calls local models can handle
- UCB1-based budget-optimal exploration with convergence guarantees
- A methodology for evaluating research quality per dollar across local and API models
Because the best research happens while you sleep. Set your budget, point it at a problem, go to bed. Wake up to results.
NightShift is MIT licensed and open to contributions. See CONTRIBUTING.md.
Santo -- I build the infrastructure most engineers import.
MIT