Stop Bleeding Money on AI Calls. Cut Costs 30-65% in 3 Lines of Code.
40-70% of text prompts and 20-60% of agent calls don't need expensive flagship models. You're overpaying every single day.
cascadeflow fixes this with intelligent model cascading, available in Python and TypeScript.
pip install cascadeflownpm install @cascadeflow/corecascadeflow is an intelligent AI model cascading library that dynamically selects the optimal model for each query or tool call through speculative execution. It's based on the research that 40-70% of queries don't require slow, expensive flagship models, and domain-specific smaller models often outperform large general-purpose models on specialized tasks. For the remaining queries that need advanced reasoning, cascadeflow automatically escalates to flagship models if needed.
Use cascadeflow for:
- Cost Optimization. Reduce API costs by 40-85% through intelligent model cascading and speculative execution with automatic per-query cost tracking.
- Cost Control and Transparency. Built-in telemetry for query, model, and provider-level cost tracking with configurable budget limits and programmable spending caps.
- Low Latency & Speed Optimization. Sub-2ms framework overhead with fast provider routing (Groq sub-50ms). Cascade simple queries to fast models while reserving expensive models for complex reasoning, achieving 2-10x latency reduction overall. (use preset
PRESET_ULTRA_FAST) - Multi-Provider Flexibility. Unified API across
OpenAI,Anthropic,Groq,Ollama,vLLM,Together, andHugging Facewith automatic provider detection and zero vendor lock-in. OptionalLiteLLMintegration for 100+ additional providers. - Edge & Local-Hosted AI Deployment. Use best of both worlds: handle most queries with local models (vLLM, Ollama), then automatically escalate complex queries to cloud providers only when needed.
ℹ️ Note: SLMs (under 10B parameters) are sufficiently powerful for 60-70% of agentic AI tasks. Research paper
cascadeflow uses speculative execution with quality validation:
- Speculatively executes small, fast models first - optimistic execution ($0.15-0.30/1M tokens)
- Validates quality of responses using configurable thresholds (completeness, confidence, correctness)
- Dynamically escalates to larger models only when quality validation fails ($1.25-3.00/1M tokens)
- Learns patterns to optimize future cascading decisions and domain specific routing
Zero configuration. Works with YOUR existing models (7 Providers currently supported).
In practice, 60-70% of queries are handled by small, efficient models (8-20x cost difference) without requiring escalation
Result: 40-85% cost reduction, 2-10x faster responses, zero quality loss.
┌─────────────────────────────────────────────────────────────┐
│ cascadeflow Stack │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Cascade Agent │ │
│ │ │ │
│ │ Orchestrates the entire cascade execution │ │
│ │ • Query routing & model selection │ │
│ │ • Drafter -> Verifier coordination │ │
│ │ • Cost tracking & telemetry │ │
│ └───────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Domain Pipeline │ │
│ │ │ │
│ │ Automatic domain classification │ │
│ │ • Rule-based detection (CODE, MATH, DATA, etc.) │ │
│ │ • Optional ML semantic classification │ │
│ │ • Domain-optimized pipelines & model selection │ │
│ └───────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Quality Validation Engine │ │
│ │ │ │
│ │ Multi-dimensional quality checks │ │
│ │ • Length validation (too short/verbose) │ │
│ │ • Confidence scoring (logprobs analysis) │ │
│ │ • Format validation (JSON, structured output) │ │
│ │ • Semantic alignment (intent matching) │ │
│ └───────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Cascading Engine (<2ms overhead) │ │
│ │ │ │
│ │ Smart model escalation strategy │ │
│ │ • Try cheap models first (speculative execution) │ │
│ │ • Validate quality instantly │ │
│ │ • Escalate only when needed │ │
│ │ • Automatic retry & fallback │ │
│ └───────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Provider Abstraction Layer │ │
│ │ │ │
│ │ Unified interface for 7+ providers │ │
│ │ • OpenAI • Anthropic • Groq • Ollama │ │
│ │ • Together • vLLM • HuggingFace • LiteLLM │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
pip install cascadeflow[all]from cascadeflow import CascadeAgent, ModelConfig
# Define your cascade - try cheap model first, escalate if needed
agent = CascadeAgent(models=[
ModelConfig(name="gpt-4o-mini", provider="openai", cost=0.000375), # Draft model (~$0.375/1M tokens)
ModelConfig(name="gpt-5", provider="openai", cost=0.00562), # Verifier model (~$5.62/1M tokens)
])
# Run query - automatically routes to optimal model
result = await agent.run("What's the capital of France?")
print(f"Answer: {result.content}")
print(f"Model used: {result.model_used}")
print(f"Cost: ${result.total_cost:.6f}")💡 Optional: Use ML-based Semantic Quality Validation
For advanced use cases, you can add ML-based semantic similarity checking to validate that responses align with queries.
Step 1: Install the optional ML package:
pip install cascadeflow[ml] # Adds semantic similarity via FastEmbed (~80MB model)Step 2: Use semantic quality validation:
from cascadeflow.quality.semantic import SemanticQualityChecker
# Initialize semantic checker (downloads model on first use)
checker = SemanticQualityChecker(
similarity_threshold=0.5, # Minimum similarity score (0-1)
toxicity_threshold=0.7 # Maximum toxicity score (0-1)
)
# Validate query-response alignment
query = "Explain Python decorators"
response = "Decorators are a way to modify functions using @syntax..."
result = checker.validate(query, response, check_toxicity=True)
print(f"Similarity: {result.similarity:.2%}")
print(f"Passed: {result.passed}")
print(f"Toxic: {result.is_toxic}")What you get:
- 🎯 Semantic similarity scoring (query ↔ response alignment)
- 🛡️ Optional toxicity detection
- 🔄 Automatic model download and caching
- 🚀 Fast inference (~100ms per check)
Full example: See semantic_quality_domain_detection.py
⚠️ GPT-5 Note: GPT-5 streaming requires organization verification. Non-streaming works for all users. Verify here if needed (~15 min). Basic cascadeflow examples work without - GPT-5 is only called when needed (typically 20-30% of requests).
📖 Learn more: Python Documentation | Quickstart Guide | Providers Guide
npm install @cascadeflow/coreimport { CascadeAgent, ModelConfig } from '@cascadeflow/core';
// Same API as Python!
const agent = new CascadeAgent({
models: [
{ name: 'gpt-4o-mini', provider: 'openai', cost: 0.000375 },
{ name: 'gpt-4o', provider: 'openai', cost: 0.00625 },
],
});
const result = await agent.run('What is TypeScript?');
console.log(`Model: ${result.modelUsed}`);
console.log(`Cost: $${result.totalCost}`);
console.log(`Saved: ${result.savingsPercentage}%`);💡 Optional: ML-based Semantic Quality Validation
For advanced quality validation, enable ML-based semantic similarity checking to ensure responses align with queries.
Step 1: Install the optional ML packages:
npm install @cascadeflow/ml @xenova/transformersStep 2: Enable semantic validation in your cascade:
import { CascadeAgent, SemanticQualityChecker } from '@cascadeflow/core';
const agent = new CascadeAgent({
models: [
{ name: 'gpt-4o-mini', provider: 'openai', cost: 0.000375 },
{ name: 'gpt-4o', provider: 'openai', cost: 0.00625 },
],
quality: {
threshold: 0.40, // Traditional confidence threshold
requireMinimumTokens: 5, // Minimum response length
useSemanticValidation: true, // Enable ML validation
semanticThreshold: 0.5, // 50% minimum similarity
},
});
// Responses now validated for semantic alignment
const result = await agent.run('Explain TypeScript generics');Step 3: Or use semantic validation directly:
import { SemanticQualityChecker } from '@cascadeflow/core';
const checker = new SemanticQualityChecker();
if (await checker.isAvailable()) {
const result = await checker.checkSimilarity(
'What is TypeScript?',
'TypeScript is a typed superset of JavaScript.'
);
console.log(`Similarity: ${(result.similarity * 100).toFixed(1)}%`);
console.log(`Passed: ${result.passed}`);
}What you get:
- 🎯 Query-response semantic alignment detection
- 🚫 Off-topic response filtering
- 📦 BGE-small-en-v1.5 embeddings (~40MB, auto-downloads)
- ⚡ Fast CPU inference (~50-100ms with caching)
- 🔄 Request-scoped caching (50% latency reduction)
- 🌐 Works in Node.js, Browser, and Edge Functions
Example: semantic-quality.ts
📖 Learn more: TypeScript Documentation | Quickstart Guide | Node.js Examples | Browser/Edge Guide
Migrate in 5min from direct Provider implementation to cost savings and full cost control and transparency.
Cost: $0.000113, Latency: 850ms
# Using expensive model for everything
result = openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What's 2+2?"}]
)Cost: $0.000007, Latency: 234ms
agent = CascadeAgent(models=[
ModelConfig(name="gpt-4o-mini", provider="openai", cost=0.000375),
ModelConfig(name="gpt-4o", provider="openai", cost=0.00625),
])
result = await agent.run("What's 2+2?")🔥 Saved: $0.000106 (94% reduction), 3.6x faster
📊 Learn more: Cost Tracking Guide | Production Best Practices | Performance Optimization
Use cascadeflow in n8n workflows for no-code AI automation with automatic cost optimization!
- Open n8n
- Go to Settings → Community Nodes
- Search for:
@cascadeflow/n8n-nodes-cascadeflow - Click Install
Create a workflow:
Manual Trigger → cascadeflow Node → Set Node
Configure cascadeflow node:
- Draft Model:
gpt-4o-mini($0.000375) - Verifier Model:
gpt-4o($0.00625) - Message: Your prompt
- Output: Full Metrics
Result: 40-85% cost savings in your n8n workflows!
Features:
- ✅ Visual workflow integration
- ✅ Multi-provider support
- ✅ Cost tracking in workflow
- ✅ Tool calling support
- ✅ Easy debugging with metrics
🔌 Learn more: n8n Integration Guide | n8n Documentation
Basic Examples - Get started quickly
| Example | Description | Link |
|---|---|---|
| Basic Usage | Simple cascade setup with OpenAI models | View |
| Preset Usage | Use built-in presets for quick setup | View |
| Multi-Provider | Mix multiple AI providers in one cascade | View |
| Reasoning Models | Use reasoning models (o1/o3, Claude 3.7, DeepSeek-R1) | View |
| Tool Execution | Function calling and tool usage | View |
| Streaming Text | Stream responses from cascade agents | View |
| Cost Tracking | Track and analyze costs across queries | View |
Advanced Examples - Production & customization
| Example | Description | Link |
|---|---|---|
| Production Patterns | Best practices for production deployments | View |
| FastAPI Integration | Integrate cascades with FastAPI | View |
| Streaming Tools | Stream tool calls and responses | View |
| Batch Processing | Process multiple queries efficiently | View |
| Multi-Step Cascade | Build complex multi-step cascades | View |
| Edge Device | Run cascades on edge devices with local models | View |
| vLLM Example | Use vLLM for local model deployment | View |
| Custom Cascade | Build custom cascade strategies | View |
| Custom Validation | Implement custom quality validators | View |
| User Budget Tracking | Per-user budget enforcement and tracking | View |
| User Profile Usage | User-specific routing and configurations | View |
| Rate Limiting | Implement rate limiting for cascades | View |
| Guardrails | Add safety and content guardrails | View |
| Cost Forecasting | Forecast costs and detect anomalies | View |
| Semantic Quality Detection | ML-based domain and quality detection | View |
| Profile Database Integration | Integrate user profiles with databases | View |
Basic Examples - Get started quickly
| Example | Description | Link |
|---|---|---|
| Basic Usage | Simple cascade setup (Node.js) | View |
| Tool Calling | Function calling with tools (Node.js) | View |
| Multi-Provider | Mix providers in TypeScript (Node.js) | View |
| Reasoning Models | Use reasoning models (o1/o3, Claude 3.7, DeepSeek-R1) | View |
| Cost Tracking | Track and analyze costs across queries | View |
| Semantic Quality | ML-based semantic validation with embeddings | View |
| Streaming | Stream responses in TypeScript | View |
Advanced Examples - Production & edge deployment
| Example | Description | Link |
|---|---|---|
| Production Patterns | Production best practices (Node.js) | View |
| Browser/Edge | Vercel Edge runtime example | View |
📂 View All Python Examples → | View All TypeScript Examples →
Getting Started - Core concepts and basics
| Guide | Description | Link |
|---|---|---|
| Quickstart | Get started with cascadeflow in 5 minutes | Read |
| Providers Guide | Configure and use different AI providers | Read |
| Presets Guide | Using and creating custom presets | Read |
| Streaming Guide | Stream responses from cascade agents | Read |
| Tools Guide | Function calling and tool usage | Read |
| Cost Tracking | Track and analyze API costs | Read |
Advanced Topics - Production, customization & integrations
| Guide | Description | Link |
|---|---|---|
| Production Guide | Best practices for production deployments | Read |
| Performance Guide | Optimize cascade performance and latency | Read |
| Custom Cascade | Build custom cascade strategies | Read |
| Custom Validation | Implement custom quality validators | Read |
| Edge Device | Deploy cascades on edge devices | Read |
| Browser Cascading | Run cascades in the browser/edge | Read |
| FastAPI Integration | Integrate with FastAPI applications | Read |
| n8n Integration | Use cascadeflow in n8n workflows | Read |
| Feature | Benefit |
|---|---|
| 🎯 Speculative Cascading | Tries cheap models first, escalates intelligently |
| 💰 40-85% Cost Savings | Research-backed, proven in production |
| ⚡ 2-10x Faster | Small models respond in <50ms vs 500-2000ms |
| ⚡ Low Latency | Sub-2ms framework overhead, negligible performance impact |
| 🔄 Mix Any Providers | OpenAI, Anthropic, Groq, Ollama, vLLM, Together + LiteLLM (optional) |
| 👤 User Profile System | Per-user budgets, tier-aware routing, enforcement callbacks |
| ✅ Quality Validation | Automatic checks + semantic similarity (optional ML, ~80MB, CPU) |
| 🎨 Cascading Policies | Domain-specific pipelines, multi-step validation strategies |
| 🧠 Domain Understanding | Auto-detects code/medical/legal/math/structured data, routes to specialists |
| 🤖 Drafter/Validator Pattern | 20-60% savings for agent/tool systems |
| 🔧 Tool Calling Support | Universal format, works across all providers |
| 📊 Cost Tracking | Built-in analytics + OpenTelemetry export (vendor-neutral) |
| 🚀 3-Line Integration | Zero architecture changes needed |
| 🏭 Production Ready | Streaming, batch processing, tool handling, reasoning model support, caching, error recovery, anomaly detection |
MIT © see LICENSE file.
Free for commercial use. Attribution appreciated but not required.
We ❤️ contributions!
📝 Contributing Guide - Python & TypeScript development setup
- Cascade Profiler - Analyzes your AI API logs to calculate cost savings potential and generate optimized cascadeflow configurations automatically
- User Tier Management - Cost controls and limits per user tier with advanced routing
- Semantic Quality Validators - Optional lightweight local quality scoring (200MB CPU model, no external API calls)
- Code Complexity Detection - Dynamic cascading based on task complexity analysis
- Domain Aware Cascading - Multi-stage pipelines tailored to specific domains
- Benchmark Reports - Automated performance and cost benchmarking
- 📖 GitHub Discussions - Searchable Q&A
- 🐛 GitHub Issues - Bug reports & feature requests
- 📧 Email Support - Direct support
If you use cascadeflow in your research or project, please cite:
@software{cascadeflow2025,
author = {Lemony Inc., Sascha Buehrle and Contributors},
title = {cascadeflow: Smart AI model cascading for cost optimization},
year = {2025},
publisher = {GitHub},
url = {https://github.com/lemony-ai/cascadeflow}
}Ready to cut your AI costs by 40-85%?
pip install cascadeflownpm install @cascadeflow/coreRead the Docs • View Python Examples • View TypeScript Examples • Join Discussions
Built with ❤️ by Lemony Inc. and the cascadeflow Community
One cascade. Hundreds of specialists.
New York | Zurich
⭐ Star us on GitHub if cascadeflow helps you save money!