A hierarchical router that answers each prompt with the cheapest, lowest-energy engine that can do the job — sending trivial queries to deterministic/local handlers and reserving large LLMs only for prompts that genuinely need them. The result: lower latency, lower cost, and less compute/carbon per query.
🔗 Live demo · Frontend on Vercel · Backend on Google Cloud Run
Most apps send every prompt to a large model, even "what is the capital of France?" EcoPrompt asks a different question first: what is the smallest engine that can answer this correctly? — then routes accordingly, and measures the energy and cost it saved.
Live metrics dashboard — over a 25-prompt sample, 96% of traffic was answered without a paid cloud LLM:
The route-distribution and per-route latency/energy charts make the core idea visible: cheap local tiers handle the bulk of traffic, while the heavy groq route is rare but accounts for nearly all the latency and energy — exactly the cost EcoPrompt is built to avoid.
Routing in action — each response is tagged with the route that answered it:
| Local knowledge + code generation | Grounded web search for fresh facts |
|---|---|
![]() |
![]() |
A simple definition is answered by the local KB-reasoned tier; a code request is served by the local model; and a real-time question ("who won the champions league?") escalates to the web search tier with cited sources.
Each incoming prompt is scored for complexity and pushed through a cascade of routes, cheapest first. It only escalates to a paid LLM when the cheaper tiers can't answer confidently.
| Tier | Route | Engine | Cost / Energy |
|---|---|---|---|
| 1 | deterministic |
Rule/lookup engine (math, geography, exact facts) | ~0 |
| 2 | kb_reasoned_local / rag_local |
Local knowledge base + lightweight RAG retrieval | ~0 |
| 3 | template_engine |
Code-template responder for common programming asks | ~0 |
| 4 | local |
Groq Llama 3.1 8B Instant (small, fast) | low |
| 5 | groq |
Groq Llama 3 70B (heavier reasoning) | higher |
| 6 | web |
Gemini grounded web search (fresh / real-time facts) | highest |
A response from a cheaper tier is sanity-checked (entity coverage, weak-answer and truncation detection); if it looks weak, EcoPrompt escalates to the next tier instead of returning a bad answer.
The local tiers are backed by curated KB modules under kb/ — geography, math, science (physics / chemistry / biology), history, programming, and high-level concepts — plus a small RAG engine (rag_engine.py) for semantic lookup. These answer a large share of everyday prompts with zero LLM calls.
Every request records latency, estimated energy (kWh), and estimated cost per route, exposed at /metrics and visualized in the dashboard. Baselines used for comparison:
- GPT-4o: ~$4.00 / 1M tokens
- Groq Llama 3 70B: ~$0.70 / 1M tokens
- Electricity: ₹8.00 / kWh (India avg)
A note on the numbers (honesty matters). The energy and CO₂ figures are estimates, not hardware measurements. Energy is modeled as
latency × assumed power drawand cost is derived from the published per-token prices above. They're meant to illustrate the relative savings of routing cheap-first — not to be billed against. The one thing measured directly is cloud-avoidance rate (the share of prompts answered without a paid LLM call), which is the metric that actually drives the savings.
Backend — Python, FastAPI, Uvicorn · Groq (Llama 3.1 8B / Llama 3 70B) · Google Gemini (grounded search) · custom deterministic + RAG engines Frontend — React + Vite + Tailwind CSS · Recharts (metrics dashboard) · react-markdown + syntax highlighting · Axios
| Method | Endpoint | Description |
|---|---|---|
POST |
/generate |
Route a prompt and return the answer + chosen route |
POST |
/generate-stream |
Same, streamed token-by-token |
GET |
/metrics |
Aggregate latency / energy / cost per route |
pip install -r requirements.txt
cp .env.example .env # then fill in your keys
uvicorn main:app --reloadBackend runs at http://localhost:8000.
cd frontend
npm install
npm run devFrontend runs at http://localhost:5173.
See .env.example. You'll need:
GROQ_API_KEY— from console.groq.com (powers the local/large LLM tiers)GEMINI_API_KEY— from Google AI Studio (powers the grounded web-search tier)
Offline unit tests cover the routing-decision logic (complexity scoring, the simple-prompt fast path, token budgeting, source-host matching, energy estimate) and the KB tokenizer. They make no API calls.
python -m unittest discover -s tests -vCI runs them on every push via GitHub Actions (see the badge above).
main.py FastAPI app — routing cascade, metrics, streaming
deterministic.py Tier-1 rule/lookup engine
kb/ Knowledge-base lookups + RAG engine (geography, math, science, …)
tests/ Offline unit tests for routing + tokenizer logic
diagrams/ Architecture diagrams (.png/.svg/.mmd)
frontend/ Vite + React + Tailwind UI and metrics dashboard
.github/workflows/ CI (runs the test suite)
- Pluggable model backends beyond Groq/Gemini
- Configurable routing policy / thresholds
- Per-user energy & cost reports
Built by K Jayarama Das.




