Open source evaluation framework: accuracy + cost + latency + hallucination #1675

vignesh2027 · 2026-06-08T05:39:12Z

vignesh2027
Jun 8, 2026

Hey OpenAI Evals community!

OpenAI Evals is fantastic for task-specific evaluation. For teams who also need production metrics alongside task accuracy, I built a complementary open source framework.

What it adds beyond task accuracy:

Cost per 1K tokens (from real token counts, not estimates)
Latency p50/p95/p99 (async parallel, realistic concurrency)
Hallucination Rate (linguistic signal analysis, no judge needed)
Reasoning Quality (CoT depth score 1-10)
Accuracy (MMLU + TruthfulQA + custom benchmark support)

Key insight from running this:
GPT-4o-mini vs Gemini Flash: 78.4% vs 76.8% accuracy. But $0.0003 vs $0.0001 per 1K. For production at scale, that 2% accuracy gap rarely justifies the 3x cost difference.

Live demo (no API key needed): https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework

71 tests, 82% coverage, full CI/CD. Open source, free forever.
Task evaluation (OpenAI Evals) + production metrics (this) = complete evaluation stack.

richardchen874-sys · 2026-06-14T05:49:25Z

richardchen874-sys
Jun 14, 2026

This is a useful direction. For production AI apps, accuracy alone is rarely enough — cost, latency, hallucination rate, and stability under real usage matter a lot.

One metric I’d also consider adding is cost-per-successful-task, not just cost per 1K tokens.

For example, a cheaper model may look better on token price, but if it needs more retries, longer prompts, or fallback calls, the real cost can be higher.

I’m especially interested in comparing OpenAI-compatible models across:

accuracy
latency
cost per task
retry rate
hallucination rate
stability under longer sessions

This kind of evaluation would be very useful for small AI SaaS teams and indie builders choosing between premium models and lower-cost alternatives like DeepSeek / BytePlus.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open source evaluation framework: accuracy + cost + latency + hallucination #1675

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Open source evaluation framework: accuracy + cost + latency + hallucination #1675

Uh oh!

vignesh2027 Jun 8, 2026

Replies: 1 comment

Uh oh!

richardchen874-sys Jun 14, 2026

vignesh2027
Jun 8, 2026

richardchen874-sys
Jun 14, 2026