Skip to content

oj-codes/alma-api-test

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ALMA API Cost Testing

A Python test harness for comparing real-world API costs and response quality across major AI models, built to inform model selection for ALMA — a conversational AI companion app.

Why This Exists

Choosing the right model for a conversational AI companion involves tradeoffs between cost, quality, and tone. This repo runs real multi-turn conversations against multiple models and produces reports that make those tradeoffs visible with hard numbers.

What It Does

  1. Runs scripted multi-turn conversations against 15+ models across 5 providers
  2. Tracks token usage and calculates real dollar costs per conversation
  3. Captures response quality via both LLM-as-judge and human rating
  4. Generates markdown reports with cost/quality comparisons (convertible to PDF/Word)

Models Tested

Provider Models Pricing (Input / Output per 1M tokens)
Anthropic Claude Opus 4 $15.00 / $75.00
Claude Sonnet 4 $3.00 / $15.00
Claude Haiku 3.5 $0.80 / $4.00
OpenAI GPT-4o $2.50 / $10.00
GPT-4o-mini $0.15 / $0.60
GPT-4.1 $2.00 / $8.00
GPT-4.1-mini $0.40 / $1.60
GPT-4.1-nano $0.10 / $0.40
Google Gemini 2.5 Pro $1.25 / $10.00
Gemini 2.5 Flash $0.15 / $0.60
Together AI Llama 3.3 70B $0.88 / $0.88
Llama 3 8B $0.10 / $0.10
Qwen 2.5 7B $0.30 / $0.30

Note: Pricing shown is approximate as of April 2025, sourced from each provider's public pricing page (Anthropic, OpenAI, Google, Together AI). Pricing changes frequently — verify against provider pages and update config/models.yaml before each test run.

Cost Per Conversation (Real Test Results)

Based on a full test run across 4 multi-turn ALMA conversation scenarios (casual chat, emotional support, daily check-in, deep conversation). Quality rated by Claude Sonnet 4 as LLM judge, 1-10 scale.

Model Avg Cost/Convo Quality Monthly @ 10K DAU Monthly @ 100K DAU
GPT-4.1 Nano $0.0009 8.5/10 $850 $8.5K
Llama 3 8B $0.0009 7.8/10 $855 $8.5K
GPT-4o Mini $0.0012 8.0/10 $1.1K $11.1K
Gemini 2.5 Flash $0.0019 8.8/10 $1.7K $17.1K
Qwen 2.5 7B $0.0029 6.5/10 $2.6K $25.9K
GPT-4.1 Mini $0.0037 8.5/10 $3.3K $33.2K
Gemini 2.5 Pro $0.0087 9.0/10 $7.9K $78.6K
Llama 3.3 70B $0.0089 8.2/10 $8.0K $80.3K
Claude Haiku 3.5 $0.01 8.8/10 $10.5K $105.4K
GPT-4o $0.02 8.0/10 $17.8K $178.2K
GPT-4.1 $0.02 8.8/10 $20.3K $203.0K
Claude Sonnet 4 $0.05 8.8/10 $40.6K $406.1K
Claude Opus 4 $0.23 9.0/10 $210.1K $2.1M

Projections assume 3 conversations per user per day, 30 days/month. Last run: April 2026.

Key findings: GPT-4.1 Nano offers the best value (94% of top quality at 11% of cost). Gemini 2.5 Flash is the sweet spot for quality-conscious budgets (8.8/10 at $0.002/convo). Claude Opus 4 and Gemini 2.5 Pro tie for highest quality (9.0/10) but at very different price points. See results/latest/final_report.md for the full analysis.

Project Structure

alma-api-test/
├── config/
│   ├── models.yaml          # Model definitions and pricing
│   ├── scenarios/            # Conversation test scripts
│   │   ├── casual_chat.yaml
│   │   ├── emotional_support.yaml
│   │   ├── daily_checkin.yaml
│   │   └── deep_conversation.yaml
│   ├── system_prompts/       # System prompt variations
│   │   └── alma_default.txt
│   └── defaults.yaml         # Default test parameters
├── src/
│   ├── clients/              # API client wrappers per provider
│   │   ├── base.py
│   │   ├── anthropic.py
│   │   ├── openai_client.py
│   │   ├── google.py
│   │   └── together.py
│   ├── runner.py             # Test execution engine
│   ├── cost.py               # Cost calculator
│   ├── judge.py              # LLM-as-judge quality scorer
│   ├── judge_cli.py          # Judge CLI entry point
│   └── report.py             # Markdown report generator
├── results/                  # Test run outputs (JSON + markdown)
├── tests/                    # Unit tests
├── .env.example              # Required API keys
├── pyproject.toml
├── CLAUDE.md
└── README.md

Quick Start

# 1. Clone and install
git clone <repo-url>
cd alma-api-test
uv sync --extra dev

# 2. Set up API keys
cp .env.example .env
# Edit .env with your keys

# 3. Run a test
uv run python -m src.runner --scenario casual_chat --models claude-sonnet-4 gpt-4o

# 4. Generate report
uv run python -m src.report --run-dir results/latest

Configuration

Test Parameters (adjustable between runs)

Parameter Description Default
model Which model(s) to test all
temperature Sampling temperature 0.7
max_tokens Max response tokens 1024
system_prompt ALMA's personality prompt see defaults.yaml
scenario Conversation script to run casual_chat
turns Number of conversation turns 8

Adding a Scenario

Create a YAML file in config/scenarios/ following this structure:

name: casual_chat
description: "Casual daily conversation with ALMA"
turns:
  - role: user
    content: "Hey ALMA, how's it going?"
  - role: user
    content: "I had a rough day at work today..."
  # ... more turns

Reports

Reports are generated as markdown files in results/ and include:

  • Cost per conversation by model
  • Cost projections at various user scales (1K, 10K, 100K daily active users)
  • Token usage breakdown (input vs output)
  • Quality scores (LLM judge + human ratings)
  • Recommendations based on cost/quality ratio

About

A Python test harness for comparing real-world API costs and response quality across major AI models, built to inform model selection for **ALMA** — a conversational AI companion app.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages