A Python test harness for comparing real-world API costs and response quality across major AI models, built to inform model selection for ALMA — a conversational AI companion app.
Choosing the right model for a conversational AI companion involves tradeoffs between cost, quality, and tone. This repo runs real multi-turn conversations against multiple models and produces reports that make those tradeoffs visible with hard numbers.
- Runs scripted multi-turn conversations against 15+ models across 5 providers
- Tracks token usage and calculates real dollar costs per conversation
- Captures response quality via both LLM-as-judge and human rating
- Generates markdown reports with cost/quality comparisons (convertible to PDF/Word)
| Provider | Models | Pricing (Input / Output per 1M tokens) |
|---|---|---|
| Anthropic | Claude Opus 4 | $15.00 / $75.00 |
| Claude Sonnet 4 | $3.00 / $15.00 | |
| Claude Haiku 3.5 | $0.80 / $4.00 | |
| OpenAI | GPT-4o | $2.50 / $10.00 |
| GPT-4o-mini | $0.15 / $0.60 | |
| GPT-4.1 | $2.00 / $8.00 | |
| GPT-4.1-mini | $0.40 / $1.60 | |
| GPT-4.1-nano | $0.10 / $0.40 | |
| Gemini 2.5 Pro | $1.25 / $10.00 | |
| Gemini 2.5 Flash | $0.15 / $0.60 | |
| Together AI | Llama 3.3 70B | $0.88 / $0.88 |
| Llama 3 8B | $0.10 / $0.10 | |
| Qwen 2.5 7B | $0.30 / $0.30 |
Note: Pricing shown is approximate as of April 2025, sourced from each provider's public pricing page (Anthropic, OpenAI, Google, Together AI). Pricing changes frequently — verify against provider pages and update
config/models.yamlbefore each test run.
Based on a full test run across 4 multi-turn ALMA conversation scenarios (casual chat, emotional support, daily check-in, deep conversation). Quality rated by Claude Sonnet 4 as LLM judge, 1-10 scale.
| Model | Avg Cost/Convo | Quality | Monthly @ 10K DAU | Monthly @ 100K DAU |
|---|---|---|---|---|
| GPT-4.1 Nano | $0.0009 | 8.5/10 | $850 | $8.5K |
| Llama 3 8B | $0.0009 | 7.8/10 | $855 | $8.5K |
| GPT-4o Mini | $0.0012 | 8.0/10 | $1.1K | $11.1K |
| Gemini 2.5 Flash | $0.0019 | 8.8/10 | $1.7K | $17.1K |
| Qwen 2.5 7B | $0.0029 | 6.5/10 | $2.6K | $25.9K |
| GPT-4.1 Mini | $0.0037 | 8.5/10 | $3.3K | $33.2K |
| Gemini 2.5 Pro | $0.0087 | 9.0/10 | $7.9K | $78.6K |
| Llama 3.3 70B | $0.0089 | 8.2/10 | $8.0K | $80.3K |
| Claude Haiku 3.5 | $0.01 | 8.8/10 | $10.5K | $105.4K |
| GPT-4o | $0.02 | 8.0/10 | $17.8K | $178.2K |
| GPT-4.1 | $0.02 | 8.8/10 | $20.3K | $203.0K |
| Claude Sonnet 4 | $0.05 | 8.8/10 | $40.6K | $406.1K |
| Claude Opus 4 | $0.23 | 9.0/10 | $210.1K | $2.1M |
Projections assume 3 conversations per user per day, 30 days/month. Last run: April 2026.
Key findings: GPT-4.1 Nano offers the best value (94% of top quality at 11% of cost). Gemini 2.5 Flash is the sweet spot for quality-conscious budgets (8.8/10 at $0.002/convo). Claude Opus 4 and Gemini 2.5 Pro tie for highest quality (9.0/10) but at very different price points. See
results/latest/final_report.mdfor the full analysis.
alma-api-test/
├── config/
│ ├── models.yaml # Model definitions and pricing
│ ├── scenarios/ # Conversation test scripts
│ │ ├── casual_chat.yaml
│ │ ├── emotional_support.yaml
│ │ ├── daily_checkin.yaml
│ │ └── deep_conversation.yaml
│ ├── system_prompts/ # System prompt variations
│ │ └── alma_default.txt
│ └── defaults.yaml # Default test parameters
├── src/
│ ├── clients/ # API client wrappers per provider
│ │ ├── base.py
│ │ ├── anthropic.py
│ │ ├── openai_client.py
│ │ ├── google.py
│ │ └── together.py
│ ├── runner.py # Test execution engine
│ ├── cost.py # Cost calculator
│ ├── judge.py # LLM-as-judge quality scorer
│ ├── judge_cli.py # Judge CLI entry point
│ └── report.py # Markdown report generator
├── results/ # Test run outputs (JSON + markdown)
├── tests/ # Unit tests
├── .env.example # Required API keys
├── pyproject.toml
├── CLAUDE.md
└── README.md
# 1. Clone and install
git clone <repo-url>
cd alma-api-test
uv sync --extra dev
# 2. Set up API keys
cp .env.example .env
# Edit .env with your keys
# 3. Run a test
uv run python -m src.runner --scenario casual_chat --models claude-sonnet-4 gpt-4o
# 4. Generate report
uv run python -m src.report --run-dir results/latest| Parameter | Description | Default |
|---|---|---|
model |
Which model(s) to test | all |
temperature |
Sampling temperature | 0.7 |
max_tokens |
Max response tokens | 1024 |
system_prompt |
ALMA's personality prompt | see defaults.yaml |
scenario |
Conversation script to run | casual_chat |
turns |
Number of conversation turns | 8 |
Create a YAML file in config/scenarios/ following this structure:
name: casual_chat
description: "Casual daily conversation with ALMA"
turns:
- role: user
content: "Hey ALMA, how's it going?"
- role: user
content: "I had a rough day at work today..."
# ... more turnsReports are generated as markdown files in results/ and include:
- Cost per conversation by model
- Cost projections at various user scales (1K, 10K, 100K daily active users)
- Token usage breakdown (input vs output)
- Quality scores (LLM judge + human ratings)
- Recommendations based on cost/quality ratio