ALMA API Cost Testing

A Python test harness for comparing real-world API costs and response quality across major AI models, built to inform model selection for ALMA — a conversational AI companion app.

Why This Exists

Choosing the right model for a conversational AI companion involves tradeoffs between cost, quality, and tone. This repo runs real multi-turn conversations against multiple models and produces reports that make those tradeoffs visible with hard numbers.

What It Does

Runs scripted multi-turn conversations against 15+ models across 5 providers
Tracks token usage and calculates real dollar costs per conversation
Captures response quality via both LLM-as-judge and human rating
Generates markdown reports with cost/quality comparisons (convertible to PDF/Word)

Models Tested

Provider	Models	Pricing (Input / Output per 1M tokens)
Anthropic	Claude Opus 4	$15.00 / $75.00
	Claude Sonnet 4	$3.00 / $15.00
	Claude Haiku 3.5	$0.80 / $4.00
OpenAI	GPT-4o	$2.50 / $10.00
	GPT-4o-mini	$0.15 / $0.60
	GPT-4.1	$2.00 / $8.00
	GPT-4.1-mini	$0.40 / $1.60
	GPT-4.1-nano	$0.10 / $0.40
Google	Gemini 2.5 Pro	$1.25 / $10.00
	Gemini 2.5 Flash	$0.15 / $0.60
Together AI	Llama 3.3 70B	$0.88 / $0.88
	Llama 3 8B	$0.10 / $0.10
	Qwen 2.5 7B	$0.30 / $0.30

Note: Pricing shown is approximate as of April 2025, sourced from each provider's public pricing page (Anthropic, OpenAI, Google, Together AI). Pricing changes frequently — verify against provider pages and update config/models.yaml before each test run.

Cost Per Conversation (Real Test Results)

Based on a full test run across 4 multi-turn ALMA conversation scenarios (casual chat, emotional support, daily check-in, deep conversation). Quality rated by Claude Sonnet 4 as LLM judge, 1-10 scale.

Model	Avg Cost/Convo	Quality	Monthly @ 10K DAU	Monthly @ 100K DAU
GPT-4.1 Nano	$0.0009	8.5/10	$850	$8.5K
Llama 3 8B	$0.0009	7.8/10	$855	$8.5K
GPT-4o Mini	$0.0012	8.0/10	$1.1K	$11.1K
Gemini 2.5 Flash	$0.0019	8.8/10	$1.7K	$17.1K
Qwen 2.5 7B	$0.0029	6.5/10	$2.6K	$25.9K
GPT-4.1 Mini	$0.0037	8.5/10	$3.3K	$33.2K
Gemini 2.5 Pro	$0.0087	9.0/10	$7.9K	$78.6K
Llama 3.3 70B	$0.0089	8.2/10	$8.0K	$80.3K
Claude Haiku 3.5	$0.01	8.8/10	$10.5K	$105.4K
GPT-4o	$0.02	8.0/10	$17.8K	$178.2K
GPT-4.1	$0.02	8.8/10	$20.3K	$203.0K
Claude Sonnet 4	$0.05	8.8/10	$40.6K	$406.1K
Claude Opus 4	$0.23	9.0/10	$210.1K	$2.1M

Projections assume 3 conversations per user per day, 30 days/month. Last run: April 2026.

Key findings: GPT-4.1 Nano offers the best value (94% of top quality at 11% of cost). Gemini 2.5 Flash is the sweet spot for quality-conscious budgets (8.8/10 at $0.002/convo). Claude Opus 4 and Gemini 2.5 Pro tie for highest quality (9.0/10) but at very different price points. See results/latest/final_report.md for the full analysis.

Project Structure

alma-api-test/
├── config/
│   ├── models.yaml          # Model definitions and pricing
│   ├── scenarios/            # Conversation test scripts
│   │   ├── casual_chat.yaml
│   │   ├── emotional_support.yaml
│   │   ├── daily_checkin.yaml
│   │   └── deep_conversation.yaml
│   ├── system_prompts/       # System prompt variations
│   │   └── alma_default.txt
│   └── defaults.yaml         # Default test parameters
├── src/
│   ├── clients/              # API client wrappers per provider
│   │   ├── base.py
│   │   ├── anthropic.py
│   │   ├── openai_client.py
│   │   ├── google.py
│   │   └── together.py
│   ├── runner.py             # Test execution engine
│   ├── cost.py               # Cost calculator
│   ├── judge.py              # LLM-as-judge quality scorer
│   ├── judge_cli.py          # Judge CLI entry point
│   └── report.py             # Markdown report generator
├── results/                  # Test run outputs (JSON + markdown)
├── tests/                    # Unit tests
├── .env.example              # Required API keys
├── pyproject.toml
├── CLAUDE.md
└── README.md

Quick Start

# 1. Clone and install
git clone <repo-url>
cd alma-api-test
uv sync --extra dev

# 2. Set up API keys
cp .env.example .env
# Edit .env with your keys

# 3. Run a test
uv run python -m src.runner --scenario casual_chat --models claude-sonnet-4 gpt-4o

# 4. Generate report
uv run python -m src.report --run-dir results/latest

Configuration

Test Parameters (adjustable between runs)

Parameter	Description	Default
`model`	Which model(s) to test	all
`temperature`	Sampling temperature	0.7
`max_tokens`	Max response tokens	1024
`system_prompt`	ALMA's personality prompt	see defaults.yaml
`scenario`	Conversation script to run	casual_chat
`turns`	Number of conversation turns	8

Adding a Scenario

Create a YAML file in config/scenarios/ following this structure:

name: casual_chat
description: "Casual daily conversation with ALMA"
turns:
  - role: user
    content: "Hey ALMA, how's it going?"
  - role: user
    content: "I had a rough day at work today..."
  # ... more turns

Reports

Reports are generated as markdown files in results/ and include:

Cost per conversation by model
Cost projections at various user scales (1K, 10K, 100K daily active users)
Token usage breakdown (input vs output)
Quality scores (LLM judge + human ratings)
Recommendations based on cost/quality ratio

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ALMA API Cost Testing

Why This Exists

What It Does

Models Tested

Cost Per Conversation (Real Test Results)

Project Structure

Quick Start

Configuration

Test Parameters (adjustable between runs)

Adding a Scenario

Reports

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
config		config
memory		memory
results		results
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

ALMA API Cost Testing

Why This Exists

What It Does

Models Tested

Cost Per Conversation (Real Test Results)

Project Structure

Quick Start

Configuration

Test Parameters (adjustable between runs)

Adding a Scenario

Reports

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages