# ðŸ““ The GenAI Revolution Cookbook

**Title:** How to Choose an AI Model for Your App: Speed, Cost, Reliability

**Description:** Quickly choose the right LLM with a practical framework: compare accuracy, context limits, latency, token cost, and risk tolerance today.

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



Choosing the right model isn't about hype, it's about fit. This guide gives you a practical framework to pick a model that matches your app's speed, cost, and reliability constraints. For a deeper dive into when a smaller model might actually outperform a large one, and how to weigh cost versus latency tradeoffs, see our analysis on [small vs large language models](/article/small-language-models-vs-large-language-models-when-to-use-each-2).

## Define Your Requirements First

Before you compare models, nail down what success looks like for your application.

**Latency budget:** Is this a chatbot where users expect sub-second responses, or a batch summarization job that can tolerate minutes? Set hard ceilings for time-to-first-token (TTFT) and total latency.

**Context length:** How much text do you need to fit in a single prompt? Set a safe ceiling lower than "max supported." Long-context models often degrade well before their advertised limit; treat 60â€“80% as usable. If you want to understand why this happens and how to mitigate it, check out our guide on [context rot and how LLMs 'forget' as their memory grows](/article/context-rot-why-llms-forget-as-their-memory-grows-3).

**Output structure:** Do you need JSON, citations, or free-form text? Models with strong instruction-following (e.g., GPT-4o, Claude 3.5 Sonnet) handle structured output better than older or smaller models.

**Accuracy floor:** Define the minimum acceptable quality. If you're extracting invoice line items, 95% precision might be table stakes. If you're drafting marketing copy, you can tolerate more variance.

**Cost ceiling:** Estimate tokens per request (input + output) and multiply by your expected volume. This informs cost projections and whether you must compress, retrieve, or re-architect. For strategies to further reduce spend and boost responsiveness, consider implementing [semantic caching with Redis Vector](/article/semantic-cache-llm-how-to-implement-with-redis-vector-to-cut-costs-6) to cache near-duplicate prompts and optimize LLM usage.

## Benchmark on Your Data, Not Demos

Public leaderboards (MMLU, HumanEval) measure general capability, not your task. A model that excels at coding benchmarks may struggle with your domain-specific extraction or summarization.

**Use your data:** Pull a representative sample of real inputs and expected outputs. If you don't have labeled data yet, create a small synthetic set that mirrors production edge cases (long documents, ambiguous queries, multilingual text).

**Measure what matters:** Track task-specific metricsâ€”F1 for extraction, ROUGE for summarization, exact match for structured output. Don't rely solely on vibes or cherry-picked examples.

**Automate evaluation:** Tools like Promptfoo, Ragas, or custom scripts let you run hundreds of test cases in minutes. Version your prompts and datasets so you can reproduce results as models update.

## Measure Effective Cost per Correct Output

Per-token pricing is a trap. A cheaper model that produces twice as many tokens or requires two retries costs more than a pricier model that nails it in one shot.

**Calculate cost per correct:** Multiply tokens per request by price per token, then divide by your measured accuracy. If Model A costs $0.01 per call at 90% accuracy, its effective cost is $0.011 per correct output. If Model B costs $0.005 at 70% accuracy, it's $0.007 per correctâ€”cheaper on paper, but you'll need fallback logic or human review.

**Account for retries and fallbacks:** If your pipeline retries failed requests or escalates to a larger model, factor those costs into your total. A 95% success rate with no retries beats 80% with expensive escalation.

**Price long context separately:** Some providers charge more per token beyond a threshold (e.g., 128k tokens). If your use case pushes context limits, test whether chunking or retrieval is cheaper than paying the long-context premium.

Here's a Python script to estimate tokens and cost per call for a batch of requests:

In [None]:
# Baseline Token Calculator: Estimate tokens and cost per call for LLM usage

import os
import logging
from typing import List, Dict, Any

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

try:
    import tiktoken
except ImportError:
    raise ImportError(
        "tiktoken is required. Install with: pip install tiktoken"
    )

try:
    from google.colab import userdata
except ImportError:
    pass

def estimate_tokens(text: str, model_name: str = "gpt-3.5-turbo") -> int:
    try:
        encoding = tiktoken.encoding_for_model(model_name)
    except KeyError:
        encoding = tiktoken.get_encoding("cl100k_base")
    tokens = encoding.encode(text)
    return len(tokens)

def predict_cost_per_call(
    tokens_in: int,
    tokens_out: int,
    model_pricing: Dict[str, Any]
) -> float:
    price_in = model_pricing["price_in"] / 1000
    price_out = model_pricing["price_out"] / 1000
    cost = tokens_in * price_in + tokens_out * price_out
    return cost

def batch_estimate(
    inputs: List[str],
    outputs: List[str],
    model_name: str,
    model_pricing: Dict[str, Any]
) -> Dict[str, Any]:
    assert len(inputs) == len(outputs), "Inputs and outputs must be same length"
    token_ins = [estimate_tokens(inp, model_name) for inp in inputs]
    token_outs = [estimate_tokens(out, model_name) for out in outputs]
    costs = [
        predict_cost_per_call(ti, to, model_pricing)
        for ti, to in zip(token_ins, token_outs)
    ]
    summary = {
        "avg_tokens_in": sum(token_ins) / len(token_ins),
        "avg_tokens_out": sum(token_outs) / len(token_outs),
        "avg_cost_per_call": sum(costs) / len(costs),
        "max_cost_per_call": max(costs),
        "min_cost_per_call": min(costs),
        "total_cost": sum(costs),
        "calls": len(inputs),
    }
    logging.info(f"Batch estimate summary: {summary}")
    return summary

if __name__ == "__main__":
    gpt4o_mini_pricing = {
        "price_in": 0.0005,
        "price_out": 0.0015,
    }

    example_inputs = [
        "Summarize the following Slack thread about Q2 sales targets.",
        "Extract all table data from the attached PDF and cite sources.",
    ]
    example_outputs = [
        '{"summary": "Q2 sales targets increased by 10%...", "details": "...", "citations": ["doc1.pdf"]}',
        '{"tables": [{"name": "Revenue", "rows": [...] }], "citations": ["page 3", "page 7"]}',
    ]

    summary = batch_estimate(
        example_inputs,
        example_outputs,
        model_name="gpt-4o-mini",
        model_pricing=gpt4o_mini_pricing,
    )

    print("Token and Cost Estimate Summary:")
    for k, v in summary.items():
        print(f"{k}: {v}")

## Test Latency Under Load

Advertised latency numbers assume ideal conditions. Real-world performance depends on server load, request batching, and network variability.

**Measure TTFT and tail latency:** Time-to-first-token (TTFT) dominates perceived responsiveness in streaming apps. Track p50, p95, and p99 latencies under realistic load. A model with 200ms p50 but 2s p99 will frustrate users.

**Simulate production traffic:** Use load testing tools (Locust, k6) to send concurrent requests at your expected QPS. Watch for throughput degradation and queue buildup.

**Compare hosted vs self-hosted:** Hosted APIs (OpenAI, Anthropic) abstract infrastructure but add network latency. Self-hosted (vLLM, TGI) gives you control but requires tuning batch size, KV cache, and quantization.

## Check for Determinism and Reproducibility

If you need auditability (compliance, debugging, A/B tests), you need reproducible outputs.

**Set temperature to 0:** This disables sampling randomness. Most models will return identical outputs for identical prompts, though some providers still inject minor variance.

**Log prompts and responses:** Store the exact prompt, model version, and response for every request. If a model update changes behavior, you can diff outputs and catch regressions.

**Version your prompts:** Treat prompts like code. Use Git or a prompt management tool to track changes and roll back if quality drops.

## Plan for Routing and Escalation

No single model is optimal for every request. Route simple queries to fast, cheap models and escalate complex ones to larger models.

**Use confidence signals:** If your model returns a confidence score or you can estimate it from output structure (e.g., presence of citations, JSON validity), route low-confidence requests to a stronger model.

**Implement tiered routing:** Start with a small model (GPT-4o mini, Llama 3.1 8B). If it fails validation or returns low confidence, retry with a larger model (GPT-4o, Claude 3.5 Sonnet).

**Monitor escalation rate:** If more than 20% of requests escalate, your small model is underperforming. Retrain, adjust prompts, or switch to a larger base model.

## Evaluate Open-Source vs Hosted Tradeoffs

Open-source models (Llama, Mistral, Qwen) give you control and eliminate per-token costs, but require infrastructure and tuning.

**When to self-host:** High volume (millions of requests/month), strict data residency, or need for custom fine-tuning. You'll pay for GPUs, engineering time, and monitoring.

**When to use hosted APIs:** Low to moderate volume, rapid prototyping, or lack of ML infrastructure. You trade control for simplicity and predictable pricing.

**Quantization and throughput:** Self-hosted models can be quantized (8-bit, 4-bit) to fit smaller GPUs and increase throughput. Test whether quantization degrades accuracy on your task before deploying.

## Monitor in Production

Model behavior drifts as input distributions shift and providers update models.

**Track accuracy over time:** Run a subset of production requests through your eval pipeline daily. If accuracy drops, investigate prompt drift, model updates, or data distribution changes.

**Alert on latency and cost spikes:** Set thresholds for p95 latency and daily spend. If either spikes, check for traffic surges, model degradation, or upstream API issues.

**Version model endpoints:** When a provider updates a model (e.g., `gpt-4o-2024-08-06` â†’ `gpt-4o-2024-11-20`), test the new version in staging before switching production traffic.

## Negotiate Volume and Caching

If you're spending thousands per month, you have leverage.

**Ask for volume discounts:** OpenAI, Anthropic, and others offer tiered pricing for high-volume customers. Negotiate before you scale.

**Enable prompt caching:** Some providers (Anthropic, OpenAI) cache repeated prompt prefixes and charge less for cached tokens. If your prompts share a long system message or context, this can cut costs by 50%.

**Batch requests:** If latency permits, batch multiple requests into a single API call. This reduces overhead and can unlock cheaper pricing tiers.

## Procurement and Compliance Notes

If you're in a regulated industry or enterprise, model selection intersects with legal and compliance requirements.

**Data residency:** Some providers offer regional endpoints (EU, US) to comply with GDPR or other regulations. Verify where your data is processed and stored.

**SLAs and uptime:** Hosted APIs typically guarantee 99.9% uptime. If downtime costs you revenue, negotiate SLAs or architect fallbacks (multiple providers, self-hosted backup).

**Audit trails:** Log every request and response with timestamps, user IDs, and model versions. This is non-negotiable for compliance in finance, healthcare, and legal.

## Beware Hidden Serving Costs

Self-hosting isn't just GPU rental. Factor in:

- **Engineering time:** Setting up vLLM, TGI, or Ray Serve, tuning batch sizes, and debugging OOM errors.
- **Monitoring and observability:** Prometheus, Grafana, and custom dashboards to track latency, throughput, and GPU utilization.
- **Model updates:** Downloading new weights, re-quantizing, and re-benchmarking every few months.

If your team is small, hosted APIs often cost less than the fully loaded cost of self-hosting.

## Final Checklist

Before you commit to a model:

- Define latency, context, accuracy, and cost requirements
- Benchmark on your data with task-specific metrics
- Calculate effective cost per correct output, including retries
- Test latency under realistic load (p95, p99)
- Verify determinism if you need reproducibility
- Plan tiered routing for cost and quality optimization
- Decide hosted vs self-hosted based on volume and control needs
- Set up monitoring for accuracy, latency, and cost drift
- Negotiate volume pricing and enable caching if applicable
- Document compliance and audit requirements

Model selection is not a one-time decision. As your application scales, your data shifts, and new models launch, revisit this framework to ensure you're still optimizing for the right tradeoffs.