# 📓 The GenAI Revolution Cookbook

**Title:** Small Language Models vs Large Language Models: When to Use Each

**Description:** A builder's decision guide with accuracy trade-offs, cost-per-call math, and latency SLAs—know exactly when SLMs beat LLMs.

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



## Why This Matters

Your voice agent misses turn-taking cues when p95 latency hits 450ms, even with bigger GPUs. Your customer-facing chatbot racks up $12,000/month in API costs despite handling only 50 concurrent users. Your compliance team blocks your LLM deployment because PII leaves the VPC.

These failures share a root cause: **batching-driven tail latency and concurrency economics make large language models the wrong default for real-time, cost-sensitive, or privacy-constrained workloads**. Small language models (SLMs) — typically 1B–13B parameters — deliver sub-200ms p95 latency, predictable per-request costs, and on-premise deployment without sacrificing task performance on structured outputs, Q&A, and extraction.

This explainer focuses on **one core tradeoff**: how batching and concurrency shape the latency-cost curve, and the decision rule you can codify to route requests correctly. You'll learn the queueing mechanism behind tail latency spikes, a formula to calculate cost-per-validated-response under different concurrency levels, and a threshold-based policy you can enforce in CI/CD.

---

## How It Works

### 1. Batching Increases Queueing Delay and p95 Latency

Large models achieve cost efficiency by batching requests: processing 8–32 prompts simultaneously amortizes the fixed cost of loading weights and KV cache across multiple outputs. But batching introduces queueing delay — each request waits for the batch to fill or a timeout to expire before inference starts.

At low concurrency (1–10 requests/sec), queues stay shallow and p95 latency remains close to median. As concurrency rises past the batch size, requests queue longer, and p95 latency diverges sharply from p50. A 70B model with batch size 16 and 50ms per-token decode time can see p95 jump from 180ms to 600ms when concurrency exceeds 20 requests/sec.

SLMs process requests individually or in small batches (1–4) because their lower parameter count and smaller KV cache fit in GPU memory without batching pressure. This keeps queueing delay near zero and p95 latency predictable, typically under 200ms even at moderate concurrency.

### 2. Parameter Count Drives Memory Bandwidth and KV Cache Size

A 70B model in FP16 requires ~140GB of weights plus KV cache proportional to context length and batch size. Serving 2048-token contexts in a batch of 16 consumes an additional ~50GB. This memory footprint forces multi-GPU setups, increasing network hops between tensor-parallel shards and adding 10–30ms per forward pass.

A 7B model fits entirely in a single A10G (24GB), eliminating inter-GPU communication. Decode latency drops to 15–25ms per token, and p95 stays under 150ms for contexts up to 4096 tokens at concurrency ≤10.

### 3. Cost Scales with Batch Utilization, Not Just Tokens

Pricing APIs charge per million tokens, but your true cost is **cost-per-validated-response**: the sum of input tokens, output tokens, retries from schema violations, and guardrail rejects. Large models require high batch utilization (≥80%) to justify their per-instance cost; at low concurrency, you pay for idle capacity.

SLMs have lower per-instance costs and achieve break-even at concurrency as low as 2–5 requests/sec. For workloads with bursty traffic or strict latency SLAs, SLMs deliver lower cost-per-validated-response because you avoid paying for unused batch slots.

### 4. Schema Constraints Reduce Output Variance

Larger models generate more creative outputs due to higher entropy in their probability distributions. For structured tasks (JSON extraction, SQL generation, form filling), this creativity increases the rate of schema violations and retries.

SLMs with temperature=0 and constrained decoding (regex or grammar-based sampling) produce deterministic outputs that pass schema validation on the first attempt 95%+ of the time. This reduces retries, lowers cost-per-validated-response, and tightens p95 latency.

---

## What You Should Do

### 1. Measure p95 Latency Over 5-Minute Windows

Track p95 latency (ms) in sliding 5-minute windows with sample size ≥100 requests. If your SLA is ≤200ms and p95 exceeds it, batching or model size is the bottleneck. Use SLMs as the default and escalate to larger models only when task-specific evaluation scores justify the latency cost.

### 2. Calculate Cost-Per-Validated-Response

Use this formula to compare models under your actual traffic:

In [None]:
cost_per_response = (input_tokens × price_in + output_tokens × price_out) × (1 + retry_rate + reject_rate)

Pull live prices from your provider's API. Measure retry_rate (schema validation failures) and reject_rate (guardrail blocks) from production logs over the past 7 days. If SLM cost-per-response is ≤50% of LLM cost at your target concurrency, default to SLMs.

### 3. Enforce Schema Validation as the Acceptance Gate

Set temperature=0 and top-p between 0.8–0.9. Use constrained decoding libraries (e.g., Outlines, Guidance) to enforce JSON schemas or regex patterns during generation. Validate outputs against your schema before returning to the user. If validation pass rate drops below 95%, tune top-p or switch to a model fine-tuned on your schema examples.

### 4. Route Based on Thresholds, Not Defaults

Implement a routing policy in your API gateway or orchestration layer:

- **Default**: Route to SLM (1B–7B) for all requests.
- **Escalate to SLM-13B**: If eval_score on a fixed test set (measured weekly) improves ≥5 percentage points on your KPI (e.g., F1, exact match, user acceptance rate).
- **Escalate to LLM (70B+)**: Only if SLM-13B eval_score is ≥10 points below target **and** p95 latency SLA can relax to ≥500ms **and** cost-per-response increase is justified by revenue impact.

Encode these thresholds in CI/CD so model swaps trigger only when metrics cross decision boundaries.

---

## Conclusion – Key Takeaways

**Core insight**: Batching-driven queueing delay and concurrency economics make SLMs the correct default for real-time, cost-sensitive workloads. Large models justify their latency and cost only when task complexity demands capabilities SLMs cannot match — and you've measured that gap with production metrics.

**Decision rule**: If your p95 latency target is ≤200ms, concurrency is <20 requests/sec, and privacy or cost constraints exist, default to SLMs. Escalate only when evaluation scores improve ≥5 points on your KPI and you can relax latency SLAs accordingly.

**When to care**:
- Your p95 latency exceeds SLA by >50ms
- Cost-per-validated-response is >2× your target unit economics
- Compliance requires on-premise inference or PII never leaves your VPC
- Retry rates from schema violations exceed 10%

For deeper dives into related tradeoffs, see [Quantization Tradeoffs: When 8-bit Holds](/article/quantization-tradeoffs), [Retrieval Quality Over Model Size in RAG](/article/retrieval-quality-over-model-size), and [Routing Policies in CI/CD](/article/routing-policies-in-cicd).