Continuum records AI workflow runs and replays them in CI—failing verification when outputs drift.
Continuum detects silent LLM drift in CI by replaying workflows and diffing outputs.
Continuum records AI workflow runs and replays them in CI.
If any step output changes, verification fails.
It prevents silent LLM output drift from reaching production systems.
AI outputs change over time. Models update. Temperature mistakes happen. Prompt tweaks slip in. Silent drift breaks production systems — and you only notice when users complain.
Continuum catches that. Store a run once. Replay it from its stored recipe. Diff the outputs. If anything changed, verification fails. No guessing. No "it worked on my machine." CI-friendly exit codes and one command: verify-all.
Clone the repository and run the example pipeline.
npm install
npm run build
npx tsx examples/invoice-processor/pipeline.tsVerify deterministic behavior:
node dist/cli/index.js verify-all --strictExpected result:
- ✓ invoice1 PASS
- ✓ invoice2 PASS
- ✓ invoice3 PASS
This ensures developers can reproduce the example quickly.
This example simulates an AI invoice extraction pipeline.
Steps:
- Raw invoice text is loaded.
- The LLM extracts structured fields.
- The output is parsed into JSON.
- Continuum records the run.
verify-all --strictreplays the workflow and checks for drift.
If any phase output changes, verification fails.
Problem
AI pipelines can silently drift when prompts change or models update.
For example, an invoice extraction system may originally output:
amount: 72
After a prompt tweak or model update it might return:
amount: "72.00"
This small change can break accounting pipelines or validation logic.
Continuum prevents this by replaying workflow runs and detecting drift.
Example Workflow
[ Input: Raw Invoice ]
↓
[ Continuum Runner ] → { Phase: LLM_Call }
↓ → { Phase: JSON_Parse }
↓
[ Stored Run ]
↕
[ verify-all --strict ]
↓
(Replay & Diff)
↓
Exit 0 → Success
Exit 1 → Drift Detected
Run the Example
Run the invoice processor pipeline:
npx tsx examples/invoice-processor/pipeline.tsThen verify deterministic behavior:
node dist/cli/index.js verify-all --strictExpected result:
- ✓ invoice1 PASS
- ✓ invoice2 PASS
- ✓ invoice3 PASS
Simulating Drift
Edit the prompt inside:
examples/invoice-processor/pipeline.ts
Change:
Extract invoice fields carefully.
to:
Extract invoice fields strictly in JSON.
Run again:
npx tsx examples/invoice-processor/pipeline.ts
node dist/cli/index.js verify-all --strictExpected result:
- Drift detected
- verify-all failed
Continuum detects the change and fails verification before corrupted data reaches production systems.
Example drift detected by Continuum:
Drift detected in phase: json_parse
Path: json_parse.amount
Stored: 72
Current: "72.00"
Path: json_parse.vendor
Stored: "Acme Industrial Supply"
Current: "Acme Industrial"
This demonstrates the exact failure developers care about.
With the mock provider, to see drift you must also simulate a model change: in src/llm/MockProvider.ts, make the invoice response return amount: "72.00" (string) instead of 72 for the invoice prompt, then run verify-all (without re-running the pipeline). With OpenAI (OPENAI_API_KEY set), changing the prompt alone can produce different output and trigger drift.
Continuum runs in CI by recording workflow runs and then replaying them. The repository includes a GitHub Actions workflow (.github/workflows/continuum-verify.yml) that:
- Checks out the repo, installs dependencies, and builds the project.
- Runs the example invoice pipeline so that runs are written to
./runs. - Runs
verify-all --strictto replay every stored run and diff outputs.
If any step output has changed since the run was recorded, the job fails. The runs/ directory is in .gitignore, so each CI run starts with a clean slate: the pipeline creates the runs, and verification confirms they replay identically. This prevents silent LLM output drift from reaching production.
In traditional software, 1 + 1 always equals 2. In LLM-integrated systems, the logic layer is non-deterministic. A subtle model update or a system prompt change—like adding "be concise"—can shift a structured extraction from an integer 72 to a string "72.00".
While this seems trivial, it is a silent failure. It won't trigger a standard crash; instead, it injects corrupted data into downstream accounting systems, database schemas, or automated payment triggers. By the time the drift is discovered, the production state is already compromised.
Deterministic Replay shifts the detection of these failures to the CI stage. By recording the granular phases of a successful run (raw tokens, JSON parsing, memory writes), Continuum creates a baseline of "known-good" behavior. Verification ensures that any deviation in the model's reasoning or output format is caught before deployment, treating LLM outputs as strictly as unit tests.
Continuum can run inside CI to prevent AI workflow drift from reaching production.
Example GitHub Actions workflow:
name: Continuum Verification
on: [push]
jobs:
verify:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm install
- run: npm run build
- run: npx tsx examples/invoice-processor/pipeline.ts
- run: node dist/cli/index.js verify-all --strictIf any workflow output changes, the CI job fails.
This prevents silent AI drift from reaching production systems.
Continuum can run automatically in CI using GitHub Actions.
Example:
name: Verify AI Workflows
on: [push]
jobs:
continuum:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: continuum-ai/verify@v1[ Agent Framework (LangGraph / CrewAI / Custom) ]
↓
[ Continuum Adapter (Context Injection) ]
↓
[ Deterministic Kernel (Memory + Replay) ]
↓
[ Checkpoint-Based Storage ]
Key principle: AI is a consumer, not the brain. The kernel is deterministic.
Use Continuum in GitHub Actions to fail the build when any stored AI run has drifted.
name: AI Drift Check
on: [push, pull_request]
jobs:
verify-ai:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node
uses: actions/setup-node@v4
with:
node-version: "20"
cache: "npm"
- name: Install and build
run: |
npm ci
npm run build
- name: Verify AI runs
run: node dist/cli/index.js verify-all --strictIf any run in ./runs no longer matches a re-execution from its stored recipe, the job fails. No silent drift.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Run once │ ──► │ Store run │ ──► │ CI: verify │
│ (invoice- │ │ (./runs/) │ │ verify-all │
│ demo) │ └─────────────┘ └──────┬──────┘
└─────────────┘ │
pass ──► exit 0 exit 1 ◄── drift
The demo that proves the value: structured extraction from messy text, then verify that it never silently changes.
-
Run
node dist/cli/index.js invoice-demo
Extracts vendor, amount, currency, due_date from sample invoice text. Stores the run under./runs. -
Verify
node dist/cli/index.js verify <runId> --strict
Replays the run from its stored recipe. Output matches → PASS, exit 0. -
Tamper — Open
runs/<runId>.json, change"amount": 72to"amount": 99instepOutputs.json_parse, save. -
Verify again — Same command. Replay still returns 72. Stored says 99. FAIL, exit 1. Drift reported:
Path: json_parse.amount, Stored vs Current.
If your model (or a bad deploy) ever extracts the wrong amount, CI fails. That’s the guard.
Stored run — Each run is saved as a JSON file in ./runs with a recipe (task, provider, model, temperature) and stepOutputs. Replay re-executes from the recipe and diffs against stepOutputs.
Verify — continuum verify <runId> --strict replays one run. continuum verify-all --strict replays every run in ./runs. Any mismatch → exit 1.
Recipe — Execution metadata (provider, model, temperature, task). Replay uses only the recipe and stored input; no CLI overrides. Historically faithful.
continuum llm-demo— Weather-style LLM call → JSON parse → memory write (mock or OpenAI).continuum demo— 4-step agent demo with optional crash/recovery.continuum replay <runId>— Replay with full diff output.continuum diff <runIdA> <runIdB>— Compare two stored runs.
- Infra engineers shipping LLM-backed features
- Teams that need to catch silent output drift in CI
- Anyone extracting structured data (invoices, tickets, etc.) and unwilling to risk wrong numbers in production
- docs/README — Documentation index
- What We Guarantee — Determinism contract
- How This Fails — Failure modes
- What We Don't Do — Non-goals
Continuum Non-Commercial Source License v1.0.
Commercial use requires separate permission from the author.
See CONTRIBUTING.md for development setup and architecture decisions.
Primary author: Mohammed Al-Hajri. Developed with AI assistance under human direction, review, and validation.

