Skip to content

Mofa1245/Continuum

Repository files navigation

Continuum

Continuum records AI workflow runs and replays them in CI—failing verification when outputs drift.

Demo

Continuum Demo

Continuum detects silent LLM drift in CI by replaying workflows and diffing outputs.

Continuum CI

What Continuum Does

Continuum records AI workflow runs and replays them in CI.

If any step output changes, verification fails.

It prevents silent LLM output drift from reaching production systems.


Why This Exists

AI outputs change over time. Models update. Temperature mistakes happen. Prompt tweaks slip in. Silent drift breaks production systems — and you only notice when users complain.

Continuum catches that. Store a run once. Replay it from its stored recipe. Diff the outputs. If anything changed, verification fails. No guessing. No "it worked on my machine." CI-friendly exit codes and one command: verify-all.


Quick Start

Clone the repository and run the example pipeline.

npm install
npm run build
npx tsx examples/invoice-processor/pipeline.ts

Verify deterministic behavior:

node dist/cli/index.js verify-all --strict

Expected result:

  • ✓ invoice1 PASS
  • ✓ invoice2 PASS
  • ✓ invoice3 PASS

This ensures developers can reproduce the example quickly.


Real Example: AI Invoice Processing

This example simulates an AI invoice extraction pipeline.

Steps:

  1. Raw invoice text is loaded.
  2. The LLM extracts structured fields.
  3. The output is parsed into JSON.
  4. Continuum records the run.
  5. verify-all --strict replays the workflow and checks for drift.

If any phase output changes, verification fails.

Problem

AI pipelines can silently drift when prompts change or models update.

For example, an invoice extraction system may originally output:

  • amount: 72

After a prompt tweak or model update it might return:

  • amount: "72.00"

This small change can break accounting pipelines or validation logic.

Continuum prevents this by replaying workflow runs and detecting drift.

Example Workflow

[ Input: Raw Invoice ]
          ↓
[ Continuum Runner ] → { Phase: LLM_Call }
          ↓          → { Phase: JSON_Parse }
          ↓
       [ Stored Run ]
            ↕
[ verify-all --strict ]
            ↓
      (Replay & Diff)
            ↓
Exit 0 → Success
Exit 1 → Drift Detected

Run the Example

Run the invoice processor pipeline:

npx tsx examples/invoice-processor/pipeline.ts

Then verify deterministic behavior:

node dist/cli/index.js verify-all --strict

Expected result:

  • ✓ invoice1 PASS
  • ✓ invoice2 PASS
  • ✓ invoice3 PASS

Simulating Drift

Edit the prompt inside:

examples/invoice-processor/pipeline.ts

Change:

Extract invoice fields carefully.

to:

Extract invoice fields strictly in JSON.

Run again:

npx tsx examples/invoice-processor/pipeline.ts
node dist/cli/index.js verify-all --strict

Expected result:

  • Drift detected
  • verify-all failed

Continuum detects the change and fails verification before corrupted data reaches production systems.

Example drift detected by Continuum:

Drift detected in phase: json_parse

Path: json_parse.amount
  Stored: 72
  Current: "72.00"

Path: json_parse.vendor
  Stored: "Acme Industrial Supply"
  Current: "Acme Industrial"

This demonstrates the exact failure developers care about.

Drift Detected

Drift Proof

With the mock provider, to see drift you must also simulate a model change: in src/llm/MockProvider.ts, make the invoice response return amount: "72.00" (string) instead of 72 for the invoice prompt, then run verify-all (without re-running the pipeline). With OpenAI (OPENAI_API_KEY set), changing the prompt alone can produce different output and trigger drift.


Using Continuum in CI

Continuum runs in CI by recording workflow runs and then replaying them. The repository includes a GitHub Actions workflow (.github/workflows/continuum-verify.yml) that:

  1. Checks out the repo, installs dependencies, and builds the project.
  2. Runs the example invoice pipeline so that runs are written to ./runs.
  3. Runs verify-all --strict to replay every stored run and diff outputs.

If any step output has changed since the run was recorded, the job fails. The runs/ directory is in .gitignore, so each CI run starts with a clean slate: the pipeline creates the runs, and verification confirms they replay identically. This prevents silent LLM output drift from reaching production.


Why LLM Drift Breaks Production Systems

In traditional software, 1 + 1 always equals 2. In LLM-integrated systems, the logic layer is non-deterministic. A subtle model update or a system prompt change—like adding "be concise"—can shift a structured extraction from an integer 72 to a string "72.00".

While this seems trivial, it is a silent failure. It won't trigger a standard crash; instead, it injects corrupted data into downstream accounting systems, database schemas, or automated payment triggers. By the time the drift is discovered, the production state is already compromised.

Deterministic Replay shifts the detection of these failures to the CI stage. By recording the granular phases of a successful run (raw tokens, JSON parsing, memory writes), Continuum creates a baseline of "known-good" behavior. Verification ensures that any deviation in the model's reasoning or output format is caught before deployment, treating LLM outputs as strictly as unit tests.


CI: Automatic Drift Detection

Continuum can run inside CI to prevent AI workflow drift from reaching production.

Example GitHub Actions workflow:

name: Continuum Verification

on: [push]

jobs:
  verify:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm install
      - run: npm run build
      - run: npx tsx examples/invoice-processor/pipeline.ts
      - run: node dist/cli/index.js verify-all --strict

If any workflow output changes, the CI job fails.

This prevents silent AI drift from reaching production systems.


GitHub Action

Continuum can run automatically in CI using GitHub Actions.

Example:

name: Verify AI Workflows

on: [push]

jobs:
  continuum:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: continuum-ai/verify@v1

Architecture

[ Agent Framework (LangGraph / CrewAI / Custom) ]
                    ↓
[ Continuum Adapter (Context Injection) ]
                    ↓
[ Deterministic Kernel (Memory + Replay) ]
                    ↓
[ Checkpoint-Based Storage ]

Key principle: AI is a consumer, not the brain. The kernel is deterministic.


CI: Drop-In Drift Check

Use Continuum in GitHub Actions to fail the build when any stored AI run has drifted.

name: AI Drift Check

on: [push, pull_request]

jobs:
  verify-ai:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node
        uses: actions/setup-node@v4
        with:
          node-version: "20"
          cache: "npm"

      - name: Install and build
        run: |
          npm ci
          npm run build

      - name: Verify AI runs
        run: node dist/cli/index.js verify-all --strict

If any run in ./runs no longer matches a re-execution from its stored recipe, the job fails. No silent drift.

    ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
    │  Run once   │ ──► │  Store run  │ ──► │  CI: verify  │
    │ (invoice-   │     │  (./runs/)  │     │  verify-all  │
    │  demo)      │     └─────────────┘     └──────┬──────┘
    └─────────────┘                                │
                                    pass ──► exit 0   exit 1 ◄── drift

Invoice Demo (What the GIF Shows)

The demo that proves the value: structured extraction from messy text, then verify that it never silently changes.

  1. Run node dist/cli/index.js invoice-demo
    Extracts vendor, amount, currency, due_date from sample invoice text. Stores the run under ./runs.

  2. Verify node dist/cli/index.js verify <runId> --strict
    Replays the run from its stored recipe. Output matches → PASS, exit 0.

  3. Tamper — Open runs/<runId>.json, change "amount": 72 to "amount": 99 in stepOutputs.json_parse, save.

  4. Verify again — Same command. Replay still returns 72. Stored says 99. FAIL, exit 1. Drift reported: Path: json_parse.amount, Stored vs Current.

If your model (or a bad deploy) ever extracts the wrong amount, CI fails. That’s the guard.


Core Concepts

Stored run — Each run is saved as a JSON file in ./runs with a recipe (task, provider, model, temperature) and stepOutputs. Replay re-executes from the recipe and diffs against stepOutputs.

Verifycontinuum verify <runId> --strict replays one run. continuum verify-all --strict replays every run in ./runs. Any mismatch → exit 1.

Recipe — Execution metadata (provider, model, temperature, task). Replay uses only the recipe and stored input; no CLI overrides. Historically faithful.


Other Commands

  • continuum llm-demo — Weather-style LLM call → JSON parse → memory write (mock or OpenAI).
  • continuum demo — 4-step agent demo with optional crash/recovery.
  • continuum replay <runId> — Replay with full diff output.
  • continuum diff <runIdA> <runIdB> — Compare two stored runs.

Who This Is For

  • Infra engineers shipping LLM-backed features
  • Teams that need to catch silent output drift in CI
  • Anyone extracting structured data (invoices, tickets, etc.) and unwilling to risk wrong numbers in production

Documentation


License

Continuum Non-Commercial Source License v1.0.
Commercial use requires separate permission from the author.


Contributing

See CONTRIBUTING.md for development setup and architecture decisions.


Primary author: Mohammed Al-Hajri. Developed with AI assistance under human direction, review, and validation.

About

CI drift guard for LLM workflows

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors